KEMBAR78
Visual Data Mining Techniques | PDF | Cluster Analysis | Data Mining
0% found this document useful (0 votes)
82 views13 pages

Visual Data Mining Techniques

This document provides an overview of visual data mining techniques. It discusses how large amounts of data are now routinely collected and stored, but exploring and analyzing this data has become difficult. Visualization techniques can help address this by directly involving users in the data mining process. The document classifies visualization techniques based on the type of data to be visualized, the visualization technique used, and interaction techniques. It provides examples of popular techniques like geometric transformations, dense pixel displays, iconic displays, and stacked displays. The goal of visual data mining is to combine automated data mining algorithms with human visualization and analysis to more effectively explore large datasets.

Uploaded by

ajrabegam2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
82 views13 pages

Visual Data Mining Techniques

This document provides an overview of visual data mining techniques. It discusses how large amounts of data are now routinely collected and stored, but exploring and analyzing this data has become difficult. Visualization techniques can help address this by directly involving users in the data mining process. The document classifies visualization techniques based on the type of data to be visualized, the visualization technique used, and interaction techniques. It provides examples of popular techniques like geometric transformations, dense pixel displays, iconic displays, and stacked displays. The goal of visual data mining is to combine automated data mining algorithms with human visualization and analysis to more effectively explore large datasets.

Uploaded by

ajrabegam2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

First publ. in: Visualization Handbook / ed. by Charles D. Hansen ... Amsterdam : Elsevier, 2004, pp.

831-843

43 Visual Data-Mining Techniques*

DANIEL A. KEIM, MIKE SIPS


University of Konstanz, Germany
MIHAEL ANKERST
The Boeing Company, USA

data is collected because people believe that it is


43.1 Introduction
a potential source of valuable information, pro-
Never before in history have data been gener- viding a competitive advantage (at some point).
ated at such high volumes as it is today. Explor- Finding the valuable information hidden in the
ing and analyzing the vast volumes of data has data, however, is a difficult task. With today’s
become increasingly difficult. Information visu- data-management systems, it is possible to view
alization and visual data mining can help to deal only small portions of the data. If the data is
with the flood of information. The advantage of presented textually, the amount of data that can
visual data exploration is that the user is directly be displayed is in the range of some one hun-
involved in the data-mining process. There are a dred data items, but this is like a drop in the
large number of information visualization tech- ocean when you are dealing with datasets con-
niques that have been developed over the last taining millions of data items. Having no possi-
few years to support the exploration of large bility to adequately explore the large amounts
datasets. In this chapter, we provide an over- of data that have been collected because of their
view of information visualization and visual potential usefulness, the data becomes useless
data-mining techniques and illustrate them and the databases become data ‘dumps.’ Infor-
using a few examples. mation visualization focuses on datasets lacking
The progress made in hardware technology inherent 2D or 3D semantics and therefore also
allows today’s computer systems to store very lacking a standard mapping of the abstract data
large amounts of data. Researchers from the onto the physical screen space. There are a
University of Berkeley estimate that every year number of well known techniques for visualiz-
about 1 exabyte (1 million terabytes) of data is ing such datasets, such as x-y plots, line plots,
generated, of which a large portion is available and histograms. These techniques are useful for
in digital form. This means that in the next three data exploration but are limited to relatively
years more data will be generated than in all of small and low-dimensional datasets. In the last
human history to date. The data is often auto- few years, a large number of novel information
matically recorded via sensors and monitoring visualization techniques have been developed,
systems. Even simple transactions of everyday allowing visualizations of multidimensional
life, such as paying by credit card or using the datasets without inherent 2D or 3D semantics.
telephone, are typically recorded by computers. Nice overviews of the approaches can be found
Usually many parameters are recorded, in a number of recent books [8,38,38,28].
resulting in data with high dimensionality. The The techniques can be classified based on three

*A earlier version of this paper with focus on visualization techniques and their classification (see section I) has been published in
[21]

Konstanzer Online-Publikations-System (KOPS)


URN: http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-69689
URL: http://kops.ub.uni-konstanz.de/volltexte/2009/6968
814

Data to be Visualized

1. one-dimensional

2. two-dimensional
Visualization Technique

3. multi-dimensionl Stacked Display

4. text/web Dense Pixel Display


Iconic Display
5. hierarchies/graphs
Geometrcally Transformed Display

6. algorithm/software Standared 2D/3D Display

Standard Projection Filtering Zoom Distortion Link&Brush


Interaction Technique

Figure 43.1 Classification of Information Visualization Techniques.

criteria [20] (Fig. 43.1): The data to be visual- ing. Note that the three dimensions of our
ized, the visualization technique, and the inter- classification data type to be visualized, visu-
action technique used. alization technique, and interaction technique
The data type to be visualized [32] may be can be assumed to be orthogonal. Orthogonal-
1D data, such as temporal (time-series) data; 2D ity means that any of the visualization tech-
data, such as geographical maps; multidimen- niques may be used in conjunction with any of
sional data, such as relational tables text, hyper- the interaction techniques for any data type.
text news articles, and web documents; or Note also that a specific system may be designed
hierarchies and graphs, such as telephone calls to support different data types and that it may
and Web documents, algorithms, and software. use a combination of visualization and inter-
The visualization technique used may be clas- action techniques. More details can be found
sified as standard 2D/3D displays, such as bar in Keim and Ward [21].
charts and x-y plots, geometrically transformed
displays, such as hyperbolic plane [36] (Fig.
43.2a) and parallel coordinates [18], icon-based
43.2 Methodology of Visual Data Mining
displays, such as chernoff faces [9] and stick
figures [24,23] (Fig. 43.2c), dense pixel displays, The data analyst typically specifies first some
such as the recursive pattern [4] (Fig. 43.2b) parameters to restrict the search space; data
and circle segments [5], stacked displays, such mining is then performed automatically by an
as treemaps [31,19] (Fig. 43.2d) and dimen- algorithm, and finally the patterns found by the
sional stacking [37]. The third dimension of the automatic data-mining algorithm are presented
classification is the interaction technique used. to the data analyst on the screen. For data
Interaction techniques allow users to directly mining to be effective, it is important to include
navigate and modify the visualizations, as well the human in the data exploration process and
as select subsets of the data for further oper- combine the flexibility, creativity, and general
ations. Examples include dynamic projection, knowledge of the human with the enormous
interactive filtering, interactive zooming, inter- storage capacity and the computational power
active distortion, interactive linking, and brush- of today’s computers. Since there is a huge
815

Figure 43.2 Some popular information visualization techniques. (a) Geometrically transformed displays: Interactive visualiza
tion of high dimensional data using the hyperbolic plane [36] Genre separation in movie space (red ‘‘x’’ marks science fiction,
black ‘‘D’’ marks animation and green ‘‘þ’’ movies belonging to both genres) ß ACM (b) Dense pixel displays: Recursive
Pattern [4] based on a generic back and forth recursive arrangement schema to represent each data value as a colored pixel and
each attribute in separate sub windows (example visualization shows the stock prices for Dow Jones, Gold, IBM and US Dollar
are depicted for almost seven consecutive years, seven vertical bars correspond to the seven years (level (3) patterns) and the
subdivision of the bars to the 12 month within each year (level (2) patterns), the coloring maps high attribute values (stock
prices) to light colors and low attributes values (stock prices) to dark colors) (c) Iconic displays: Stick Figures [24,23]
visualization of multidimensional data using properties of angle and/or length of the limbs (US Census Data Median Household
Income and Age of Householder) (d) Stacked displays: TreeMaps [31,9] splitting the screen into rectangles in alternating
horizontal and vertical directions in each level (example visualization shows a hierarchical file system of a large hard disk)

amount of patterns generated by an automatic human perceptual abilities to the analysis of


data-mining algorithm in textual form it is large datasets available in today’s computer
almost impossible for the human to interprete systems. Presenting data in an interactive,
and evaluate the pattern in detail and extract graphical form often fosters new insights, en-
interesting knowledge and general characteris- couraging the formation and validation of new
tics. Visual data mining aims at integrating the hypotheses to the end of better problem-solving
human in the data-mining process, and applying and gaining deeper domain knowledge.
816

Visual data exploration usually follows a Visual data mining is based on an automatic
three-step process: Overview first, zoom and part, the data-mining algorithm, and an inter-
filter, and then details-on-demand (which has active part, the visualization technique. There
been called the Information Seeking Mantra are three common approaches to integrate the
[32]). First, the data analyst needs to get an human in the data exploration process to realize
overview of the data. In the overview, the data different kinds of visual data mining approaches
analyst identifies interesting patterns or groups (Fig. 43.3):
in the data and focuses on one or more of them.
. Preceding Visualization (PV): Data is visual-
For analyzing the patterns, the data analyst
ized in some visual form before running a
needs to drill down and access details of the
data-mining algorithm. By interaction with
data. Visualization technology may be used for
the raw data the data analyst has full control
all three steps of the data exploration process.
over the analysis in the search space. Inter-
Visualization techniques are useful for showing
esting patterns are discovered by exploring
an overview of the data, allowing the data ana-
the data.
lyst to identify interesting subsets. In this step, it
is important to keep the overview visualization . Subsequent Visualization (SV): An automatic
while focusing on the subset using another visu- data-mining algorithm performs the data-
alization technique. An alternative is to distort mining task by extracting patterns from a
the overview visualization in order to focus on given dataset. These patterns are visualized
the interesting subsets. This can be performed to make them interpretable for the data ana-
by dedicating a larger percentage of the display lyst. Subsequent visualizations enable the
to the interesting subsets while decreasing screen data analyst to specify feedbacks. Based on
utilization for uninteresting data. To further the visualization, the data analyst may want
explore the interesting subsets, the data analyst to return to the data-mining algorithm and
needs a drill-down capability in order to observe use different input parameters to obtain
the details about the data. Note that visualiza- better results.
tion technology not only provides the base visu- . Tightly Integrated Visualization (TIV): An
alization techniques for all three steps but also automatic data-mining algorithm performs
bridges the gaps between the steps. Visual data an analysis of the data but does not produce
mining can be seen as a hypothesis-generation the final results. A visualization technique is
process; the visualizations of the data allow the used to present the intermediate results of
data analyst to gain insight into the data and the data exploration process. The combin-
come up with new hypotheses. The verification ation of some automatic data-mining algo-
of the hypotheses can also be done via data rithms and visualization techniques enables
visualization, but may also be accomplished by specified user feedback for the next data-
automatic techniques from statistics, pattern mining run. Then, the data analyst identifies
recognition, or machine learning. As a result, the interesting patterns in the visualization of
visual data mining usually allows faster data the intermediate results based on his domain
exploration and often provides better results, knowledge. A motivation of this approach is
especially in cases where automatic data-mining to achieve independence of the data-mining
algorithms fail. In addition, visual data explor- algorithms from the application. A given
ation techniques provide a much higher degree automatic data-mining algorithm can be
of user satisfaction and confidence in the find- very useful in one domain but may have
ings of the exploration. This fact leads to a high drawbacks in some other domain. Since
demand for visual exploration techniques and there is no automatic data-mining algorithm
makes them indispensable in conjunction with (with one parameter setting) suitable for
automatic exploration techniques. all application domains, tightly integrated
817

Date
Date Date

Visualization + Interaction
DM-Algorithm
Visualization of step 1
the data DM-Algorithm

DM-Algorithm
step n
DM-Algorithm Result

Visualization of Result
Result the data

Knowledge Knowledge Knowledge

Preceding Subsequent Tightly integrated


Visualization (PV) Visualization (SV) Visualization (TIV)

Figure 43.3 Overview of different approaches of human involvement.

visualization leads to a better understanding ation goals is automatically done if necessary.


of the data and the extracted patterns. In the next sections, we show that the integra-
tion of the human in the data-mining process
In addition to the direct involvement of the
and applying human perceptual abilities to the
human, the main advantages of visual data ex-
analysis of large datasets can help to provide
ploration over automatic data mining tech-
more effective results in important data-mining
niques are the following:
application domains, such as in the mining for
. Visual data exploration can easily deal with association rules, clustering, classification, and
highly nonhomogeneous and noisy data. text retrieval.
. Visual data exploration is intuitive and re-
quires no understanding of complex math-
ematical or statistical algorithms or 43.3 Association Rules
parameters.
The goal of association rule generation is to find
. Visualization can provide a qualitative over-
interesting patterns and trends in transaction
view of the data, allowing data phenomena
databases. Association rules are statistical rela-
to be isolated for further quantitative analy-
tions between two or more items in the dataset.
sis.
In a supermarket basket application, associ-
Visual data-mining techniques have proven ations express the relations between items that
to be of high value in exploratory data analysis are bought together. It is, for example, interest-
and have a high potential for exploring large ing if we find out that in 70% of the cases when
databases. Visual data exploration is especially people buy bread, they also buy milk. Associ-
useful when little is known about the data and ation rules tell us that the presence of some
the exploration goals are vague. Since the data items in a transaction imply the presence of
analyst is directly involved in the exploration other items in the same transaction with a cer-
process, shifting and adjusting the explor- tain probability, called confidence. A second
818

important parameter is the support of an asso- ingness of the rule. Using the visualization, the
ciation rule, which is defined as the percentage user is able to see groups of related rules and the
of transactions in which the items co-occur. impact of different confidence and support
Let I ¼ {i1 , . . . in } be a set of items and let D levels. The number of rules that can be visual-
be a set of transactions, where each transaction ized, however, is limited, and the visualization
T is a set of items such that T  I. An associ- does not support combinations of items on the
ation rule is an implication of the form X ) Y , left- or right-hand side of the association rules.
where X 2 I, Y 2 I, and X , Y 6¼ ;. The confi- Fig. 43.5 shows two alternative visualizations
dence c is defined as the percentage of transac- called mosaic and double-decker plots [15].
tions that contain Y, given X. The support is the The basic idea is to partition a rectangle on the
percentage of transactions that contain both X y-axis according to one attribute and make the
and Y. For given support and confidence levels, regions proportional to the sum of the corres-
there are efficient algorithms to determine all ponding data values. Compared to bar charts,
association rules [1]. A problem, however, is mosaic plots use the height of the bars instead of
that the resulting set of association rules is usu- the width to show the parameter value. Then
ally very large, especially for low support and each resulting area is split in the same way
confidence levels. Using higher support and according to a second attribute. The coloring
confidence levels may not be effective, since reflects the percentage of data items that fulfill
useful rules may then be overlooked. a third attribute. The visualization shows the
Visualization techniques have been used to support and confidence values of all rules of
overcome this problem and to allow an inter- the form X1 X2 ) Y . Mosaic plots are restricted
active selection of good support and confidence to two attributes on the left side of the associ-
levels. Fig. 43.4 shows SGI MineSets Rule ation rule. Double-decker plots can be used to
Visualizer [17], which maps the left- and right- show more than two attributes on the left side.
hand sides of the rules to the x- and y-axes of the The idea is to show a hierarchy of attributes on
plot and shows the confidence as the height of the bottom (Heineken, Coke, chicken, in the
the bars and the support as the height of the example shown in Fig. 43.5) corresponding to
discs. The color of the bars shows the interest- the left-hand side of the association rules; the

Figure 43.4. MineSet’s Association Rule Visualizer [17] maps the left and right hand sides of the rules to the x and y axes of
the plot and shows the confidence as the height of the bars and the support as the height of the discs; color of the bars shows the
interestingness of the rule (example visualization shows market basket data for customer buying patterns) ßSGI
819

x11 100
not sardines

x12 50
x12 P(x12 and x21)
sardines

P(x12 and x21 and y2) 0


x1 heineken
coke
x2 x21 x22x23 x24 chicken

Figure 43.5 Association Rule Visualization [15] partitions a rectangle on the y axis according to one attribute and makes the
regions proportional to the sum of the corresponding data values. ß ACM (a) Mosaic Plot: 2D mosaic plot of attributes Ax1
and Ax2 ; high lighting show up in the mosaic plot as a third dimension (b) Double Decker Plot: example visualization shows a
hierarchy of supermarket basket items: Heineken, Coke, chicken and sardines.

bars on the top correspond to the number of are approaches that use neural networks, gen-
items in the corresponding subset of the data- etic algorithms, or Bayesian networks to solve
base and therefore visualize the support of the the classification problem. Since most algo-
rule. The colored areas in the bars correspond rithms work as black-box approaches it is
to the percentage of data transactions that con- often difficult to understand and optimize the
tain an additional item (sardines, in Fig. 43.5) decision model. Problems such as over-fitting or
and therefore correspond to the support. Other tree pruning are difficult to tackle.
approaches to association rule visualization in- Visualization techniques can help to over-
clude graphs with nodes corresponding to items come these problems. The decision tree visuali-
and arrows corresponding to implications as zer in SGI’s MineSet system [17] shows an
used in DBMiner [16] and association matrix overview of the decision tree together with im-
visualizations to cluster-related rules [12]. portant parameters such as the attribute value
distributions. The system allows an interactive
selection of the attributes shown and helps the
43.4 Classification user understand the decision tree. A more so-
phisticated approach that also helps in decision
Classification is the process of developing a tree construction is visual classification, as pro-
classification model based on a training dataset posed by Ankerst et al. [3]. The basic idea is to
with known class labels. To construct the clas- show each attribute value by a colored pixel and
sification model, the attributes of the training arrange them in bars. The pixels of each attri-
dataset are analyzed and an accurate descrip- bute bar are sorted separately and the attribute
tion or model of the classes based on the attri- with the purest value distribution is selected as
butes available in the dataset is developed. The the split attribute of the decision tree. The pro-
class descriptions are used then to classify data cedure is repeated until all leaves correspond
for which the class labels are unknown. Classifi- to pure classes. An example of the decision
cation is sometimes also called supervised learn- tree resulting from this process is shown in
ing because the training set is used to teach the Fig. 43.7. Compared to a standard visualization
system how to classify the data. There are many of a decision tree, additional information is pro-
algorithms for solving classification talks. The vided that is helpful for explaining and analyz-
most popular approaches are algorithms that ing the decision tree, namely
inductively construct decision trees. Examples
are ID3 [25], CART [7], ID5 [34,35], C4.5 [26], . Size of the nodes (number of training records
SLIQ [22], and SPRINT [30]. In addition, there corresponding to the node)
820

Figure 43.6 MineSets Decision Tree Visualizer [17] displays decision trees as 3D landscapes, each node contains bars whose
height, color, and disk correspond to important parameters. ß SGI

Figure 43.7 Visual Classification [3] shows each attribute value by a colored pixel and arranges them in bars (example shows a
visualization of a decision trees for the DNA segment training data from the Statlog benchmark having 19 attributes). ß ACM

. Quality of the split (purity of the resulting to easily interact with the classification algo-
partitions) rithms in order to optimize the model gener-
. Class distribution (frequency and location of ation and classification process.
the training instances of all classes).
Some of this information might also be pro-
43.5 Clustering
vided by annotating the standard visualization
of a decision tree (for example, annotating the Clustering is the process of finding a partitioning
nodes with the number of records or the gini- of the dataset into homogeneous subsets called
index), but this approach clearly fails for more clusters. Unlike classification, clustering is un-
complex information such as the class distribu- supervised learning. This means that the classes
tion. In general, visualizations can help us to are unknown and no training set with class
better understand the classification models and labels is available. A wide range of clustering
821

example, x-y plots), but in higher-dimensional


space the impact is much more difficult to under-
stand. Some higher-dimensional techniques try
to determine 2D or 3D projections of the data
that retain the properties of the high-dimensional
clusters as much as possible [39]. Fig. 43.8 shows
a 3D projection of a dataset consisting of five
clusters.
While this approach works well with low- to
medium-dimensional datasets, it is difficult to
apply to large high-dimensional datasets, espe-
cially if the clusters are not clearly separated and
the dataset also contains noise (data that does
not belong to any cluster). In this case, more
sophisticated visualization techniques are
needed to guide the clustering process, select the
Figure 43.8 Visualization based on a projection into right clustering model, and adjust the parameter
3D space [39]: 3D cluster guided projection, where the 3D values appropriately. An example of a system
subspace is determined by centroids of 4 clusters 0, 1, 3, 5. that uses visualization techniques to help in
ß ACM high-dimensional clustering is OPTICS [2]. The
idea of OPTICS (Ordering Points To Identify the
Clustering Structure) is to create a 1D ordering of
algorithms have been proposed in the literature, the database representing its density-based clus-
including density-based methods such as kernel tering structure. Fig. 43.9 shows a 2D example
density estimation [29] and linkage-based dataset together with its reachability distance
methods [6]. Most algorithms use assumptions plot. Intuitively, points within a cluster are
about the properties of the clusters that are either close in the generated 1D ordering and their
used as defaults or have to be given as input reachability distance shown in Fig. 43.9 is simi-
parameters. Depending on the parameter values, lar. Jumping to another cluster results in higher
the user gets differing clustering results. In 2D or reachability distances. The idea works for data
3D space, the impact of different algorithms and of arbitrary dimension. The reachability plot
parameter settings can easily be explored using provides a visualization of the inherent cluster-
simple visualizations of the resulting clusters (for ing structure and is therefore valuable for

(a) ExampleDataSet (b) Reachability Plot - objects are on the x-axis


with their reachability values on the y-axis

Figure 43.9 OPTICS Visual Clustering [2]. ß ACM


822 The Visualization Handbook

understanding the clustering and guiding the and bottom-middle in Fig. 43.10) visualize the
clustering process. partitioning potential of a large number of
Another interesting approach is the HD-Eye projections. The properties are based on histo-
system [14]. The HD-Eye system considers the gram information of the point density in the
clustering problem a partitioning problem and projected space. The number of data points
supports a tight integration of advanced cluster- belonging to the maximum corresponds to the
ing algorithms and state-of-the-art visualization color of the icon. The color follows a given
techniques, allowing the user to directly interact color table ranging from dark colors for large
in the crucial steps of the clustering process. The maxima to bright colors for small maxima. The
crucial steps are the selection of dimensions to measure of how well a maximum is separated
be considered, the selection of the clustering from the others corresponds to the shape of the
paradigm, and the partitioning of the dataset. icon, and the degree of separation varies from
Novel visualization techniques are employed to sharp spikes for well separated maxima to blunt
help the user identify the most interesting pro- spikes for badly separated maxima. The color-
jections and subsets as well as the best separ- and curve-based point density displays present
ators for partitioning the data. Fig. 43.10 shows the density of the data and allow a better under-
an example screenshot of the HD-Eye system standing of the data distribution, which is cru-
with its basic visual components for cluster sep- cial for an effective partitioning of the data. The
aration. The separator tree represents the clus- visualizations are used to decide which dimen-
tering model produced so far in the clustering sions are used for the partitioning. In addition,
process. The abstract iconic displays (top-right the partitioning can be specified interactively

Figure 43.10 HD Eye screenshot [14] showing different visualizations of projections and the separator tree. Clockwise from the
top: separator tree, iconic representation of 1D projections, 1D projection histogram, 1D color based density plots, iconic
representation of multidimensional projections and color based 2D density plot (example visualization shows a large molecular
biology dataset) ß IEEE
823

directly within the visualizations, allowing the mation from text with high reliability. The goals
user to define nonlinear partitionings. of the text-mining process are automatic docu-
ment clusterization/categorization, assignment
of keywords to text documents, topic identifica-
tion and tracking in ordered (time) sequences of
43.6 Text
text documents, searching documents based on
With the growing importance of electronic the content categories and not only keywords,
media for storing and exchanging text docu- generation and analysis of user profiles based on
ments, there is also a growing interest in tools the usage of text databases, and other related
that can help us find and sort information in- problems. A wide range of automatic text-
cluded in the text documents. Text documents mining algorithms have been proposed in the
are semistructured data, in that they are neither literature over the last few decades [10,11].
completely unstructured nor completely struc- An interesting visual data-mining approach is
tured. For example, a document may contain ThemeRiver [13]. The ThemeRiver visualization
some structured fields, such as title, authors, depicts thematic variations over time within a
publication date, length, and category, as well large collection of documents. The thematic
as largely unstructured text components, such changes are shown in the context of a timeline
as abstract and content. Text mining is a process and corresponding external events. The docu-
in finding for patterns in text databases, and ment collection’s timeline, selected thematic
may be defined as the process of analyzing text content, and thematic strength are indicated by
to extract information from it. Text mining rec- the river’s directed flow, composition, and
ognizes that complete understanding of natural- changing width, respectively. The directed flow
language text, a long-standing goal of computer from left to right is interpreted as movement
science, is not immediately attainable and through time, and the horizontal distance be-
focuses on extracting a small amount of infor- tween two points on the river defines a time

Figure 43.11 ThemeRiver [13]: visualization of thematic changes in documents (example visualization shows Castro data from
November 1959 through June 1961). ß IEEE.
824

techniques can be useful in solving this problem.


Visual data exploration has a high potential, and
many applications such as fraud detection and
data mining can use information visualization
technology for improved data analysis.
Avenues for future work include the tight
integration of visualization techniques with trad-
itional techniques from such disciplines as stat-
istics, machine learning, operations research,
and simulation. Integration of visualization
techniques and these more established methods
would combine fast automatic data-mining algo-
rithms with the intuitive power of the human
mind, improving the quality and speed of the
data-mining process. Visual data-mining tech-
niques also need to be tightly integrated with
the systems used to manage the vast amounts of
relational and semistructured information, in-
cluding database management and data ware-
Figure 43.12 Shape based Visual Interface for Text Re
trieval [27]: shape based visualization of query results house systems. The ultimate goal is to bring the
(example visualization shows the result for the key words power of visualization technology to every desk-
lion, sheep, mouse, and wolf ). ß ACM top to allow a better, faster, and more intuitive
exploration of very large data resources. This will
interval. At any point in time, the vertical dis- not only be valuable in an economic sense but
tance, or width, of the river indicates the collect- will also stimulate and delight the user.
ive strength of the selected themes. Colored
‘‘currents’’ flowing within the river represent
individual themes. A current’s vertical width References
narrows or broadens to indicate decreases or 1. R. Agarwal, H. Mannila, R. Srikant, H. Toivo
increases in the strength of the individual theme. nen, and A. Verkamo. Fast discovery of associ
Another interesting approach is the shape- ation rules. Advances in Knowledge Discovery and
Data Mining, pages 307 328, 1996.
based visual interface for text retrieval [27]. 2. M. Ankerst, M. Breunig, H. Kriegel, and J.
This exploratorion system uses procedurally Sander. OPTICS: Ordering points to identify
generated shapes coupled with an underlying the clustering structure. Proc. ACM SIGMOD
text retrieval engine. Traditional text-based ’99, Int. Conf on Management of Data, Phila
queries and summarization are enhanced with delphia, PA, pages 49 60, 1999.
3. M. Ankerst, M. Ester, and H. Kriegel. Towards
a visual interface based on 3D shapes (glyphs). an effective cooperation of the computer and the
The interface allows visualization of multidi- user for classification. SIGKDD Int. Conf. On
mensional relationships among documents and Knowledge Discovery & Data Mining (KDD
perception of more information than with con- 2000), Boston, MA, pages 179 188, 2000.
ventional text-based interfaces. 4. M. Ankerst, D. A. Keim, and H. P. Kriegel.
Recursive pattern: A technique for visualizing
very large amounts of data. In Proc. Visualization
’95, Atlanta, GA, pages 279 286, 1995.
43.7 Conclusion 5. M. Ankerst, D. A. Keim, and H. P. Kriegel. Circle
segments: A technique for visually exploring large
The exploration of large datasets is an important multidimensional data sets. In Visualization ’96,
but difficult problem. Information visualization Hot Topic Session, San Francisco, CA, 1996.
825

6. H. H. Bock. Automatic Classification. Vanden 23. R. M. Pickett. Visual Analyses of Texture in the
hoeck and Ruprecht, Göttingen, 1974. Detection and Recognition of Objects. Academic
7. L. Breiman, J. Friedman, R. Olshen, and C. Press, New York, 1970.
Stone. Classification and Regression Trees. 24. R. M. Pickett and G. G. Grinstein. Icono
Wadsworth and Brooks, Monterey, CA, 1984. graphic displays for visualizing multidimen
8. S. Card, J. Mackinlay, and B. Shneiderman. sional data. In Proc. IEEE Conf. on Systems,
Readings in Information Visualization. Morgan Man and Cybernetics, IEEE Press, Piscataway,
Kaufmann, 1999. NJ, pages 514 519, 1988.
9. H. Chernoff. The use of faces to represent points 25. J. R. Quinlan. Induction of decision trees. Ma
in k dimensional space graphically. Journal chine Learning, pages 81 106, 1986.
Amer. Statistical Association, 68:361 368, 1973. 26. J. R. Quinlan. C4.5: Programs for Machine
10. J. Han and M. Kamber. Data Mining: Concepts Learning. Morgan Kaufmann, Los Altos, CA,
and Techniques. Morgan Kaufmann Publishers, 1993.
2001. 27. R. M. Rohrer, J. L. Sibert, and D. S. Ebert. A
11. D. J. Hand, H. Mannila, and P. Smyth. Prin shape based visual interface for text retrieval.
ciples of Data Mining. MIT Press, 2001. IEEE Computer Graphics and Applications,
12. M. Hao, M. Hsu, U. Dayal, S. F. Wei, T. 19(5):40 47, 1999.
Sprenger, and T. Holenstein. Market basket an 28. H. Schumann and W. Müller. Visualisierung:
alysis visualization on a spherical surface. Visual Grundlagen und allgemeine Methoden. Springer,
Data Exploration and Analysis Conference, San 2000.
Jose, CA, 2001. 29. D. W. Scott. Multivariate Density Estimation.
13. S. Havre, B. Hetzler, L. Nowell, and P. Whit Wiley and Sons, 1992.
ney. Themeriver: Visualizing thematic changes 30. J. Shafer, R. Agrawal, and M. Mehta. SPRINT:
in large document collections. Transactions on A scalable parallel classifier for data mining.
Visualization and Computer Graphics, 2001. Conf. on Very Large Databases, 1996.
14. A. Hinneburg, D. Keim, and M. Wawryniuk. 31. B. Shneiderman. Tree visualization with tree
HD Eye: Visual Mining of High dimensional maps: A 2D space filling approach. ACM
Data. IEEE Computer Graphics and Applica Transactions on Graphics, 11(1):92 99, 1992.
tions, 19(5), 1999. 32. B. Shneiderman. The eye have it: A task by data
15. H. Hofmann, A. Siebes, and A. Wilhelm. Visu type taxonomy for information visualizations.
alizing association rules with interactive mosaic In Visual Languages, 1996.
plots. SIGKDD Int. Conf. On Knowledge Dis 33. B. Spence. Information Visualization. Pearson
covery & Data Mining (KDD 2000), Boston, Education Higher Education publishers, UK,
MA, 2000. 2000.
16. D. T. Inc. Dbminer. http://www.dbminer.com, 34. P. E. Utgoff. Incremental induction of decision
2001. trees. Machine Learning, 4:161 186, 1989.
17. S. G. Inc. Mineset. http://www.sgi.com/software/ 35. P. E. Utgoff, N. C. Berkman, and J. A. Clouse.
mineset, 2001. Decision tree induction based on efficient
18. A. Inselberg and B. Dimsdale. Parallel coordin tree restructuring. Machine Learning, 29:5 44,
ates: A tool for visualizing multi dimensional 1997.
geometry. In Proc. Visualization 90, San Fran 36. J. Walter and H. Ritter. On interactive visual
cisco, CA, pages 361 370, 1990. ization of high dimensional data using the
19. B. Johnson and B. Shneiderman. Treemaps: A hyperbolic plane. In Proc. ACM SIGKDD Inter
space filling approach to the visualization of national Conference on Knowledge Discovery and
hierarchical information. In Proc. Visualization Data Mining, pages 123 131, 2002.
’91 Conf, pages 284 291, 1991. 37. M. O. Ward. Xmdvtool: Integrating multiple
20. D. Keim. Visual exploration of large databases. methods for visualizing multivariate data. In
Communications of the ACM, 44(8):38 44, 2001. Proc. Visualization 94, Washington, DC, pages
21. D. Keim and M. Ward. Visual Data Mining 326 336, 1994.
Techniques, Book Chapter in: Intelligent Data 38. C. Ware. Information Visualization: Perception
Analysis, an Introduction by D. Hand and M. for Design. Morgen Kaufman, 2000.
Berthold. Springer Verlag, 2 edition, 2002. 39. L. Yan. Interactive exploration of very large
22. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: relational data sets through 3d dynamic projec
A fast scalable classifier for data mining. Conf. tions. SIGKDD Int. Conf. On Knowledge Dis
on Extending Database Technology (EDBT), covery & Data Mining (KDD 2000), Boston,
Avignon, France, 1996. MA, 2000.

You might also like