02 Data
02 Data
— Chapter 2 —
◼ Data Visualization
◼ Summary
2
Types of Data Sets
◼ Record
◼ Relational records
◼ Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
◼ Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
◼ Transaction data
◼ Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
◼ Dimensionality
◼ Curse of dimensionality
◼ Sparsity
◼ Only presence counts
◼ Resolution
◼ Patterns depend on the scale
◼ Distribution
◼ Centrality and dispersion
4
Data Objects
◼ Binary
◼ Numeric: quantitative
◼ Interval-scaled
◼ Ratio-scaled
6
Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome (e.g., HIV
positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
7
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
◼ Measured on a scale of equal-sized units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar dates
◼ No true zero-point
◼ Ratio
◼ Inherent zero-point
◼ We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts,
monetary quantities
8
Discrete vs. Continuous Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values
collection of documents
◼ Sometimes, represented as integer variables
◼ Data Visualization
◼ Summary
10
Basic Statistical Descriptions of Data
◼ Motivation
◼ To better understand the data: central tendency,
variation and spread
◼ Data dispersion characteristics
◼ median, max, min, quantiles, outliers, variance, etc.
◼ Numerical dimensions correspond to sorted intervals
◼ Data dispersion: analyzed with multiple granularities
of precision
◼ Boxplot or quantile analysis on sorted intervals
◼ Dispersion analysis on computed measures
◼ Folding measures into numerical dimensions
◼ Boxplot or quantile analysis on the transformed cube
11
Measuring the Central Tendency
◼ Mean (algebraic measure) (sample vs. population): 1 n
x = xi = x
Note: n is sample size and N is population size. n i =1 N
n
Weighted arithmetic mean:
w x
◼
i i
◼ Trimmed mean: chopping extreme values x= i =1
n
◼ Median: w
i =1
i
[ xi − ( xi ) 2 ]
1 1
s = ( xi − x ) = = − = xi − 2
2 2 2 2 2
( x )
n − 1 i =1 n − 1 i =1 n i =1 N i =1
i
N i =1
14
Boxplot Analysis
15
Visualization of Data Dispersion: 3-D Boxplots
17
Graphic Displays of Basic Statistical Descriptions
18
Histogram Analysis
◼ Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
◼ It shows what proportion of cases 30
fall into each of several categories
25
◼ Differs from a bar chart in that it is
20
the area of the bar that denotes the
value, not the height as in bar 15
charts, a crucial distinction when the 10
categories are not of uniform width
5
◼ The categories are usually specified
0
as non-overlapping intervals of 10000 30000 50000 70000 90000
some variable. The categories (bars)
must be adjacent
19
Histograms Often Tell More than Boxplots
20
Quantile Plot
◼ Displays all of the data (allowing the user to assess both
the overall behavior and unusual occurrences)
◼ Plots quantile information
◼ For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi% of the data are
below or equal to the value xi
22
Scatter plot
◼ Provides a first look at bivariate data to see clusters of
points, outliers, etc
◼ Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
23
Positively and Negatively Correlated Data
24
Uncorrelated Data
25
Chapter 2: Getting to Know Your Data
◼ Data Visualization
◼ Summary
26
Data Visualization
◼ Why data visualization?
◼ Gain insight into an information space by mapping data onto graphical
primitives
◼ Provide qualitative overview of large data sets
◼ Search for patterns, trends, structure, irregularities, relationships among
data
◼ Help find interesting regions and suitable parameters for further
quantitative analysis
◼ Provide a visual proof of computer representations derived
◼ Categorization of visualization methods:
◼ Pixel-oriented visualization techniques
◼ Geometric projection visualization techniques
◼ Icon-based visualization techniques
◼ Hierarchical visualization techniques
◼ Visualizing complex data and relations
27
Pixel-Oriented Visualization Techniques
◼ For a data set of m dimensions, create m windows on the screen, one
for each dimension
◼ The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
◼ The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit (c) transaction volume (d) age
28
Laying Out Pixels in Circle Segments
◼ To save space and show the connections among multiple dimensions,
space filling is often done in a circle segment
32
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
• • •
35
Icon-Based Visualization Techniques
36
Chernoff Faces
37
Stick Figure
A census data
figure showing
age, income,
gender,
education, etc.
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern 38
Hierarchical Visualization Techniques
39
Dimensional Stacking
attribute 4
attribute 2
attribute 3
attribute 1
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
41
Worlds-within-Worlds
◼ Assign the function and two most important parameters to innermost
world
◼ Fix all other parameters at constant values - draw other (1 or 2 or 3
dimensional worlds choosing these as the axes)
◼ Software that uses this paradigm
◼ N–vision: Dynamic
interaction through data
glove and stereo
displays, including
rotation, scaling (inner)
and translation
(inner/outer)
◼ Auto Visual: Static
interaction by means of
queries
42
Tree-Map
◼ Screen-filling method which uses a hierarchical partitioning
of the screen into regions depending on the attribute values
◼ The x- and y-dimension of the screen are partitioned
alternately according to the attribute values (classes)
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 43
Tree-Map of a File System (Schneiderman)
44
InfoCube
45
Three-D Cone Trees
◼ 3D cone tree visualization technique works
well for up to a thousand nodes or so
◼ First build a 2D circle tree that arranges its
nodes in concentric circles centered on the
root node
◼ Cannot avoid overlaps when projected to
2D
◼ G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
◼ Graph from Nadeau Software Consulting
website: Visualize a social network data set
that models the way an infection spreads
from one person to the next
Ack.: http://nadeausoftware.com/articles/visualization
46
Visualizing Complex Data and Relations
◼ Visualizing non-numerical data: text and social networks
◼ Tag cloud: visualizing user-generated tags
◼ The importance of
tag is represented
by font size/color
◼ Besides text data,
there are also
methods to visualize
relationships, such as
visualizing social
networks
◼ Data Visualization
◼ Summary
48
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are
are
◼ Lower when objects are more alike
49
Data Matrix and Dissimilarity Matrix
◼ Data matrix
◼ n data points with p x11 ... x1f ... x1p
dimensions ... ... ... ... ...
x xip
◼ Two modes
... xif ...
i1
... ... ... ... ...
x ... xnf ... xnp
n1
◼ Dissimilarity matrix
0
◼ n data points, but d(2,1)
0
registers only the
d(3,1) d ( 3, 2 ) 0
distance
◼ A triangular matrix : : :
d ( n,1) d ( n, 2 ) ... ... 0
◼ Single mode
50
Proximity Measure for Nominal Attributes
51
Proximity Measure for Binary Attributes
Object j
◼ A contingency table for binary data
Object i
52
Dissimilarity between Binary Variables
◼ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
zif = sf
◼ standardized measure (z-score):
◼ Using mean absolute deviation is more robust than using standard
deviation
54
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
55
Distance on Numeric Data: Minkowski Distance
◼ Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
◼ A distance that satisfies these properties is a metric
56
Special Cases of Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are
57
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
58
Ordinal Variables
59
Attributes of Mixed Type
60
Cosine Similarity
◼ A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
61
Example: Cosine Similarity
◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
62
Chapter 2: Getting to Know Your Data
◼ Data Visualization
◼ Summary
63
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web, image.
◼ Gain insight into the data by:
◼ Basic statistical data description: central tendency, dispersion,
graphical displays
◼ Data visualization: map data onto graphical primitives
◼ Measure data similarity
◼ Above steps are the beginning of data preprocessing.
◼ Many methods have been developed but still an active area of research.
64
References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
◼ C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
65