Getting To Know Your Data
Getting To Know Your Data
1
Getting to Know Your Data
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data 3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
3
Data Objects
Types of attributes:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: categories, states, or “names of things”
Eg:- Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
No meaningful order and are not quantitative
Mathematical operation on values of nominal attributes are not meaningful
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
6
Attribute Types
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
Central tendency can be represented by – mode and
median
But ‘mean’ cannot be defined
7
Numeric Attribute Types
Numeric
Quantitative - it is measurable quantity represented in integer or
real-values
Can be interval –scaled or ratio scaled
Interval scaled
Measured on a scale of equal-sized units
Allow us to compare and quantify the difference between
values
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Because neither 0 degree Celcius/Fahrenheit indicates no
temperature
Ratios are not valid, but since it is numeric, mean can be
computed, similarly median and mode are also valid
8
Numeric Attribute Types
Ratio
Inherent zero-point
Values are ordered, being multiple (ratio), Can
compute difference, mean, mode, median,
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚, since 0 K˚ = -273.15˚C ).
e.g., temperature in Kelvin, length, counts,
monetary quantities
9
Discrete vs. Continuous Attributes
Another way of organizing attributes – discrete and continuous
Discrete Attribute
Has only a finite or countable infinite set of values
of documents
Sometimes, represented as integer variables
Continuous Attribute
Has real numbers as attribute values
variables
10
Getting to Know Your Data
Data Visualization
Summary
11
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency, variation
and spread
Central tendency
Location of middle or center of dispersion
Dispersion
How data are spread about
Range, quartiles, interquartile range, five-number summary.
12
Measuring the Central Tendency
Mean (algebraic measure
Note: n is sample size. 1 n
Weighted arithmetic mean: x xi
n i 1
Problem
A small number of extreme values can corrupt the n
mean w x i i
Trimmed mean: chopping extreme values x i 1
n
w
i 1
i
13
Median
Median: Median
interval
Middle value if odd number of values, or
Interval
average of the middle two values otherwise
where
Estimated by interpolation (for grouped data): cumulati
ve freq
n / 2 ( freq ) l >= N/2
median L1 ( ) width
freq median
L1 –lower boundary of median interval (21)
n – the number of values in entire data set (3194)
∑freq – sum of freq of all intervals lower than
median interval (950)
Freqmedian – frequency of median interval (1500)
Width- width of median interval (30)
14
Measuring the Central Tendency
Mode
Value that occurs most frequently in the data
Based on number of modes present in data set –
Unimodal, bimodal, trimodal
For unimodal numeric data, we have empirical formula:
15
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
N N
1 1
x 2
2
2
( xi 2
) i
N i 1 N i 1
17
Measuring the Dispersion of Data
Quartiles, outliers and boxplots
Quartiles: sorted, splits into equal size, consecutive set
Points taken at regular intervals of data distribution
Q1 (25th percentile), Q3 (75th percentile)
18
Boxplot Analysis
19
Example
Suppose, we have a set values for salary (in dollars) as
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110
Draw a Boxplot
Median ?
Q1 ?
Q3 ?
IQR ?
Outlier ?
20
Visualization of Data Dispersion: 3-D Boxplots
22
Graphic Displays of Basic Statistical Descriptions
25
Quantile Plot
Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
Plots quantile information
For a data x data sorted in increasing order, f indicates that
i i
approximately 100 fi% of the data are below or equal to the
value xi
27
Scatter plot
Provides a first look at bivariate data to see clusters of points,
outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
28
Positively and Negatively Correlated Data
29
Uncorrelated Data
30
Similarity and Dissimilarity
Proximity refers to a similarity or dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity (e.g., distance)
Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
31
Data Matrix and Dissimilarity Matrix
Data structures
1. Data matrix x11 ... x1f ... x1p
Stores the n data ... ... ... ... ...
x ... xif ... xip
objects with p i1
attributes in the form of ... ... ... ... ...
x ... xnf ... xnp
a relational table n1
It is n by p matrix
32
Data Matrix and Dissimilarity Matrix
Data structures 0
d(2,1) 0
2. Dissimilarity matrix
This structure shows collection of d(3,1) d ( 3,2) 0
proximities that are available for all pair : : :
of n objects. d ( n,1) d ( n,2) ... ... 0
d(i, j) : difference between object i and j
d(i, j) = 0, objects are same
d(i, j) = d(j, i)
Many clustering and nearest neighbor
algorithms operate on dissimilarity matrix
Measures of similarity can expressed in
terms of dissimilarity
sim(i, j) = 1 - d(i, j)
33
Proximity Measure for Nominal Attributes
d (i, j) p
p
m
34
Dissimilarity- Nominal data
Obj. ID attr-1 attr-2
101 A X
102 B X
103 A Y
104 C Z
Calculate
0
d(2,1) 0
0 1 / 2 0
d(3,1) d ( 3,2) 0
1 / 2 1 0
: : :
d ( n,1) d ( n,2) ... ... 0 1 1 1
35
Proximity Measure for Binary Attributes
Object j
36
Proximity Measure for Binary Attributes
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and h is the order (the distance so defined is also
called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0, distance of an object to itself or i= j
(Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
38
Special Cases of Minkowski Distance
h = 1: Manhattan (city block, L1 norm) distance
E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
39
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
40
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Ordinal attributes can be treated like interval-scaled attributes.
Suppose f is attribute from a set of ordinal attributes.
The value of ith object is xif rif {1,..., M f }
replace xif by their rank rif
map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by, rif 1
zif
M f 1
compute the dissimilarity using methods for interval-scaled
variables
41
Ordinal variables
Obj. ID attr-1
101 Excellent
102 Fair
103 Good
104 Excellent
Steps to find dissimilarity measure
1. Replace states with ranks. Here, Mf = 3, and assigned ranks are
3, 2, 1 for excellent, good and fair, respectively rif 1
zif
2. Normalize ranking by mapping ranks to 1 to 0.0 using
M f 1
Fair - > (1 - 1)/ (3 – 1) -> 0,
Good -> 0.5
Excellent -> 1
3. Find dissimilarly measure ( Euclidean , Manhattan etc)
42
Attributes of Mixed Type
A database may contain all attribute types
Nominal, symmetric binary, asymmetric binary, numeric, ordinal
ij( f )dij( f )
P
d (i, j) p ( f )
f 1
δij if
f2)x1=xij and attribute is asymmetric
=0, if either 1) x or x is missing
jf if jf
binary, otherwise δij =1
f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
f is numeric: use the normalized distance
f is ordinal dij |x x |
if jf
M 1 f 43
Mixed type example
Obj. ID Attr-1 Attr- 2 Attr-3
101 Code A Excellent 45
102 Code B Fair 22
103 Code C Good 64
104 Code A Excellent 28
0
0.85 0
0.65 0.83 0
0.13 0.71 0.79
Cosine Similarity
48
Example: Cosine Similarity
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
49
Data Reduction : Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining / learning
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
50
Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space
x2
x1
51
Background for PCA
Suppose attributes are A1 and A2, and we have n
training examples. x’s denote values of A1 and y’s
denote values of A2 over the training examples.
Variance of an attribute:
n
i
( x x ) 2
var( A1 ) i 1
(n 1)
Covariance of two attributes
Covariance of two attributes:
n
( x x )( y
i i y)
cov( A1 , A2 ) i 1
(n 1)
Covariance matrix
Suppose we have n attributes, A1, ..., An.
Covariance matrix:
C nn (ci , j ), where ci , j cov( Ai , A j )
54
Covariance Calculation
Covariance
cov( H , H ) cov( H , M )
cov(M , H ) cov(M , M )
var( H ) 104.5
104.5 var(M )
47.7 104.5
104.5 370
Covariance matrix
56
Eigenvectors
Let M be an nn matrix.
v is an eigenvector of M if M v = v
M av av
Thus you can always choose eigenvectors of length 1:
2 2
v1 ... vn 1
Eigenvectors can only be found for a square matrix. Not every
square matrix has eigenvectors. Given M is n x n matrix, and if M
has any eigenvectors, it has n of them, and they are orthogonal to
one another.
58
Since Mv = v where v is eigenvector of M and is eigenvalue of M
corresponding to the vector v.
i.e Mv - v = 0,
(M - I)v = 0, which implies (M - I) cannot have inverse
Therefore det(M - I) = 0
Let M=
1 3
3
5
59
Principal Component Analysis (Steps)
Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
Normalize input data: Each attribute falls within the same range
Compute k orthonormal (unit) vectors, i.e., principal components
Each input data (vector) is a linear combination of the k principal
component vectors
The principal components are sorted in order of decreasing
“significance” or strength
Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
Works for numeric data only
60
PCA steps
1. Given original data set S = {x1, ..., xk}, produce
new set by subtracting the mean of attribute Ai
from each xi.
.735178956
v 2 .0490833989
.677873399
65
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in one or
more other attributes
Irrelevant attributes
Contain no information that is useful for the data mining
task at hand
E.g., students' ID is often irrelevant to the task of predicting
students' GPA
66
Heuristic Search in Attribute Selection
67
Heuristic Search in Attribute Selection
Heuristic methods of attribute subset selection
Stepwise forward selection
Starts with empty set, the best of original attributes is determined and
added to the reduced set
Stepwise backward elimination
Starts with full set of attributes. At each step, it removes the worst
attribute remaining in the set
Combination of forward and backward elimination
Combination of both
At each step, the procedure selects the best attribute and removes the
worst from among the remaining
Decision tree induction
Algorithms such as ID3, c4.5, CART
All attributes that do not appear in the tree are assumed to be irrelevant
68
Histograms
Another data reduction technique
Divide data into buckets and store average (sum) for
each bucket
Singleton buckets
Each bucket represent only a single attribute-value /
frequency pair
Partitioning rules
Equal width
70
Data Redundancy
An attribute is redundant, if it can be derived from
another attribute or set of attributes
Redundancies can be detected by correlation analysis
Tells how one attribute’s values vary from those of
another
For nominal attributes
Chi-square test
Numeric attributes
correlation coefficient or covarience
71
Chi-square test (for Nominal data)
The Chi-square test is intended to test how likely it is
that an observed distribution is due to chance.
It measures how well the observed distribution of
data fits with the distribution that is expected if the
variables are independent.
A Chi-square test is designed to analyze categorical
data.
72
Chi-square test (for Nominal data)
Let attribute A has c distinct values, a1, a2, …, ac and B has r
distinct values b1, b2, ….., br
A contingency table :
a , a , …, a forms columns
1 2 c
Χ2 (chi-square) test
(Oij eij )2
i 1 i 1 eij
2 r c
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
74
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
75
Chi-Square
A chi-squared test is a test that is used for hypothesis testing for
categorical variables.
The null hypothesis represents a general or default position.
Example
No relationship occurs between two measured parameters A and B
a certain treatment has no effect. The null hypothesis is denoted by H0.
Degrees of freedom (d.f)
The number of independent ways by which a dynamical system can move
without violating any constraint imposed on it, is called degree of
freedom.
In statistics, d.f = n-1, where n is number of variables
Level of Significance
A statistical assessment of whether the results obtained of an experiment
are due to some error or simply by chance.
76
Correlation Analysis (Numeric Data)
(n 1) A B (n 1) A B
77
Normalization
Normalization: Scaled to fall within a smaller, specified range
Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600 12,000
(1.0 0) 0 0.716
Then $73,000 is mapped to 98,000 12,000
73,600 54,000
1.225
Ex. Let μ = 54,000, σ = 16,000. Then 16,000
78