KEMBAR78
Getting To Know Your Data | PDF | Principal Component Analysis | Eigenvalues And Eigenvectors
0% found this document useful (0 votes)
65 views78 pages

Getting To Know Your Data

This document discusses different types of data and attributes that comprise datasets. It describes how data objects can be represented as records, documents, graphs or networks. It also defines different attribute types like nominal, binary, numeric discrete vs continuous. The document outlines techniques for describing datasets statistically, including measures of central tendency like mean, median and modes. It introduces the concept of dispersion to quantify how data is spread out.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views78 pages

Getting To Know Your Data

This document discusses different types of data and attributes that comprise datasets. It describes how data objects can be represented as records, documents, graphs or networks. It also defines different attribute types like nominal, binary, numeric discrete vs continuous. The document outlines techniques for describing datasets statistically, including measures of central tendency like mean, median and modes. It introduces the concept of dispersion to quantify how data is spread out.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 78

Getting to Know Your Data

1
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:

3
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
4
Attributes
 Attribute (or dimension (data warehousing), feature (machine
learning), variable(statisticians)): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address

 Types of attributes:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

5
Attribute Types
 Nominal: categories, states, or “names of things”
 Eg:- Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 No meaningful order and are not quantitative
 Mathematical operation on values of nominal attributes are not meaningful
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)

6
Attribute Types
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
 Central tendency can be represented by – mode and
median
 But ‘mean’ cannot be defined

7
Numeric Attribute Types
 Numeric
 Quantitative - it is measurable quantity represented in integer or
real-values
 Can be interval –scaled or ratio scaled
 Interval scaled
 Measured on a scale of equal-sized units
 Allow us to compare and quantify the difference between
values
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Because neither 0 degree Celcius/Fahrenheit indicates no
temperature
 Ratios are not valid, but since it is numeric, mean can be
computed, similarly median and mode are also valid
8
Numeric Attribute Types
 Ratio
 Inherent zero-point
 Values are ordered, being multiple (ratio), Can
compute difference, mean, mode, median,
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚, since 0 K˚ = -273.15˚C ).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

9
Discrete vs. Continuous Attributes
 Another way of organizing attributes – discrete and continuous
 Discrete Attribute
 Has only a finite or countable infinite set of values

 E.g., zip codes, profession, or the set of words in a collection

of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

using a finite number of digits


 Continuous attributes are typically represented as floating-point

variables
10
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

11
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency, variation
and spread
 Central tendency
 Location of middle or center of dispersion
 Dispersion
 How data are spread about
 Range, quartiles, interquartile range, five-number summary.

12
Measuring the Central Tendency
 Mean (algebraic measure
Note: n is sample size. 1 n
 Weighted arithmetic mean: x   xi
n i 1
 Problem
 A small number of extreme values can corrupt the n
mean w x i i
 Trimmed mean: chopping extreme values x i 1
n

w
i 1
i

13
Median
 Median: Median
interval
 Middle value if odd number of values, or
Interval
average of the middle two values otherwise
where
 Estimated by interpolation (for grouped data): cumulati
ve freq
n / 2  ( freq ) l >= N/2
median  L1  ( ) width
freq median
 L1 –lower boundary of median interval (21)
 n – the number of values in entire data set (3194)
 ∑freq – sum of freq of all intervals lower than
median interval (950)
 Freqmedian – frequency of median interval (1500)
 Width- width of median interval (30)

14
Measuring the Central Tendency

 Mode
 Value that occurs most frequently in the data
 Based on number of modes present in data set –
 Unimodal, bimodal, trimodal
 For unimodal numeric data, we have empirical formula:

mean  mode  3  (mean  median)

15
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

09/16/20 Data Mining: Concepts and Techniques 16


Measuring the Dispersion of Data
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)

N N
1 1
    x  2
2
2
( xi  2
)  i
N i 1 N i 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

17
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: sorted, splits into equal size, consecutive set
 Points taken at regular intervals of data distribution
 Q1 (25th percentile), Q3 (75th percentile)

 Inter-quartile range: IQR = Q3 – Q1


 Five number summary: fuller summary of shape of
distribution
 min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked;
add whiskers (lines outside the box), and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR

18
Boxplot Analysis

 Five-number summary of a distribution


 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to
Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

19
Example
 Suppose, we have a set values for salary (in dollars) as
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110

 Draw a Boxplot
 Median ?

 Q1 ?

 Q3 ?

 IQR ?

 Outlier ?

20
Visualization of Data Dispersion: 3-D Boxplots

09/16/20 Data Mining: Concepts and Techniques 21


Properties of Normal Distribution Curve

 The normal (distribution) curve


 From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation)


 From μ–2σ to μ+2σ: contains about 95% of it

 From μ–3σ to μ+3σ: contains about 99.7% of it

22
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary


 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
 Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
23
Histogram Analysis
 Histogram: Graph display of tabulated
40
frequencies, shown as bars
35
 It shows what proportion of cases fall
into each of several categories 30

 Differs from a bar chart in that it is the25


area of the bar that denotes the 20
value, not the height as in bar charts, 15
a crucial distinction when the 10
categories are not of uniform width
5
 The categories are usually specified as
0
non-overlapping intervals of some 10000 30000 50000 70000 90000

variable. The categories (bars) must


be adjacent
24
Histograms Often Tell More than Boxplots

 The two histograms


shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data
distributions

25
Quantile Plot
 Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
 Plots quantile information
 For a data x data sorted in increasing order, f indicates that
i i
approximately 100 fi% of the data are below or equal to the
value xi

Data Mining: Concepts and Techniques 26


Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

27
Scatter plot
 Provides a first look at bivariate data to see clusters of points,
outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

28
Positively and Negatively Correlated Data

 The left half fragment is positively


correlated
 The right half is negative correlated

29
Uncorrelated Data

30
Similarity and Dissimilarity
 Proximity refers to a similarity or dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0

31
Data Matrix and Dissimilarity Matrix
 Data structures
1. Data matrix  x11 ... x1f ... x1p 
 
 Stores the n data  ... ... ... ... ... 
x ... xif ... xip 
objects with p  i1 
attributes in the form of  ... ... ... ... ... 
x ... xnf ... xnp 
a relational table  n1 
 It is n by p matrix

32
Data Matrix and Dissimilarity Matrix
 Data structures  0 
 d(2,1) 0 
2. Dissimilarity matrix  
 This structure shows collection of  d(3,1) d ( 3,2) 0 
 
proximities that are available for all pair  : : : 
of n objects. d ( n,1) d ( n,2) ... ... 0
 d(i, j) : difference between object i and j
 d(i, j) = 0, objects are same
 d(i, j) = d(j, i)
 Many clustering and nearest neighbor
algorithms operate on dissimilarity matrix
 Measures of similarity can expressed in
terms of dissimilarity
 sim(i, j) = 1 - d(i, j)

33
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be M.


 Simple matching
 m: # of matches (number of attributes for which i
and j are in same state),
 p: total # of attributes

d (i, j)  p 
p
m

34
Dissimilarity- Nominal data
Obj. ID attr-1 attr-2
101 A X
102 B X
103 A Y
104 C Z

Calculate
 0 
 d(2,1)   0 
 0  1 / 2 0 
 d(3,1) d ( 3,2) 0   
  1 / 2 1 0 
 : : :   
d ( n,1) d ( n,2) ... ... 0  1 1 1 
 

35
Proximity Measure for Binary Attributes
Object j

 Binary objects have only two states,


Object i
0 and 1
 A contingency table for binary data
where
q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

36
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity


 Number of –ve matches is
considered as unimportant

 Asymmetric binary similarity q


sim(i, j )  1 d (i, j ) 
qr s
between object I and j
q
Sim(i, j) = qr  s

 Jaccard coefficient (similarity


measure for asymmetric binary
variables):
37
Dissimilarity on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and h is the order (the distance so defined is also
called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0, distance of an object to itself or i= j
(Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

38
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.


 This is the maximum difference between any component (attribute)
of the vectors

39
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
40
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Ordinal attributes can be treated like interval-scaled attributes.
Suppose f is attribute from a set of ordinal attributes.
 The value of ith object is xif rif {1,..., M f }
 replace xif by their rank rif
 map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by, rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-scaled
variables

41
Ordinal variables
Obj. ID attr-1
101 Excellent
102 Fair
103 Good
104 Excellent
Steps to find dissimilarity measure
1. Replace states with ranks. Here, Mf = 3, and assigned ranks are
3, 2, 1 for excellent, good and fair, respectively rif 1
zif 
2. Normalize ranking by mapping ranks to 1 to 0.0 using
M f 1
 Fair - > (1 - 1)/ (3 – 1) -> 0,
 Good -> 0.5
 Excellent -> 1
3. Find dissimilarly measure ( Euclidean , Manhattan etc)

42
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric, ordinal

 One may use a weighted formula to combine their effects (p


attributes)

  ij( f )dij( f )
P

d (i, j)  p ( f )
f 1

 δij if
 f2)x1=xij and attribute is asymmetric
=0, if either 1) x or x is missing
jf if jf
binary, otherwise δij =1
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal dij  |x x |
if jf

 Compute ranks r and


if max x  min x
h hf h hf
 Treat z as interval-scaled
if
zif 
r 1
if

M 1 f 43
Mixed type example
Obj. ID Attr-1 Attr- 2 Attr-3
101 Code A Excellent 45
102 Code B Fair 22
103 Code C Good 64
104 Code A Excellent 28

• Dissimilarity matrix for Attribute -1


0 
 0   
 d(2,1)  1 0 
 0   
 d(3,1) d ( 3,2) 0  1 1 0 
   
 : : :  0 1 1 
d ( n,1) d ( n,2) ...  
... 0 
 

Mixed type example
Obj. ID Attr-1 Attr- 2 Attr-3
101 Code A Excellent 45
102 Code B Fair 22
103 Code C Good 64
104 Code A Excellent 28
• Dissimilarity matrix for Attribute -2
• Let rank of excellent is 3, Good is 2 and fair is 1
• Normalize the ranks, and find Euclidean distance
0 
Attr-2 Attr-2  
Excellent 1 1.0 0 
 
Fair 0 0.5 0.5 0 
Good  
0.5 0 1.0 0.5 
Excellent 1  

 

Mixed type example
Obj. ID Attr-1 Attr- 2 Attr-3
101 Code A Excellent 45
102 Code B Fair 22
103 Code C Good 64
104 Code A Excellent 28
• Dissimilarity matrix for Attribute -3
• Max = 64, Min =22
• Normalize the ordinal attribute values
0 
Attr-3  
45 0.55 0 
 
22 0.45 1.0 0 
64  
0.40 0.14 0.86 
28  

 

• By using dissimilarity matrix of each of the three
attributes, calculate overall dissimilarity.
• δij =1 for each of the three attributes
• d(3,1) = (1(1) + 1(0.5) +1 (0.45) ) /3 =0.65

0 
 
0.85 0 
 
0.65 0.83 0 
 
0.13 0.71 0.79 
 

 

Cosine Similarity

 A document can be represented by thousands of attributes, each recording the


frequency of a particular word (such as keywords) or phrase in the document.

 Other vector objects: gene features in micro-arrays, …


 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

48
Example: Cosine Similarity

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,


where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

49
Data Reduction : Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining / learning
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

50
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space

x2

x1
51
Background for PCA
 Suppose attributes are A1 and A2, and we have n
training examples. x’s denote values of A1 and y’s
denote values of A2 over the training examples.

 Variance of an attribute:
n

 i
( x  x ) 2

var( A1 )  i 1
(n  1)
Covariance of two attributes
 Covariance of two attributes:
n

 ( x  x )( y
i i  y)
cov( A1 , A2 )  i 1
(n  1)

 If covariance is positive, both dimensions increase


together. If negative, as one increases, the other
decreases. Zero: independent of each other.
Covariance matrix

 Covariance matrix
 Suppose we have n attributes, A1, ..., An.

 Covariance matrix:
C nn  (ci , j ), where ci , j  cov( Ai , A j )

54
Covariance Calculation
Covariance

 cov( H , H ) cov( H , M ) 
 
 cov(M , H ) cov(M , M ) 

 var( H ) 104.5 
  
 104.5 var(M ) 

 47.7 104.5 
  
104.5 370 

Covariance matrix

56
Eigenvectors
 Let M be an nn matrix.
 v is an eigenvector of M if M  v = v

  is called the eigenvalue associated with v


Eigenvectors
 For any eigenvector v of M and scalar a,

M  av  av
 Thus you can always choose eigenvectors of length 1:
2 2
v1  ...  vn  1
 Eigenvectors can only be found for a square matrix. Not every
square matrix has eigenvectors. Given M is n x n matrix, and if M
has any eigenvectors, it has n of them, and they are orthogonal to
one another.

 Thus eigenvectors can be used as a new basis for a n-dimensional


vector space.

58
 Since Mv =  v where v is eigenvector of M and  is eigenvalue of M
corresponding to the vector v.
 i.e Mv -  v = 0,
 (M - I)v = 0, which implies (M - I) cannot have inverse
 Therefore det(M - I) = 0

 Let M=
1  3

3 
  5

59
Principal Component Analysis (Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
 Works for numeric data only
60
PCA steps
1. Given original data set S = {x1, ..., xk}, produce
new set by subtracting the mean of attribute Ai
from each xi.

Mean: 1.81 1.91 Mean: 0 0


PCA steps
2. Calculate the covariance matrix:
x y
x
y

2. Calculate the (unit) eigenvectors and eigenvalues


of the covariance matrix:
PCA steps
4. Order eigenvectors by eigenvalue, highest to
lowest.
  .677873399 
v1      1.28402771
  .735178956 

  .735178956 
v 2      .0490833989
 .677873399 

In general, you get n components. To reduce


dimensionality to p, ignore np components at
the bottom of the list.
PCA steps
Construct new feature vector.
Feature vector = (v1, v2, ...vp)

FeatureVec tor =   .677873399  .735178956 


 
  .735178956 .677873399 

or reduced dimension feature vector :

FeatureVec tor 2 =   .677873399 


 
  .735178956 
PCA steps

5. Deriving new dataset


 Final data = RowFeatureVector x RowDataAdjust
 RowfeatureVector
 is the matrix with eigenvectors in the columns
transposed
 RowDataAdjust
 Mean adjusted data transposed

65
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in one or
more other attributes
 Irrelevant attributes
 Contain no information that is useful for the data mining
task at hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA

66
Heuristic Search in Attribute Selection

 There are 2d possible attribute combinations of d attributes


 Typical heuristic attribute selection methods:
 Best single attribute
1. Determined by statistical significance (assume attributes
are independent of one another)
2. Information gain

67
Heuristic Search in Attribute Selection
 Heuristic methods of attribute subset selection
 Stepwise forward selection

 Starts with empty set, the best of original attributes is determined and
added to the reduced set
 Stepwise backward elimination
 Starts with full set of attributes. At each step, it removes the worst
attribute remaining in the set
 Combination of forward and backward elimination
 Combination of both
 At each step, the procedure selects the best attribute and removes the
worst from among the remaining
 Decision tree induction
 Algorithms such as ID3, c4.5, CART

All attributes that do not appear in the tree are assumed to be irrelevant

68
Histograms
 Another data reduction technique
 Divide data into buckets and store average (sum) for
each bucket
 Singleton buckets
 Each bucket represent only a single attribute-value /

frequency pair
 Partitioning rules
 Equal width

 The width of bucket range is uniform


 Equal Frequency
 Roughly, frequency of each bucket is constant (bucket
contains roughly same number of data samples)
69
histogram

70
Data Redundancy
 An attribute is redundant, if it can be derived from
another attribute or set of attributes
 Redundancies can be detected by correlation analysis
 Tells how one attribute’s values vary from those of

another
 For nominal attributes
 Chi-square test

 Numeric attributes
 correlation coefficient or covarience

71
Chi-square test (for Nominal data)
 The Chi-square test is intended to test how likely it is
that an observed distribution is due to chance.
 It measures how well the observed distribution of
data fits with the distribution that is expected if the
variables are independent. 
 A Chi-square test is designed to analyze  categorical
 data.

72
Chi-square test (for Nominal data)
 Let attribute A has c distinct values, a1, a2, …, ac and B has r
distinct values b1, b2, ….., br
 A contingency table :
 a , a , …, a forms columns
1 2 c

 b1, b2, ….., br form rows


count ( A  ai ) x count(B  bj)
 Expected value is calculated as eij 

n
300 x 450
e11   90
1500
Play chess Not play chess Sum (row)

Like science fiction 250 200 450


Not like science fiction 50 1000 1050

Sum(col.) 300 1200 1500 73


Correlation Analysis (Nominal Data)

 Χ2 (chi-square) test

(Oij  eij )2
  i 1 i 1 eij
2 r c

 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count

74
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450


Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500


 (chi-square) calculation (numbers in parenthesis are expected
counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 With a degrees of freedom (2-1)x(2-1)=1, significance level 0.001, Χ2 value >
10.828. Hence hythothesis is rejected (variables are independent)
 It shows that science_fiction and play_chess are correlated in the group

75
Chi-Square
  A chi-squared test is a test that is used for hypothesis testing for
categorical variables.
 The null hypothesis represents a general or default position.
 Example
 No relationship occurs between two measured parameters A and B
 a certain treatment has no effect. The null hypothesis is denoted by H0. 
 Degrees of freedom (d.f)
 The number of independent ways by which a dynamical system can move
without violating any constraint imposed on it, is called degree of
freedom.
 In statistics, d.f = n-1, where n is number of variables
  Level of Significance
 A statistical assessment of whether the results obtained of an experiment
are due to some error or simply by chance.

76
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment


coefficient)

i1 (ai  A)(bi  B) 


n n
(ai bi )  n AB
rA, B   i 1

(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective means of A


and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

77
Normalization
 Normalization: Scaled to fall within a smaller, specified range
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600  12,000
(1.0  0)  0  0.716
Then $73,000 is mapped to 98,000  12,000

 Z-score normalization (μ: mean, σ: standard deviation):


v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000

78

You might also like