0% found this document useful (0 votes)

65 views78 pages

Getting To Know Your Data

This document discusses different types of data and attributes that comprise datasets. It describes how data objects can be represented as records, documents, graphs or networks. It also defines different attribute types like nominal, binary, numeric discrete vs continuous. The document outlines techniques for describing datasets statistically, including measures of central tendency like mean, median and modes. It introduces the concept of dispersion to quantify how data is spread out.

Uploaded by

Anjali Asha Jacob

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views78 pages

Getting To Know Your Data

Uploaded by

Anjali Asha Jacob

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 78

Getting to Know Your Data

1
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:

3
Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
4
Attributes
 Attribute (or dimension (data warehousing), feature (machine
learning), variable(statisticians)): a data field, representing a
characteristic or feature of a data object.
 E.g., customer _ID, name, address

 Types of attributes:
 Nominal

 Binary

 Numeric: quantitative

 Interval-scaled

 Ratio-scaled

5
Attribute Types
 Nominal: categories, states, or “names of things”
 Eg:- Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 No meaningful order and are not quantitative
 Mathematical operation on values of nominal attributes are not meaningful
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV positive)

6
Attribute Types
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings
 Central tendency can be represented by – mode and
median
 But ‘mean’ cannot be defined

7
Numeric Attribute Types
 Numeric
 Quantitative - it is measurable quantity represented in integer or
real-values
 Can be interval –scaled or ratio scaled
 Interval scaled
 Measured on a scale of equal-sized units
 Allow us to compare and quantify the difference between
values
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Because neither 0 degree Celcius/Fahrenheit indicates no
temperature
 Ratios are not valid, but since it is numeric, mean can be
computed, similarly median and mode are also valid
8
Numeric Attribute Types
 Ratio
 Inherent zero-point
 Values are ordered, being multiple (ratio), Can
compute difference, mean, mode, median,
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚, since 0 K˚ = -273.15˚C ).
 e.g., temperature in Kelvin, length, counts,
monetary quantities

9
Discrete vs. Continuous Attributes
 Another way of organizing attributes – discrete and continuous
 Discrete Attribute
 Has only a finite or countable infinite set of values

 E.g., zip codes, profession, or the set of words in a collection

of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete attributes

 Continuous Attribute
 Has real numbers as attribute values

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

using a finite number of digits

 Continuous attributes are typically represented as floating-point

variables
10
Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

11
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency, variation
and spread
 Central tendency
 Location of middle or center of dispersion
 Dispersion
 How data are spread about
 Range, quartiles, interquartile range, five-number summary.

12
Measuring the Central Tendency
 Mean (algebraic measure
Note: n is sample size. 1 n
 Weighted arithmetic mean: x   xi
n i 1
 Problem
 A small number of extreme values can corrupt the n
mean w x i i
 Trimmed mean: chopping extreme values x i 1
n

w
i 1
i

13
Median
 Median: Median
interval
 Middle value if odd number of values, or
Interval
average of the middle two values otherwise
where
 Estimated by interpolation (for grouped data): cumulati
ve freq
n / 2  ( freq ) l >= N/2
median  L1  ( ) width
freq median
 L1 –lower boundary of median interval (21)
 n – the number of values in entire data set (3194)
 ∑freq – sum of freq of all intervals lower than
median interval (950)
 Freqmedian – frequency of median interval (1500)
 Width- width of median interval (30)

14
Measuring the Central Tendency

 Mode
 Value that occurs most frequently in the data
 Based on number of modes present in data set –
 Unimodal, bimodal, trimodal
 For unimodal numeric data, we have empirical formula:

mean  mode  3  (mean  median)

15
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

09/16/20 Data Mining: Concepts and Techniques 16

Measuring the Dispersion of Data
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)

N N
1 1
    x  2
2
2
( xi  2
)  i
N i 1 N i 1

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

17
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: sorted, splits into equal size, consecutive set
 Points taken at regular intervals of data distribution
 Q1 (25th percentile), Q3 (75th percentile)

 Inter-quartile range: IQR = Q3 – Q1

 Five number summary: fuller summary of shape of
distribution
 min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked;
add whiskers (lines outside the box), and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR

18
Boxplot Analysis

 Five-number summary of a distribution

 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the box
 Whiskers: two lines outside the box extended to
Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

19
Example
 Suppose, we have a set values for salary (in dollars) as
30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110

 Draw a Boxplot
 Median ?

 Q1 ?

 Q3 ?

 IQR ?

 Outlier ?

20
Visualization of Data Dispersion: 3-D Boxplots

09/16/20 Data Mining: Concepts and Techniques 21

Properties of Normal Distribution Curve

 The normal (distribution) curve

 From μ–σ to μ+σ: contains about 68% of the measurements

(μ: mean, σ: standard deviation)

 From μ–2σ to μ+2σ: contains about 95% of it

 From μ–3σ to μ+3σ: contains about 99.7% of it

22
Graphic Displays of Basic Statistical Descriptions

 Boxplot: graphic display of five-number summary

 Histogram: x-axis are values, y-axis repres. frequencies
 Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are  xi
 Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
 Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
23
Histogram Analysis
 Histogram: Graph display of tabulated
40
frequencies, shown as bars
35
 It shows what proportion of cases fall
into each of several categories 30

 Differs from a bar chart in that it is the25

area of the bar that denotes the 20
value, not the height as in bar charts, 15
a crucial distinction when the 10
categories are not of uniform width
5
 The categories are usually specified as
0
non-overlapping intervals of some 10000 30000 50000 70000 90000

variable. The categories (bars) must

be adjacent
24
Histograms Often Tell More than Boxplots

 The two histograms

shown in the left may
have the same boxplot
representation
 The same values for:
min, Q1, median, Q3,
max
 But they have rather
different data
distributions

25
Quantile Plot
 Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
 Plots quantile information
 For a data x data sorted in increasing order, f indicates that
i i
approximately 100 fi% of the data are below or equal to the
value xi

Data Mining: Concepts and Techniques 26

Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
 View: Is there is a shift in going from one distribution to another?
 Example shows unit price of items sold at Branch 1 vs. Branch 2 for
each quantile. Unit prices of items sold at Branch 1 tend to be lower
than those at Branch 2.

27
Scatter plot
 Provides a first look at bivariate data to see clusters of points,
outliers, etc
 Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

28
Positively and Negatively Correlated Data

 The left half fragment is positively

correlated
 The right half is negative correlated

29
Uncorrelated Data

30
Similarity and Dissimilarity
 Proximity refers to a similarity or dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity (e.g., distance)
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Minimum dissimilarity is often 0

31
Data Matrix and Dissimilarity Matrix
 Data structures
1. Data matrix  x11 ... x1f ... x1p 
 
 Stores the n data  ... ... ... ... ... 
x ... xif ... xip 
objects with p  i1 
attributes in the form of  ... ... ... ... ... 
x ... xnf ... xnp 
a relational table  n1 
 It is n by p matrix

32
Data Matrix and Dissimilarity Matrix
 Data structures  0 
 d(2,1) 0 
2. Dissimilarity matrix  
 This structure shows collection of  d(3,1) d ( 3,2) 0 
 
proximities that are available for all pair  : : : 
of n objects. d ( n,1) d ( n,2) ... ... 0
 d(i, j) : difference between object i and j
 d(i, j) = 0, objects are same
 d(i, j) = d(j, i)
 Many clustering and nearest neighbor
algorithms operate on dissimilarity matrix
 Measures of similarity can expressed in
terms of dissimilarity
 sim(i, j) = 1 - d(i, j)

33
Proximity Measure for Nominal Attributes

 Let the number of states of a nominal attribute be M.

 Simple matching
 m: # of matches (number of attributes for which i
and j are in same state),
 p: total # of attributes

d (i, j)  p 
p
m

34
Dissimilarity- Nominal data
Obj. ID attr-1 attr-2
101 A X
102 B X
103 A Y
104 C Z

Calculate
 0 
 d(2,1)   0 
 0  1 / 2 0 
 d(3,1) d ( 3,2) 0   
  1 / 2 1 0 
 : : :   
d ( n,1) d ( n,2) ... ... 0  1 1 1 
 

35
Proximity Measure for Binary Attributes
Object j

 Binary objects have only two states,

Object i
0 and 1
 A contingency table for binary data
where
q – # of attributes == 1 for both objects
t - # attributes == 0 for both objects
s - # attributes == 0 object i and == 1 object j
r - # attributes == 1 object i and == 0 object j

 Symmetric binary dissimilarity

36
Proximity Measure for Binary Attributes

 Asymmetric binary dissimilarity

 Number of –ve matches is
considered as unimportant

 Asymmetric binary similarity q

sim(i, j )  1 d (i, j ) 
qr s
between object I and j
q
Sim(i, j) = qr  s

 Jaccard coefficient (similarity

measure for asymmetric binary
variables):
37
Dissimilarity on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional
data objects, and h is the order (the distance so defined is also
called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0, distance of an object to itself or i= j
(Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric

38
Special Cases of Minkowski Distance
 h = 1: Manhattan (city block, L1 norm) distance
 E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

 h = 2: (L2 norm) Euclidean distance

d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

 h  . “supremum” (Lmax norm, L norm) distance.

 This is the maximum difference between any component (attribute)
of the vectors

39
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
40
Ordinal Variables
 An ordinal variable can be discrete or continuous
 Order is important, e.g., rank
 Ordinal attributes can be treated like interval-scaled attributes.
Suppose f is attribute from a set of ordinal attributes.
 The value of ith object is xif rif {1,..., M f }
 replace xif by their rank rif
 map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by, rif 1
zif 
M f 1
 compute the dissimilarity using methods for interval-scaled
variables

41
Ordinal variables
Obj. ID attr-1
101 Excellent
102 Fair
103 Good
104 Excellent
Steps to find dissimilarity measure
1. Replace states with ranks. Here, Mf = 3, and assigned ranks are
3, 2, 1 for excellent, good and fair, respectively rif 1
zif 
2. Normalize ranking by mapping ranks to 1 to 0.0 using
M f 1
 Fair - > (1 - 1)/ (3 – 1) -> 0,
 Good -> 0.5
 Excellent -> 1
3. Find dissimilarly measure ( Euclidean , Manhattan etc)

42
Attributes of Mixed Type
 A database may contain all attribute types
 Nominal, symmetric binary, asymmetric binary, numeric, ordinal

 One may use a weighted formula to combine their effects (p

attributes)

  ij( f )dij( f )
P

d (i, j)  p ( f )
f 1

 δij if
 f2)x1=xij and attribute is asymmetric
=0, if either 1) x or x is missing
jf if jf
binary, otherwise δij =1
 f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
 f is numeric: use the normalized distance
 f is ordinal dij  |x x |
if jf

 Compute ranks r and

if max x  min x
h hf h hf
 Treat z as interval-scaled
if
zif 
r 1
if

M 1 f 43
Mixed type example
Obj. ID Attr-1 Attr- 2 Attr-3
101 Code A Excellent 45
102 Code B Fair 22
103 Code C Good 64
104 Code A Excellent 28

• Dissimilarity matrix for Attribute -1

0 
 0   
 d(2,1)  1 0 
 0   
 d(3,1) d ( 3,2) 0  1 1 0 
   
 : : :  0 1 1 
d ( n,1) d ( n,2) ...  
... 0 
 

Mixed type example
Obj. ID Attr-1 Attr- 2 Attr-3
101 Code A Excellent 45
102 Code B Fair 22
103 Code C Good 64
104 Code A Excellent 28
• Dissimilarity matrix for Attribute -2
• Let rank of excellent is 3, Good is 2 and fair is 1
• Normalize the ranks, and find Euclidean distance
0 
Attr-2 Attr-2  
Excellent 1 1.0 0 
 
Fair 0 0.5 0.5 0 
Good  
0.5 0 1.0 0.5 
Excellent 1  

 

Mixed type example
Obj. ID Attr-1 Attr- 2 Attr-3
101 Code A Excellent 45
102 Code B Fair 22
103 Code C Good 64
104 Code A Excellent 28
• Dissimilarity matrix for Attribute -3
• Max = 64, Min =22
• Normalize the ordinal attribute values
0 
Attr-3  
45 0.55 0 
 
22 0.45 1.0 0 
64  
0.40 0.14 0.86 
28  

 

• By using dissimilarity matrix of each of the three
attributes, calculate overall dissimilarity.
• δij =1 for each of the three attributes
• d(3,1) = (1(1) + 1(0.5) +1 (0.45) ) /3 =0.65

0 
 
0.85 0 
 
0.65 0.83 0 
 
0.13 0.71 0.79 
 

 

Cosine Similarity

 A document can be represented by thousands of attributes, each recording the

frequency of a particular word (such as keywords) or phrase in the document.

 Other vector objects: gene features in micro-arrays, …

 Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d

48
Example: Cosine Similarity

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

where  indicates vector dot product, ||d|: the length of vector d

 Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

49
Data Reduction : Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining / learning
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

50
Principal Component Analysis (PCA)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance
matrix, and these eigenvectors define the new space

x1
51
Background for PCA
 Suppose attributes are A1 and A2, and we have n
training examples. x’s denote values of A1 and y’s
denote values of A2 over the training examples.

 Variance of an attribute:
n

 i
( x  x ) 2

var( A1 )  i 1
(n  1)
Covariance of two attributes
 Covariance of two attributes:
n

 ( x  x )( y
i i  y)
cov( A1 , A2 )  i 1
(n  1)

 If covariance is positive, both dimensions increase

together. If negative, as one increases, the other
decreases. Zero: independent of each other.
Covariance matrix

 Covariance matrix
 Suppose we have n attributes, A1, ..., An.

 Covariance matrix:
C nn  (ci , j ), where ci , j  cov( Ai , A j )

54
Covariance Calculation
Covariance

 cov( H , H ) cov( H , M ) 
 
 cov(M , H ) cov(M , M ) 

 var( H ) 104.5 
  
 104.5 var(M ) 

 47.7 104.5 
  
104.5 370 

Covariance matrix

56
Eigenvectors
 Let M be an nn matrix.
 v is an eigenvector of M if M  v = v

  is called the eigenvalue associated with v

Eigenvectors
 For any eigenvector v of M and scalar a,

M  av  av
 Thus you can always choose eigenvectors of length 1:
2 2
v1  ...  vn  1
 Eigenvectors can only be found for a square matrix. Not every
square matrix has eigenvectors. Given M is n x n matrix, and if M
has any eigenvectors, it has n of them, and they are orthogonal to
one another.

 Thus eigenvectors can be used as a new basis for a n-dimensional

vector space.

58
 Since Mv =  v where v is eigenvector of M and  is eigenvalue of M
corresponding to the vector v.
 i.e Mv -  v = 0,
 (M - I)v = 0, which implies (M - I) cannot have inverse
 Therefore det(M - I) = 0

 Let M=
1  3

3 
  5


59
Principal Component Analysis (Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
 Works for numeric data only
60
PCA steps
1. Given original data set S = {x1, ..., xk}, produce
new set by subtracting the mean of attribute Ai
from each xi.

Mean: 1.81 1.91 Mean: 0 0

PCA steps
2. Calculate the covariance matrix:
x y
x
y

2. Calculate the (unit) eigenvectors and eigenvalues

of the covariance matrix:
PCA steps
4. Order eigenvectors by eigenvalue, highest to
lowest.
  .677873399 
v1      1.28402771
  .735178956 

  .735178956 
v 2      .0490833989
 .677873399 

In general, you get n components. To reduce

dimensionality to p, ignore np components at
the bottom of the list.
PCA steps
Construct new feature vector.
Feature vector = (v1, v2, ...vp)

FeatureVec tor =   .677873399  .735178956 

 
  .735178956 .677873399 

or reduced dimension feature vector :

FeatureVec tor 2 =   .677873399 

 
  .735178956 
PCA steps

5. Deriving new dataset

 Final data = RowFeatureVector x RowDataAdjust
 RowfeatureVector
 is the matrix with eigenvectors in the columns
transposed
 RowDataAdjust
 Mean adjusted data transposed

65
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in one or
more other attributes
 Irrelevant attributes
 Contain no information that is useful for the data mining
task at hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA

66
Heuristic Search in Attribute Selection

 There are 2d possible attribute combinations of d attributes

 Typical heuristic attribute selection methods:
 Best single attribute
1. Determined by statistical significance (assume attributes
are independent of one another)
2. Information gain

67
Heuristic Search in Attribute Selection
 Heuristic methods of attribute subset selection
 Stepwise forward selection

 Starts with empty set, the best of original attributes is determined and
added to the reduced set
 Stepwise backward elimination
 Starts with full set of attributes. At each step, it removes the worst
attribute remaining in the set
 Combination of forward and backward elimination
 Combination of both
 At each step, the procedure selects the best attribute and removes the
worst from among the remaining
 Decision tree induction
 Algorithms such as ID3, c4.5, CART

All attributes that do not appear in the tree are assumed to be irrelevant

68
Histograms
 Another data reduction technique
 Divide data into buckets and store average (sum) for
each bucket
 Singleton buckets
 Each bucket represent only a single attribute-value /

frequency pair
 Partitioning rules
 Equal width

 The width of bucket range is uniform

 Equal Frequency
 Roughly, frequency of each bucket is constant (bucket
contains roughly same number of data samples)
69
histogram

70
Data Redundancy
 An attribute is redundant, if it can be derived from
another attribute or set of attributes
 Redundancies can be detected by correlation analysis
 Tells how one attribute’s values vary from those of

another
 For nominal attributes
 Chi-square test

 Numeric attributes
 correlation coefficient or covarience

71
Chi-square test (for Nominal data)
 The Chi-square test is intended to test how likely it is
that an observed distribution is due to chance.
 It measures how well the observed distribution of
data fits with the distribution that is expected if the
variables are independent.
 A Chi-square test is designed to analyze categorical
data.

72
Chi-square test (for Nominal data)
 Let attribute A has c distinct values, a1, a2, …, ac and B has r
distinct values b1, b2, ….., br
 A contingency table :
 a , a , …, a forms columns
1 2 c

 b1, b2, ….., br form rows

count ( A  ai ) x count(B  bj)
 Expected value is calculated as eij 

n
300 x 450
e11   90
1500
Play chess Not play chess Sum (row)

Like science fiction 250 200 450

Not like science fiction 50 1000 1050

Sum(col.) 300 1200 1500 73

Correlation Analysis (Nominal Data)

 Χ2 (chi-square) test

(Oij  eij )2
  i 1 i 1 eij
2 r c

 The larger the Χ2 value, the more likely the variables are
related
 The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count

74
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

 (chi-square) calculation (numbers in parenthesis are expected
counts calculated based on the data distribution in the two
categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
    507.93
90 210 360 840
 With a degrees of freedom (2-1)x(2-1)=1, significance level 0.001, Χ2 value >
10.828. Hence hythothesis is rejected (variables are independent)
 It shows that science_fiction and play_chess are correlated in the group

75
Chi-Square
 A chi-squared test is a test that is used for hypothesis testing for
categorical variables.
 The null hypothesis represents a general or default position.
 Example
 No relationship occurs between two measured parameters A and B
 a certain treatment has no effect. The null hypothesis is denoted by H0.
 Degrees of freedom (d.f)
 The number of independent ways by which a dynamical system can move
without violating any constraint imposed on it, is called degree of
freedom.
 In statistics, d.f = n-1, where n is number of variables
 Level of Significance
 A statistical assessment of whether the results obtained of an experiment
are due to some error or simply by chance.

76
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment

coefficient)

i1 (ai  A)(bi  B) 

n n
(ai bi )  n AB
rA, B   i 1

(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective means of A

and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
 rA,B = 0: independent; rAB < 0: negatively correlated

77
Normalization
 Normalization: Scaled to fall within a smaller, specified range
 Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600  12,000
(1.0  0)  0  0.716
Then $73,000 is mapped to 98,000  12,000

 Z-score normalization (μ: mean, σ: standard deviation):

v  A
v' 
 A

73,600  54,000
 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then 16,000

Unit 1b
No ratings yet
Unit 1b
69 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
02 Data
No ratings yet
02 Data
62 pages
02 Data
No ratings yet
02 Data
64 pages
Lect 3
No ratings yet
Lect 3
51 pages
02 Data
No ratings yet
02 Data
65 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
02 Data
No ratings yet
02 Data
65 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Module 1
No ratings yet
Module 1
64 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
CH 2
No ratings yet
CH 2
68 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
02 Data
No ratings yet
02 Data
66 pages
Lecture 2
No ratings yet
Lecture 2
62 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
02 Data
No ratings yet
02 Data
41 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
1 L2 Intro DAM
No ratings yet
1 L2 Intro DAM
27 pages
02 KnowYourData
No ratings yet
02 KnowYourData
44 pages
02 Data
No ratings yet
02 Data
24 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
02data (Compatibility Mode)
No ratings yet
02data (Compatibility Mode)
11 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
DWDM LS2 Fall 24 25
No ratings yet
DWDM LS2 Fall 24 25
42 pages
About Data
No ratings yet
About Data
25 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
02know Your Data Lecture2 3
No ratings yet
02know Your Data Lecture2 3
53 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
02 Data
No ratings yet
02 Data
35 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Lec 2
No ratings yet
Lec 2
26 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Lectur 4 Basic Statistical Descriptions of Data
No ratings yet
Lectur 4 Basic Statistical Descriptions of Data
44 pages
02 Data
No ratings yet
02 Data
36 pages
Data ch2
No ratings yet
Data ch2
16 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Assignment Week5
No ratings yet
Assignment Week5
2 pages
Toshiba 2SK2961 N-Channel MOSFET
No ratings yet
Toshiba 2SK2961 N-Channel MOSFET
7 pages
Spooky2 Users Guide 20200917
100% (2)
Spooky2 Users Guide 20200917
241 pages
Speed Control of Switched Reluctance Motor Based On Fuzzy Logic Controller
No ratings yet
Speed Control of Switched Reluctance Motor Based On Fuzzy Logic Controller
5 pages
Tamil Nadu Telecom Achievements 2019-20
100% (1)
Tamil Nadu Telecom Achievements 2019-20
45 pages
Implementing A Multi-Domain System: Siemens Digital Industries Software
No ratings yet
Implementing A Multi-Domain System: Siemens Digital Industries Software
9 pages
Firmware Update Log for Engineers
No ratings yet
Firmware Update Log for Engineers
2 pages
Vsx305 Om-Pioneer Text
No ratings yet
Vsx305 Om-Pioneer Text
28 pages
Class 10 Real Numbers Solutions
No ratings yet
Class 10 Real Numbers Solutions
7 pages
10 Essential InDesign Skills by InDesignSkills
100% (5)
10 Essential InDesign Skills by InDesignSkills
14 pages
Bits For Mid1
100% (1)
Bits For Mid1
14 pages
Computer Engineering Department: Operating System Activity Manual
No ratings yet
Computer Engineering Department: Operating System Activity Manual
6 pages
Automatic Differentiation in Solid Mechanics
No ratings yet
Automatic Differentiation in Solid Mechanics
7 pages
36 PES POs and Mapping CSE
No ratings yet
36 PES POs and Mapping CSE
3 pages
Panvalet Users Guide: Number: 11.04.01 Effective: 08/01/01
No ratings yet
Panvalet Users Guide: Number: 11.04.01 Effective: 08/01/01
71 pages
Java - Developer
No ratings yet
Java - Developer
4 pages
Model 363 Control Valve Guide
No ratings yet
Model 363 Control Valve Guide
20 pages
JSS 1 Third Term Maths Examination
100% (3)
JSS 1 Third Term Maths Examination
3 pages
DBMS Keys: Primary, Candidate, Super, Alternate and Foreign Etc
No ratings yet
DBMS Keys: Primary, Candidate, Super, Alternate and Foreign Etc
25 pages
Account 421
No ratings yet
Account 421
537 pages
ZXA10 C320 Product Introduction
100% (2)
ZXA10 C320 Product Introduction
9 pages
Pythagoras-5 (Solutions)
No ratings yet
Pythagoras-5 (Solutions)
6 pages
Using The 1783-NATR To Bridge Networks - The Automation Blog
No ratings yet
Using The 1783-NATR To Bridge Networks - The Automation Blog
5 pages
Mark Sheet
No ratings yet
Mark Sheet
1 page
Elementary Level Test: 1. Open The Brackets Using The Correct Form of The Verb: Present Simple/continuous
No ratings yet
Elementary Level Test: 1. Open The Brackets Using The Correct Form of The Verb: Present Simple/continuous
2 pages
Nor Azimah Khalid FSKM, Uitm Shah Alam
No ratings yet
Nor Azimah Khalid FSKM, Uitm Shah Alam
39 pages
AUTOSAR CP SRS SPALGeneral
No ratings yet
AUTOSAR CP SRS SPALGeneral
23 pages
A. Introduction Handouts
No ratings yet
A. Introduction Handouts
6 pages
Lesson 15 Multimedia Information and Media
No ratings yet
Lesson 15 Multimedia Information and Media
5 pages
USB Dongle Setup for RES2DINV/3DINV
No ratings yet
USB Dongle Setup for RES2DINV/3DINV
1 page

Getting To Know Your Data

Uploaded by

Getting To Know Your Data

Uploaded by

Getting to Know Your Data

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Data sets are made up of data objects.

 E.g., zip codes, profession, or the set of words in a collection

 Note: Binary attributes are a special case of discrete attributes

 E.g., temperature, height, or weight

 Practically, real values can only be measured and represented

using a finite number of digits

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

mean  mode  3  (mean  median)

positively skewed negatively skewed

09/16/20 Data Mining: Concepts and Techniques 16

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

 Inter-quartile range: IQR = Q3 – Q1

 Five-number summary of a distribution

09/16/20 Data Mining: Concepts and Techniques 21

 The normal (distribution) curve

(μ: mean, σ: standard deviation)

 From μ–3σ to μ+3σ: contains about 99.7% of it

 Boxplot: graphic display of five-number summary

 Differs from a bar chart in that it is the25

variable. The categories (bars) must

 The two histograms

Data Mining: Concepts and Techniques 26

 The left half fragment is positively

 Let the number of states of a nominal attribute be M.

 Binary objects have only two states,

 Symmetric binary dissimilarity

 Asymmetric binary dissimilarity

 Asymmetric binary similarity q

 Jaccard coefficient (similarity

 h = 2: (L2 norm) Euclidean distance

 h  . “supremum” (Lmax norm, L norm) distance.

 One may use a weighted formula to combine their effects (p

 Compute ranks r and

• Dissimilarity matrix for Attribute -1

 A document can be represented by thousands of attributes, each recording the

 Other vector objects: gene features in micro-arrays, …

 cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

 Ex: Find the similarity between documents 1 and 2.

 If covariance is positive, both dimensions increase

  is called the eigenvalue associated with v

 Thus eigenvectors can be used as a new basis for a n-dimensional

Mean: 1.81 1.91 Mean: 0 0

2. Calculate the (unit) eigenvectors and eigenvalues

In general, you get n components. To reduce

FeatureVec tor =   .677873399  .735178956 

or reduced dimension feature vector :

FeatureVec tor 2 =   .677873399 

5. Deriving new dataset

 There are 2d possible attribute combinations of d attributes

 The width of bucket range is uniform

 b1, b2, ….., br form rows

Like science fiction 250 200 450

Sum(col.) 300 1200 1500 73

Like science fiction 250(90) 200(360) 450

Sum(col.) 300 1200 1500

 Correlation coefficient (also called Pearson’s product moment

i1 (ai  A)(bi  B) 

where n is the number of tuples, A and B are the respective means of A

 Z-score normalization (μ: mean, σ: standard deviation):

You might also like