DWDM - Unit 3 - Data Preprocessing
DWDM - Unit 3 - Data Preprocessing
Prepared by
Prof. T. M. Kodinariya
Department of Computer Engineering
Government Engineering College - Rajkot
1
Measuring the Central Tendency
• Mean
• Weighted arithmetic Mean
• Median
• Mode
2
Statistics: Mean
Arithmetic mean is commonly called as average.
Mean or Average is defined as the sum of all the given elements divided by the total
number of elements.
Formula:
Mean = sum of elements / number of elements
= a1+a2+a3+.....+an/n
3
Statistics: Mean
• Sometimes, each value xi in a set may be associated with a weight wi, for i = 1,.. ,N.
• The weights reflect the significance, importance, or occurrence frequency attached
to their respective values.
4
Statistics: Mean
• Although the mean is the single most useful quantity for describing a data set, it is
not always the best way of measuring the center of the data.
• A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
Even a small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by that
of a few highly paid managers. Similarly, the average score of a class in an exam
could be pulled down quite a bit by a few very low scores.
• The trimmed mean, which is the mean obtained after chopping off values at the
high and low extremes.
• For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both ends as this
can result in the loss of valuable information.
5
Statistics: Mode
Mode Definition:
Mode is the most frequently occurring value in a frequency distribution.
Step 2:
In the above distribution
Number 11 occurs 3 times,
Number 3 occurs 2 times,
Number 5 occurs 1 times,
Number 7 occurs 1 times.
So the number with most occurrences is 11 and is the Mode of this distribution.
Mode = 11
6
Statistics: Mode
• Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal.
• In general, a data set with two or more modes is multimodal.
• At the other extreme, if each data value occurs only once, then there is no mode.
7
Statistics: Median
Arithmetic Median Definition:
Median is the middle value of the given numbers or distribution in their ascending
order. Median is the average value of the two middle elements when the size of the
distribution is even.
8
Statistics: Median
Example 2: To find the median of 4,5,7,2,1,8 [Even]
Step 1: Count the total numbers given.
There are 6 elements or numbers in the distribution.
9
Measuring the Dispersion of Data
• The degree to which numerical data tend to spread is called the dispersion,
or variance of the data.
• The most common measures of data dispersion are
• range,
• five-number summary (based on quartiles),
• the interquartile range,
• the standard deviation
• Boxplots can be plotted based on the five-number summary and are a
useful tool for identifying outliers.
10
Measuring the Dispersion of Data
• The range of the set is the difference between the largest (max()) and
smallest (min()) values.
• The kth percentile of a set of data in numerical order is the value xi having
the property that k percent of the data entries lie at or below xi.
• The median is the 50th percentile.
11
Statistics: variance & standard deviation
Variance
The average of the squared differences from the Mean.
2
(X X )2
N
Standard Deviation:
2 (X X )2
N
12
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
5 5
13
Now, we calculate each dogs difference from the Mean:
2062 + 762 + (-
108,520
224)2 + 362 + (-94)2
Variance: σ =
2
= = 21,704
5 5
14
Variance
Standard Deviation
So, using the Standard Deviation we have a "standard" way of knowing what is normal,
and what is extra large or extra small.
15
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.
1. The minimum value of a data set is the least value in the set.
2. The first quartile, denoted by Q1 , is the median of the lower
half of the data set. This means that about 25% of the
numbers in the data set lie below Q1 and about 75% lie
above Q1 .
16
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.
3. The median of a data set is the number that, when the set is
put into increasing order, divides the data into two equal
parts.
17
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.
the upper half of the data is: {13, 14, 18, 21}, so
18
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.
19
Five Number Summary: Exercise
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.
Find the five-number summary for the data set {3, 7, 8, 5, 12, 14, 21, 15, 18, 14}.
20
Example: calculation of Mean, Mode,
Median, Standard deviation, variance and
five number summary
34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37
21
Box-Plot
• The box and whiskers chart shows you how your data is spread out. ‘
• Five pieces of information (the “five number summary“) are generally included in the
chart:
1. The minimum (the smallest number in the data set). The minimum is shown at the far
left of the chart, at the end of the left “whisker.”
2. First quartile, Q1, is the far left of the box (or the far right of the left whisker).
3. The median is shown as a line in the center of the box.
4. Third quartile, Q3, shown at the far right of the box (at the far left of the right whisker).
5. The maximum (the largest number in the data set), shown at the far right of the box.
Example: Box-Plot
Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Example: Box-Plot
Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Draw a number line that will include the smallest and the largest data.
Example: Box-Plot
Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Draw three vertical lines at the lower quartile (12), median (22) and the upper quartile (36),
just above the number line.
Join the lines for the lower quartile and the upper quartile to form a box.
Example: Box-Plot
Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Draw a line from the smallest value (5) to the left side of the box and draw a line from the right
side of the box to the biggest value (53).
Interpreting a Box Plot
An outlier, is one that appears to deviate markedly from other members of the data set
in which it occurs.
Q1 = 3, Q3 = 12.5
IQR = 12.5 - 3 = 9.5.
24, 58, 61, 67, 71, 73, 76, 79, 82, 83, 85, 87, 88, 88, 92, 93,
94,
IQR 97
= Q3 - Q1 = 89 - 70 = 19.
Lower limit = Q1 - 1.5 · IQR = 70 - 1.5 *19 = 41.5
Upper limit = Q3 + 1.5 · IQR = 89 + 1.5 * 19 = 117.5
Lower adjacent value = 58
Upper adjacent value = 97
Since 24 lies outside the lower and upper limit, it is a potential outlier.
Box Plot Example
Min=0
Median =3
Q1=1.5
Q3=6
Max=6
Box Plot Example
• Find Q1, Q2, and Q3 for the following data set. Identify any outliers, and draw a box-
and-whisker plot. {5,40,42,46,48,49,50,50,52,53,55,56,58,75,102}
Scatter Plot
Age of
Height
the Child
3 2.3
4 2.7
5 3.1
6 3.6
7 3.8
8 4
9 4.3
10 4.5
Scatter Plot
(Outlier)
Scatter Plot
54
Why Is Data Dirty?
Reason for missing values/Incomplete data
• Data that were inconsistence with other recorded data may have
been deleted
55
Why Is Noisy data?
57
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization and concept hierarchy generation
– Part of data reduction but with particular importance, especially for
numerical data
58
Forms of data preprocessing
60
Agenda
61
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– smooth out noisy data while identifying outliers
• Binning
• Clustering
• regression
62
Data Cleaning
63
RowID Variable_1 Variable_2 Variable_3
1 12.34 aa 12
2 34 33
3 44 cc 22
4 -433 ff
5 66
6 43 dd 33
7 #NAME? 66
8 6743 df 22
9 3 fg
10 4 88
11 3 g 55
12 34 dd 22
Handling missing value
1 12.34 aa 12
2 34 UNKNOWN 33
3 44 cc 22
4 -433 ff 41.9
6 43 dd 33
8 6743 df 22
9 3 fg 41.9
10 4 UNKNOWN 88
11 3 g 55
12 34 dd 22 64
Data Cleaning
• Use the attribute mean for all samples belonging to the same
class as the given tuple
– For example, if classifying customers according to credit risk, replace
the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.
65
Data Cleaning
66
Handling Missing valve
• It is important to note that, in some cases, a missing value may not imply an error
in the data!
• For example, when applying for a credit card, candidates may be asked to supply
their driver’s license number. Candidates who do not have a driver’s license may
naturally leave this field blank. Forms should allow respondents to specify values
such as “not applicable”.
• Software routines may also be used to uncover other null values, such as “don’t
know”, “?”, or “none”.
• Ideally, each attribute should have one or more rules regarding the null condition.
The rules may specify whether or not nulls are allowed, and/or how such values
should be handled or transformed.
• Fields may also be intentionally left blank if they are to be provided in a later step
of the business process.
• Hence, although we can try our best to clean the data after it is seized, good design
of databases and of data entry procedures should help minimize the number of
missing values or errors in the first place.
67
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– smooth out noisy data while identifying outliers
• Binning
• Clustering
• regression
68
Data Cleaning: Noisy data
Noisy Data
• What is noise?
• Random error or variance in a measured variable.
69
Data Cleaning: Noisy data
Binning
• First sort data and partition into bins or buckets
• And then smooth out data by consulting it’s
neighborhood i.e. value around it.
• Binning methods
• smooth by bin means,
• smooth by bin median,
• smooth by bin boundaries
70
Data Cleaning: Noisy data
Binning
• In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin.
• In smoothing by bin medians, each bin value is
replaced by the bin median.
• In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the
bin boundaries.
– Each bin value is then replaced by the closest boundary
value.
– In general, the larger the width, the greater the effect of the
smoothing.
71
Binning techniques
1. Equi-Depth (Equi-frequency) :-
The algorithm divides the data into k groups which
each group contains approximately same number
of values.
2. Equi-Width
The simplest binning approach is to partition the
range of the variable into k equal-width intervals.
The interval width is simply the range [A, B] of the
variable divided by k,
And
W= the(min-max)/k
interval boundaries are:
min+w, min+2w, ... , min+(k-1)w 72
Example
73
Data Cleaning: Noisy data
Binning
• Alternatively, bins may be equal-width,
• where the interval range of values in each bin is
constant.
75
Binned_Va Var1
Var1
r1
Intervals value
4 7 #records
From To
8 7
4 14 3
9 7
14 24 3
15 19
24 34 6
21 19
21 19
24 27.67
25 27.67
26 27.67
28 27.67 Binning using Equi-width
29 27.67 -smoothing by means
34 27.67
Bin 1: [4,14)
Bin 2: [14, 24)
Bin 3:[24, ..
76
Data Cleaning: Noisy data
Question?
• The age values for the data tuples are (in increasing order)
• 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33,
35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
77
Solution
• The median (middle value of the ordered set,
as the number of values in the set is odd) of
the data is: 25
78
Solution
• Step 1: Sort the data
• Step 2: Partition the data into equal-frequency
bins of size 3.
79
GTU Question
• What is noise? Explain data smoothing
methods as noise removal technique to divide
given data into bins of size 3 by bin partition
(equal frequency), by bin means, by bin
medians and by bin boundaries.
• Consider the data: 10, 2, 19, 18, 20, 18, 25, 28,
22 [Summer 2017, Winter 2018 – 7 marks]
80
GTU Question
• Suppose a group of sales price records has
been sorted as follows:
• 6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204,
232 Partition them into three bins by
equal-frequency (equi-depth)
partitioning method. Perform data
smoothing by bin mean.
• [winter 2017 – 3 marks]
81
Data Cleaning: Noisy data
Clustering
• Outliers may be detected by clustering, where similar values
are organized into groups, or “clusters.”
• Intuitively, values that fall outside of the set of clusters may be
considered outliers
82
Data Cleaning: Noisy data
Cluster Analysis
83
Data Cleaning: Noisy data
Regression
• Data can be smoothed by fitting the data to a function, such as
with regression.
• Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used to
predict the other.
• Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are
fit to a multidimensional surface.
84
Regression
y (salary)
Y1 y=x+1
X1 x (age)
85
Regress Analysis
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
86
Agenda
98
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
99
Data Integration
101
Data integration
• For numerical attributes, we can evaluate the correlation between two attributes, A
and B, by computing the correlation coefficient (also known as Pearson’s product
moment coefficient)
where oi j is the observed frequency (i.e., actual count) of the joint event
(Ai;Bj) and ei j 103
is the expected frequency of (Ai;Bj),
Data integration
• Expected frequency is calculated as
• where N is the number of data tuples, count(A=ai) is the number of tuples having
value ai for A, and count(B = bj) is the number of tuples having value bj for B.
• The 2 statistic tests the hypothesis that A and B are independent. The test
is based on a significance level, with (r-1)(c-1) degrees of freedom
• If the hypothesis can be rejected, then we say that A and B are statistically related
or associated
the numbers in
parentheses are the
expected frequencies.
104
Data integration
• Suppose that a group of 1,500 people was surveyed. The gender of each person
was noted. Each person was polled as to whether their preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender and
preferred reading.
105
Data integration
• For this 2 x 2 table, the
degrees of freedom are
(2-1)(2-1) = 1. For 1
degree of freedom, the
2 value needed to reject
the hypothesis at the
0.001 significance level is
10.828
• Since our computed
value is above this, we
can reject the hypothesis
that gender and
preferred reading are
independent and
conclude that the two
attributes are (strongly)
correlated for the given
group of people. 106
Data integration
• In addition to detecting redundancies between attributes, duplication
should also be detected at the tuple level (e.g., where there are two or
more identical tuples for a given unique data entry case).
• The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy
107
Data integration
• Note that correlation does not imply causality.
• That is, if A and B are correlated, this does not necessarily
imply that A causes B or that B causes A.
• For example, in analyzing a demographic database, we may
find that attributes representing the number of hospitals and
the number of car thefts in a region are correlated.
• This does not mean that one causes the other. Both are
actually causally linked to a third attribute, namely, population.
• For categorical (discrete) data, a correlation relationship
between two attributes, A and B, can be discovered by a chi-
square test.
108
Data integration
There are following issue:
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units, different currency
– For a hotel chain, the price of rooms in different cities may involve not
only different currencies but also different services (such as free
breakfast) and taxes.
– An attribute in one system may be recorded at a lower level of
abstraction than the “same” attribute in another. For example, the total
sales in one database may refer to one branch of All Electronics, while an
attribute of the same name in another database may refer to the total
sales for All Electronics stores in a given region.
109
Data Transformation
• Smoothing: remove noise from data (such as binning,
regression, and clustering.)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones 110
Data Transformation: Normalization
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,600 is mapped to
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
113
Data Transformation: Normalization
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
• Suppose that the recorded values of A range from -986 to 917. The
• Maximum absolute value of A is 986.
• A has 986 will be normalized to 0.986 (since j=3)
114
Question:
(a) Use min-max normalization to transform the value 35 for age onto the range
[0:0;1:0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
115
Question:
Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000
Mean =(200+300+400+600+1000)/5=500
Variance = (200-500)2+(300-500)2+(400-500)2+(600-500)2+(1000-500)2 =80000
5
80000 282.84
2
2 ( X X ) 2
N
116
Question:
Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting min = 0 and max = 1
117
GTU Question
118
GTU Question
119
Data Transformation: attribute construction
121
Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set
• Solution?
– Data reduction…
122
Data Reduction
Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results
123
Data Reduction Techniques
• Data cube aggregation
• Attribute subset selection: where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed
• step-wise forward selection
• step-wise backward elimination
• forward selection and backward elimination
• decision-tree induction
• Numerosity reduction where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which need store
only the model parameters instead of the actual data) or nonparametric methods such
as clustering, sampling, and the use of histograms
128
Attribute Subset Selection
• The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting
probability distribution of the data classes is as
close as possible to the original distribution
obtained using all attributes.
• Advantage
– Faster execution
– It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns
easier to understand i.e. simplicity measure improve
129
How can we find a ‘good’ subset of the original attributes?”
Brute-force approach:
• For n attributes, there are 2n possible subsets.
• Try all possible feature subsets as input to data mining algorithm
• It’s time consuming
130
Heuristic methods: step-wise forward selection
131
Heuristic methods: step-wise backward Elimination
132
Heuristic methods: combining forward selection and backward
elimination
133
Heuristic methods: Decision Tree Induction
134
Heuristic methods: Decision Tree Induction
135
Data Reduction Techniques
– Data cube aggregation
– Attribute subset selection
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
– Numerosity reduction
Regression and log –linear model
Histogram, Clustering, Sampling…..
– Dimensionality reduction / Data Compression
Wavelet transforms
Principal components analysis.
– Discretization and concept hierarchy generation
140
Numerosity Reduction
141
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions
142
Parametric Data Reduction: Linear Regression
Linear regression
• In (simple) linear regression, the data are modeled to fit a straight line.
• For example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation
Y=wX+b
• where the variance of y is assumed to be constant.
• In the context of data mining, x and y are numerical database
attributes.
• The coefficients, w and b (called regression coefficients), specify the
slope of the line and the Y-intercept, respectively.
• These coefficients can be solved for by the method of least squares,
which minimizes the error between the actual line separating the
data and the estimate of the line.
143
Linear Regression: Example
145
Log-Linear Models
Log-linear models:
Log(y)= b0 + b1 log(X1 )+ b2 log(X2 )
• Log-linear models approximate discrete multidimensional probability
distributions.
• Given a set of tuples in n dimensions (e.g., described by n attributes), can
consider each tuple as a point in an n-dimensional space.
• Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from lower
dimensional spaces.
• Log-linear models are therefore also useful for dimensionality reduction
(since the lower-dimensional points together typically occupy less space than
the original data points) and data smoothing (since aggregate estimates in the
lower-dimensional space are less subject to sampling variations than the
estimates in the higher-dimensional space) 146
147
Histograms
148
Histograms
To further reduce the data, it is common to have each bucket
denote a continuous range of values for the given attribute.
150
“How are the buckets determined and the attribute values partitioned?”
Equal-width:
– It divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.
151
“How are the buckets determined and the attribute values partitioned?”
Histogram types
• Equal-width histograms:
– It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing approximately
same number of samples
• V-optimal:
– It considers all histogram types for a given number of buckets and
chooses the one with the least variance.
154
V-Optimal
Take a simple set of data, for example, a list of integers:
1, 3, 4, 7, 2, 8, 3, 6, 3, 6, 8, 2, 1, 6, 3, 5, 3, 4, 7, 2, 6, 7, 2
155
(1, 2), (2, 4), (3, 5), (4, 2), (5, 1), (6, 4), (7, 3), (8, 2)
N
Weighted variance = 2.28
Bucket 2:
Average frequency 2.5
Weighted variance 2.19
Sum of Weighted Variance 4.47
156
(1, 2), (2, 4), (3, 5), (4, 2), (5, 1), (6, 4), (7, 3), (8, 2)
Bucket 1:
Average frequency 3
Weighted variance 1.41
Bucket 2:
Average frequency 2.83
Weighted variance 3.29
Sum of Weighted Variance 4.70
The first choice is better
157
“How are the buckets determined and the attribute values partitioned?”
Histogram types
• Equal-width histograms:
– It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing approximately
same number of samples
• V-optimal:
– It considers all histogram types for a given number of buckets and
chooses the one with the least variance.
• MaxDiff:
– After sorting the data to be approximated, it defines the borders of
the buckets at points where the adjacent values have the maximum
difference
• Example: split 1,1,4,5,5,7,9, 14,16,18, 27,30,30,32 to three
buckets
MaxDiff 27-18 and 14-9 Histograms 158
Histograms
• V-Optimal and MaxDiff histograms tend to be
the most accurate and practical.
• Histograms are highly effective approximating
both sparse and dense data, as well as highly
skewed and uniform data.
• Multidimensional histograms can capture
dependencies between attributes.
159
Clustering
• Partition data set into clusters, and store cluster representation
only
• Quality of clusters measured by their diameter (max distance
between any two objects in the cluster) or centroid distance (avg.
distance of each cluster object from its centroid)
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering (possibly stored in multi-
dimensional index tree structures (B+-tree, R-tree, quad-tree, etc))
• There are many choices of clustering definitions and clustering
algorithms (further details later)
161
Sampling
• Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
• Sampling is used in data mining because processing the entire set
of data of interest is too expensive or time consuming.
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in
the presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
164
Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using
the entire data sets, if the sample is representative
165
Types of Sampling
Simple random sampling
There is an equal probability of selecting
any particular item
166
Sampling: With or without Replacement
R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
e p l a ce
r
SRSW
R
Raw Data
167
Sampling: With or without Replacement
168
Types of Sampling (continue)
Cluster Sample
Let D = total number of tuples
M= Number of mutually disjoints cluster
s= number of cluster in sample
170
Types of Sampling (continue)
Stratified sample
Let D = total number of tuples
• If D is divided into mutually disjoint parts called strata, a
stratified sample of D is generated by obtaining an SRS at each
stratum.
• This helps ensure a representative sample, especially when the
data are skewed.
• For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In
this way, the age group having the smallest number of
customers will be sure to be represented.
171
Stratified sample
172
Advantage of sampling
An advantage of sampling for data reduction is that the cost of
obtaining a sample is proportional to the size of the sample, s, as
opposed to N, the data set size.
173
Question
Sketch examples of each of the following sampling techniques:
SRSWOR,
SRSWR,
cluster sampling,
stratified sampling.
Use samples of size 5 and
the strata “youth,” “middle-aged,” and “senior.”
The age values for the data tuples are (in increasing order)
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40,
45, 46, 62, 70.
174
Answer
175
Answer
176
Answer
177
Answer
178
Data Reduction Techniques
– Data cube aggregation
– Attribute subset selection
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
– Numerosity reduction
Regression and log –linear model
Histogram, Clustering, Sampling…..
– Dimensionality reduction / Data Compression
Wavelet transforms
Principal components analysis.
– Discretization and concept hierarchy generation
179
Data Compression
180
Data Compression
181
Wavelet Transforms
• Discrete wavelet transform (DWT): linear signal processing
• In this, a data vector D is transforms to a numerically different vector
D’ of wavelet coefficients. Both vector has same length
• If both vector has same size then how reduction done?
182
Wavelet Transforms
• Wavelet transformed data can be truncated.
• Compressed approximation: store only a small fraction of the strongest
of the wavelet coefficients
• For example, all wavelet coefficient larger than user specified threshold
can be retained, remaining are set to zero.
• Similar to discrete Fourier transform (DFT), but better lossy
compression.
• That is, if the same number of coefficients is retained for a DWT and a
DFT of a given data vector, the DWT version will provide a more
accurate approximation of the original data.
• Hence, for an equivalent approximation, the DWT requires less space
than the DFT.
• Unlike the DFT, wavelets are quite localized in space, contributing to the
conservation of local detail. 183
Wavelet Transforms
• Method (hierarchical pyramid algorithm):
– Length, L, must be an integer power of 2 (padding with 0s, when necessary)
– Each transform has 2 functions:
• smoothing (e.g., sum, weighted avg.), weighted difference
– Applies to pairs of data, resulting in two sets of data of length L/2
– Applies the two functions recursively, until reaches the desired length
– A selection of values from the data sets obtain in the above iterations are designated
the wavelet coefficients of the transformed data
185
Example
Second iteration
186
Example
Third iteration
187
Example
188
Principal Component Analysis (PCA)
Principal Component Analysis, or PCA, is an unsupervised
dimensionality-reduction method that is used to reduce the
dimensionality of large data sets, by transforming a large set of
variables into a smaller one that still contains most of the
information in the large set.
It transforms the variables into a new set of variables called as
principal components.
These principal components are linear combination of original
variables and are orthogonal i.e completely independent of
each other (the correlation between a pair of variables is zero).
Principal Component Analysis
Principal components are new variables that are constructed as linear
combinations or mixtures of the initial variables.
These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within
the initial variables is squeezed or compressed into the first components.
So, the idea is 10-dimensional data
gives us 10 principal components, but
PCA tries to put maximum possible
information in the first component,
then maximum remaining information
in the second and so on until 10
principal components are generated.
Organizing information in principal
components this way, will allow to
reduce dimensionality without losing
much information, and this by
discarding the components with low
information and considering the
remaining components as new
Objective of PCA
1. The new features are distinct i.e. the covariance between the
new features (in case of PCA, they are the principal
components) is 0.
2. The principal components are generated in order of the
variability in the data that it captures. Hence, the first
principal component should capture the maximum variability,
the second one should capture the next highest variability etc.
3. The sum of the variance of the new features / the principal
components should be equal to the sum of the variance of
the original features.
Principal Component Analysis
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing “significance”
or strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
• Works for numeric data only
192
EXAMPLE: PRINCIPAL COMPONENT ANALYSIS
Feature
193
Extraction
PRINCIPAL COMPONENT ANALYSIS
i i , i 1,2,...., M
Feature
194
Extraction
PRINCIPAL COMPONENT ANALYSIS
Step 6: Represent face image using Eigenfaces
U KT i
Wi K=1,2…M
K
Where U1, U2, ….UM are eigenvectors, and 1, 2, 3… M are
eigenvalues corresponding to eigenvectors
The eigenvectors are sorted in descending order of eignevalues.
Feature
195
Extraction
Data Reduction Techniques
– Data cube aggregation
– Attribute subset selection
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
– Dimensionality reduction / Data Compression
Wavelet transforms
Principal components analysis.
– Numerosity reduction
Regression and log –linear model
Histogram, Clustering, Sampling…..
– Discretization and concept hierarchy generation
196
Discretization and Concept Hierarchy
• Discretization
– reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
– Interval labels can then be used to replace actual data values.
– Concept hierarchies for a given numerical attributes defines a
discretization of the attribute.
• Concept Hierarchies
– reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
197
Introduction to Dimensionality Reduction
228
The Curse of Dimensionality
229
Benefits of applying Dimensionality Reduction
There are two ways to apply the dimension reduction technique, which
are given below:
1. Feature selection is the process of selecting the subset of the
relevant features and leaving out the irrelevant features present in
a dataset to build a model of high accuracy. In other words, it is a
way of selecting the optimal features from the input dataset.
2. Feature extraction is the process of transforming the space
containing many dimensions into space with fewer dimensions.
This approach is useful when we want to keep the whole
information but use fewer resources while processing the
information.
• Principal Component Analysis
• Linear Discriminant Analysis
• Kernel PCA
• Quadratic Discriminant Analysis 231
CUR Decomposition
CUR matrix decomposition is a low-rank matrix decomposition
algorithm that is explicitly expressed in a small number of actual
columns and/or actual rows of data matrix.
If A is sparse then two large matrices C and R are also sparse, only the
middle matrix is dense but this matrix is small so density does not
hurt too much.
232
CUR Decomposition
A C U R
Pseudo-inverse of
the intersection of C and R
A C U R 233
Definition of CUR Decomposition
234
Computing U
W R
A
C
U = W+
235
Choosing Columns and Rows properly
236
Example: CUR Decomposition
The sum of squares of elements of the matrix M is 243
Suppose first we select Alien and then Casablanca as column of C. We scale each
column by dividing its element by sqrt( r* probability of col)
Alien: Casablanca :
divided by = 0.430
238
Example: CUR Decomposition
Let r= 2
R Matrix:
239
Example: CUR Decomposition
Selection of U Matrix:
240
Example: CUR Decomposition
Selection of U Matrix:
2. Calculate SVD of W
241
Example: CUR Decomposition
242
Another Example: CUR Decomposition
243
Feature Selection
Filter Method: uses the variable ranking technique in order to
select the variables for ordering and here, the selection of features is
independent of the classifiers used .Example: Information gain, Chi-
square test, Variance threshold, etc
Wrapper Method: Wrappers require some method to search the
space of all possible subsets of features, assessing their quality by
learning and evaluating a classifier with that feature subset. The
feature selection process is based on a specific machine learning
algorithm that we are trying to fit on a given dataset. Example:
Forward feature selection, backward feature selection, Recursive
feature Elimination
Embedded Methods: Embedded methods check the different
training iterations of the machine learning model and evaluate the
importance of each feature. Some common techniques of Embedded
methods are:. Example: LASSO Regularization (L1) , Elastic Net etc
245
GTU Question
Minimum salry is 20,000Rs and Maximum salary is 1,70,000Rs. Map the salry 1,00,000Rs in
new Range of (60,000 , 2,60,000) Rs using min-max normalization method. [4 marks]
If Mean salary is 54,000Rs and standard deviation is 16,000 Rs then find z score value of
73,600 Rs salry. [3 Marks]
Explain Mean, Median, Mode, Variance, Standard Deviation & five number summary with
suitable database example. [ 7 Marks]
List and describe methods for handling missing values in data cleaning. [3 marks]
What is noise? Explain data smoothing methods as noise removal technique to divide given
data into bins of size 3 by bin partition (equal frequency), by bin means, by bin medians and
by bin boundaries. Consider the data: 10, 2, 19, 18, 20, 18, 25, 28, 22 [ 7 marks]
Enlist the preprocessing steps with example. Explain procedure of any technique of
preprocessing [7 marks]
What is noise? Explain binning methods for data smoothing [4 marks]
Explain various data normalization techniques. [7 marks]
Enlist data reduction strategies and explain any two [7 Marks]
Explain the following data normalization techniques: (i) min-max normalization and (ii)
decimal scaling [7 marks]
In real-world data, tuples with missing values for some attributes are a common occurrence.
Describe various methods for handling this problem. [7 marks]
In data pre-processing why we need data smoothing? Discuss data smoothing by Binning. [7
marks]
246
GTU Question
Explain the need for data smoothing during pre-processing and
discuss data smoothing by Binning [7 marks]
Use min-max normalization method to normalize the following group of data by setting min
= 0 and max = 1 200, 300, 400, 600, 1000 [ 3 marks]
Describe various methods for handling missing data values [4 marks]
Suppose a group of sales price records has been sorted as follows: 6, 9, 12, 13, 15, 25, 50, 70,
72, 92, 204, 232 Partition them into three bins by equal-frequency (equi-depth) partitioning
method. Perform data smoothing by bin mean. [3 marks]
Explain the following terms: Numerosity reduction, Data Integration, Data transformation [3
marks]
Explain sampling methods for data reduction. [7 marks]
Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 23, 29, 35, 41, 44, 53, 62, 69, 72 Use
min-max normalization to transform the value 45 for age onto the range [0:0, 1:0] [4 marks]
Define noise data. Enlist the reasons for the presence of noise in data collection. Explain the
methods to deal with noise. [7 marks]
247