0% found this document useful (0 votes)

13 views178 pages

DWDM - Unit 3 - Data Preprocessing

The document provides an overview of data preprocessing techniques, focusing on measuring central tendency (mean, median, mode) and dispersion (variance, standard deviation, five-number summary). It explains the importance of these statistical measures, their calculations, and how they can be used to analyze data sets effectively. Additionally, the document discusses the identification of outliers using box plots and the significance of data cleaning in real-world applications.

Uploaded by

raaharpalsinh1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views178 pages

DWDM - Unit 3 - Data Preprocessing

Uploaded by

raaharpalsinh1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 178

Data Preprocessing

Prepared by
Prof. T. M. Kodinariya
Department of Computer Engineering
Government Engineering College - Rajkot
1
Measuring the Central Tendency
• Mean
• Weighted arithmetic Mean
• Median
• Mode

2
Statistics: Mean
Arithmetic mean is commonly called as average.
Mean or Average is defined as the sum of all the given elements divided by the total
number of elements.

Formula:
Mean = sum of elements / number of elements
= a1+a2+a3+.....+an/n

Example: To find the mean of 3,5,7.

Step 1: Find the sum of the numbers.
3+5+7 = 15

Step 2: Calculate the total number.

there are 3 numbers.

Step 3: Finding mean.

15/3 = 5

3
Statistics: Mean
• Sometimes, each value xi in a set may be associated with a weight wi, for i = 1,.. ,N.
• The weights reflect the significance, importance, or occurrence frequency attached
to their respective values.

• This is called the weighted arithmetic mean or the weighted average

4
Statistics: Mean
• Although the mean is the single most useful quantity for describing a data set, it is
not always the best way of measuring the center of the data.
• A major problem with the mean is its sensitivity to extreme (e.g., outlier) values.
Even a small number of extreme values can corrupt the mean.
• For example, the mean salary at a company may be substantially pushed up by that
of a few highly paid managers. Similarly, the average score of a class in an exam
could be pulled down quite a bit by a few very low scores.
• The trimmed mean, which is the mean obtained after chopping off values at the
high and low extremes.
• For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean.
• We should avoid trimming too large a portion (such as 20%) at both ends as this
can result in the loss of valuable information.

5
Statistics: Mode
Mode Definition:
Mode is the most frequently occurring value in a frequency distribution.

Example: To find the mode of 11,3,5,11,7,3,11

Step 1:Arrange the numbers in ascending order.
3,3,5,7,11,11,11

Step 2:
In the above distribution
Number 11 occurs 3 times,
Number 3 occurs 2 times,
Number 5 occurs 1 times,
Number 7 occurs 1 times.
So the number with most occurrences is 11 and is the Mode of this distribution.
Mode = 11

6
Statistics: Mode
• Data sets with one, two, or three modes are respectively called unimodal, bimodal,
and trimodal.
• In general, a data set with two or more modes is multimodal.
• At the other extreme, if each data value occurs only once, then there is no mode.

7
Statistics: Median
Arithmetic Median Definition:
Median is the middle value of the given numbers or distribution in their ascending
order. Median is the average value of the two middle elements when the size of the
distribution is even.

Example 1: To find the median of 4,5,7,2,1 [ODD].

Step 1: Count the total numbers given.

There are 5 elements or numbers in the distribution.

Step 2: Arrange the numbers in ascending order.

1,2,4,5,7

Step 3: The total elements in the distribution (5) is odd.

The middle position can be calculated using the formula. (n+1)/2
So the middle position is (5+1)/2 = 6/2 = 3
The number at 3rd position is = Median = 4

8
Statistics: Median
Example 2: To find the median of 4,5,7,2,1,8 [Even]
Step 1: Count the total numbers given.
There are 6 elements or numbers in the distribution.

Step 2: Arrange the numbers in ascending order.

1,2,4,5,7,8

Step 3: The total elements in the distribution (6) is even.

As the total is even, we have to take average of number at n/2 and (n/2)+1
So the position are n/2= 6/2 = 3 and 4
The number at 3rd and 4th position are 4,5

Step 4: Find the median.

The average is(4+5)/2 = Median = 4.5

9
Measuring the Dispersion of Data
• The degree to which numerical data tend to spread is called the dispersion,
or variance of the data.
• The most common measures of data dispersion are
• range,
• five-number summary (based on quartiles),
• the interquartile range,
• the standard deviation
• Boxplots can be plotted based on the five-number summary and are a
useful tool for identifying outliers.

10
Measuring the Dispersion of Data
• The range of the set is the difference between the largest (max()) and
smallest (min()) values.
• The kth percentile of a set of data in numerical order is the value xi having
the property that k percent of the data entries lie at or below xi.
• The median is the 50th percentile.

11
Statistics: variance & standard deviation
Variance
The average of the squared differences from the Mean.

 2

 (X  X )2
N

Standard Deviation:

  2   (X  X )2
N

measures of how spread out data are.

In other words, they are measures of variability.

12
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.

600 + 470 + 170 +

1970
430 + 300
Mean = = = 394

5 5

13
Now, we calculate each dogs difference from the Mean:

2062 + 762 + (-
108,520
224)2 + 362 + (-94)2
Variance: σ =
2
= = 21,704

5 5

Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm)

14
Variance

Standard Deviation

So, using the Standard Deviation we have a "standard" way of knowing what is normal,
and what is extra large or extra small.
15
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.

1. The minimum value of a data set is the least value in the set.
2. The first quartile, denoted by Q1 , is the median of the lower
half of the data set. This means that about 25% of the
numbers in the data set lie below Q1 and about 75% lie
above Q1 .

The first quartile, Q1, is the median of {3, 5, 7, 8}

16
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.

3. The median of a data set is the number that, when the set is
put into increasing order, divides the data into two equal
parts.

the median is 12.

17
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.

4. The third quartile, denoted by Q3 , is the median of the

upper half of the data set. This means that about 75% of the
numbers in the data set lie below Q3 and about 25% lie
above Q3 .

the upper half of the data is: {13, 14, 18, 21}, so

18
Five Number Summary
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.

5. The maximum value of a data set is the greatest value in the

set.

19
Five Number Summary: Exercise
• The five-number summary of a data set consists of the five
numbers determined by computing the minimum, Q1 ,
median, Q3 , and maximum of the data set.

Find the five-number summary for the data set {3, 7, 8, 5, 12, 14, 21, 15, 18, 14}.

Minimum: 3 Q1 : 7 Median: 13 Q3 : 15 Maximum: 21

20
Example: calculation of Mean, Mode,
Median, Standard deviation, variance and
five number summary
34, 47, 1, 15, 57, 24, 20, 11, 19, 50, 28, 37

21
Box-Plot

• The box and whiskers chart shows you how your data is spread out. ‘
• Five pieces of information (the “five number summary“) are generally included in the
chart:

1. The minimum (the smallest number in the data set). The minimum is shown at the far
left of the chart, at the end of the left “whisker.”
2. First quartile, Q1, is the far left of the box (or the far right of the left whisker).
3. The median is shown as a line in the center of the box.
4. Third quartile, Q3, shown at the far right of the box (at the far left of the right whisker).
5. The maximum (the largest number in the data set), shown at the far right of the box.
Example: Box-Plot

Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25
Example: Box-Plot

Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25

Draw a number line that will include the smallest and the largest data.
Example: Box-Plot

Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25

Draw three vertical lines at the lower quartile (12), median (22) and the upper quartile (36),
just above the number line.
Join the lines for the lower quartile and the upper quartile to form a box.
Example: Box-Plot

Construct a box plot for the following data: 12, 5, 22, 30, 7, 36, 14, 42, 15, 53, 25

Draw a line from the smallest value (5) to the left side of the box and draw a line from the right
side of the box to the biggest value (53).
Interpreting a Box Plot

The five-number summary divides the data

About what percent of the boxes of
into sections that each contain raisins weighed more than 29 grams?
approximately 25%, percent of the data in
that set.

SinceQ1=29, about 25% percent of data is lower

than 29 and about75%, percent is above is 29
Outlier detection using Box Plot

An outlier, is one that appears to deviate markedly from other members of the data set
in which it occurs.

 Inter-Quartile Range (IQR) is the distance between the first and

second quartiles.
 Multiply the IQR by 1.5.
 Subtract that value from the 1st Quartile to get your lower
boundary.
 Add that value to the 2nd Quartile to get your upper boundary.
 Values in the data set that fall outside of these limits are
considered outliers.
Outlier detection using Box Plot: Example

0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39

Q1 = 3, Q3 = 12.5
IQR = 12.5 - 3 = 9.5.

Q1 – 1.5xIQR = 3 – 1.5(9.5) = 3 -14.25 = -11.25

Anything less than -11.25 is an outlier.

In this case there are no outliers on the low end.

Outlier detection using Box Plot: Example

0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39

Q1 = 3, Q3 = 12.5
IQR = 12.5 - 3 = 9.5.

Q3 + 1.5xIQR = 12.5 + 1.5*9.5 = 12.5 + 14.25 = 26.75

Anything more than 26.75 is an outlier.

39 is the only outlier.

Outlier detection using Box Plot: Example 2

24, 58, 61, 67, 71, 73, 76, 79, 82, 83, 85, 87, 88, 88, 92, 93,
94,
IQR 97
= Q3 - Q1 = 89 - 70 = 19.
Lower limit = Q1 - 1.5 · IQR = 70 - 1.5 *19 = 41.5
Upper limit = Q3 + 1.5 · IQR = 89 + 1.5 * 19 = 117.5
Lower adjacent value = 58
Upper adjacent value = 97
Since 24 lies outside the lower and upper limit, it is a potential outlier.
Box Plot Example

• Num= [0, 1.5, 2, 3, 6, 6, 6]

Min=0
Median =3
Q1=1.5
Q3=6
Max=6
Box Plot Example

• Find Q1, Q2, and Q3 for the following data set. Identify any outliers, and draw a box-
and-whisker plot. {5,40,42,46,48,49,50,50,52,53,55,56,58,75,102}
Scatter Plot

Age of
Height
the Child
3 2.3
4 2.7
5 3.1
6 3.6
7 3.8
8 4
9 4.3
10 4.5
Scatter Plot
(Outlier)
Scatter Plot

(Positive Correlation) (Negative Correlation)

Scatter Plot (No relationship)
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate
data
• e.g., occupation=“ ”
– noisy: containing errors or outliers
• e.g., Salary=“-10”
– inconsistent: containing inconsistency in codes or
names
• e.g., Age=“42” Birthday=“03/07/1997”
• e.g., Was rating “1,2,3”, now rating “A, B, C”
• e.g., discrepancy between duplicate records
53
Why data preprocessing is important?

• To make data more suitable for data mining

• To improve the data mining analysis with
respect to time, cost and quality
• For that, the size of data and dimentionality is
reduced.

54
Why Is Data Dirty?
Reason for missing values/Incomplete data

• Attribute of interest may not be available ex. E.g., customer

income in sales data

• Sometimes data may not be included simply because it not

seems to important at the time of entry

• Relevant data may not be included due to equipment

malfunction

• Data that were inconsistence with other recorded data may have
been deleted
55
Why Is Noisy data?

Reason for Noisy data

• The data collection instrument used may be faulty

• Human or computer error occurring at the time of data entry

• Error in data transmission can occur

• There may be technology limitations such as limited buffer size

for coordinating synchronized data transfer and consumption

• Incorrect data may occur due to inconsistency in naming

convention or data code used or inconsistent formats for input
fields, such as date.
56
Why Inconsistent data?
• Inconsistent data may come from
– Different data sources
– Functional dependency violation (e.g., modify some linked data)

• Duplicate records also need data cleaning

57
Major Tasks in Data Preprocessing

• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization and aggregation
• Data reduction
– Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization and concept hierarchy generation
– Part of data reduction but with particular importance, especially for
numerical data
58
Forms of data preprocessing

60
Agenda

• Why preprocess the data?

• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary

61
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– smooth out noisy data while identifying outliers
• Binning
• Clustering
• regression

– Data cleaning as a process

62
Data Cleaning

How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing (assuming
the task is classification or description)
– This method is not very effective, unless the tuple contains several
attributes with missing values.
– It is especially poor when the percentage of missing values per
attribute varies considerably.

• Fill in the missing value manually: tedious + infeasible? (for large

data set with many missing value)

• Use a global constant to fill in the missing value: e.g., “unknown”,

– If missing values are replaced by, say, “Unknown,” then the mining
program may mistakenly think that they form an interesting concept,
since they all have a value in common—that of “Unknown.”
– Hence, although this method is simple, it is not foolproof.

63
RowID Variable_1 Variable_2 Variable_3

1 12.34 aa 12
2 34 33
3 44 cc 22
4 -433 ff
5 66
6 43 dd 33
7 #NAME? 66
8 6743 df 22
9 3 fg
10 4 88
11 3 g 55
12 34 dd 22
Handling missing value

RowID Variable_1 Variable_2 Variable_3

1 12.34 aa 12
2 34 UNKNOWN 33
3 44 cc 22
4 -433 ff 41.9
6 43 dd 33
8 6743 df 22
9 3 fg 41.9
10 4 UNKNOWN 88
11 3 g 55
12 34 dd 22 64
Data Cleaning

How to Handle Missing Data?

• Use the attribute mean to fill in the missing value:
– For example, suppose that the average income of All Electronics
customers is $56,000. Use this value to replace the missing value for
income.

• Use the attribute mean for all samples belonging to the same
class as the given tuple
– For example, if classifying customers according to credit risk, replace
the missing value with the average income value for customers in the
same credit risk category as that of the given tuple.

65
Data Cleaning

How to Handle Missing Data?

• Use the most probable value to fill in the missing value:

inference-based such as regression, Bayesian formula, decision tree
• For example, using the other customer attributes in your data
set, you may construct a decision tree to predict the missing
values for income.

66
Handling Missing valve
• It is important to note that, in some cases, a missing value may not imply an error
in the data!
• For example, when applying for a credit card, candidates may be asked to supply
their driver’s license number. Candidates who do not have a driver’s license may
naturally leave this field blank. Forms should allow respondents to specify values
such as “not applicable”.
• Software routines may also be used to uncover other null values, such as “don’t
know”, “?”, or “none”.
• Ideally, each attribute should have one or more rules regarding the null condition.
The rules may specify whether or not nulls are allowed, and/or how such values
should be handled or transformed.
• Fields may also be intentionally left blank if they are to be provided in a later step
of the business process.
• Hence, although we can try our best to clean the data after it is seized, good design
of databases and of data entry procedures should help minimize the number of
missing values or errors in the first place.
67
Data Cleaning
• Data cleaning tasks
– Fill in missing values
– smooth out noisy data while identifying outliers
• Binning
• Clustering
• regression

– Data cleaning as a process

– Resolve redundancy caused by data integration

68
Data Cleaning: Noisy data

Noisy Data
• What is noise?
• Random error or variance in a measured variable.

• There are following smoothing techniques to smooth

out noise in the data
– Binning
– Clustering
– Regression

69
Data Cleaning: Noisy data

Binning
• First sort data and partition into bins or buckets
• And then smooth out data by consulting it’s
neighborhood i.e. value around it.
• Binning methods
• smooth by bin means,
• smooth by bin median,
• smooth by bin boundaries

70
Data Cleaning: Noisy data

Binning
• In smoothing by bin means, each value in a bin is
replaced by the mean value of the bin.
• In smoothing by bin medians, each bin value is
replaced by the bin median.
• In smoothing by bin boundaries, the minimum and
maximum values in a given bin are identified as the
bin boundaries.
– Each bin value is then replaced by the closest boundary
value.
– In general, the larger the width, the greater the effect of the
smoothing.
71
Binning techniques
1. Equi-Depth (Equi-frequency) :-
 The algorithm divides the data into k groups which
each group contains approximately same number
of values.
2. Equi-Width
 The simplest binning approach is to partition the
range of the variable into k equal-width intervals.
The interval width is simply the range [A, B] of the
variable divided by k,
And
W= the(min-max)/k
interval boundaries are:
min+w, min+2w, ... , min+(k-1)w 72
Example

73
Data Cleaning: Noisy data

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin medians:
- Bin 1: 8.5, 8.5, 8.5, 8.5
- Bin 2: 22.5, 22.5, 22.5, 22.5
- Bin 3: 28.5, 28.5, 28.5, 28.5
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34 74
Data Cleaning: Noisy data

Binning
• Alternatively, bins may be equal-width,
• where the interval range of values in each bin is
constant.

75
Binned_Va Var1
Var1
r1
Intervals value
4 7 #records
From To
8 7
4 14 3
9 7
14 24 3
15 19
24 34 6
21 19
21 19
24 27.67
25 27.67
26 27.67
28 27.67 Binning using Equi-width
29 27.67 -smoothing by means
34 27.67
Bin 1: [4,14)
Bin 2: [14, 24)
Bin 3:[24, ..

76
Data Cleaning: Noisy data

Question?
• The age values for the data tuples are (in increasing order)
• 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33,
35, 35, 35, 35, 36, 40, 45, 46, 52, 70.

• What is the mean of the data? What is the median?

• Use smoothing by bin means, median and boundaries to
smooth the data, using a bin depth of 3

77
Solution
• The median (middle value of the ordered set,
as the number of values in the set is odd) of
the data is: 25

78
Solution
• Step 1: Sort the data
• Step 2: Partition the data into equal-frequency
bins of size 3.

79
GTU Question
• What is noise? Explain data smoothing
methods as noise removal technique to divide
given data into bins of size 3 by bin partition
(equal frequency), by bin means, by bin
medians and by bin boundaries.
• Consider the data: 10, 2, 19, 18, 20, 18, 25, 28,
22 [Summer 2017, Winter 2018 – 7 marks]

80
GTU Question
• Suppose a group of sales price records has
been sorted as follows:
• 6, 9, 12, 13, 15, 25, 50, 70, 72, 92, 204,
232 Partition them into three bins by
equal-frequency (equi-depth)
partitioning method. Perform data
smoothing by bin mean.
• [winter 2017 – 3 marks]
81
Data Cleaning: Noisy data

Clustering
• Outliers may be detected by clustering, where similar values
are organized into groups, or “clusters.”
• Intuitively, values that fall outside of the set of clusters may be
considered outliers

82
Data Cleaning: Noisy data

Cluster Analysis

83
Data Cleaning: Noisy data

Regression
• Data can be smoothed by fitting the data to a function, such as
with regression.
• Linear regression involves finding the “best” line to fit two
attributes (or variables), so that one attribute can be used to
predict the other.
• Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are
fit to a multidimensional surface.

84
Regression
y (salary)

Example of linear regression

Y1 y=x+1

X1 x (age)

85
Regress Analysis
• Linear regression: Y = w X + b
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …, X1, X2,
….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above

86
Agenda

• Why preprocess the data?

• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary

98
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store

99
Data Integration

There are following issue:

• Schema integration and Object matching
– How can equivalent real-world entities from multiple data sources be
matched up?
• Entity identification problem: identify real world entities from multiple data
sources, e.g., A.cust-id  B.cust-#,
• e.g., Bill Clinton = William Clinton
– Metadata can be used to help avoid errors in schema integration.
– Examples of metadata for each attribute include the name, meaning,
data type, and range of values permitted for the attribute, and null
rules for handling blank, zero, or null values.
– The metadata may also be used to help transform the data (e.g.,
where data codes for pay_type in one database may be “H” and “S”,
and 1 and 2 in another).
100
Data Integration

There are following issue:

• Redundancy
– An attribute (such as annual revenue, for instance) may be
redundant if it can be “derived” from another attribute or set
of attributes.
– Inconsistencies in attribute or dimension naming can also
cause redundancies in the resulting data set.
– Some redundancies can be detected by correlation analysis

101
Data integration
• For numerical attributes, we can evaluate the correlation between two attributes, A
and B, by computing the correlation coefficient (also known as Pearson’s product
moment coefficient)

• where N is the number of tuples,

• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The higher,
the stronger correlation. Higher value may indicate A (or B) may be remove as a
redundancy
• rA,B = 0: independent;
• rAB < 0: negatively correlated

• Careful integration can help reduce/avoid redundancies and inconsistencies and

102
Data integration
• For categorical (discrete) data, a correlation relationship between two attributes, A
and B, can be discovered by a 2 (chi-square) test.
• Suppose A has c distinct values, namely a1,a2; …., ac. B has r distinct values,
namely b1, b2,….,br.
• The data tuples described by A and B can be shown as a contingency table, with
the c values of A making up the columns and the r values of B making up the rows.
• Let (Ai, Bj) denote the event that attribute A takes on value ai and attribute B takes
on value bj, that is, where (A = ai, B = bj). Each and every possible (Ai, Bj) joint
event has its own cell (or slot) in the table. The 2 value (also known as the Pearson
2 statistic) is computed as:

where oi j is the observed frequency (i.e., actual count) of the joint event
(Ai;Bj) and ei j 103
is the expected frequency of (Ai;Bj),
Data integration
• Expected frequency is calculated as

• where N is the number of data tuples, count(A=ai) is the number of tuples having
value ai for A, and count(B = bj) is the number of tuples having value bj for B.
• The 2 statistic tests the hypothesis that A and B are independent. The test
is based on a significance level, with (r-1)(c-1) degrees of freedom
• If the hypothesis can be rejected, then we say that A and B are statistically related
or associated

the numbers in
parentheses are the
expected frequencies.

104
Data integration
• Suppose that a group of 1,500 people was surveyed. The gender of each person
was noted. Each person was polled as to whether their preferred type of reading
material was fiction or nonfiction. Thus, we have two attributes, gender and
preferred reading.

105
Data integration
• For this 2 x 2 table, the
degrees of freedom are
(2-1)(2-1) = 1. For 1
degree of freedom, the
2 value needed to reject
the hypothesis at the
0.001 significance level is
10.828
• Since our computed
value is above this, we
can reject the hypothesis
that gender and
preferred reading are
independent and
conclude that the two
attributes are (strongly)
correlated for the given
group of people. 106
Data integration
• In addition to detecting redundancies between attributes, duplication
should also be detected at the tuple level (e.g., where there are two or
more identical tuples for a given unique data entry case).
• The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy

107
Data integration
• Note that correlation does not imply causality.
• That is, if A and B are correlated, this does not necessarily
imply that A causes B or that B causes A.
• For example, in analyzing a demographic database, we may
find that attributes representing the number of hospitals and
the number of car thefts in a region are correlated.
• This does not mean that one causes the other. Both are
actually causally linked to a third attribute, namely, population.
• For categorical (discrete) data, a correlation relationship
between two attributes, A and B, can be discovered by a chi-
square test.

108
Data integration
There are following issue:
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units, different currency
– For a hotel chain, the price of rooms in different cities may involve not
only different currencies but also different services (such as free
breakfast) and taxes.
– An attribute in one system may be recorded at a lower level of
abstraction than the “same” attribute in another. For example, the total
sales in one database may refer to one branch of All Electronics, while an
attribute of the same name in another database may refer to the total
sales for All Electronics stores in a given region.

109
Data Transformation
• Smoothing: remove noise from data (such as binning,
regression, and clustering.)
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
• Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
• Attribute/feature construction
– New attributes constructed from the given ones 110
Data Transformation: Normalization

Min-max normalization: to [new_min , new_max ] A A

• performs a linear transformation on the original data

v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,600 is mapped to
73,600  12,000
(1.0  0)  0 0.716
98,000  12,000

– Min-max normalization preserves the relationships among the original

data values.
112
Data Transformation: Normalization

Z-score normalization (or zero-mean normalization),

• the values for an attribute, A, are normalized based on the mean and
standard deviation of A.
v  A
v' 
 A

– μ: mean, σ: standard deviation

73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 1.225
16,000

• This method of normalization is useful when the actual minimum

and maximum of attribute A are unknown, or when there are
outliers that dominate the min-max normalization.

113
Data Transformation: Normalization

Normalization by decimal scaling

• Normalizes by moving the decimal point of values of attribute A.
• The number of decimal points moved depends on the maximum absolute
value of A.

v
v'  j Where j is the smallest integer such that Max(|ν’|) < 1
10

• Suppose that the recorded values of A range from -986 to 917. The
• Maximum absolute value of A is 986.
• A has 986 will be normalized to 0.986 (since j=3)

114
Question:

(a) Use min-max normalization to transform the value 35 for age onto the range
[0:0;1:0].
(b) Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.

115
Question:

Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000

(a) min-max normalization by setting min = 0 and max = 1

(b) z-score normalization

Mean =(200+300+400+600+1000)/5=500
Variance = (200-500)2+(300-500)2+(400-500)2+(600-500)2+(1000-500)2 =80000
5

    80000 282.84
2

  2   ( X  X ) 2

N
116
Question:

Use the two methods below to normalize the following group of data:
200, 300, 400, 600, 1000
(a) min-max normalization by setting min = 0 and max = 1

(b) z-score normalization

z-score normalization of 200=(200-500)/282.84 =-1.06

117
GTU Question

• Suppose that the data for analysis includes the

attribute age. The age values for the data tuples are (in
increasing order):
• 13, 15, 16, 16, 19, 20, 23, 29, 35, 41, 44, 53, 62, 69, 72
Use min-max normalization to transform the value 45
for age onto the range [0:0, 1:0].
• [Winter 2018 - 4 marks]

118
GTU Question

• Use min-max normalization method to normalize the

following group of data by setting min = 0 and max = 1
200, 300, 400, 600, 1000 [Winter 2017 - 3 marks]
• Minimum salary is 20,000Rs and Maximum salary is
1,70,000Rs. Map the salary 1,00,000Rs in new Range of
(60,000 , 2,60,000) Rs using min-max normalization
method. [Summer 2017- 4 marks]
• If Mean salary is 54,000Rs and standard deviation is
16,000 Rs then find z score value of 73,600 Rs salary.
[Summer 2017- 3 Marks]

119
Data Transformation: attribute construction

• In attribute construction, new attributes are constructed

from the given attributes and added in order to help
improve the accuracy and understanding of structure in
high-dimensional data.

• For example, we may wish to add the attribute area based

on the attributes height and width.

• By combining attributes, attribute construction can

discover missing information about the relationships
between data attributes that can be useful for knowledge
discovery. 120
Agenda
• Why preprocess the data?
• Data cleaning
• Data integration and transformation
• Data reduction
• Discretization and concept hierarchy generation
• Summary

121
Data Reduction
• Problem:
Data Warehouse may store terabytes of data:
Complex data analysis/mining may take a very
long time to run on the complete data set

• Solution?
– Data reduction…

122
Data Reduction
Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results

123
Data Reduction Techniques
• Data cube aggregation
• Attribute subset selection: where irrelevant, weakly relevant, or
redundant attributes or dimensions may be detected and removed
• step-wise forward selection
• step-wise backward elimination
• forward selection and backward elimination
• decision-tree induction
• Numerosity reduction where the data are replaced or estimated by
alternative, smaller data representations such as parametric models (which need store
only the model parameters instead of the actual data) or nonparametric methods such
as clustering, sampling, and the use of histograms

• Dimensionality reduction where encoding mechanisms are used to reduce

the data set size.
• Wavelet transforms
• Principal components analysis.
• Discretization and concept hierarchy generation where raw
data values for attributes are replaced by ranges or higher conceptual levels
124
Data Cube Aggregation
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation capable to solve the
task
• Queries regarding aggregated information
should be answered using data cube, when
possible
125
126
Data Reduction Techniques
– Data cube aggregation
– Attribute subset selection
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
– Numerosity reduction
Regression and log –linear model
Histogram, Clustering, Sampling…..
– Dimensionality reduction / Data Compression
Wavelet transforms
Principal components analysis.
– Discretization and concept hierarchy generation
127
Attribute Subset Selection
• Data sets for analysis may contain hundreds of
attributes, many of which may be irrelevant to
the mining task or redundant.
• For example, if the task is to classify customers
as to whether or not they are likely to purchase
a popular new CD at AllElectronics when notified
of a sale, attributes such as the customer’s
telephone number are likely to be irrelevant,
unlike attributes such as age or music taste.

128
Attribute Subset Selection
• The goal of attribute subset selection is to find a
minimum set of attributes such that the resulting
probability distribution of the data classes is as
close as possible to the original distribution
obtained using all attributes.
• Advantage
– Faster execution
– It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns
easier to understand i.e. simplicity measure improve
129
How can we find a ‘good’ subset of the original attributes?”

Brute-force approach:
• For n attributes, there are 2n possible subsets.
• Try all possible feature subsets as input to data mining algorithm
• It’s time consuming

Heuristic methods (greedy approach)

– step-wise forward selection
– step-wise backward elimination
– combining forward selection and backward elimination
– decision-tree induction

130
Heuristic methods: step-wise forward selection

• The procedure starts with an empty set of attributes as

the reduced set.
• The best of the original attributes is determined and
added to the reduced set.
• At each subsequent iteration or step, the best of the
remaining original attributes is added to the set.

131
Heuristic methods: step-wise backward Elimination

• The procedure starts with the full set of

attributes.
• At each step, it removes the worst attribute
remaining in the set.

132
Heuristic methods: combining forward selection and backward
elimination

• The stepwise forward selection and backward

elimination methods can be combined so that, at
each step, the procedure selects the best
attribute and removes the worst from among the
remaining attributes.

133
Heuristic methods: Decision Tree Induction

Decision tree induction constructs a flow chart like

structure where
– each internal (nonleaf) node denotes a test on an attribute,
– Each branch corresponds to an outcome of the test, and
– each external (leaf) node denotes a class prediction.
• At each node, the algorithm chooses the “best” attribute to
partition the data into individual classes.
• All attributes that do not appear in the tree are assumed to be
irrelevant.
• The set of attributes appearing in the tree form the reduced
subset of attributes.

134
Heuristic methods: Decision Tree Induction

• All attributes that do not appear in the tree are assumed

to be irrelevant.
• The set of attributes appearing in the tree form the
reduced subset of attributes.

135
Data Reduction Techniques
– Data cube aggregation
– Attribute subset selection
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
– Numerosity reduction
Regression and log –linear model
Histogram, Clustering, Sampling…..
– Dimensionality reduction / Data Compression
Wavelet transforms
Principal components analysis.
– Discretization and concept hierarchy generation
140
Numerosity Reduction

• Reduce data volume by choosing alternative, smaller forms of

data representation
• Parametric methods (e.g., regression)
• A model is used to estimate the data, so that typically only the data parameters
need to be stored, instead of the actual data. (Outliers may also be stored.)
• Ex.: Log-linear models—obtain value at a point in m-D space as the product on
appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …

141
Parametric Data Reduction: Regression and
Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a linear
function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions

142
Parametric Data Reduction: Linear Regression

Linear regression
• In (simple) linear regression, the data are modeled to fit a straight line.
• For example, a random variable, y (called a response variable), can be
modeled as a linear function of another random variable, x (called a
predictor variable), with the equation
Y=wX+b
• where the variance of y is assumed to be constant.
• In the context of data mining, x and y are numerical database
attributes.
• The coefficients, w and b (called regression coefficients), specify the
slope of the line and the Y-intercept, respectively.
• These coefficients can be solved for by the method of least squares,
which minimizes the error between the actual line separating the
data and the estimate of the line.
143
Linear Regression: Example

Final Linear Equation : y=26.768+ 0.644*x 144

Multiple Linear regression
• Multiple linear regression is an extension of (simple) linear regression, which
allows a response variable, y, to be modeled as a linear function of two or
more predictor variables.
Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above

145
Log-Linear Models
Log-linear models:
Log(y)= b0 + b1 log(X1 )+ b2 log(X2 )
• Log-linear models approximate discrete multidimensional probability
distributions.
• Given a set of tuples in n dimensions (e.g., described by n attributes), can
consider each tuple as a point in an n-dimensional space.
• Log-linear models can be used to estimate the probability of each point in a
multidimensional space for a set of discretized attributes, based on a smaller
subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from lower
dimensional spaces.
• Log-linear models are therefore also useful for dimensionality reduction
(since the lower-dimensional points together typically occupy less space than
the original data points) and data smoothing (since aggregate estimates in the
lower-dimensional space are less subject to sampling variations than the
estimates in the higher-dimensional space) 146
147
Histograms

• Histograms use binning to approximate data distributions and

are a popular form of data reduction.

• A histogram for an attribute, A, partitions the data distribution

of A into disjoint subsets, or buckets.

• If each bucket represents only a single

attribute-value/frequency pair, the buckets are called singleton
buckets.

148
Histograms
To further reduce the data, it is common to have each bucket
denote a continuous range of values for the given attribute.

150
“How are the buckets determined and the attribute values partitioned?”

Equal-width:
– It divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B-A)/N.
– The most straightforward
– But outliers may dominate presentation
– Skewed data is not handled well.

Equal-frequency (or equidepth):

– It divides the range into N intervals, each containing approximately same
number of samples
– Good data scaling – good handing of skewed data

151
“How are the buckets determined and the attribute values partitioned?”
Histogram types

154
V-Optimal
Take a simple set of data, for example, a list of integers:
1, 3, 4, 7, 2, 8, 3, 6, 3, 6, 8, 2, 1, 6, 3, 5, 3, 4, 7, 2, 6, 7, 2

Compute the value and frequency pairs

(1, 2), (2, 4), (3, 5), (4, 2), (5, 1), (6, 4), (7, 3), (8, 2)

• Our V-optimal histogram will have two buckets.

• The V-optimality rule states that the cumulative weighted

variance of the buckets must be minimized. We will look at two
options and compute the cumulative variance of those options.

155
(1, 2), (2, 4), (3, 5), (4, 2), (5, 1), (6, 4), (7, 3), (8, 2)

Option 1: Bucket 1 contains values 1 through 4. Bucket 2 contains

values 5 through 8.
Bucket 1
Average frequency: (2+4+5+2)/4 = 3.25
Weighted variance = n* 2 where    ( X  X )
2
2

N
Weighted variance = 2.28

Bucket 2:
Average frequency 2.5
Weighted variance 2.19
Sum of Weighted Variance 4.47

156
(1, 2), (2, 4), (3, 5), (4, 2), (5, 1), (6, 4), (7, 3), (8, 2)

Option 2: Bucket 1 contains values 1 through 2. Bucket 2 contains

values 3 through 8.

Bucket 1:
Average frequency 3
Weighted variance 1.41

Bucket 2:
Average frequency 2.83
Weighted variance 3.29
Sum of Weighted Variance 4.70
The first choice is better

157
“How are the buckets determined and the attribute values partitioned?”
Histogram types

• Equal-width histograms:
– It divides the range into N intervals of equal size
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each containing approximately
same number of samples
• V-optimal:
– It considers all histogram types for a given number of buckets and
chooses the one with the least variance.
• MaxDiff:
– After sorting the data to be approximated, it defines the borders of
the buckets at points where the adjacent values have the maximum
difference
• Example: split 1,1,4,5,5,7,9, 14,16,18, 27,30,30,32 to three
buckets
MaxDiff 27-18 and 14-9 Histograms 158
Histograms
• V-Optimal and MaxDiff histograms tend to be
the most accurate and practical.
• Histograms are highly effective approximating
both sparse and dense data, as well as highly
skewed and uniform data.
• Multidimensional histograms can capture
dependencies between attributes.

159
Clustering
• Partition data set into clusters, and store cluster representation
only
• Quality of clusters measured by their diameter (max distance
between any two objects in the cluster) or centroid distance (avg.
distance of each cluster object from its centroid)
• Can be very effective if data is clustered but not if data is “smeared”
• Can have hierarchical clustering (possibly stored in multi-
dimensional index tree structures (B+-tree, R-tree, quad-tree, etc))
• There are many choices of clustering definitions and clustering
algorithms (further details later)
161
Sampling
• Sampling can be used as a data reduction technique because it
allows a large data set to be represented by a much smaller
random sample (or subset) of the data.
• Sampling is used in data mining because processing the entire set
of data of interest is too expensive or time consuming.
• Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
• Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in
the presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling:
• Note: Sampling may not reduce database I/Os (page at a time)
164
Sampling …
• The key principle for effective sampling is the
following:
– using a sample will work almost as well as using
the entire data sets, if the sample is representative

– A sample is representative if it has approximately

the same property (of interest) as the original set
of data

165
Types of Sampling
Simple random sampling
There is an equal probability of selecting
any particular item

Sampling without replacement Sampling with replacement

(SRSWOR) (SRSWR)
Once an object is selected, it is A selected object is not
removed from the population removed from the population

166
Sampling: With or without Replacement

R S WOR ndom
S le ra hout
i m p
(s
p l e wit
sam ment)
e p l a ce
r

SRSW
R

Raw Data
167
Sampling: With or without Replacement

168
Types of Sampling (continue)
Cluster Sample
Let D = total number of tuples
M= Number of mutually disjoints cluster
s= number of cluster in sample

• If the tuples in D are grouped into M mutually disjoint “clusters,”

then an SRS of s clusters can be obtained, where s < M.

• For example, tuples in a database are usually retrieved a page at

a time, so that each page can be considered a cluster.
• A reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.
169
Cluster sampling

170
Types of Sampling (continue)
Stratified sample
Let D = total number of tuples
• If D is divided into mutually disjoint parts called strata, a
stratified sample of D is generated by obtaining an SRS at each
stratum.
• This helps ensure a representative sample, especially when the
data are skewed.
• For example, a stratified sample may be obtained from customer
data, where a stratum is created for each customer age group. In
this way, the age group having the smallest number of
customers will be sure to be represented.

171
Stratified sample

172
Advantage of sampling
An advantage of sampling for data reduction is that the cost of
obtaining a sample is proportional to the size of the sample, s, as
opposed to N, the data set size.

Hence, sampling complexity is potentially sublinear to the size of

the data.

Other data reduction techniques can require at least one

complete pass through D.

173
Question
Sketch examples of each of the following sampling techniques:
SRSWOR,
SRSWR,
cluster sampling,
stratified sampling.
Use samples of size 5 and
the strata “youth,” “middle-aged,” and “senior.”

The age values for the data tuples are (in increasing order)

13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40,
45, 46, 62, 70.

174
Answer

175
Answer

176
Answer

177
Answer

178
Data Reduction Techniques
– Data cube aggregation
– Attribute subset selection
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
– Numerosity reduction
Regression and log –linear model
Histogram, Clustering, Sampling…..
– Dimensionality reduction / Data Compression
Wavelet transforms
Principal components analysis.
– Discretization and concept hierarchy generation
179
Data Compression

• Data encoding or transformations are applied so as to

obtain a reduced or “compressed” representation of
the original data.
• If the original data can be reconstructed from the
compressed data without any loss of information, the
data reduction is called lossless.
• If, instead, we can reconstruct only an approximation of
the original data, then the data reduction is called lossy.

180
Data Compression

Lossy dimensionality reduction:

• Wavelet transforms
• Principal components analysis.

181
Wavelet Transforms
• Discrete wavelet transform (DWT): linear signal processing
• In this, a data vector D is transforms to a numerically different vector
D’ of wavelet coefficients. Both vector has same length
• If both vector has same size then how reduction done?

182
Wavelet Transforms
• Wavelet transformed data can be truncated.
• Compressed approximation: store only a small fraction of the strongest
of the wavelet coefficients
• For example, all wavelet coefficient larger than user specified threshold
can be retained, remaining are set to zero.
• Similar to discrete Fourier transform (DFT), but better lossy
compression.
• That is, if the same number of coefficients is retained for a DWT and a
DFT of a given data vector, the DWT version will provide a more
accurate approximation of the original data.
• Hence, for an equivalent approximation, the DWT requires less space
than the DFT.
• Unlike the DFT, wavelets are quite localized in space, contributing to the
conservation of local detail. 183
Wavelet Transforms
• Method (hierarchical pyramid algorithm):
– Length, L, must be an integer power of 2 (padding with 0s, when necessary)
– Each transform has 2 functions:
• smoothing (e.g., sum, weighted avg.), weighted difference
– Applies to pairs of data, resulting in two sets of data of length L/2
– Applies the two functions recursively, until reaches the desired length
– A selection of values from the data sets obtain in the above iterations are designated
the wavelet coefficients of the transformed data

• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to

S^ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
• Compression: many small detail coefficients can be replaced by 0’s, and only
the significant coefficients are retained
• Other techniques, matrix multiplication can be applied to the input data to
obtain wavelet coefficient. 184
Example
• A signal with 8 samples:
56; 40; 8; 24; 48; 48; 40; 16
• First row is the original signal.
• The second row in the table is generated by taking the mean of the
samples pairwise, put them in the first four places, and then the
difference between the first member of the pair and the computed mean.

185
Example

Second iteration

186
Example

Third iteration

187
Example

188
Principal Component Analysis (PCA)
 Principal Component Analysis, or PCA, is an unsupervised
dimensionality-reduction method that is used to reduce the
dimensionality of large data sets, by transforming a large set of
variables into a smaller one that still contains most of the
information in the large set.
 It transforms the variables into a new set of variables called as
principal components.
 These principal components are linear combination of original
variables and are orthogonal i.e completely independent of
each other (the correlation between a pair of variables is zero).
Principal Component Analysis
 Principal components are new variables that are constructed as linear
combinations or mixtures of the initial variables.
 These combinations are done in such a way that the new variables (i.e.,
principal components) are uncorrelated and most of the information within
the initial variables is squeezed or compressed into the first components.
 So, the idea is 10-dimensional data
gives us 10 principal components, but
PCA tries to put maximum possible
information in the first component,
then maximum remaining information
in the second and so on until 10
principal components are generated.
 Organizing information in principal
components this way, will allow to
reduce dimensionality without losing
much information, and this by
discarding the components with low
information and considering the
remaining components as new
Objective of PCA
1. The new features are distinct i.e. the covariance between the
new features (in case of PCA, they are the principal
components) is 0.
2. The principal components are generated in order of the
variability in the data that it captures. Hence, the first
principal component should capture the maximum variability,
the second one should capture the next highest variability etc.
3. The sum of the variance of the new features / the principal
components should be equal to the sum of the variance of
the original features.
Principal Component Analysis
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing “significance”
or strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a
good approximation of the original data)
• Works for numeric data only
192
EXAMPLE: PRINCIPAL COMPONENT ANALYSIS

PCA contains following Step

Step1: Establishes the training set
Let {I | i=1,2,3…M} be a training set of M images
Where  is a face Image
Step 2:Converts the images into vector of (N*N)*M where M is a number
of samples and N x N is an image size

Feature
193
Extraction
PRINCIPAL COMPONENT ANALYSIS

Step 3: Calculate Mean Image of all training samples

M
1

M

i 1
i

Step 4: Calculates the difference Images by subtracting

the training set vector by the Mean Image

 i i   , i 1,2,...., M

Where  1 ,  2 ,...,  M that represents how each of the

original set of images varies from the mean image

Step 5: Calculate Eigenvectors and Eigenvalues of Covariance

Matrix C
M
1
 
T
C i i = AAT where A=  1 ,  2 ,...,  M
M i 1

Feature
194
Extraction
PRINCIPAL COMPONENT ANALYSIS
Step 6: Represent face image using Eigenfaces

U KT i   
Wi  K=1,2…M
K

Where U1, U2, ….UM are eigenvectors, and 1, 2, 3… M are
eigenvalues corresponding to eigenvectors
The eigenvectors are sorted in descending order of eignevalues.

Feature
195
Extraction
Data Reduction Techniques
– Data cube aggregation
– Attribute subset selection
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
– Dimensionality reduction / Data Compression
Wavelet transforms
Principal components analysis.
– Numerosity reduction
Regression and log –linear model
Histogram, Clustering, Sampling…..
– Discretization and concept hierarchy generation
196
Discretization and Concept Hierarchy
• Discretization
– reduce the number of values for a given continuous attribute by
dividing the range of the attribute into intervals.
– Interval labels can then be used to replace actual data values.
– Concept hierarchies for a given numerical attributes defines a
discretization of the attribute.
• Concept Hierarchies
– reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).

197
Introduction to Dimensionality Reduction

• The number of input features, variables, or columns present in a

given dataset is known as dimensionality, and the process to reduce
these features is called dimensionality reduction.

• Dimensionality reduction technique can be defined as, "It is a way of

converting the higher dimensions dataset into lesser dimensions
dataset ensuring that it provides similar information."

• These techniques are widely used in machine learning for obtaining a

better fit predictive model while solving the classification and
regression problems.

228
The Curse of Dimensionality

• Handling the high-dimensional data is very difficult in practice,

commonly known as the curse of dimensionality.
• If the dimensionality of the input dataset increases, any machine
learning algorithm and model becomes more complex.
• As the number of features increases, the number of samples also gets
increased proportionally, and the chance of overfitting also
increases.
• If the machine learning model is trained on high-dimensional data, it
becomes overfitted and results in poor performance.
• Hence, it is often required to reduce the number of features, which
can be done with dimensionality reduction.

229
Benefits of applying Dimensionality Reduction

 By reducing the dimensions of the features, the space required to

store the dataset also gets reduced.
 Less Computation training time is required for reduced dimensions
of features.
 Reduced dimensions of features of the dataset help in visualizing the
data quickly.
 It removes the redundant features (if present) by taking care of
multicollinearity (Multicollinearity is the occurrence of high inter-
correlations among two or more independent variables in a multiple
regression model)
Disadvantages of dimensionality Reduction

• Some data may be lost due to dimensionality reduction.

• PCA fails in cases where mean and covariance are not enough to define datasets.
230
Approaches of Dimensionality Reduction

There are two ways to apply the dimension reduction technique, which
are given below:
1. Feature selection is the process of selecting the subset of the
relevant features and leaving out the irrelevant features present in
a dataset to build a model of high accuracy. In other words, it is a
way of selecting the optimal features from the input dataset.
2. Feature extraction is the process of transforming the space
containing many dimensions into space with fewer dimensions.
This approach is useful when we want to keep the whole
information but use fewer resources while processing the
information.
• Principal Component Analysis
• Linear Discriminant Analysis
• Kernel PCA
• Quadratic Discriminant Analysis 231
CUR Decomposition
 CUR matrix decomposition is a low-rank matrix decomposition
algorithm that is explicitly expressed in a small number of actual
columns and/or actual rows of data matrix.

 CUR matrix decomposition was developed as an alternative

to Singular Value Decomposition (SVD) and Principal Component
Analysis (PCA).

 CUR matrix decomposition original matrix in to 3 matrices:

C (column), U and R (row). i.e. A = C U R

 If A is sparse then two large matrices C and R are also sparse, only the
middle matrix is dense but this matrix is small so density does not
hurt too much.

232
CUR Decomposition

A C U R

Pseudo-inverse of
the intersection of C and R

A C U R 233
Definition of CUR Decomposition

234
Computing U

• Let W be the “intersection” of sampled

columns C and rows R
– Let SVD of W = X Z YT
• Then: U = W+ = Y Z+ XT
– Z+: reciprocals of non-zero
singular values: Z+ii =1/ Zii
– W+ is the “pseudoinverse”

W R
A
 C

U = W+

235
Choosing Columns and Rows properly

236
Example: CUR Decomposition
The sum of squares of elements of the matrix M is 243

Frobenius norms of column

Matrix : 12 + 32 + 42 + 52 = 51  probability = 51/243 = 0.210

Alien : 12 + 32 + 42 + 52 = 51  probability = 51/243 = 0.210
Star Wars : 12 + 32 + 42 + 52 = 51  probability = 51/243 =
0.210
Casablanca = 42 + 52 + 22 = 45  probability = 45/243= 0.185
Titanic = 42 + 52 + 22 = 45  probability = 45/243= 0.185

Frobenius norms of rows

Joe = 1 + 1 +1 = 3  probability = 3/243 = 0.012

Jim = 32 + 32 + 32 = 27  probability = 27/243 = 0.111
John= 42 + 42 + 42 = 48  probability = 48/243 = 0.198
Jack = 52 + 52 + 52 = 75 probability = 75/243 = 0.309
Jill = 42 + 42 = 32  probability = 32/243 = 0.132
Jenny = 52 + 52 = 50 probability = 50/243 = 0.206 237
Jane = 22 + 22 = 8  probability = 8/243 = 0.033
Example: CUR Decomposition
Let r= 2

Random selection of rows and columns

• Each column / rows are randomly selected form the
columns/rows of original matrix M.
• Each column / rows are independently selected from the
columns/rows of M, so there is a chance that same
column/row is selected more than once
• After selecting each column/row , we scale it by dividing
each elements by square root of the expected number of
times this column/row selected i.e. we divide it by sqrt( r*
probability of col/row)

Suppose first we select Alien and then Casablanca as column of C. We scale each
column by dividing its element by sqrt( r* probability of col)

Alien: Casablanca :
divided by = 0.430

238
Example: CUR Decomposition
Let r= 2

R Matrix:

Supposed we choose jenny and jack as rows of R.

239
Example: CUR Decomposition

Selection of U Matrix:

1. Calculate matrix W (r x r) intersection of C and R

C2 C4

240
Example: CUR Decomposition

Selection of U Matrix:

2. Calculate SVD of W

Moore Penrose pseudo inverse of 

241
Example: CUR Decomposition

242
Another Example: CUR Decomposition

243
Feature Selection
 Filter Method: uses the variable ranking technique in order to
select the variables for ordering and here, the selection of features is
independent of the classifiers used .Example: Information gain, Chi-
square test, Variance threshold, etc
 Wrapper Method: Wrappers require some method to search the
space of all possible subsets of features, assessing their quality by
learning and evaluating a classifier with that feature subset. The
feature selection process is based on a specific machine learning
algorithm that we are trying to fit on a given dataset. Example:
Forward feature selection, backward feature selection, Recursive
feature Elimination
 Embedded Methods: Embedded methods check the different
training iterations of the machine learning model and evaluate the
importance of each feature. Some common techniques of Embedded
methods are:. Example: LASSO Regularization (L1) , Elastic Net etc
245
GTU Question
 Minimum salry is 20,000Rs and Maximum salary is 1,70,000Rs. Map the salry 1,00,000Rs in
new Range of (60,000 , 2,60,000) Rs using min-max normalization method. [4 marks]
 If Mean salary is 54,000Rs and standard deviation is 16,000 Rs then find z score value of
73,600 Rs salry. [3 Marks]
 Explain Mean, Median, Mode, Variance, Standard Deviation & five number summary with
suitable database example. [ 7 Marks]
 List and describe methods for handling missing values in data cleaning. [3 marks]
 What is noise? Explain data smoothing methods as noise removal technique to divide given
data into bins of size 3 by bin partition (equal frequency), by bin means, by bin medians and
by bin boundaries. Consider the data: 10, 2, 19, 18, 20, 18, 25, 28, 22 [ 7 marks]
 Enlist the preprocessing steps with example. Explain procedure of any technique of
preprocessing [7 marks]
 What is noise? Explain binning methods for data smoothing [4 marks]
 Explain various data normalization techniques. [7 marks]
 Enlist data reduction strategies and explain any two [7 Marks]
 Explain the following data normalization techniques: (i) min-max normalization and (ii)
decimal scaling [7 marks]
 In real-world data, tuples with missing values for some attributes are a common occurrence.
Describe various methods for handling this problem. [7 marks]
 In data pre-processing why we need data smoothing? Discuss data smoothing by Binning. [7
marks]
246
GTU Question
 Explain the need for data smoothing during pre-processing and
discuss data smoothing by Binning [7 marks]
 Use min-max normalization method to normalize the following group of data by setting min
= 0 and max = 1 200, 300, 400, 600, 1000 [ 3 marks]
 Describe various methods for handling missing data values [4 marks]
 Suppose a group of sales price records has been sorted as follows: 6, 9, 12, 13, 15, 25, 50, 70,
72, 92, 204, 232 Partition them into three bins by equal-frequency (equi-depth) partitioning
method. Perform data smoothing by bin mean. [3 marks]
 Explain the following terms: Numerosity reduction, Data Integration, Data transformation [3
marks]
 Explain sampling methods for data reduction. [7 marks]
 Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order): 13, 15, 16, 16, 19, 20, 23, 29, 35, 41, 44, 53, 62, 69, 72 Use
min-max normalization to transform the value 45 for age onto the range [0:0, 1:0] [4 marks]
 Define noise data. Enlist the reasons for the presence of noise in data collection. Explain the
methods to deal with noise. [7 marks]

247

Statistics Grade 11
No ratings yet
Statistics Grade 11
73 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Statistics Part 1 and 2
No ratings yet
Statistics Part 1 and 2
53 pages
Intro to Descriptive Statistics
No ratings yet
Intro to Descriptive Statistics
68 pages
Lecture 3 - Stat HO
No ratings yet
Lecture 3 - Stat HO
21 pages
Averages and Variation in Data
No ratings yet
Averages and Variation in Data
30 pages
2 Stats Intro 14022024 105150am
No ratings yet
2 Stats Intro 14022024 105150am
19 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
STAT241 - Business Statistics (Day 3)
No ratings yet
STAT241 - Business Statistics (Day 3)
32 pages
Descriptive Statsistics
No ratings yet
Descriptive Statsistics
34 pages
Share MBBS - Lecture 4 (1) - 1
No ratings yet
Share MBBS - Lecture 4 (1) - 1
68 pages
1 Basics of Stat (Statistics IEM 2-2)
No ratings yet
1 Basics of Stat (Statistics IEM 2-2)
29 pages
2 - Chapter 1 - Measures of Central Tendency and Variation New
100% (1)
2 - Chapter 1 - Measures of Central Tendency and Variation New
18 pages
Lecture 04
No ratings yet
Lecture 04
88 pages
Statistical Analysis
No ratings yet
Statistical Analysis
15 pages
المحاضرة رقم 3
No ratings yet
المحاضرة رقم 3
44 pages
ISE1204 - Lecture 2
No ratings yet
ISE1204 - Lecture 2
42 pages
Stat 1101 4 7
No ratings yet
Stat 1101 4 7
18 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Statistics 1
No ratings yet
Statistics 1
10 pages
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
No ratings yet
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
4 pages
Statistics Unit1 Notes
No ratings yet
Statistics Unit1 Notes
11 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Lecture 3 Numerical Measures of Data
No ratings yet
Lecture 3 Numerical Measures of Data
36 pages
2 Research - 2ND QT - Week 1 - 10 14 2024
No ratings yet
2 Research - 2ND QT - Week 1 - 10 14 2024
13 pages
Lecture 2 - Mean, Median and Mode
No ratings yet
Lecture 2 - Mean, Median and Mode
9 pages
Introductory of Statistics - Chapter 3
No ratings yet
Introductory of Statistics - Chapter 3
7 pages
Measures of Central Tendency & Spread
No ratings yet
Measures of Central Tendency & Spread
6 pages
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
No ratings yet
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
34 pages
Measures of Central Tendency and Spread: Chapter 1, Section 2
No ratings yet
Measures of Central Tendency and Spread: Chapter 1, Section 2
36 pages
Lecture Slides - Capítulo 02
No ratings yet
Lecture Slides - Capítulo 02
21 pages
Unit - 2 Biostatistics
No ratings yet
Unit - 2 Biostatistics
9 pages
Lecture 06-Describing Data Visual Information
No ratings yet
Lecture 06-Describing Data Visual Information
49 pages
Chapter 3 (Technical English For Statistics)
No ratings yet
Chapter 3 (Technical English For Statistics)
8 pages
Lecture 3
No ratings yet
Lecture 3
14 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
38 pages
CH 3 Projectaustinandmaddie
No ratings yet
CH 3 Projectaustinandmaddie
3 pages
Numerical Measures: Bf1206-Business Mathematics SEMESTER 2 - 2016/2017
No ratings yet
Numerical Measures: Bf1206-Business Mathematics SEMESTER 2 - 2016/2017
25 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
83 pages
Central Tendency - Lecture Notes
No ratings yet
Central Tendency - Lecture Notes
34 pages
FDSA Unit 2
No ratings yet
FDSA Unit 2
44 pages
4 Measures of Central-Tendency & Box Plot
No ratings yet
4 Measures of Central-Tendency & Box Plot
30 pages
Core Statistics 101 Guide
No ratings yet
Core Statistics 101 Guide
32 pages
Lecture 2b - Describing Data-Numerical
No ratings yet
Lecture 2b - Describing Data-Numerical
47 pages
Session 2 Inferential Statistics Slides
100% (1)
Session 2 Inferential Statistics Slides
93 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Ch3-Numerical Measures
No ratings yet
Ch3-Numerical Measures
33 pages
Mathematical Analysis
100% (1)
Mathematical Analysis
46 pages
Descriptive Stats for Students
No ratings yet
Descriptive Stats for Students
46 pages
2.data Description
No ratings yet
2.data Description
57 pages
Evaluating Analytical Chemistry
No ratings yet
Evaluating Analytical Chemistry
4 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
DDDDDD 2
No ratings yet
DDDDDD 2
5 pages
Part 2-Chapter 3 - Describing Data - Edit
No ratings yet
Part 2-Chapter 3 - Describing Data - Edit
46 pages
2 - Statistics
No ratings yet
2 - Statistics
50 pages
Statistics and Probability
No ratings yet
Statistics and Probability
91 pages
Day 3 Educational Statistics
No ratings yet
Day 3 Educational Statistics
37 pages
YASG
No ratings yet
YASG
19 pages
DWDM Unit 6 Cluster Analysis
No ratings yet
DWDM Unit 6 Cluster Analysis
183 pages
DWDM Unit 1 2 GPT Notes
No ratings yet
DWDM Unit 1 2 GPT Notes
22 pages
TMK - DWDM - Unit 1 - Data Warehousing
No ratings yet
TMK - DWDM - Unit 1 - Data Warehousing
86 pages
Case Processing Summary: Lama Kerja
No ratings yet
Case Processing Summary: Lama Kerja
2 pages
Important Questions BMB 104
No ratings yet
Important Questions BMB 104
4 pages
A Repeated Power Training Enhances Fatigue Resistance While Redu 2018
No ratings yet
A Repeated Power Training Enhances Fatigue Resistance While Redu 2018
11 pages
Aircraft Engine RUL Prediction
No ratings yet
Aircraft Engine RUL Prediction
28 pages
Chapter 3 Problems
100% (1)
Chapter 3 Problems
6 pages
Notes Estimation Theory
100% (3)
Notes Estimation Theory
39 pages
28 Data Handling and Statistics
No ratings yet
28 Data Handling and Statistics
62 pages
Presentación Modelo 3
No ratings yet
Presentación Modelo 3
8 pages
Test Scribd
No ratings yet
Test Scribd
1 page
Phi-Coefficient: Conceptual Explanation
No ratings yet
Phi-Coefficient: Conceptual Explanation
130 pages
Descriptive Statistics Cheat Sheet
No ratings yet
Descriptive Statistics Cheat Sheet
1 page
16-17 Chapter 10 Quiz Review
No ratings yet
16-17 Chapter 10 Quiz Review
6 pages
Predicting Cubic Zirconia Prices
No ratings yet
Predicting Cubic Zirconia Prices
28 pages
Cambridge International AS & A Level: Mathematics 9709/52
No ratings yet
Cambridge International AS & A Level: Mathematics 9709/52
16 pages
Descriptive Statistics
0% (1)
Descriptive Statistics
3 pages
R Studio Cheat Sheet
No ratings yet
R Studio Cheat Sheet
6 pages
Hybrid Math 11 Stat Q1 M3 W3 V2
No ratings yet
Hybrid Math 11 Stat Q1 M3 W3 V2
14 pages
Lecture+14 SAS Bootstrap and Jackknife
No ratings yet
Lecture+14 SAS Bootstrap and Jackknife
12 pages
Identifying Multiple Outliers
No ratings yet
Identifying Multiple Outliers
12 pages
Chapter 3 - Forecasting PDF
No ratings yet
Chapter 3 - Forecasting PDF
45 pages
Business Statistics Quiz: Probability & Distribution
No ratings yet
Business Statistics Quiz: Probability & Distribution
4 pages
DATA ANALYTICS-5049-B.Voc-IT
No ratings yet
DATA ANALYTICS-5049-B.Voc-IT
2 pages
Devya Distribution
No ratings yet
Devya Distribution
14 pages
Spss Tutorials: Independent Samples T Test
100% (1)
Spss Tutorials: Independent Samples T Test
13 pages
ADL 07 Quantitative Techniques in Management V3
No ratings yet
ADL 07 Quantitative Techniques in Management V3
5 pages
Big Five Traits & Student GPA
No ratings yet
Big Five Traits & Student GPA
18 pages
PSYC 340 Statistics and Computing 2 Take Home Exam Spring 2024 Final
No ratings yet
PSYC 340 Statistics and Computing 2 Take Home Exam Spring 2024 Final
17 pages
B. Com. H Business Statistics S FpigWq1
No ratings yet
B. Com. H Business Statistics S FpigWq1
8 pages
Testing of Hypothesis - Two Samples
No ratings yet
Testing of Hypothesis - Two Samples
33 pages
Data Cepietso-Pcp (7-11 May 2012)
No ratings yet
Data Cepietso-Pcp (7-11 May 2012)
17 pages

DWDM - Unit 3 - Data Preprocessing

Uploaded by

DWDM - Unit 3 - Data Preprocessing

Uploaded by

Data Preprocessing

Example: To find the mean of 3,5,7.

Step 2: Calculate the total number.

Step 3: Finding mean.

• This is called the weighted arithmetic mean or the weighted average

Example: To find the mode of 11,3,5,11,7,3,11

Example 1: To find the median of 4,5,7,2,1 [ODD].

Step 1: Count the total numbers given.

Step 2: Arrange the numbers in ascending order.

Step 3: The total elements in the distribution (5) is odd.

Step 2: Arrange the numbers in ascending order.

Step 3: The total elements in the distribution (6) is even.

Step 4: Find the median.

measures of how spread out data are.

600 + 470 + 170 +

Standard Deviation: σ = √21,704 = 147.32... = 147 (to the nearest mm)

The first quartile, Q1, is the median of {3, 5, 7, 8}

the median is 12.

4. The third quartile, denoted by Q3 , is the median of the

5. The maximum value of a data set is the greatest value in the

Minimum: 3 Q1 : 7 Median: 13 Q3 : 15 Maximum: 21

The five-number summary divides the data

SinceQ1​=29, about 25% percent of data is lower

 Inter-Quartile Range (IQR) is the distance between the first and

0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39

Q1 – 1.5xIQR = 3 – 1.5(9.5) = 3 -14.25 = -11.25

Anything less than -11.25 is an outlier.

In this case there are no outliers on the low end.

0, 1, 2, 4, 5, 5, 7, 10, 10, 12, 13, 17, 39

Q3 + 1.5xIQR = 12.5 + 1.5*9.5 = 12.5 + 14.25 = 26.75

Anything more than 26.75 is an outlier.

39 is the only outlier.

• Num= [0, 1.5, 2, 3, 6, 6, 6]

(Positive Correlation) (Negative Correlation)

• To make data more suitable for data mining

• Attribute of interest may not be available ex. E.g., customer

• Sometimes data may not be included simply because it not

• Relevant data may not be included due to equipment

Reason for Noisy data

• The data collection instrument used may be faulty

• Human or computer error occurring at the time of data entry

• Error in data transmission can occur

• There may be technology limitations such as limited buffer size

• Incorrect data may occur due to inconsistency in naming

• Duplicate records also need data cleaning

• Why preprocess the data?

– Data cleaning as a process

How to Handle Missing Data?

• Fill in the missing value manually: tedious + infeasible? (for large

• Use a global constant to fill in the missing value: e.g., “unknown”,

RowID Variable_1 Variable_2 Variable_3

How to Handle Missing Data?

How to Handle Missing Data?

• Use the most probable value to fill in the missing value:

– Data cleaning as a process

• There are following smoothing techniques to smooth

Binning Methods for Data Smoothing

• What is the mean of the data? What is the median?

Example of linear regression

• Why preprocess the data?

There are following issue:

There are following issue:

• where N is the number of tuples,

• Careful integration can help reduce/avoid redundancies and inconsistencies and

Min-max normalization: to [new_min , new_max ] A A

• performs a linear transformation on the original data

– Min-max normalization preserves the relationships among the original

Z-score normalization (or zero-mean normalization),

– μ: mean, σ: standard deviation

• This method of normalization is useful when the actual minimum

Normalization by decimal scaling

(a) min-max normalization by setting min = 0 and max = 1

(b) z-score normalization

z-score normalization of 200=(200-500)/282.84 =-1.06

• Suppose that the data for analysis includes the

• Use min-max normalization method to normalize the

SinceQ1=29, about 25% percent of data is lower