KEMBAR78
Data Preprocessing | PDF | Chi Squared Test | Regression Analysis
0% found this document useful (0 votes)
60 views120 pages

Data Preprocessing

The document discusses data preprocessing, focusing on data cleaning, integration, reduction, and transformation, emphasizing the importance of data quality measures such as accuracy, completeness, consistency, timeliness, believability, and interpretability. It outlines major tasks in data preprocessing, including filling in missing values, smoothing noisy data, and resolving inconsistencies, as well as techniques for handling noisy and incomplete data. Additionally, it covers data integration challenges like entity identification and redundancy analysis, and introduces data reduction strategies to maintain data integrity while minimizing volume.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views120 pages

Data Preprocessing

The document discusses data preprocessing, focusing on data cleaning, integration, reduction, and transformation, emphasizing the importance of data quality measures such as accuracy, completeness, consistency, timeliness, believability, and interpretability. It outlines major tasks in data preprocessing, including filling in missing values, smoothing noisy data, and resolving inconsistencies, as well as techniques for handling noisy and incomplete data. Additionally, it covers data integration challenges like entity identification and redundancy analysis, and introduces data reduction strategies to maintain data integrity while minimizing volume.

Uploaded by

yijac51850
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 120

Data Preprocessing

Data Cleaning, Data Integration, Data Reduction


&
Data Transformation
Data Quality
Measures for data quality:
There are many factors comprising data quality, including accuracy,
completeness, consistency, timeliness, believability, and interpretability.
1. Accuracy: correct or wrong, accurate or not (i.e., having incorrect
attribute values).
• The data collection instruments used may be faulty. There may have been
human or computer errors occurring at data entry.
• Errors in data transmission can also occur--limited buffer size. Buffer OVerflow
• Duplicate tuples also require data cleaning.
2. Completeness: not recorded, unavailable, …
• Attributes of interest may not always be available.
• Data may not be included simply because they were not considered
important at the time of entry.
• Relevant data may not be recorded due to a misunderstanding or because
of equipment malfunctions.
Data Quality
Measures for data quality:
There are many factors comprising data quality, including accuracy,
completeness, consistency, timeliness, believability, and interpretability.
3. Consistency: some modified but some not, dangling, …
• Data that were inconsistent with other recorded data may have been
deleted.
• Missing data, particularly for tuples with missing values for some
attributes, may need to be inferred.
4. Timeliness: timely update?
• The fact that the month-end data are not updated in a timely fashion has a
negative impact on the data quality.
5. Believability: how trustable the data are correct?
• How much the data are trusted by users.
6. Interpretability: how easily the data can be understood?
Major Tasks in Data Preprocessing
1. Data cleaning
➢Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
2. Data integration
➢Integration of multiple databases, data cubes, or files
3. Data reduction
➢Dimensionality reduction
➢Numerosity reduction
➢Data compression
4. Data transformation and data discretization
➢Normalization
➢Concept hierarchy generation
Major Tasks in Data Preprocessing
1. Data cleaning
➢Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
2. Data integration
➢Integration of multiple databases, data
cubes, or files
3. Data reduction
➢Dimensionality reduction
➢Numerosity reduction
➢Data compression
4. Data transformation and data
discretization
➢Normalization
➢Concept hierarchy generation
1. Data cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error.
A. Incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
➢e.g., Occupation=“ ” (missing data)
B. Noisy: containing noise, errors, or outliers
➢e.g., Salary=“−10” (an error)
C. Inconsistent: containing discrepancies in codes or names, e.g.,
➢Age=“42”, Birthday=“03/07/2010”
➢Was rating “1, 2, 3”, now rating “A, B, C”
➢discrepancy between duplicate records
D. Intentional (e.g., disguised missing data)
➢Jan. 1 as everyone’s birthday?
A. Incomplete (Missing) Data
• Data is not always available
➢E.g., many tuples have no recorded value for several attributes,
such as customer income in sales data.
• Missing data may be due to:
➢equipment malfunction
➢inconsistent with other recorded data and thus deleted
➢data not entered due to misunderstanding
➢certain data may not be considered important at the time of entry
➢not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per
attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant: e.g., “unknown”, a new class?!
• Use a measure of central tendency for the attribute (e.g., the mean or
median)
• Use the attribute mean for all samples belonging to the same class:
smarter
• Use the most probable value: inference-based such as Bayesian
formula or decision tree
B. Noisy Data
• Noise: random error or variance in a measured variable.
• Incorrect attribute values may be due to:
➢faulty data collection instruments
➢data entry problems
➢data transmission problems
➢technology limitation
➢inconsistency in naming convention
• Other data problems which require data cleaning:
➢duplicate records
➢incomplete data
➢inconsistent data
How to Handle Noisy Data?
Binning:
• first sort data .
4, 8, 15, 21, 21, 24, 25, 28, 34
• partition into (equal-frequency) bins.
✓then one can smooth by bin means
✓smooth by bin median
✓smooth by bin boundaries, etc.
Take lowest values

➢bins may be equal width, where the


interval range of values in each bin is
constant.
How to Handle Noisy Data?
• Regression
➢Data smoothing can also be done by regression, a technique that
conforms data values to a function.
➢Linear regression involves finding the “best” line to fit two
attributes (or variables) so that one attribute can be used to
predict the other.
➢Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit
to a multidimensional surface.
• Outlier analysis
✓Outliers may be detected by clustering, for example, where similar
values are organized into groups, or “clusters.” Intuitively, values
that fall outside of the set of clusters may be considered outliers.
✓Detect and remove outliers
Data Cleaning as a Process
• Data discrepancy detection
➢Use metadata (e.g., domain, range, dependency, distribution)
➢Check field overloading: squeeze new attribute definitions into
unused (bit) portions of already defined attributes.
➢Check uniqueness rule: each value of the given attribute must be
different from all other values for that attribute
➢ Consecutive rule: there can be no missing values between the
lowest and highest values for the attribute, and that all values
must also be unique
➢Null rule: blanks, question marks, special characters, or other
strings that may indicate the null condition, and how such values
should be handled.
Data Cleaning as a Process
• Data discrepancy detection
• Use commercial tools
✓Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
✓Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
Data Cleaning as a Process
• Data migration and integration
• Data migration tools: allow simple transformations to be specified
such as to replace the string “gender” by “sex.”
• ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface (GUI).
• Integration of the two processes
• Iterative and interactive: Potter’s Wheel, for example, is a publicly
available data cleaning tool that integrates discrepancy detection
and transformation.
2. Data Integration
• Data integration Introduction:
✓Combines data from multiple sources into a coherent store.
✓Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set.
✓Integrate metadata from different sources
✓Semantic heterogeneity and structure of data pose great
challenges in data integration.
1) Entity Identification Problem
2) Redundancy and Correlation Analysis
3) Tuple Duplication
4) Data Value Conflict Detection and Resolution
Data Integration: Entity identification problem
• Identify real world entities from multiple data sources.
• Schema integration: e.g., customer_id & cust_number

• Metadata for each attribute include the name, meaning, data type,
and range of values permitted for the attribute, and null rules for
handling blank, zero, or null values.
• Such metadata can be used to help avoid errors in schema
integration.
Data Integration: Redundancy and Correlation Analysis
• Redundancy is another important issue in data integration. An
attribute (such as annual revenue, for instance) may be redundant if it
can be derived from another attribute or set of attributes.
• Inconsistencies in attribute or dimension naming can also cause
redundancies in the resulting data set.
• Some redundancies can be detected by correlation analysis. Given
two attributes, such analysis can measure how strongly one attribute
implies the other, based on the available data.
a) χ2 (chi-square) test for Nominal Data
b) Correlation Coefficient for Numeric Data
c) Covariance of Numeric Data
Data Integration: Redundancy and Correlation Analysis
a) χ2 (chi-square) test for Nominal Data
• Correlation relationship between two attributes, A and B, can be
discovered by a χ2 (chi-square) test.
• Suppose A has c distinct values, namely a1,a2,…,ac.
• B has r distinct values, namely b1, b2,…,br.
• The c values of A making up the columns and the r values of B making
up the rows.

*Note: Are gender and


preferred reading correlated?
Data Integration: Redundancy and Correlation Analysis
a) χ2 (chi-square) test for Nominal Data
• The χ2 value (also known as the Pearson χ2 statistic) is computed as:

• where oij is the observed frequency (i.e., actual count) of the joint
event (Ai, Bj) and eij is the expected frequency of (Ai, Bj), which can be
computed as:

• where n is the number of data tuples.


• count(A = ai) is the number of tuples having value ai for A.
• count(B = bj) is the number of tuples having value bj for B.
Data Integration: Redundancy and Correlation Analysis
a) χ2 (chi-square) test for Nominal Data
• The χ2 value (also known as the Pearson χ2 statistic) is computed as:

In any row, the sum of the


expected frequencies must equal
the total observed frequency for
that row, and the sum of the
expected frequencies in any
column must also equal the total
observed frequency for that
column.
*Note: Are gender and
preferred reading correlated?
Data Integration: Redundancy and Correlation Analysis
a) χ2 (chi-square) test for Nominal Data
• The χ2 value (also known as the Pearson χ2 statistic) is computed as:

*Note: Are gender and


preferred reading correlated?
Data Integration: Redundancy and Correlation Analysis
a) χ2 (chi-square) test for Nominal Data
• The χ2 value (also known as the Pearson χ2 statistic) is computed as:

300 450 135000 1500 90


1200 450 540000 1500 360
300 1050 315000 1500 210
1200 1050 1260000 1500 840

*Note: Are gender and


preferred reading correlated?
Data Integration: Redundancy and Correlation Analysis
a) χ2 (chi-square) test for Nominal Data
• The χ2 value (also known as the Pearson χ2 statistic) is computed as:

Degrees of freedom >31 then good


correlation

*Note: Are gender and


preferred reading correlated?
Table of the chi square distribution:Degrees of freedom

Note- Since our computed value is above this, we can reject the hypothesis that
gender and preferred reading are independent and conclude that the two
attributes are (strongly) correlated for the given group of people.
Data Integration: Redundancy and Correlation Analysis
b) Correlation Coefficient for Numeric Data
• We can evaluate the correlation between two attributes, A and B, by
computing the correlation coefficient (also known as Pearson’s
product moment coefficient).

• where n is the number of tuples, ai and bi are the respective values of


A and B in tuple i.
• 𝐴ҧ and 𝐵ത are the respective mean values of A and B
• σA and σB are the respective standard deviations of A and B.
• is the sum of the AB cross-product
Data Integration: Redundancy and Correlation Analysis
b) Correlation Coefficient for Numeric Data
• We can evaluate the correlation between two attributes, A and B, by
computing the correlation coefficient (also known as Pearson’s
product moment coefficient).

• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s).
The higher, the stronger correlation.
• rA,B = 0: independent
• rAB < 0: negatively correlated
Visually Evaluating Correlation

Scatter plots showing the


similarity from –1 to 1.
Data Integration: Redundancy and Correlation Analysis
c) Covariance of Numeric Data
• Consider two numeric attributes A and B, and a set of n observations
{(a1,b1),…,(an,bn)}. The mean values of A and B, respectively, are also
known as the expected values on A and B, that is.

• The covariance between A and B is defined as:

If we compare function rA,B (correlation coefficient) with covariance, we


see that
Covariance also be shown as
Data Integration: Redundancy and Correlation Analysis
c) Covariance of Numeric Data
Data Integration: Redundancy and Correlation Analysis
c) Covariance of Numeric Data

Therefore, given the positive covariance we can say that


stock prices for both companies rise together.
Data Integration: Tuple Duplication
• In addition to detecting redundancies between attributes, duplication
should also be detected at the tuple level (e.g., where there are two
or more identical tuples for a given unique data entry case).
• For example, if a purchase order database contains attributes for the
purchaser’s name and address instead of a key to this information in a
purchaser database, discrepancies can occur, such as the same
purchaser’s name appearing with different addresses within the
purchase order database.
Data Integration: Data Value Conflict Detection and Resolution
• In real-world entity, attribute values from different sources may differ.
This may be due to differences in representation, scaling, or encoding.
• For instance, a weight attribute may be stored in metric units in one
system and British imperial units in another.
• For a hotel chain, the price of rooms in different cities may involve not
only different currencies but also different services (e.g., free
breakfast) and taxes.
• When exchanging information between schools, for example, each
school may have its own curriculum and grading scheme. One
university may adopt a quarter system, offer three courses on
database systems, and assign grades from AC to F, whereas another
may adopt a semester system, offer two courses on databases, and
assign grades from 1 to 10.
3. Data Reduction
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume, yet
closely maintains the integrity of the original data.
• Mining on the reduced data set should be more efficient yet produce
the same (or almost the same) analytical results.
• Data reduction strategies include
1) Dimensionality reduction
2) Numerosity reduction
3) Data compression.
3. Data Reduction
• Dimensionality reduction is the process of reducing the number of
random variables or attributes under consideration.
• Dimensionality reduction methods include wavelet transforms and
principal components analysis (PCA).
• Attribute subset selection is a method of dimensionality reduction in
which irrelevant, weakly relevant, or redundant attributes or
dimensions are detected and removed.
3. Data Reduction
• Numerosity reduction techniques replace the original data volume by
alternative, smaller forms of data representation.
• Numerosity reduction techniques are parametric or nonparametric.
a) For parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored, instead of
the actual data. (Outliers may also be stored.)
• Regression and log-linear models are examples.
b) Nonparametric methods for storing reduced representations of the
data include histograms, clustering, sampling, and data cube
aggregation.
3. Data Reduction
• Data Compression transformations are applied so as to obtain a
reduced or “compressed” representation of the original data.
• Lossless: If the original data can be reconstructed from the
compressed data without any information loss.
• Lossy: If we can reconstruct only an approximation of the original
data.
• There are several lossless algorithms for string compression; however,
they typically allow only limited data manipulation.
• Dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression.
1) Dimensionality Reduction: Wavelet Transforms
• Discrete wavelet transform (DWT) for linear signal processing, multi-
resolution analysis technique.

• X data vector transforms it to a numerically different vector 𝑋.
• The two vectors are of the same length.
• Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
• The usefulness lies in the fact that the wavelet transformed data can
be truncated.
• A compressed approximation of the data can be retained by storing
only a small fraction of the strongest of the wavelet coefficients.
• The resulting data representation is very sparse, so that operations
that can take advantage of data sparsity are computationally very fast
if performed in wavelet space.
1) Dimensionality Reduction: Wavelet Transforms
• The technique also works to remove noise without smoothing out the
main features of the data, making it effective for data cleaning as
well.
• Popular wavelet transforms include the Haar-2, Daubechies-4, and
Daubechies-6.
• The general procedure for applying a discrete wavelet transform uses
a hierarchical pyramid algorithm that halves the data at each
iteration, resulting in fast computational speed.
1) Dimensionality Reduction: Wavelet Transforms
• Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
• Method:
i. The length, L, of the input data vector must be an integer power of 2.
This condition can be met by padding the data vector with zeros as
necessary (L ≥ n).
ii. Each transform has 2 functions: smoothing: such as a weighted
average, difference: acts to bring out the detailed features of the
data.
iii.The two functions are applied to pairs of data points in X. This results
in two data sets of length L/2.
iv.Applies two functions recursively, until reaches the desired length.
v. Selected values are designated the wavelet coefficients of the
1) Dimensionality Reduction: Wavelet Transforms
• Wavelets: A math tool for space-efficient hierarchical decomposition
of functions
• S = [2, 2, 0, 2, 3, 5, 4, 4] can be transformed to
𝑆ҧ = [23/4, -11/4, 1/2, 0, 0, -1, -1, 0]
• Compression: many small detail coefficients can be replaced by 0’s,
and only the significant coefficients are retained.
1) Dimensionality Reduction: Wavelet Transforms- Haar
Hierarchical Coefficient “Supports”
decomposition
2.75
2.75 +
structure (a.k.a.
“error tree”) +
-1.25
-1.25 + -
+
-
0.5 0
0.5 + -
+ - + - 0 + -
0 -1 -1 0
+
- + - + - + - 0 + -
2 2 0 2 3 5 4 4 -1 + -
-1 + -
Original frequency distribution 0 + -
41
1) Dimensionality Reduction: Wavelet Transforms
• Example:
• S = [56, 40, 8, 24, 48, 48, 40, 16] can be transformed to 𝑆ҧ = [35, -3, 16,
10, 8, -8 ,0, 12] or 𝑆ҧ = [35, 0, 16, 10, 8, 0 ,0, 12]

Get Original Values by-
First Value=Average+coefficient
Second Value=Average-coefficient
1) Dimensionality Reduction: Wavelet Transforms
• Use hat-shape filters
• Emphasize region where points cluster
• Suppress weaker information in their boundaries
• Effective removal of outliers
• Insensitive to noise, insensitive to input order
• Multi-resolution
• Detect arbitrary shaped clusters at different scales
• Efficient
• Complexity O(N)
• Only applicable to low dimensional data
1) Dimensionality Reduction: PCA
• Principal components analysis (also called the Karhunen-Loeve, or K-L
method) searches for k n-dimensional orthogonal vectors that can
best be used to represent the data, where k ≤ n.
• Find a projection that captures the largest amount of variation in data
• The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2

x1
1) Dimensionality Reduction: PCA
• The original data are thus projected onto a much smaller space,
resulting in dimensionality reduction.
➢Normalize input data: Each attribute falls within the same range.
➢Compute k orthonormal (unit) vectors, i.e., principal components.
➢Each input data (vector) is a linear combination of the k principal
component vectors.
➢The principal components are sorted in order of decreasing
“significance” or strength.
➢Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
1) Dimensionality Reduction: PCA
• Steps for PCA:
• Step 1: Standardization (or Data normalization):
➢PCA normalized the input data, so that each attribute falls within the
same range.
➢Normalization helps ensure that attributes with large domains will not
dominate attributes with smaller domains.
✓ Calculate mean.
1) Dimensionality Reduction: PCA
• Steps for PCA:
• Step 2: Covariance matrix (Find Covariance Matrix to identify
correlation):
➢This step calculates the covariance matrix of the standardized data.
➢Covariance matrix shows how each variable is related to every other
variable in the dataset.
▪ If the value of the Covariance Matrix is positive, then it indicates
that the variables are correlated.
▪ If the value of the Covariance Matrix is negative, then it indicates
that the variables are inversely correlated.
▪ If the value of the Covariance Matrix is zero, then the variables are
no correlated.
1) Dimensionality Reduction: PCA
• Step 3: Calculate eigen value of the Covariance matrix:
➢Characteristic equation is used to find eigen values.
det(A- λI) = 0
➢where I is the identity matrix and det(B) is the determinant of the
matrix B.
➢From the determinant of the matrix, we can find Quadratic
equation and can find the λ.
➢The solutions λ of the characteristic equation are the eigenvalues.
➢ An eigenvalue is a number representing the amount of variance
present in the data for a given direction.
➢N by N matrix has N eigen values.
1) Dimensionality Reduction: PCA
• Step 4: Find eigen vectors of the Covariance matrix:
[A- λI]X = 0 or AX= λX
➢The eigenvectors represent the directions in which the data varies
the most, while the eigenvalues represent the amount of variation
along each eigenvector.
➢Each eigenvector has its corresponding eigenvalue.
1) Dimensionality Reduction: PCA
• Step 5: Find a unit (or feature) eigen vectors:
||X|| = 𝑥 2 + 𝑦 2 + ⋯ + 𝑛2
➢Find Magnitude ||X|| of the eigen vector X.
➢Divide each elements of the eigen vector by ||X||.
➢At the end we got final eigen vector E1. Similarity, can find final
eigen vector for all the λ.
1) Dimensionality Reduction: PCA
• Step 6: Find Principal components:
𝑋11 − 𝑋1
𝐸𝑖𝑇 . 𝑋12 − 𝑋2
𝑋𝑚𝑛 − 𝑋𝑚
➢Transpose the final eigen vector and multiply with corresponding
values of the original given data.
➢After multiply both the matrix, we will get the principal component
corresponding to each point.
1) Dimensionality Reduction: PCA
• Step 7: Transform the data (Visualization):
➢The final step is to transform the original data into the lower-
dimensional space defined by the principal components.
➢The transformation does not modify the original data itself but
instead provides a new perspective to better represent the data.
1) Dimensionality Reduction: PCA
• Example1: X1 X2
• Step 1: Standardization (or Data normalization): E1 4 11
E2 8 4
➢Calculate mean
E3 13 5
𝑋1 = 8,
E4 7 14
𝑋2 = 8.5

• Step 2: Find Covariance matrix:

14 − 11
−11 23
1) Dimensionality Reduction: PCA
• Example1 continue… 14 − 11
• Step 3 Calculate eigen value of the Covariance matrix : −11 23

det(A- λI) = 0
14 − λ − 11
−11 23 − λ
14 − λ 23 − λ − (−11 × −11)
λ2 − 37λ + 201
λ1 = 30.3849
λ2 = 6.6151

• λ1 and λ2 are our eigen values.


1) Dimensionality Reduction: PCA
• Example1 continue…
• Step 4 Find eigen vectors of the Covariance matrix:
[A- λI]X = 0 or AX= λX
14 − λ − 11 𝑋1
−11 23 − λ 𝑋2 =0
14 − λ 𝑋1 − 11𝑋2
=0
−11𝑋1 (23 − λ)𝑋2
14 − λ 𝑋1 − 11𝑋2 = 0 equation1
−11𝑋1 + 23 − λ 𝑋2 = 0 equation2
Simplification of equation1:
𝑋1 11
=
𝑋2 14 − λ
1) Dimensionality Reduction: PCA
• Example1 continue… 𝑋1 11
𝑋2
=
• Step 5: Find a unit (or feature) eigen vectors: 14 − λ

||X|| = 𝑥 2 + 𝑦 2 + ⋯ + 𝑛2
= 112 + (14 − λ1 )2
=19.7348
11ൗ
19.7348 0.5574
𝐸1 = (14 − λ ) =
1 ൗ −0.8303
19.7348
11ൗ
13.2490 0.8303
𝐸2 = (14 − λ ) =
2 ൗ 0.5574
13.2490
1) Dimensionality Reduction: PCA
• Example1 continue… X1 X2
−4.3053 E1 4 11
• Step 6: Find Principal components:
3.7363 E2 8 4
𝑋12 − 𝑋1 5.6930 E3 13 5
𝐸1𝑇 .
𝑋13 − 𝑋2 −5.1240 E4 7 14

𝑋 − 𝑋1 𝑋1 = 8,
𝐸1 = 0.5574 − 0.8303 . 12
𝑋13 − 𝑋2 𝑋2 = 8.5
𝐸1 = 0.5574(𝑋12 − 𝑋1 ) − 0.8303(𝑋13 − 𝑋2 )
𝐸1 = 0.5574(4 − 8) − 0.8303(11 − 8.5)
−4.3053 0.5574
𝐸1 =
−0.8303
1) Dimensionality Reduction: Attribute Subset Selection
Data sets for analysis may contain hundreds of attributes, many of
which may be irrelevant to the mining task or redundant.
• Redundant attributes.
➢Duplicate much or all of the information contained in one or more
other attributes
➢E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
➢Contain no information that is useful for the data mining task at
hand
➢E.g., students' ID is often irrelevant to the task of predicting
students' GPA
• Added volume of irrelevant or redundant attributes can slow down
the mining process. It discovered patterns of poor quality.
1) Dimensionality Reduction: Attribute Subset Selection
• Attribute subset selection also called feature subset selection.
• Attribute subset selection reduces the data set size by removing
irrelevant or redundant attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes.
• It reduces the number of attributes appearing in the discovered
patterns, helping to make the patterns easier to understand.
1) Dimensionality Reduction: Attribute Subset Selection
• For n attributes, there are 2n possible subsets. take and Not take
• An exhaustive search for the optimal subset of attributes can be
prohibitively expensive, especially as n and the number of data
classes increase.
• So, heuristic methods that explore a reduced search space are
commonly used for attribute subset selection.
• Heuristic methods are typically greedy in that, while searching
through attribute space, they always make what looks to be the best
choice at the time.
• Their strategy is to make a locally optimal choice in the hope that this
will lead to a globally optimal solution.
• Such greedy methods are effective in practice and may come close to
estimating an optimal solution.
1) Dimensionality Reduction: Attribute Subset Selection
• The “best” (and “worst”) attributes are typically determined using
tests of statistical significance, which assume that the attributes are
independent of one another.
• Many other attribute evaluation measures can be used such as the
information gain measure used in building decision trees for
classification.
1) Dimensionality Reduction: Attribute Subset Selection
Figure: Greedy (heuristic) methods for attribute subset selection.
1) Dimensionality Reduction: Attribute Subset Selection
• 1. Stepwise forward selection: The procedure starts with an empty
set of attributes as the reduced set. The best of the original attributes
is determined and added to the reduced set.
• 2. Stepwise backward elimination: The procedure starts with the full
set of attributes. At each step, it removes the worst attribute
remaining in the set.
• 3. Combination of forward selection and backward elimination: The
procedure selects the best attribute and removes the worst from
among the remaining attributes.
1) Dimensionality Reduction: Attribute Subset Selection
• 4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5,
and CART) were originally intended for classification.
➢Decision tree induction constructs a flowchart like structure where
each internal (non-leaf) node denotes a test on an attribute.
➢Each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction.
➢At each node, the algorithm chooses the best attribute to partition
the data into individual classes.

• The stopping criteria for the methods may vary.


• The procedure may employ a threshold on the measure used to
determine when to stop the attribute selection process.
2) Numerosity reduction: Regression and log-linear models
• In (simple) linear regression, the data are
modeled to fit a straight line.
• Linear regression finds the linear
relationship between the dependent
variable and independent variable using
a best-fit straight line.
• It aims to find the best-fitting line for
predicting future values.
• The slope indicates the rate of change of
the dependent variable w.r.t the
difference in the independent quantities.
• The intercept indicates the value of the
dependent variable when the
independent variable is zero.
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏
• x is an Independent numeric attribute.
• y is a dependent numeric attribute.
• w is the gradient or slope of the line.
• b is the intercept or height at which line
crosses the y axis.
• The coefficients, w and b (called
regression coefficients).
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏

𝐶𝑜𝑣(𝑥,𝑦)
𝑤=
𝑉𝑎𝑟(𝑥)

𝑏 = 𝑦ത − 𝑤. 𝑥ҧ

σ(𝑥−𝑥)(𝑦−
ҧ ത
𝑦)
𝑤= σ(𝑥−𝑥)ҧ 2
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏
mean deviation deviation product sum of square
𝐶𝑜𝑣(𝑥,𝑦) x y x mean y x y deviation deviation deviation x
𝑤=
𝑉𝑎𝑟(𝑥)
8 10 12 16 -4 -6 24 60 16

𝑏 = 𝑦ത − 𝑤 . 𝑥ҧ 10 13 12 16 -2 -3 6 60 4

12 16 12 16 0 0 0 60 0
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝑤= 14 19 12 16 2 3 6 60 4
σ(𝑥 − 𝑥)ҧ 2
16 22 12 16 4 6 24 60 16
60 60 80 60 40
𝑤= = 1.5
40

𝑏 = 16 − 1.5 × 12 = −2
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏
mean deviation deviation product sum of square
𝐶𝑜𝑣(𝑥,𝑦) x y x mean y x y deviation deviation deviation x
𝑤=
𝑉𝑎𝑟(𝑥)
8 10 12 16 -4 -6 24 60 16

𝑏 = 𝑦ത − 𝑤 . 𝑥ҧ 10 13 12 16 -2 -3 6 60 4

12 16 12 16 0 0 0 60 0
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝑤= 14 19 12 16 2 3 6 60 4
σ(𝑥 − 𝑥)ҧ 2
16 22 12 16 4 6 24 60 16
60 60 80 60 40
𝑤= = 1.5
40

𝑏 = 16 − 1.5 × 12 = −2
• If the value of x is 18, find the value of y?.
2) Numerosity reduction: Regression and log-linear models
𝑦 = 𝑤𝑥 + 𝑏
mean deviation deviation product sum of square
𝐶𝑜𝑣(𝑥,𝑦) x y x mean y x y deviation deviation deviation x
𝑤=
𝑉𝑎𝑟(𝑥)
8 10 12 16 -4 -6 24 60 16

𝑏 = 𝑦ത − 𝑤 . 𝑥ҧ 10 13 12 16 -2 -3 6 60 4

12 16 12 16 0 0 0 60 0
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)

𝑤= 14 19 12 16 2 3 6 60 4
σ(𝑥 − 𝑥)ҧ 2
16 22 12 16 4 6 24 60 16
60 60 80 60 40
𝑤= = 1.5
40

𝑏 = 16 − 1.5 × 12 = −2
• If the value of independent attribute is 20, then find the value of
dependent attribute
2) Numerosity reduction: Regression and log-linear models
• Log-linear models approximate discrete multidimensional probability
distributions.
• Given a set of tuples in n dimensions (e.g., described by n attributes),
we can consider each tuple as a point in an n-dimensional space.
• Log-linear models can be used to estimate the probability of each
point in a multidimensional space for a set of discretized attributes,
based on a smaller subset of dimensional combinations.
• This allows a higher-dimensional data space to be constructed from
lower-dimensional spaces.
• Log-linear models used Backward Elimination Procedure to remove
higher-dimensional data space (or to reduce dimensionality).
2) Numerosity reduction: Histograms
• Histograms use binning to approximate data distributions and are a
popular form of data reduction.
• The histogram for an attribute A, partitions the data distribution of A
into disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–value/frequency pair,
the buckets are called singleton buckets.
• Often, buckets instead represent continuous ranges for the given
attribute.
2) Numerosity reduction: Histograms
• Example: A list of prices for commonly sold items are given.
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10,
12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.
• Figure shows a histogram for the data
using singleton buckets.
2) Numerosity reduction: Histograms
• Example: A list of prices for commonly sold items are given.
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10,
12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20,
20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.
• To further reduce the data, it is
common to have each bucket denote a
continuous value range for the given
attribute.
• In Figure, each bucket represents a
different 10 range for price.
2) Numerosity reduction: Histograms
• How are the buckets determined and the attribute values partitioned.
• There are several partitioning rules, including the following:
➢Equal-width: In an equal-width histogram, the width of each bucket
range is uniform (e.g., the width of 10 for the buckets in Figure).
➢Equal-frequency (or equal-depth): In an equal-frequency histogram,
the buckets are created so that, roughly, the frequency of each
bucket is constant (i.e., each bucket contains roughly the same
number of contiguous data samples).
• How are the buckets determined and the attribute values partitioned.
2) Numerosity reduction: Histograms
• Histograms are highly effective at approximating both sparse and
dense data, as well as highly skewed and uniform data.
• The histograms described before for single attributes can be extended
for multiple attributes.
• Multidimensional histograms can capture dependencies between
attributes.
• These histograms have been found effective in approximating data
with up to five attributes.
• More studies are needed regarding the effectiveness of
multidimensional histograms for high dimensionalities.
• Singleton buckets are useful for storing high-frequency outliers.
2) Numerosity reduction: Clustering
• Clustering techniques partition the
objects into groups, or clusters, so that
objects within a cluster are similar to
one another and dissimilar to objects in
other clusters.
• Similarity is commonly defined in terms
of how close the objects are in space,
based on a distance function.
• The quality of a cluster may be
represented by its diameter, the
maximum distance between any two
objects in the cluster.
2) Numerosity reduction: Clustering
• Figure shows a 2-D plot of customer
data with respect to customer locations
in a city. Three data clusters are visible.
• Type of clustering models:
i. Centroid models:-K-Means, K-Medoid.
ii. Hierarchical clustering:-Agglomerative
Hierarchical Clustering, BIRCH (Balanced
Iterative Reducing and Clustering using
Hierarchies).
iii.Distribution (probability-based) models:
Expectation-maximization, Gaussian
distribution.
iv.Density models:-DBSCAN (Density-Based
Spatial Clustering of Applications with
Noise),OPTICS (Ordering Points to Identify
the Clustering Structure)
2) Numerosity reduction: Clustering
• In data reduction, the cluster representations of the data are used to
replace the actual data.
• The effectiveness of this technique depends on the data’s nature.
• It is much more effective for data that can be organized into distinct
clusters than for smeared data.
2) Numerosity reduction: Sampling
• Sampling can be used as a data reduction technique because it allows
a large data set to be represented by a much smaller random data
sample (or subset).
• Suppose that a large data set, D, contains N tuples.
➢Simple random sample without replacement (SRSWOR) of size s:
This is created by drawing s of the N tuples from D, where the
probability of drawing any tuple in D is 1/N, that is, all tuples are
equally likely to be sampled.
➢Simple random sample with replacement
(SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from
D, it is recorded and then replaced. That is,
after a tuple is drawn, it is placed back in D
so that it may be drawn again.
2) Numerosity reduction: Sampling
➢Cluster sample: If the tuples in D are grouped into M mutually
disjoint clusters, then an SRS of s clusters can be obtained.
➢For example, tuples in a database are usually retrieved a page at a
time, so that each page can be considered a cluster.
➢A reduced data representation can be obtained by applying, say,
SRSWOR to the pages, resulting in a cluster sample of the tuples.
➢Other clustering criteria conveying rich semantics can also be
explored.
2) Numerosity reduction: Sampling
➢Stratified sample: D is divided into mutually disjoint parts called
strata.
➢A stratified sample of D is generated by obtaining an SRS at each
stratum.
➢Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data).
2) Numerosity reduction: Data Cube Aggregation
• For example, you have data consist of the sales per quarter, for the
years 2008 to 2010.
• However, you are interested in the annual sales (total per year),
rather than the total per quarter.
• Thus, the data can be aggregated so that the resulting data
summarize the total sales per year instead of per quarter.
• The resulting data is smaller in volume, without loss of information
necessary for the analysis task.
4. Data Transformation
• In the preprocessing step, the data are transformed or consolidated
so that the resulting mining process may be more efficient, and the
patterns found may be easier to understand.
• In data transformation, the data are transformed or consolidated into
forms appropriate for mining.
• Strategies or methods for data transformation
➢1. Smoothing: which works to remove noise from the data.
Techniques include binning, regression, and clustering.
➢2. Attribute construction (or feature construction): where new
attributes are constructed and added from the given set of
attributes to help the mining process.
4. Data Transformation
➢ 3. Aggregation: where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total amounts.
This step is typically used in constructing a data cube for data
analysis at multiple abstraction levels.
➢ 4. Normalization: where the attribute data are scaled so as to fall
within a smaller range, such as −1.0 to 1.0, or 0.0 to 1.0.
➢ 5. Discretization: where the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior). The labels, in turn, can
be recursively organized into higher-level concepts, resulting in a
concept hierarchy for the numeric attribute.
4. Data Transformation
➢ 6. Concept hierarchy generation for nominal data: where
attributes such as street can be generalized to higher-level
concepts, like city or country. Many hierarchies for nominal
attributes are implicit within the database schema and can be
automatically defined at the schema definition level.
• There is much overlap between the major data preprocessing tasks.
The first three of these strategies were discussed earlier.
• Smoothing is a form of data cleaning.
• Attribute construction and aggregation were discussed on data
reduction.
• So, here we concentrate on the remaining three strategies.
4. Data Transformation: Data Transformation by Normalization
• The measurement unit used can affect the data analysis.
• For example, changing measurement units from meters to inches for
height, or from kilograms to pounds for weight, may lead to very
different results.
• To avoid the issues of measurement units, the data must be
normalized or standardized.
• This involves transforming the data to fall within a smaller or common
range such as [−1,1] or [0.0, 1.0].
• Normalizing the data attempts to give all attributes an equal weight.
• There are many methods for data normalization.
• We study min-max normalization, z-score normalization, and
normalization by decimal scaling.
• Let A be a numeric attribute with n observed values, v1, v2, …, vn.
4. Data Transformation: Data Transformation by Normalization
• Min-max normalization performs a linear transformation on the
original data.
• Suppose that minA and maxA are the minimum and maximum values
of an attribute, A.
• Min-max normalization maps a value, vi, of A to in the range
[new_minA, new_maxA].

• For a attribute income, min & max are $12,000 and $98,000,
respectively. We would like to map income to the range [0.0, 1.0].
• A given value of $73,600 for income is transformed to
4. Data Transformation: Data Transformation by Normalization
• z-score normalization (or zero-mean normalization), the values for an
attribute A, are normalized based on the mean (i.e., average) and
standard deviation.
• A value vi, of A is normalized to

• where and σA are the mean and standard deviation, respectively, of


attribute A .
• This method of normalization is useful when the actual minimum and
maximum of attribute A are unknown, or when there are outliers
that dominate the min-max normalization.
4. Data Transformation: Data Transformation by Normalization
• Suppose that the mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively.
• With z-score normalization, a value of $73,600 for income is
transformed to

• A variation of this z-score normalization replaces the standard


deviation by the mean absolute deviation of A. The mean absolute
deviation of A, denoted sA
4. Data Transformation: Data Transformation by Normalization
• The mean absolute deviation sA, is more robust to outliers than the
standard deviation, σA.
• When computing the mean absolute deviation, the deviations from
the mean (i.e., 𝑥𝑖 − 𝑥ҧ ) are not squared; therefore, the effect of
outliers is somewhat reduced.
4. Data Transformation: Data Transformation by Normalization
• Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A.
• The number of decimal points moved depends on the maximum
absolute value of A.
• The value vi of A is normalized to

• where j is the smallest integer such that


• Suppose the values of A range from −986 to 917. The maximum
absolute value of A is 986. To normalize by decimal scaling, we
therefore divide each value by 1000 (i.e., j = 3) so that −986
normalizes to −0.986 and 917 normalizes to 0.917.
4. Data Transformation: Data Transformation by Normalization
• Note that normalization can change the original data quite a bit,
especially when using z-score normalization or decimal scaling.
• It is also necessary to save the normalization parameters (e.g., the
mean and standard deviation if using z-score normalization) so that
future data can be normalized in a uniform manner.
4. Data Transformation: Data Transformation by Discretization
Discretization: Divide the range of a continuous attribute into intervals.
• Interval labels can then be used to replace actual data values.
• Reduce data size by discretization.
• Supervised vs. unsupervised.
• Split (top-down) vs. merge (bottom-up).
• Discretization can be performed recursively on an attribute.
• Prepare for further analysis, e.g., classification.
4. Data Transformation: Data Transformation by Discretization
a) Discretization by Binning (Unsupervised, Top-down)
b) Discretization by Histogram Analysis (Unsupervised, Top-down)
c) Discretization by Cluster (Unsupervised, Top-down)
d) Discretization by Decision Tree (Supervised, Top-down)
e) Discretization by Correlation Analyses (Supervised, Bottom-up)
4. Data Transformation: Data Transformation by Discretization
a) Discretization by Binning
• Binning is a top-down splitting technique based on a specified
number of bins.
• Previously discussed binning methods for data smoothing. Those
methods are also used as discretization methods for data reduction
and concept hierarchy generation.
• For example, attribute values can be discretized by applying equal-
width or equal-frequency binning, and then replacing each bin value
by the bin mean or median, as in smoothing by bin means or
smoothing by bin medians, respectively.
• These techniques can be applied recursively to the resulting partitions
to generate concept hierarchies.
• Binning does not use class information and is therefore an
unsupervised discretization technique.
4. Data Transformation: Data Transformation by Discretization
b) Discretization by Histogram Analysis
• Like binning, histogram analysis is an unsupervised discretization
technique.
• A histogram partitions the values of an attribute, A, into disjoint
ranges called buckets or bins.
• Equal-width histogram, the values are partitioned into equal-size
partitions or ranges.
• Equal-frequency histogram, the values are partitioned so that, ideally,
each partition contains the same number of data tuples.
4. Data Transformation: Data Transformation by Discretization
b) Discretization by Histogram Analysis
• The histogram analysis algorithm can be applied recursively to each
partition in order to automatically generate a multilevel concept
hierarchy.
• A minimum interval size can also be used per level to control the
recursive procedure.
• This specifies the minimum width of a partition, or the minimum
number of values for each partition at each level.
• Histograms can also be partitioned based on cluster analysis of the
data distribution.
4. Data Transformation: Data Transformation by Discretization
c) Discretization by Cluster
• A clustering algorithm can be applied to discretize a numeric
attribute, A, by partitioning the values of A into clusters or groups.
• Clustering takes the distribution of A into consideration, as well as the
closeness of data points, and therefore is able to produce high-quality
discretization results.
• Clustering can be used to generate a concept hierarchy for A by
following either a top-down splitting strategy or a bottom-up merging
strategy, where each cluster forms a node of the concept hierarchy.
• Each initial cluster or partition may be further decomposed into
several sub-clusters, forming a lower level of the hierarchy.
4. Data Transformation: Data Transformation by Discretization
d) Discretization by Decision Tree
• Decision Tree techniques employ a top-down splitting approach.
• Discretization using decision tree is supervised technique.
• For example, we may have a data set of patient symptoms (the
attributes) where each patient has an associated diagnosis class label.
• Class distribution information is used in the calculation and
determination of split-points.
• Entropy is the most commonly used measure for split-points.
4. Data Transformation: Data Transformation by Discretization
e) Discretization by Correlation Analyses
• Measures of correlation can be used for discretization.
• ChiMerge is a χ2-based discretization method.
• ChiMerge is supervised and uses class information.
• ChiMerge employs a bottom-up approach by finding the best
neighboring intervals and then merging them to form larger intervals,
recursively.
• The relative class frequencies should be fairly consistent within an
interval.
• Therefore, if two adjacent intervals have a very similar distribution of
classes, then the intervals can be merged. Otherwise, they should
remain separate.
4. Data Transformation: Data Transformation by Discretization
e) Discretization by Correlation Analyses
• Each distinct value of a numeric attribute A is considered to be one
interval.
• χ2 tests are performed for every pair of adjacent intervals.
• Adjacent intervals with the least χ2 values are merged together,
because low χ2 values for a pair indicate similar class distributions.
• This merging process proceeds recursively until a predefined stopping
criterion is met.
4. Data Transformation: Data Transformation
Concept Hierarchy Generation for Nominal Data
• Nominal attributes have a finite (but possibly large) number of
distinct values, with no ordering among the values.
• The concept hierarchies can be used to transform the data into
multiple levels of granularity.
• For example, data mining patterns regarding sales may be found
relating to specific regions or countries, in addition to individual
branch locations.
• Four methods for the generation of concept hierarchies for nominal
data.
4. Data Transformation: Data Transformation
a) Specification of a partial ordering of attributes explicitly at the
schema level by users or experts:
• Concept hierarchies for nominal attributes or dimensions typically
involve a group of attributes.
• A user or expert can easily define a concept hierarchy by specifying a
partial or total ordering of the attributes at the schema level.
• For example, a hierarchy can be defined by specifying the total
ordering among these attributes at the schema level such as
street < city < state < country.
4. Data Transformation: Data Transformation
b) Specification of a portion of a hierarchy by explicit data grouping:
• In a large database, it is unrealistic to define an entire concept
hierarchy by explicit value enumeration.
• On the contrary, we can easily specify explicit groupings for a small
portion of intermediate-level data.
• For example, after specifying that state and country form a hierarchy
at the schema level, a user could define some intermediate levels
manually, such as
{Alberta, Saskatchewan, Manitoba} ⊂ prairies_Canada, and
{British Columbia, prairies_Canada} ⊂ Western_Canada
4. Data Transformation: Data Transformation
c) Specification of a set of attributes, but not of their partial
ordering:
• Consider the observation that since higher-level concepts generally
cover several subordinate lower-level concepts, an attribute defining
a high concept level (e.g., country) will usually contain a smaller
number of distinct values than an attribute defining a lower concept
level (e.g., street).
• Based on this observation, a concept hierarchy can be automatically
generated based on the number of distinct values per attribute in the
given attribute set.
• The attribute with the most distinct values is placed at the lowest
hierarchy level.
• The lower the number of distinct values an attribute has, the higher
it is in the generated concept hierarchy.
4. Data Transformation: Data Transformation
c) Specification of a set of attributes, but not of their partial
ordering:

Automatic generation of a schema


concept hierarchy based on the
number of distinct attribute values.
4. Data Transformation: Data Transformation
d) Specification of only a partial set of attributes:
• Sometimes a user can be careless when defining a hierarchy, or have
only a vague idea about what should be included in a hierarchy.
• Consequently, the user may have included only a small subset of the
relevant attributes in the hierarchy specification.
• For example, instead of including all of the hierarchically relevant
attributes for location, the user may have specified only street and
city.
• To handle such partially specified hierarchies, it is important to
embed data semantics in the database schema so that attributes with
tight semantic connections can be pinned together.
Summary
• Data quality is defined in terms of accuracy, completeness,
consistency, timeliness, believability, and interpretabilty. These
qualities are assessed based on the intended use of the data.
• Data cleaning routines attempt to fill in missing values, smooth out
noise while identifying outliers, and correct inconsistencies in the
data. Data cleaning is usually performed as an iterative two-step
process consisting of discrepancy detection and data transformation.
• Data integration combines data from multiple sources to form a
coherent data store. The resolution of semantic heterogeneity,
metadata, correlation analysis, tuple duplication detection, and data
conflict detection contribute to smooth data integration.
Summary
• Data reduction techniques obtain a reduced representation of the
data while minimizing the loss of information content.
• These include methods of dimensionality reduction, numerosity
reduction, and data compression.
• Dimensionality reduction reduces the number of random variables or
attributes under consideration.
• Methods include wavelet transforms, principal components analysis,
attribute subset selection, and attribute creation.
• Numerosity reduction methods use parametric or nonparatmetric
models to obtain smaller representations of the original data.
• Parametric models store only the model parameters instead of the
actual data. Examples include regression and log-linear models.
• Nonparamteric methods include histograms, clustering, sampling, and
data cube aggregation.
Summary
• Data compression methods apply transformations to obtain a
reduced or “compressed” representation of the original data.
• The data reduction is lossless if the original data can be reconstructed
from the compressed data without any loss of information;
otherwise, it is lossy.
• Data transformation routines convert the data into appropriate forms
for mining.
• For example, in normalization, attribute data are scaled so as to fall
within a small range such as 0.0 to 1.0.
• Other examples are data discretization and concept hierarchy
generation.
Summary
• Data discretization transforms numeric data by mapping values to
interval or concept labels.
• Such methods can be used to automatically generate concept
hierarchies for the data, which allows for mining at multiple levels of
granularity.
• Discretization techniques include binning, histogram analysis, cluster
analysis, decision tree analysis, and correlation analysis.
• For nominal data, concept hierarchies may be generated based on
schema definitions as well as the number of distinct values per
attribute.
References
• Jiawei Han, Micheline Kamber and Jian Pei, Data Mining: Concepts and
Techniques, Morgan Kaufmann, 3rd Edition.
• PPT by Prof. Jiawei Han (Link)
• PCA (Link)
• The Mathematics Behind Principal Component Analysis by Akash Dubey
(three feature Example)(link).
• PCA by Josh Starmer (link). (For visualization)
• Linear and Multiple Linear Regression (Link)

You might also like