KEMBAR78
Data Preprocessing | PDF | Sampling (Statistics) | Principal Component Analysis
0% found this document useful (0 votes)
13 views56 pages

Data Preprocessing

Chapter 2 of the Data Mining document focuses on data preprocessing, emphasizing its importance due to the presence of dirty, incomplete, and noisy real-world data. It outlines various preprocessing tasks including data cleaning, integration, transformation, reduction, and discretization, along with methods to handle missing values and noise. The chapter also discusses dimensionality reduction techniques and the significance of maintaining data quality for effective data mining and analytics.

Uploaded by

Sangat Rokaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views56 pages

Data Preprocessing

Chapter 2 of the Data Mining document focuses on data preprocessing, emphasizing its importance due to the presence of dirty, incomplete, and noisy real-world data. It outlines various preprocessing tasks including data cleaning, integration, transformation, reduction, and discretization, along with methods to handle missing values and noise. The chapter also discusses dimensionality reduction techniques and the significance of maintaining data quality for effective data mining and analytics.

Uploaded by

Sangat Rokaya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Mining

Chapter 2: Data Preprocessing

Khwopa College of Engineering- Dr. Jhanak Parajuli


Data Preprocessing

❏ Why data preprocessing?

❏ What is Data cleaning ?

❏ Data integration and transformation

❏ Data reduction

❏ Discretization and concept hierarchy generation


Process of Knowledge Discovery
Selection Pre-processing Data mining Post-processing

knowledge

Raw Data Target Data


Processed Data Data Patterns Visualization

Collection in Only selected Data is processed Discover patterns


using different Visualize the data to
database, flat data collected in data using
techniques such obtain proper knowledge
files, cloud or in database, different statistical
as normalization, and make future
any other flat files etc.. and modern
feature selection, business predictions
sources machine learning
dimensionality techniques, will
reduction, data help in predictive
subsetting etc. analytics
Why Data Preprocessing?

❏ Real world data is dirty


❏ incomplete: missing attribute values or features, lacking desired
attributes, or containing only aggregate data
■ noisy: containing errors or outliers
■ inconsistent: containing discrepancies in codes or names

❏ Garbage in, Garbage Out


■ Quality decisions must be based on quality data
■ Data warehouse needs consistent integration of quality data
■ Required for both OLAP and Data Mining!
Why are Data incomplete?

❏ Attributes of interest are not available at the time of recording data


❏ (e.g., customer information for sales transaction data)

❏ Data were not considered important at the time of transactions, so they were
not recorded!

❏ Data not recorder because of misunderstanding or malfunctions

❏ Data may have been recorded and later deleted!

❏ Missing/unknown values for some data


Why are Data noisy?

❏ Faulty instruments for data collection

❏ Human or computer errors

❏ Errors in data transmission

❏ Technology limitations (e.g., sensor data come at a faster rate than they can
be processed)

❏ Inconsistencies in naming conventions or data codes (e.g., 2/5/2002 could be


2 May 2002 or 5 Feb 2002)

❏ Duplicate tuples, which were received twice should also be removed


Data Preprocessing Tasks
❏ Data Cleaning
❏ Fill in missing values
❏ Smoothen noisy data
❏ Identify and remove outliers
❏ Resolve inconsistencies
❏ Data Integration
❏ Integration of multiple database, data cubes or files
❏ Data Transformation
❏ Data normalization/scaling samples to have unit norm
❏ Standardization/mean removal and variance scaling
❏ Aggregation
❏ Data Reduction
❏ Dimensionality Reduction
❏ Data Discretization/quantization/binning
❏ Finite grouping (K-bins discretization)
❏ Feature binarization
Data Preprocessing Tasks
❏ Data Cleaning ❏ Data Transformation

❏ Data Reduction
❏ Data Integration

❏ Data Discretization
Data Cleaning

❏ Fill in missing values/Imputation of missing values


❏ Identify outliers
❏ smoothen noisy data
❏ Resolve inconsistencies e.g. duplicate entries

A mistake or a millionaire ?

Missing Values

Inconsistent duplicate entries


Noise

❏ Noise modifies the original value.


❏ Noise creates chaos in the data and might break the desired patterns
Outliers

❏ Outliers are data objects with characteristics that are considerably different
than most of the other data objects in the data set
Handling Missing Values

❏ Eliminate data objects:


❏ This is done when the missing values are few and do not lose a lot of data points
❏ Estimate missing values:
❏ Use statistical measures such as mean or median of the attributes (preferred when the values
are numerical)
❏ Use attribute mean or median for all samples belonging to the same class (classification
problems)
❏ Use inference based approach such as Bayesian formula or decision based tree
❏ Ignore missing values during analysis:
❏ If the missing values are much higher than the known values, then filling the missing values
may not make proper analysis.
❏ Replace the missing values with possible values
❏ Fill in manually : a tedious task for large dataset (not preferred)
❏ Use global constant to fill in the missing values (e.g. 0 or NaN or Unknown etc.. or other
values) -- not preferred for all datasets
❏ forward fill: fill the missing value with the value ahead (not preferred for all datasets)
❏ backward fill: fill the missing value with the value before (not preferred for all datasets)
Handling Missing Values

Age Income Religion Gender


23 24,200 Muslim M
39 ? Christian F
45 45,390 ? F

Fill missing values using aggregate functions (e.g., average) or probabilistic


estimates on global value distribution
E.g., put the average income here, or put the most probable income based
on the fact that the person is 39 years old
E.g., put the most frequent religion here
Handling Missing Values

Python → sklearn.impute.SimpleImputer
❏ The SimpleImputer class provides basic strategies for imputing missing values. Missing values can
be imputed with a provided constant value, or using the statistics (mean, median or most frequent)
of each column in which the missing values are located. This class also allows for different missing
values encodings.

❏ The SimpleImputer class also supports categorical data represented as string values or pandas
categoricals when using the 'most_frequent' or 'constant' strategy:
Handling Data Noise

❏ Binning
❏ sort data and partition into equi-depth and equi-width bins
❏ smooth by bin means, bin median, bin boundaries, etc.
❏ Regression
❏ smooth by fitting a regression function
❏ Clustering
❏ detect and remove outliers
❏ Combined Computer and Human Inspection
❏ detect suspicious values automatically and check by human
❏ Use concept hierarchies
❏ e.g., price value -> “expensive”
Handling Data Noise (Binning)

❏ Equal-width (Distance) binning


❏ Divides the range into N intervals of equal size
❏ Width of intervals
❏ Simple
❏ Outliers may dominate result
❏ Equal-depth (frequency) binning
❏ Divides the range into N intervals, each containing approximately same
number of records
❏ Skewed data is also handled well

Equal Depth Binning


Handling Data Noise (Binning)

❏ Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
❏ Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

❏ Smoothing by bin means:


- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29

❏ Smoothing by bin boundaries: [4,15],[21,25],[26,34]


- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Handling Data Noise (Regression)

❏ Replace noisy data or missing data by predicted values


❏ Example: Linear Regression on missing continuous valued data
Handling Data Noise (Regression)

❏ Replace noisy data or missing data by predicted values


❏ Example: Linear Regression on missing continuous valued data
Handling Data Noise (Clustering)

❏ K-means Clustering is the most popular clustering technique:


❏ The K-Means algorithm clusters data by trying to separate samples in K groups of equal
variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares.

outliers
clusters
Combined Computer and Human Inspection

❏ Inconsistent data are handled by:


❏ Manual correction (expensive and tedious)
❏ Use routines designed to detect inconsistencies and manually correct
them. E.g., the routine may use the check global constraints (age>10) or
functional dependencies
❏ Other inconsistencies (e.g., between names of the same attribute) can be
corrected during the data integration process
❏ Detect suspicious values automatically
❏ Use statistical formula to remove outliers
❏ By calculating, Z-score = (X - mean)/sigma; +/-3*Z-score is considered as outlier
❏ By histogram plot
❏ By using boxplot or Interquartile range (IQR)
Boxplot

❏ Boxplot: Use boxplot to detect outliers and remove them


Data Integration

❏ Combining data from multiple sources


❏ Multiple sources: Relational Database, NOSQL database, flat files etc.
❏ Schema integration: Integrating metadata from different sources
❏ metadata: data that contains describes the data; data descriptors
❏ Maintaining data dictionary
❏ data-dictionary: explanation of each of the data features
❏ Detecting and resolving data value conflicts
❏ values from different sources might be different
❏ determine the differences and deviation
❏ possible reasons: different representations, different scales, different time
zones
Handling Redundant Data

❏ Redundant data occur often when integration of multiple databases


■ The same attribute may have different names in different databases
■ One attribute may be a “derived” attribute in another table, e.g., annual
revenue

❏ Redundant data can be detected by correlation analysis

❏ Careful integration of the data from multiple sources will help to avoid or
reduce the redundant data and inconsistencies.

❏ Redundancies removal improves data mining speed and quality


Data Portability

SAX =Symbolic Aggregate Approximation

DWT = Discrete Wavelet Transform

DFT = Discrete Fourier Transform, MDS = Multidimensional scaling

Reference: Book Charu C. Aggrawal, Data Mining Text Book


Assignment 2 (due 2 weeks - June 19, 2021)

1. For the given dataset, do the following data cleaning tasks. Use python as
programming language. Perform the task in your jupyter notebook.
a. Load data as pandas dataframe
b. Find the number of missing values in each column
c. delete missing rows
d. fill the missing values with mean
e. fill the missing values with median
f. find outliers in the data
g. use linear regression to remove and replace outliers
Data Transformation

❏ Data Transformation → Process in which data is consolidated or transformed


into other stand forms which are suitable for data mining and/or suitable for
using known machine learning algorithms for predictive analytics
❏ Normalization: It is the process of scaling individual samples to have unit
❏ What is a vector norm ?

❏ Normalization Techniques:
❏ Decimal Scaling
❏ Min-Max scaling
❏ Z-score Normalization/Standardization

❏ Feature generation: Create new features from the given features


Why Normalization ?

❏ Speeds-up some learning techniques (eg. neural networks)

❏ Helps prevent attributes with large ranges outweigh ones with small ranges

❏ Example:
❏ income has range 3000-200000
❏ age has range 10-80
❏ gender has domain M/F
Normalization Techniques

❏ Normalization by decimal scaling

❏ Min-Max Normalization

❏ Z-score Normalization/Standardization
Variable Transformation

❏ Simple mathematical formulation is applied to each value individually


❏ If x is a variable, then examples of such transformations include x^k,
logx, e^x, √x, 1/x, sinx, or |x|.
❏ In statistics, variable transformations, especially sqrt, log, and 1/x, are
often used to transform data that does not have a Gaussian (normal)
distribution into data that does.
❏ Variable transformations should be applied with caution since they
change the nature of the data.
❏ To help clarify the effect of a transformation, it is important to ask
questions such as the following:
❏ Does the order need to be maintained?
❏ Does the transformation apply to all values, especially negative
values and 0?
❏ What is the effect of the transformation on the values between 0
and 1?
Sampling

❏ Used for selecting subset of data objects to be analyzed


❏ Sampling is done because it is too expensive or time consuming to process all the
data
❏ Sample is representative if it has same property as the original dataset.

❏ Choosing of appropriate Sampling size and sampling technique is necessary to


obtain a representative sample with high probability.
Sampling Approaches

❏ Simple random sampling:


❏ equal probability of selecting any item
❏ Two types:
❏ Sampling without replacement
❏ sampling with replacement
❏ Simple random sampling doesn’t properly represent imbalanced
dataset
❏ Stratified Sampling:
❏ Sampling is done on a prespecified group of objects
❏ Each group is created so that the data is balanced in each group
❏ Equal number of data points are taken from each group
Loss of information with Sampling
Sampling Approaches

❏ Proper sampling size is difficult to obtain


❏ In that case, we use Progressive or adaptive sampling techniques
❏ Start with a small sample size and increase the size until a sample of
sufficient size has been obtained

❏ In predictive modeling, we analyze the model error/inaccuracy with initial


sample points, increase the sample points and see that the inaccuracy drops.
If we keep on increasing the sample data, the inaccuracy increases after
some point. That will be the optimal sample size.
Dimensionality Reduction

❏ As the name suggests, dimensionality reduction is the method of reducing the


dimensionality or the number of features in the dataset.
❏ E.g. consider a large document data. For processing such data, we embed
each word with a vector and such data can have thousands of dimensions.
❏ In that case, data becomes increasingly sparse and it is very difficult to
analyze.
❏ Many data mining algorithms work better if the dimensionality is lower. e.g.
k-means clustering.
❏ Dimensionality reduction helps to understand the model better as it has fewer
attributes.
❏ Reducing irrelevant features reduces noise in the dataset.
The Curse of Dimensionality

❏ As the dimensionality increases, the data becomes sparse and difficult to


analyze. This is called the Curse of dimensionality.
❏ For classification, this can mean that there are not enough data objects to
allow the creation of a model that reliably assigns a class to all possible
objects.
❏ For clustering, the definitions of density and the distance between points,
which are critical for clustering, become less meaningful.
Linear Algebra Techniques for Dimensionality Reduction

1. Principal component analysis (PCA)


2. Singular Value Decomposition (SVD)

Principal Component Analysis:

❏ Used for dimensionality reduction in continuous data


❏ Given N data vectors from k-dimensions, find c <= k orthogonal vectors that
can be best used to represent data

Please follow attached notebooks for more on dimensionality reduction


Principal Component Analysis

X1, X2: original axes (attributes) X2


Y1,Y2: principal components
Y1
Y2
significant component
(high variance)

X1

Order principal components by significance and eliminate weaker ones


Measures of Similarity and dissimilarity

❏ Similarity measurement is used in many data mining techniques


such as clustering, nearest neighboring classification and anomaly
detection.
❏ Once the similarity and dissimilarity measurement is known, we
may not even need the initial dataset
❏ data is transformed to similarity/dissimilarity (proximity) space
and analyzed
❏ How to measure the proximity between objects having only one
attribute ?
❏ How to measure the proximity between objects having multiple
attributes ?
Proximity Measures between two objects

❏ Similarity:
❏ numerical measure of the degree to which two objects are alike
❏ similarity = 0 → no similarity
❏ similarity = 1 → complete similarity
❏ dissimilarity:
❏ numerical measure of the degree to which two objects are different
❏ distance is sometimes used as a synonym for dissimilarity, but distance
only refers to special class of dissimilarities.
❏ dissimilarities are measured between [0,1], but also sometimes between
[0,∞]
❏ Similarity → dissimilarity
❏ If the similarity and dissimilarity fall in the interval [0,1], similarity = 1-
dissimilarity OR
❏ Similarity is the negative of dissimilarity
Proximity Measures for single attribute

❏ If the attribute is nominal:


❏ Nominal attributes only convey information about the distinctness of the
object
❏ Two objects can be similar or dissimilar
❏ similarity = 1, dissimilarity = 0
❏ If the attribute is ordinal:
❏ Information about order should be taken into account
❏ e.g. quality of a product (poor, fair, ok, good, wonderful)
❏ If product 1 is wonderful and P2 is good, they are more similar than if P1
is wonderful and P2 is fair
❏ Each ordinal quality is mapped to successive integer and the difference
between the two gives the measure of dissimilarity.
❏ (poor, fair, ok, good, wonderful) → (1,2,3,4,5) then P1(wonderful),
P2(good) = d(P1,P2) = 5-4 =1
❏ If we want d(P1,P2) to fall between [0,1] → d(P1,P2)/(len(order)-1) →1/4
❏ Is it fair to give equal weight to each order ?
Proximity Measures for single attribute

❏ If the attribute is interval or ratio:


❏ dissimilarity between two objects = absolute difference between their
values
❏ dissimilarity range from 0 to infinity
❏ E.g.. difference of current weight and weight 6 months ago is expressed
in absolute numbers

Reference: Introduction to Data Mining , Tan, Steinbach , Kumar


Dissimilarities Measures between Data Objects

We consider the following measurement of dissimilarities

● Minkowski Distance: Given data x and y, the Minkowski distance is given


by the following formula:

where r is the parameter, n is the number of dimensions and x_k and y_k are the
k-th attribute of x and y.

-- Named after Hermann Minkowski is a German, Polish mathematician


Dissimilarities Measures between Data Objects

● Manhattan Distance (Taxicab distance or City block distance): Given


data x and y, the Manhattan distance is calculated as:

where n is the number of dimensions and x_k and y_k are the k-th attribute of x
and y.

● r = 1 in Minkowski distance
● Also called L1 norm
● It is the distance a car would drive in a city (e.g., Manhattan)
Dissimilarities Measures between Data Objects

● Euclidean Distance: Given data x and y, the Euclidean distance is


calculated as:

where n is the number of dimensions and x_k and y_k are the k-th attribute of x
and y.

● r = 2 in Minkowski distance
● Also called L2 norm
Dissimilarities Measures between Data Objects

● Supremum Distance: Given data x and y, the Supremum distance is


calculated as:

where n is the number of dimensions and x_k and y_k are the k-th attribute of x
and y.
● This is the maximum of absolute value of given data e.g. ||(1, −4, 5)||∞ =
max{|1|, | − 4|, |5|} = 5
● r → ∞ in Minkowski distance
● Also called Lmax or L∞ norm
Dissimilarities Measures between Data Objects

Reference: Introduction to Data Mining, Tan, Steinbach and Kumar


Properties of distance

● Measures that satisfy all three properties are known as metrics.


Do all dissimilarities satisfy all three properties?

❏ No - Such are called Non-metric dissimilarities


❏ E.g. Set differences B
A
❏ A = {1,2,3,4}, B={2,3,4}, A-B = {1}, B-A = {Phi} 1
2
3
4

❏ If we define d(A,B) = size(A-B) , then the symmetry property is


violated
❏ This also violates triangle inequality property
❏ However if we define d(A,B) = size(A-B) + size(B-A) then symmetry
and triangle inequality property is preserved
Do all dissimilarities satisfy all three properties?

❏ Example 2: Time difference


❏ Let us consider the distance between two time is measured as

❏ d(1, 2) = 1, but d(2,1) = 23


❏ violates symmetry property and triangle inequality property
Similarities Measures between Data objects

❏ For Similarities Measures triangle inequality do not hold but positivity and
symmetry do.

❏ There is no general analog of triangle inequality for similarity measures.


However, there are some similarity measures where the triangle inequality
holds. E.g. Cosine and Jaccard similarity measures
❏ In some similarities measures symmetry property do not hold. For example
confusion matrix for classification. What is the similarity measure for 0 and
o? If we consider the number in the matrix,then it is not symmetric.
❏ Hence similarity is measured as:
Predicted 0 predicted o
s’(y,x) = s’(x,y) = (s(x,y)+s(y,x))/2
Actual 0 160 40

Actual o 30 170
Similarity Measures for Binary Data
❏ Simple Matching Coefficient:

❏ For the example data,


f_{00} = 0, f_{01} = 2, f_{10} = 1, f_{11} = 2 ,
SMC = ⅖ = 0.4
❏ This measure counts both presence and absence equally
Similarity Measures for Binary Data
❏ Jaccard Similarity Coefficient:
❏ Used when the given binary attribute is asymmetric
❏ E.g. In a shop, items purchased by customer =1 and items not
purchased = 0. In that case, for each transaction 0 outnumbers 1. It
means f_{00} will be very high compared to f_{11}. If we use SMC,
this doesn’t capture the similarity measure truly.
Cosine Similarity

❏ Used for non binary data


❏ Used for Document data
❏ If we want to measure similarity between two documents, the document is
represented by a vector, where each attribute is the frequency of the word
occurred in the document.
❏ Document processing also ignores certain common words
❏ The vector is non binary and can be sparse, we use cosine similarity to
measure similarity between two documents.

Reference: Introduction to data mining, Tan, Steinbach and Kumar


Cosine Similarity

Example:

Cosine similarity is the measure of the angle between x and y.


Follow lecture videos, text-book and shared jupyter notebooks for examples and
other details

You might also like