Data Preprocessing Essentials
Data Preprocessing Essentials
Source: Chapter 3
book: Data Mining by Han and Kamber
CSE25.8: Elective-I
MCA II Sem
March-June 2022
Content
• Data cleaning
• Data integration
• Data reduction
• Data transformation
• Data discretization
1
Data pre-processing
• How can the data be preprocessed in order
Data Quality
• There are many factors comprising data quality,
including:
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability
– interpretability.
2
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
Data Cleaning
• Missing Values
• Noisy data
3
Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different
sources are different
– Possible reasons: different representations, different scales, e.g.,
metric vs. British units
4
Issues of Information Integration
• Data type differences
– Serial numbers may be represented by characters, numbers, fixed
v/s varying length at different sources
• Value differences
– The same concept may be represented by different constants at
different sources. E.g. color black can be given a code BL, and BL
may mean blue color at another source, etc.
• Semantic differences
– Terms may be given different interpretations at different sources.
E.g. One dealer may include trucks in Cars relation, another may
include station wagons and mini vans, etc.
• Missing values
– A source might not record information of a type that most of other
sources provide. E.g. one dealer might not record some minor
accessories at all, etc.
Data Transformation
• A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature construction
• New attributes constructed from the given ones
– Aggregation: Summarization, data cube construction
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: Concept hierarchy climbing
5
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
6
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same
class: smarter
– the most probable value: inference-based such as
Bayesian formula or decision tree
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
7
How to Handle Noisy Data?
• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g.,
deal with possible outliers)
Binning
8
Binning
• In general, the larger the width, the greater the effect of
the smoothing.
• Alternatively, bins may be equal width, where the interval
range of values in each bin is constant.
9
Data discrepancy detection
• Discrepancies can be caused by several factors
– Poorly designed data entry forms that have many optional fields
– Human error/deliberate error in data entry
– Data decay (e.g. outdated addresses)
– Inconsistent data representations (e.g. where a given attribute
can have different names in different databases)
• In order deal with data discrepancy, use knowledge we
may already have regarding properties of the data i.e.
meta data
• Meta data includes:
– domain and data type of each attribute
– Acceptable values for each attribute
– Range of length of values, do all values fall within the range; e.g.
values that are more than 2 standard deviations away from mean
for an attribute may be flagged as potential outliers.
Field overloading
• Field overloading is another source of errors that
typically results when developers squeeze new attribute
definitions into unused bit portions of already defined
attributes
10
Applying various rules
• A unique rule says that each value of given attribute
must be different from all other values for that attribute.
• A consecutive rule says there can be no missing values
between lowest and highest values for the attribute, and
all values must also be unique (e.g. check numbers)
• A null rule specifies use of blanks, question marks,
special characters, or other strings that may indicate null
condition (e.g. where a value for a given attribute is not
available)
11
Mitosheet
https://docs.trymito.io/getting-started/installing-mito
Data Reduction
• Data reduction strategies include dimensionality
reduction, numerosity reduction, and data
compression.
• Dimensionality reduction is the process of reducing
the number of random variables or attributes under
consideration. Dimensionality reduction methods
include wavelet transform etc.
12
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple
databases
– Object identification: The same attribute or object may
have different names in different databases
– Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Data Normalization
• The measurement unit used can affect the data analysis.
• For example, changing measurement units from meters to
inches for height, or from kilograms to pounds for weight, may
lead to very different results.
• In general, expressing an attribute in smaller units will lead to
a larger range for that attribute, and thus tend to give such an
attribute greater effect or “weight.”
• To help avoid dependence on the choice of measurement
units, the data should be normalized or standardized.
• Normalizing the data attempts to give all
attributes an equal weight.
13
Data Normalisation
• where the attribute data are scaled so as
to fall within a smaller range, such as -1.0
to 1.0, or 0.0 to 1.0.
• to bring train and test data at same page
Data Normalization
• Normalization is particularly useful for classification algorithms
involving neural networks or distance measurements such as
nearest-neighbor classification and clustering.
• If using the neural network backpropagation algorithm for
classification, normalizing the input values for each attribute
measured in the training tuples will help speed up the learning
phase.
• For distance-based methods, normalization helps prevent
attributes with initially large ranges (e.g., income) from
outweighing attributes with initially smaller ranges (e.g.,
binary attributes).
• It is also useful when given no prior knowledge of the data.
• There are many methods for data normalization.
14
LP-Norm
• For normalizing data using LP norms, first we need to
calculate their norms.
• Norm of a vector represents some property of vector; e.g.
L2 norm of a vector denotes its length/magnitude.
• One important use of norm is to transform a given
vector into a unit-length vector
– i.e., making the magnitude of vector = 1, while still preserving its
direction.
• This is achieved by dividing each element in a vector by
its length i.e. its L2-norm.
• Another use of L1 & L2 norm:
– in computation of loss in regularized gradient descent algorithms.
– These are also used in famous Ridge and Lasso regression
algorithms.
LP-Norm
• The p-norm is a norm on suitable real vector spaces given
by the pth root of the sum (or integral) of the pth-powers of
the absolute values of the vector components.
• The general definition for the p-norm of a vector v that has
N elements is:
• https://www.journaldev.com/45324/norm-of-vector-python
where p is any positive real value, Inf, or -Inf. Some common
values of p are 1, 2, and Inf.
•If p is 1, then the resulting 1-norm is the sum of the absolute
values of the vector elements.
•If p is 2, then the resulting 2-norm gives the vector magnitude
or Euclidean length of the vector.
•If p is Inf, then
15
Rescaling
• Rescaling changes the distance between the
min and max values in a data set by stretching
or squeezing the points along the number line.
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
• Normalization by decimal scaling
v
v' Where j is the smallest integer such that Max(|ν’|) < 1
10 j
16
Standard Normal Distribution
• A standard normal distribution has mean = 0 and standard
deviation=1.
• Z-score normalization is useful in machine learning
settings since it can tell you:
– how far a data point is from the average of
the whole data set.
• It can be most appropriate when there are just a few
outliers, since it provides a simple way to compare a data
point to the norm.
Z-score
• z-scores measure the distance of a data point from the
mean in terms of the standard deviation.
• The standardized data set has mean=0 and standard
deviation 1, and retains the shape properties of the original
data set.
17
Z-score
• The z-scores of the data are preserved, so
the shape of the distribution remains the
same. v A
• E.g. v'
(μ: mean, σ: standard deviation): A
73,600 54,000
1.225
16,000
IQR
• The interquartile range (IQR) of a data set describes the
range of the middle 50% of values when the values are
sorted.
18
Normalisation
• It is necessary to save the normalization parameters
(e.g., the mean and standard deviation etc) so that
future data can be normalized in a uniform manner.
19
Redundancy and Correlation Analysis
• Given two attributes, such analysis can measure how strongly
one attribute implies the other, based on the available data.
• For nominal data, we use the χ2 (chi-square) test.
• For numeric attributes, we can use the correlation coefficient
and covariance, both of which access how one attribute’s
values vary from those of another.
Categorical Continuous-value
Expected
• The larger the χ 2 value, the more likely the variables are
related
• The cells that contribute the most to the χ 2 value are
those whose actual count is very different from the
expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
20
Expected value
• The expected value (EV) is an anticipated value for an
investment at some point in the future.
• In statistics and probability analysis, the expected
value is calculated by multiplying each of the possible
outcomes by the likelihood each outcome will occur and
then summing all of those values.
Degrees of freedom:
– that’s just the number of categories minus 1 i.e. (r-1) * (c-1)
The alpha level(α):
– chosen by you, or the researcher
– usual alpha level is 0.05 (5%), but you could also have other
levels like 0.01 or 0.10.
21
e.g.
• Suppose that a group of 1500 people was surveyed.
• The gender of each person was noted.
• Each person was polled as to whether his or her
preferred type of reading material was fiction or
nonfiction.
• Thus, we have two attributes, gender and preferred
reading.
• The observed frequency (or count) of each possible joint
event is summarized in the contingency table shown in
Table 3.1
22
Hypothesis that the attributes are not
correlated is rejected
• For this 2 X 2 table, the degrees of freedom are (2-1) X
(2-1).
Chi-distribution Table
23
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product
moment coefficient)
n n
(ai A)(bi B ) (ai bi ) n AB
rA, B i 1
i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
48
24
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
– Best single attribute under the attribute independence
assumption: choose by significance tests
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
– Optimal branch and bound:
• Use attribute elimination and backtracking
50
25
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in
one or more other attributes
– E.g., purchase price of a product and the amount of
sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data
mining task at hand
– E.g., students' ID is often irrelevant to the task of
predicting students' GPA
x1
26
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal
component vectors
– The principal components are sorted in order of decreasing
“significance” or strength
– Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
• Works for numeric data only
27
Parametric Data Reduction: Regression
and Log-Linear Models
• Linear regression
– Data modeled to fit a straight line
– Often uses the least-square method to fit the line
• Multiple regression
– Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
• Log-linear model
– Approximates discrete multidimensional probability
distributions
y
Regression Analysis Y1
28
Regress Analysis and Log-Linear
• Linear regression: Y = w X + b Models
– Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
– Using the least squares criterion to the known values of Y1, Y2, …,
X1, X2, ….
• Multiple regression: Y = b0 + b1 X1 + b2 X2
– Many nonlinear functions can be transformed into the above
• Log-linear models:
– Approximate discrete multidimensional probability distributions
– Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset
of dimensional combinations
– Useful for dimensionality reduction and data smoothing
Histogram Analysis
• Divide data into buckets
and store average (sum) 40
35
for each bucket
30
• Partitioning rules:
25
– Equal-width: equal 20
bucket range 15
– Equal-frequency (or 10
equal-depth) 5
0
10000 30000 50000 70000 90000
29
Clustering
• Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
• Can be very effective if data is clustered but not if data is
“smeared”
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
• There are many choices of clustering definitions and
clustering algorithms
• Cluster analysis will be studied in depth in Chapter 10
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic
rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
– Supervised vs. unsupervised
– Split (top-down) vs. merge (bottom-up)
– Discretization can be performed recursively on an attribute
– Prepare for further analysis, e.g., classification
30
Data Discretization Methods
• Typical methods: All the methods can be applied
recursively
– Binning
• Top-down split, unsupervised
– Histogram analysis
• Top-down split, unsupervised
– Clustering analysis (unsupervised, top-down split or
bottom-up merge)
– Decision-tree analysis (supervised, top-down split)
– Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
31
Simple Discretization: Binning
• Equal-width (distance) partitioning
– Divides the range into N intervals of equal size: uniform grid
– if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
– The most straightforward, but outliers may dominate presentation
– Skewed data is not handled well
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 Binning Methods
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
for Data
* Smoothing by bin boundaries: Discretization
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
32
Summary
• Data quality: accuracy, completeness, consistency,
timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
– Entity identification problem
– Remove redundancies
– Detect inconsistencies
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
33