Chapter 2
Data Preprocessing
Data Quality
• Real world database are highly unprotected from noise,
missing and inconsistent data due to their typically huge size
and their possible origin from multiple, heterogeneous
sources.
• Low quality data will lead to low quality mining results.
• Data pre-processing is required to handle these above
mentioned facts.
• The methods for data preprocessing are organized into
– Data Cleaning
– Data Integration
– Data Transformation
– Data Reduction
– Data Discritization
Data Cleaning
• Mostly concern with
– Fill-in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Eliminate duplicate data
– Missing Data
• Data is not always available because many tuples may not have
recorded values for several attributes such as age, income.
• Missing data may be due to:
• Equipment Malfunction
• Inconsistent with other recorded data and thus deleted.
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry.
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing. Not
effective when the percentage of missing values per attribute
varies considerably.
• Fill-in missing values manually: Tedious and infeasible task.
• Use a global constant to fill-in missing values.
• Use an attribute mean fill-in missing values belonging to the
same class.
• Use the most probable value to fill-in missing value.
Noisy Data
• Noisy data is a form of error because of random error in a measured
variable.
• Incorrect attribute values may be due to:
• Faulty data collection instruments
• Data entry problem
• Data transmission problem
• Technology limitation
• Inconsistency in naming convention
How to Handle Noisy Data
– Clustering: Detect and remove outliers
– Regression: Smooth by fitting the data into regressi9on function
– Binning Method: First sort the data and partition into different boundaries with
mean, median values.
– Combined computer and human inspection, doing so suspicious values are
detected by human
Outliers
• Outliers are a set of data points that are considerably
dissimilar or inconsistent with the remaining data.
• In most of the cases they are inference of noise while in
some cases they may actually carry valuable information.
Outliers can occur because of:
– Transient malfunction of data measurement
– Error in data transmission or transcription
– Changes in system behavior
– Data contamination from outside the population
examined.
– Flaw in assumed theory
Outliers
How to Handle Outliers
• There are three fundamental approaches to the problem of
outlier’s detection
• Type 1:
– Determine the outliers with no prior knowledge of data. This is a
learning approach analogous to unsupervised learning.
• Type 2:
– Model with normality and abnormality. Analogous to supervised
learning.
• Type 3:
– Model with normality. Semi- supervised learning approach
Data Integration
• Combines data from multiple sources into a coherent store.
• Integrate meta data from different sources (Schema Integration)
– Problem: - Entity Identification Problem.
– Different sources have different values for same attributes.
– Data Redundancy
• These problems are mainly because of different representation,
different scales etc.
How to handle redundant data in data integration?
• Redundant data may be able to be detected by correlation
analysis.
• Step-wise and careful integration of data from multiple sources
may help to improve mining speed and quality.
Data Transformation
• Changing data from one form to another form.
• Approaches:
– Smoothing: Remove noise from data.
– Aggregation: Summarizations of data
– Generalization: Hierarchy climbing of data
– Normalization: Scaled to fall within a small specified range.
Types
– Min-Max Normalization:
• V’ = ((V-min)/(max-min)* (new_max – new_min)) + new_min
– Z-Score Normalization:
• V’ = (V-min)/ stand_dev.
– Normalization by decimal scaling:
• V’= V/ 10j where j is the smallest integer such that max (|V’|) <1
Data Aggregation:
Combining two or more attributes (or objects)
into a single attribute (or object).
• Purpose
– Data reduction: Reduce the number of attributes
or objects
– Change of scale: Cities aggregated into regions,
states, countries, etc
– More “stable” data: Aggregated data tends to have
less variability
Data Reduction:
• Warehouse may store terabytes of data hence complex data
mining may take a very long time to run on complete data set.
• Data reduction is the process of obtaining a reduced
representation of data set that is much smaller in volume but
yet produces the same or almost same analytical results.
• Different methods such as data sampling, dimensionality
reduction, data cube, aggregation, discritization and hierarchy
are used for data reduction.
• Data compression can also be used mostly in media files or
data.
Data Sampling:
• It is one of main method for data selection
• It is often used for both the preliminary investigation of the
data and the final data analysis.
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
• Sampling is used in data mining because processing the entire
set of data of interest is too expensive or time consuming.
• Often used for both preliminary investigation of data and the
final data analysis.
• Important since obtaining entire set of data of interest is too
expensive or time consuming.
• Sampling should be representative since it must represent
approximately the same property as the original set of data.
Sampling types
• Simple Random Sampling: Equal probability of selecting any
particular item.
• Sampling without replacement: As each item is selected, it is
removed from population.
• Sampling with replacement: Objects are not removed from the
population as they are selected from the sample. The same
objects can be picked-up more than once.
• Stratified Sampling: Split the data into several partitions, then
draw random samples from each partition.
Data Discretization
• Convert continuous data into discrete data.
• Partition data into different classes.
Two approaches are:
• Equal width (distance) partitioning:
– It divides the range into N intervals of equal size.
– If A and B are the lowest and the highest values of the attribute, the
width of interval will be
W = (A – B)/N.
– The most straight forward approach for data discretization.
• Equal depth (frequency) partitioning:
– It divides the range into N intervals, each containing approximately
same number of samples.
– Good data scaling
– Managing categorical attributes can be tricky.
OLAP
• OLAP stands for On-Line Analytical Processing.
• An OLAP cube is a data structure that allows fast analysis of data.
• OLAP tools were developed to solve multi-dimensional data
analysis which stores their data in a special multi-dimensional
format (data cube) with no updating facility.
• An OLAP toll doesn’t learn, it creates no new knowledge and they
can’t reach new solutions.
• Information of multi-dimension nature can’t be easily analyzed
when the table has the standard 2-D representation.
• A table with n- independent attributes can be seen as an n-
dimensional space.
• It is required to explore the relationships between several
dimensions and standard relational databases are not very good for
this.
OLAP Tool
OLAP operations:
OLAP operations:
OLAP operations:
OLAP operations:
OLAP operations:
OLAP operations:
OLTP (Online Transaction Processing)
• Used to carry out day to day business functions such as ERP (Enterprise
Resource Planning), CRM ( Customer Relationship Planning)
• OLTP system solved a critical business problem of automating daily
business functions and running real time report and analysis.
OLAP Vs OLTP
Facts OLTP OLAP
Source of Data Operational Data Data warehouse (From various
database)
Purpose of data Control and run fundamental For planning, problem solving
business tasks and decision support
Queries Simple queries Complex queries and
algorithms
Processing Typically very fast Depends on data size,
Speed techniques and algorithms
Space Can be relatively small Larger due to aggregated
requirements databases
Database Design Highly Normalized with Typically denormalized with
many tables. fewer tables. Use of star or
snowflake schema.
Similarity and Dissimilarity of OLAP and OLTP
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Various types of Similarity and Dissimilarity of
Measures
1. Jaccard Coefficient/Jaccard Distance
2. Cosine Similarity/Cosine Dissimilarity
3. Manhattan Distance
4. Euclidean Distance
4. Minkowski Distance
5. Hamming Similarity/ Hamming Distance
JACCARD SIMILARITY
• Jaccard similarity measures the similarity
between two sets by comparing the size of
their intersection to the size of their union.
JACCARD SIMILARITY
JACCARD SIMILARITY
JACCARD Distance
JACCARD Distance
JACCARD Distance
COSINE SIMILARITY
• Cosine similarity measures the cosine of the angle between
two non-zero vectors in an inner product space. It tells us how
similar the directions of the vectors are, regardless of their
magnitude.
COSINE SIMILARITY
COSINE SIMILARITY
COSINE DISSIMILARITY
• Cosine dissimilarity is a measure of how
different two vectors are in direction. It is the
complement of cosine similarity:
• Cosine Dissimilarity=1−Cosine Similarity
• Ranges from 0 to 2.
• 0 indicates that the vectors are identical.
• 1 indicates that the vectors are orthogonal.
• 2 indicates that the vectors are diametrically
opposed.
Manhattan Distance
Manhattan Distance
Manhattan Distance
Euclidean Distance
• Euclidean distance measures the straight-line
distance between two points in Euclidean
space. It's the most intuitive way to quantify
how far apart two points (or vectors) are.
Euclidean Distance
Minkowski Distance
Supremum Distance
• The supremum distance between two vectors
is the maximum absolute difference between
their corresponding elements.
Supremum Distance
SIMPLE MATCHING COEFICIENT
• The Simple Matching Coefficient is a similarity measure used
for comparing two binary vectors. It calculates the proportion
of matching attributes (both 0s and 1s) between the two
vectors.
SIMPLE MATCHING COEFICIENT
Dissimilarity of Symmetric Binary Attributes
Dissimilarity of Symmetric Binary Attributes
Hamming Similarity
• Hamming similarity is a measure of similarity
between two strings (or binary vectors) of equal
length. It’s closely related to the Hamming
distance, which counts the number of positions
where the two strings differ.
• Hamming distance between two equal-length
strings is the number of positions at which the
corresponding symbols are different.
• Hamming similarity is often defined as the
proportion of positions where the two strings are
the same.
Hamming Similarity
Hamming Distance
Hamming Distance