Data
Preprocessing
Contents
What & Why preprocess the Data
Data Cleaning
Data integration
Data Transformation
Data reduction
Data Preprocessing
It is a data mining technique that involves
transformation of raw data into an
understandable format.
Why to Preprocess Data
• Data in the real world is:
Incomplete: lacking values, certain attributes of
interest etc.
Noisy: containing errors or outliers
Inconsistent: lack of compatibility or similarity
between two or more facts.
Why to Preprocess Data
• No Quality Data, no Quality mining:
Quality decision must be based on quality data.
Measure of Data Quality
Accuracy
Completeness
Consistency
Timeliness
Value added
Interpretability
Accessibility etc
Data Preprocessing
Techniques
Data Cleaning
Data Integration
Data Transformation
Data Reduction
What is Data?
Attributes Class
attribute
Tid Refund Marital Taxable
Collection of data objects and Income Cheat
Status
their attributes 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
10 No Single 90K Yes
Types of Attributes
There are different types of attributes
Nominal
Examples: ID numbers, eye color, zip codes
Ordinal
Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
Examples: temperature in Kelvin, length, time, counts
Properties of Attribute
Values
The type of an attribute depends on which of the
following properties it possesses:
Distinctness: =
Order: < >
Addition: + -
Multiplication: */
Nominal attribute: distinctness
Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Data Types and Forms
A1 A2 … An C
Attribute-value data:
Data types
numeric, categorical
(see the hierarchy for
its relationship)
static, dynamic
(temporal)
7
Record Data
Data that consists of a collection of records, each of which consists of
a fixed set of attributes
Tid Refund Marital Taxable
Income Cheat
Status
1 Yes Single 125K No
2 No Married 100K No
Record
3 No Single 70K No
Data Matrix
4 Yes Married 120K No Document Data
5 No Divorced 95K Yes Transaction Data
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10
10 No Single 90K Yes
Data Matrix
Data objects with the fixed set of numeric attributes
Consider them as points in a multi-dimensional space, where each
dimension represents a distinct attribute
Represent by an m by n matrix,
where there are m rows, one for each object, and n columns, one for
each attribute
Projection Projection Distance Load Thickness
of x Load of y load
10.23 5.27 15.22 2.7 1.2
12.65 6.25 16.22 2.2 1.1
Document Data
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times
the corresponding term occurs in the document.
timeout
season
coach
game
score
pla y
team
wi n
ball
lost
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
A special type of record data, where
each record (transaction) involves a set of items.
For example, a grocery store transactions.
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data
World Wide Web
Molecular Structures
Generic graph and HTML Links
2
5 1
2
5
Benzene Molecule: C6H6
Ordered data Sequences of transactions
Spatial Data
Temporal Data
Sequential Data
Genetic Sequence Data
Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
Spatio-Temporal Data CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
Average Monthly Temperature of land and ocean
CCCTCTGCTCGGCCTAGACCTGA
The data analysis pipeline
Mining is not the only step in the analysis process
Data Result
Preprocessing
Data Mining Post-processing
Preprocessing: real data is noisy, incomplete and inconsistent
Data cleaning is required to make sense of the data
Techniques: Sampling, Dimensionality Reduction, Feature selection
Post-Processing: Make the data actionable and useful to the user
Statistical analysis of importance
Visualization
Data Preprocessing
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Data Quality
Examples of data quality problems:
Noise and outliers Tid Refund Marital
Status
Taxable
Income Cheat
Missing values 1 Yes Single 125K No
Duplicate data 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
A mistake or a millionaire? 5 No Divorced 10000K Yes
6 No NULL 60K No
Missing values 7 Yes Divorced 220K NULL
8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone
Two Sine Two Sine Waves
Waves + Noise
Outliers
Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
Can help to
detect new phenomenon or
discover unusual behavior in data
detect problems
Sample applications of outlier detection
Fraud detection
Abnormal buying patterns can characterize credit card abuse
Medicine
Unusual symptoms or test results may indicate potential health problems of a
patient
Public health
The occurrence of a particular disease
Sports statistics
Outstanding (in a positive as well as a negative sense) players
may be identified as having abnormal parameter values
Detecting measurement errors
• Data derived from sensors (e.g. in a given scientific experiment) may contain
measurement errors
• Abnormal values could provide an indication of a measurement error
“One person‘s noise could be another person‘s signal.”
Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
Duplicate Data
Data set may include data objects that are duplicates, or
almost duplicates of one another
Major issue when merging data from heterogeous sources
Examples:
Same person with multiple email addresses
Forms of data
preprocessing
• Fill in missing values
• Smooth noisy data
• Remove outliers
• Resolve inconsistencies
Normalization and aggregation
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
Regression
smooth by fitting the data
into regression functions
Simple Discretization Methods:
Binning
Equal-width (distance) partitioning:
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
The most straightforward but outliers may dominate presentation
Skewed data is not handled well.
Simple Discretization Methods:
Binning
Equal-depth (frequency) partitioning:
Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Binning Example
Attribute values (for one attribute e.g., age):
0, 4, 12, 16, 16, 18, 24, 26, 28
Equi-width binning – for bin width of e.g.,
10: Bin 1: 0, 4 [-,10) bin
Bin 2: 12, 16, 16, 18 [10,20) bin
Bin 3: 24, 26, 28 [20,+) bin
– denote negative infinity, + positive infinity
Equi-frequency binning – for bin density of e.g.,
3: Bin 1: 0, 4, 12 [-, 14) bin
Bin 2: 16, 16, 18 [14, 21) bin
Bin 3: 24, 26, 28 [21,+] bin
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Integration
Data integration:
combines data from multiple sources into a coherent
store
Detecting and resolving data value conflicts
for the same real world entity, attribute values from
different sources are different
possible reasons:
different representations,
different scales, e.g., metric vs. British units
Handling Redundant Data
Redundant data occur often when integration of multiple
databases is done
The same attribute may have different names in different
databases
Careful integration of the data from multiple sources may help
reduce/avoid
redundancies
inconsistencies and
improve mining speed and quality
Data Transformation
Transform or consolidate data into forms appropriate for
mining
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Daily sales data aggregated to compute monthly or annual amount
Normalization: scaled to fall within a small, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Data Transformation
Aggregation: summarization, data cube construction
Daily sales data aggregated to compute monthly or annual amount
A data cube for sales
Data Transformation: Normalization
An attribute values are scaled to fall within a small, specified
range , such as 0.0 to 1.0
Min-Max normalization
performs a linear transformation on the original data.
v minA
v' (new _ maxA new _ minA) new _
maxA minA
minA Let min and max values for the attribute income are
Example:
$12,000 and $98,000, respectively.
Map income to the range [0.0;1.0].
Data Transformation: Normalization
z-score normalization(or zero-mean normalization)
An attribute A, values are normalized based on the mean and
standard deviation of A.
v m e a nA
v'
s ta n d _ d e vA
Example: Let mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively.
With z-score normalization, a value of $73,600 for income is
transformed to
Data Transformation: Normalization
Decimal scaling
normalizes by moving the decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A.
v
v'
Where j is the smallest integer such that Max(| v'
10
j
|)<1
Example: Suppose that the recorded values of A range from -986 to 917.
The maximum absolute value of A is 986.
To normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., j = 3)
-986 normalizes to -0.986 and 917 normalizes to 0.917.
Data Reduction
Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
Data reduction
Obtains a reduced representation of the data set that is much
smaller in volume
but produces the same (or almost the same) analytical results
Data Reduction Strategies
Dimensionality reduction
Data compression
use encoding schemes to reduce the data set size
Numerosity reduction
data is replaced or estimated by alternative smaller data
representations
Sampling
Histograms
Clustering
Discretization and concept hierarchy
generation
replace raw data values for attributes by ranges or higher
conceptual
levels
Histograms
A popular data reduction 4
technique 0
Divide data into buckets and
3
store average (sum) for each 3
50
bucket
Can be constructed optimally 25
in one dimension using
2
dynamic programming
0
1
5
1
0
100002000030000400005000060000700008000090000 100000
5
Cluster Analysis
Partition data into
clusters, and store
cluster representation
only
Sampling
Statisticians sample because obtaining the entire set of data
of interest is too expensive or time consuming.
Example: What is the average height of a person in Pakistan?
We cannot measure the height of everybody
Sampling is used in data mining because processing the entire
set of
data of interest is too expensive or time consuming.
Example: We have 1M documents. What fraction has at least 100 words in
common?
Computing number of common words for all pairs requires 1012 comparisons
Example: What fraction of tweets in a year contain the word “Lahore”?
300M tweets per day, if 100 characters on average, 86.5TB to store all tweets
Sampling …
The key principle for effective sampling is the following:
using a sample will work almost as well as using the entire data
sets, if the sample is representative
A sample is representative if it has approximately the same
property (of interest) as the original set of data
Otherwise we say that the sample introduces some bias
Types of Sampling
Simple Random Sampling
There is an equal probability of selecting any particular item
Sampling without replacement
As each item is selected, it is removed from the population
Sampling with replacement
Objects are not removed from the population as they are
selected for the sample.
In sampling with replacement, the same object can be picked
up more than once.
This makes analytical computation of probabilities easier
Sampling
Raw Data
Sample Size
8000 points 2000 Points 500
Points
Discretization
Discretization:
Divide the range of a continuous attribute into intervals
Reduce data size by discretization
Interval labels can be used to replace actual data values.
Discretization for numeric data
Binning
sensitive to the user-specified number of bins and
outliers
Histogram
Clustering analysis
Segmentation by natural Partitioning
Summar
y
Data preparation is a big issue for both warehousing and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still an active area of
research