02 Data Preprocessing
02 Data Preprocessing
202223
02
Data Preprocessing
A
G
E
N
D Methodology in Data Mining
A
What is data?
Data preprocessing
1 What is methodology?
What is a
methodology in DS
Framework for recording experience, allowing projects to be replicated.
Aids to project planning and management
1 What is methodology?
CRISP-DM VS
SEMMA VS OSEMN
CRoss-Industry Standard Sample, Explore, Obtain, Scrub, Explore,
Process for Data Mining Modify, Model, Assess Model, INterpret
SAMPLE
(Generate a
representative
Business Data sample of the data)
Understanding Understanding
ASSESS EXPLORE
Evaluate the (Visualization and
model's accuracy basic description of
Data and usefulness the data)
Preparation
Deployment
DATA Modelling
MODEL MODIFY
Use variety of (Select variables,
statistical and ML transform variable
Evaluation models representations)
1 Methodologies in Data Science
CRISP-DM
BUSINESS UNDERSTANDING
Project objectives and requirements understanding, Data mining
problem definition
CRISP-DM
MODELLING
Modeling techniques selection and
application, Parameters calibration
1 Methodologies in Data Science
Other approaches: SEMMA
SAMPLE
(Generate a
representative
sample of the data)
ASSESS EXPLORE
Evaluate the model's (Visualization and
accuracy and basic description of
usefulness the data)
MODEL MODIFY
Use variety of (Select variables,
statistical and ML transform variable
models representations)
• It includes five stages: Obtaining the data, Scrubbing and cleaning the
OSEMN data, Exploring the data, Modeling the data, and Interpreting the results.
O S E M N
OBTAIN SCRUB EXPLORE MODEL INTEGRATE
Each methodology provides a structured approach to data analysis, but they have different stages and
approaches that emphasize different aspects of the data analysis process:
Apply models
- Choose the modeling technique
- Build models
What is data
Collection of data objects and their attributes
2 What is data?
OBJECT
DATA TYPES
NUMERIC CATEGORICAL
Expressed in a Can take only a specific
numerical scale set of values
Graph
- World Wide Web (3) (3) GRAPH DATA (4) SPATIAL DATA
- Molecular Structures
Ordered
- Spatial Data (4)
- Temporal Data (5)
- Genetic Sequence
Data
(5) TEMPORAL DATA MUSIC, IMAGES…
3 Preprocessing
What is
preprocessing
Process of turning raw data into a state
that can be easily used by an algorithm
3 Preprocessing
Why preprocess data?
Minimize GIGO
(Garbage in Garbage Out)
60%
OF EFFORT
3 Preprocessing
Why preprocess data?
Forbes
3.1 Data cleaning
Marital Transaction
Customer_ID Zip Gender Income Age
Status Amount
1001 10048 M 75,000 C M 5000
1002 J2S7K7 F -40,000 40 W 4000
1003 90210 10,000,000 45 S 7000
1004 6269 M 50,000 0 S 1000
1005 55101 F 99,999 36 D 3000
Five numeral U.S. Zip Code? Income equal to - $40.000 Age field contains “C”
- Not all countries use same - Income less than 0? Value beyond bounds - Other records have numeric
zip format for expected income, therefore an error values for field
- 10048 (U.S.) vs J2S7K7 - Caused by data entry error? - Record categorized into
(Canada) - Canadian dollars or U.S. dollars? group labeled C
- Discuss anomaly with database - Value must be resolved
Four digit Zip Code? administrator
- Leading zero Marital Status field contains “S”
truncated (6269 vs - What does this symbol means?
Marital Transaction
06269)
Customer_ID Zip Gender Income Age
Status Amount - Does “S” imply single or
- Database field as 1001 10048 M 75,000 C M 5000
separated?
numeric (chopped - Discuss anomaly with database
1002 J2S7K7 F -40,000 40 W 4000
off leading zero) administrator
1003 90210 10,000,000 45 S 7000
1004 6269 M 50,000 0 S 1000 Age field contains 0
Missing value in 1005 55101 F 99,999 36 D 3000 - Zero-value used to indicate
Gender missingness/unknown value?
- Customer refused to provide
their age?
Income equal to $10.000.000 Age field
Income equal to $99.999 - Date-type fields may become
- Assumed to measure gross annual
- Other values rounded to nearest $5,000 obsolete
income (possibly valid)
- Value may be completely valid - Use date of birth, then derive
- Still considered an outlier (extreme
- Can represent code used to denote missing values? Age
data value)
- Discuss anomaly with database administrator
3.1 Data cleaning
Missing values
Marital Income
Customer_ID
Status
Age
(€)
Origin Should we delete records containing missing values?
1001 Single 27 1370 US - Not necessarily the best approach
1002 Divorced 36 Europe - Pattern of missing values may be systematic
1003 Married 42 1590 US
- Deleting records creates biased subset
1004 12 0 France
1005 Divorced 44 1370 Japan - Valuable information in other fields lost
3.1 Data cleaning
Missing values – Fill with constant
Marital Income
Customer_ID Age Origin
Status (€)
1001 Single 27 1370 US
- Missing numeric values replaced with 0.0
1002 Divorced 36 0.0 Europe
1003 Married 42 1590 US - Missing categorical values replaced with “Missing”
1004 Missing 12 0 France
1005 Divorced 44 1370 Japan
3.1 Data cleaning
Missing values – Fill with mean / mode
Marital Income
Customer_ID Age Origin
Status (€)
1001 Single 27 1370 US
1002 Divorced 36 1082.5 Europe - Missing numeric values replaced with the mean (1082.5)
1003 Married 42 1590 US
- Missing categorical values replaced with the mode (Divorced)
1004 Divorced 12 0 France
1005 Divorced 44 1370 Japan
We are going to talk about this in Data Cleaning: Identify Misclassifications & incoherencies
3.1 Data cleaning
Missing values – Fill with mean / mode
Marital Income
Customer_ID Age Origin
Status (€)
1001 Single 27 1370 US
1002 Divorced 36 .
1370 Europe
1003 Married 42 1590 US
1004 Missing 12 0 France
- Mean: 1082.5 1005 Divorced 44 1370 Japan
- Median: 1370
3.1 Data cleaning
Missing values – Fill with random value
Frequency
1004 Single 12 0 France 1.5
0.5
0
0 1370 1590
Income
3.1 Data cleaning
Missing values – Apply a predictive model
KNN Imputer
- Match a point with its closest k neighbors in a multi-
dimensional space
- Suppose we are going to use just Age to estimate the
Marital Income
Customer_ID Age Origin Income:
Status (€)
1001 Single 27 1370 US - The nearest neighbor for {36, Missing} is {42, 1590}
1002 Divorced 36 1590 Europe - We fill the missing value in Income with 1590
1003 Married 42 1590 US
1004 12 0 France
1005 Divorced 44 1370 Japan Other predictive models can be used: Linear Regression,
Decision Tree, Neural Network…
We can do the same for filling the missing value in Marital
Status (after encoding the categorical data)
3.1 Data cleaning
Misclassifications & incoherencies
Question:
How much a customer with an income of 2250€ will spent on my store?
MONETARY SPENT
y = 140 €
2250 € INCOME
3.1 Data cleaning
Outliers
Question:
How much a customer with an income of 2250€ will spent on my store?
MONETARY SPENT
Why should we care?
y = 98 €
2250 € INCOME
3.1 Data cleaning
Outliers
Question:
How many distinct groups of customers do I have? What is the profile of
each one?
MONETARY SPENT
Why should we care?
INCOME
3.1 Data cleaning
Outliers
Question:
How many distinct groups of customers do I have? What is the profile of
each one?
Customers with high income
MONETARY SPENT
spend around 170 €
INCOME
3.1 Data cleaning
Outliers
Question:
How many distinct groups of customers do I have? What is the profile of
each one?
MONETARY SPENT
Why should we care?
INCOME
3.1 Data cleaning
Outliers
Question:
How many distinct groups of customers do I have? What is the profile of
each one?
MONETARY SPENT
Customers with medium and
high income spend around
Why should we care?
150 €
Customers with high
income spend
around 80 €
INCOME
3.1 Data cleaning
Outliers
Question:
How many distinct groups of customers do I have? What is the profile of
each one?
Customers with
MONETARY SPENT
high income spend
around 170 €
Customers with medium
income spend around 120 €
Why should we care?
Customers with high
income spend
around 80 €
INCOME
3.1 Data cleaning
Outliers
MinMax Normalization
- Guarantees that all data point lie within a given range
- If we want to change the range to vary between a and b:
𝑣 − 𝑚𝑖𝑛𝑥
𝑣′ = 𝑎 + × (𝑏 − 𝑎)
𝑚𝑎𝑥𝑥 − 𝑚𝑖𝑛𝑥
𝑣 − 𝑚𝑖𝑛𝑥
- The most common range is [0,1]: 𝑣 ′ = 𝑚𝑎𝑥
𝑥 − 𝑚𝑖𝑛𝑥
Z-Score
𝑣 − 𝜇𝑥
𝑣′ =
𝜎𝑥
3.2 Data transformation ID Age Income
Scaling 1 20 980
2 37 1230
3 42 1020
4 18 890
5 34 1290
6 55 1720
Why should we care?
7 62 1650
8 21 1020
6,7
6,7
2,5
MinMax
1,3,4,8
1,4,8
2,3,5
3.2 Data transformation
Power transforms
Source
3.2 Data transformation
Dummy Variables
Encoded
Animal Target Animal Target
Animal
Cat 1
Cat 0.4 1
Hamster 0
Hamster 0.5 0
Cat 0
Probability of Cat 0.4 0
Cat 1 Animal Target 0 Target 1
1 Cat 0.4 1
Dog 1 Cat 3 2 0.4 Dog 0.67 1
Hamster 1 Dog 1 2 0.67 Hamster 0.5 1
Cat 0 Hamster 1 1 0.5 Cat 0.4 0
Dog 1
Dog 0.67 1
Cat 0
Cat 0.4 0
Dog 0
Dog 0.67 0
Why?
What can
0 14 28 42
we do?
0 14 28 42 0 14 28 42
Equal width Equal frequency K-Means Clustering
Divides the numerical Divides the numerical Uses a clustering algorithm, such as
variables into k categories variables into k categories, K-Means, to automatically calculate
of equal width. each having k/n records the “optimal” partitioning
3.2 Data transformation
Binning / Discretizing
Percent
50%
No Churn
Customers with less than four calls to customer service had a lower 40%
Churn
30%
churn rate than customers who had four or more calls to customer 20%
10%
service. 0%
0 1 2 3 4 5 6 7 8 9
Customer Service Calls
Bin the customer service calls variable into two classes:
- Low (fewer than four)
- High (Four or more)
3.2 Data transformation
Reclassify
Logistic Regression and Decision Trees C4.5 perform suboptimal when confronted with
What can
we do?
3.2 Data reduction
Remove Records
Duplicate records
Lead to an overweight of the data values
Remove features
DATA
REDUCTION Multicollinearity
A condition where some of the variables are correlated with each other
- Lead to instability in the solution space, possibly resulting in incoherent result.
- Inclusion of highly correlated variables overemphasizes particular components of the model
Too many variables
- Unnecessary complicate interpretation of analysis
- Violates principle of parsimony: one should consider keeping the number of variables to a size
that could be easily interpreted
- In supervised problems, can lead to overfitting
- Curse of dimensionality
3.2 Data reduction
Removing records
Duplicate records
• Curse of dimensionality
High dimensional data lead to sparse data – as the
number of features increases, complexity increases
and data analysis tasks become significantly harder.
Possible solutions…
• Principal component analysis (PCA) (out of scope) https://xkcd.com/721/