0% found this document useful (0 votes)

42 views62 pages

02 Data Preprocessing

CRISP-DM is a comprehensive methodology from 1996 with six stages: business understanding, data understanding, data preparation, modeling, evaluation, and deployment. It emphasizes understanding the business problem. SEMMA from the 2000s has five stages starting with sampling data and includes exploration, modification, modeling, and assessment with a focus on statistics. OSEMN from 2010 has five stages of obtaining data, scrubbing it, exploring it, modeling it, and interpreting results to emphasize data quality and exploration.

Uploaded by

sidra shafiq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views62 pages

02 Data Preprocessing

Uploaded by

sidra shafiq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Master in Information Management

202223

02
Data Preprocessing
A
G
E
N
D Methodology in Data Mining
A

What is data?

Data preprocessing
1 What is methodology?

What is a
methodology in DS
Framework for recording experience, allowing projects to be replicated.
Aids to project planning and management
1 What is methodology?

1996 ≈2000 2010

CRISP-DM VS
SEMMA VS OSEMN
CRoss-Industry Standard Sample, Explore, Obtain, Scrub, Explore,
Process for Data Mining Modify, Model, Assess Model, INterpret
SAMPLE
(Generate a
representative
Business Data sample of the data)
Understanding Understanding

ASSESS EXPLORE
Evaluate the (Visualization and
model's accuracy basic description of
Data and usefulness the data)
Preparation

Deployment

DATA Modelling

MODEL MODIFY
Use variety of (Select variables,
statistical and ML transform variable
Evaluation models representations)
1 Methodologies in Data Science
CRISP-DM

• CRISP-DM (Cross-Industry Standard Process for Data Mining), launched in

1996, is a comprehensive methodology that provides a structured approach to
planning, executing, and monitoring data mining projects.

CRISP-DM • It includes six stages: Business Understanding, Data Understanding, Data

Preparation, Modeling, Evaluation, and Deployment.

• CRISP-DM emphasizes understanding the business problem, collecting

relevant data, preparing the data for modeling, developing and testing models,
and deploying the results.
1 Methodologies in Data Science
CRISP-DM

Business Data • Data mining methodology

Understanding Understanding
• Life cycle: 6 phases
• Non-proprietary
Data
Preparation • Application/Industry neutral
Deployment • Tool neutral
• Focus on business issues, as well as technical analysis
DATA Modelling
• Framework for guidance
• Experience base - templates for Analysis
Evaluation
1 Methodologies in Data Science
CRISP-DM

BUSINESS UNDERSTANDING
Project objectives and requirements understanding, Data mining
problem definition

DEPLOYMENT DATA UNDERSTANDING

Result model deployment, Repeatable Initial data collection and familiarization,
data mining process implementation Data quality problems identification

CRISP-DM

EVALUATION DATA PREPARATION

Business objectives & issues Table, record and attribute selection,
achievement evaluation Data transformation and cleaning

MODELLING
Modeling techniques selection and
application, Parameters calibration
1 Methodologies in Data Science
Other approaches: SEMMA

• SEMMA (Sample, Explore, Modify, Model, Assess) is a methodology (born

around 2000’s) used by the SAS Institute for data mining projects, with an
emphasis on statistical analysis.
SEMMA
• SEMMA starts with sample data and moves through a series of stages that
involve data exploration and modification, model development, and model
assessment.
1 Methodologies in Data Science
Other approaches: SEMMA

SAMPLE
(Generate a
representative
sample of the data)

ASSESS EXPLORE
Evaluate the model's (Visualization and
accuracy and basic description of
usefulness the data)

MODEL MODIFY
Use variety of (Select variables,
statistical and ML transform variable
models representations)

Image source: https://dqlab.id/penerapan-algoritma-data-science-semma-framework

1 Methodologies in Data Science
Other approaches: OSEMN

• OSEMN (Obtain, Scrub, Explore, Model, Interpret) is a methodology, born

in 2010, that was popularized by the data science community.

• It includes five stages: Obtaining the data, Scrubbing and cleaning the
OSEMN data, Exploring the data, Modeling the data, and Interpreting the results.

• OSEMN emphasizes the importance of data quality and exploration in the

data analysis process.
1 Methodologies in Data Science
Other approaches: OSEMN

O S E M N
OBTAIN SCRUB EXPLORE MODEL INTEGRATE

Gather data Clean data to Find significant Build models Deploy

from relevant formats that patterns and based to models and
sources machine trends using predict and turn insights
understands statistical forecast into actions
methods

Source: Mason & Higgs (2010)

Image source: https://medium.com/analytics-vidhya/osemn-is-awesome-3c9e42c3067d

1 Methodologies in Data Science
Summary

Each methodology provides a structured approach to data analysis, but they have different stages and
approaches that emphasize different aspects of the data analysis process:

• CRISP-DM (6 phases) emphasizes understanding the business problem

• SEMMA (5 phases) emphasizes statistical analysis
• OSEMN (5 phases) emphasizes data quality and exploration
1 What is methodology?
Data Science loop
1 What is methodology?
Data Science loop

Make sure that the data:

- Is complete (no missing values)
- Is correct (no errors)
- Has acceptable values (no outliers)

Get to know your data!

- Is the data useful to answer your problem?
- Does that data make sense with your business intuition?
- Do we see any unexpected pattern?
- How is the data distributed?

“Improve“ your data!

- Apply transformations to improve your data
- Create new features by combining other features
- Feature selection
1 What is methodology?
Data Science loop

Apply models
- Choose the modeling technique
- Build models

Improve your models

Do hyperparameter tuning (i.e., choose the best model
parameters)

Get the best model

- Evaluate models by comparing their predictions with real data
- Pick the best model based on performance and other desired
metrics: scalability, interpretability, etc.
2 What is data?

What is data
Collection of data objects and their attributes
2 What is data?

- Is a property or characteristic of an object

- Also know as variable, field, characteristic,
feature, property …
- A collection of attributes describes an object
ATTRIBUTE

OBJECT

- Also know as record, point, case, sample,

entity, instance, observation, event …
2 What is data?
Data types

DATA TYPES

NUMERIC CATEGORICAL
Expressed in a Can take only a specific
numerical scale set of values

CONTINUOUS DISCRETE NOMINAL ORDINAL BINARY

Infinite options Has order Special case with
Finite options No order
Age, weight, blood Pain severity, only 2 value
Number of children Eye color, blood type
pressure satisfaction rate 0/1 , True/False
2 What is data?

Record (1) TABLES (2) TRANSACTIONS

- Data Matrix (1)
- Distance Matrix
- Document Data
- Transaction Data (2)

Graph
- World Wide Web (3) (3) GRAPH DATA (4) SPATIAL DATA
- Molecular Structures

Ordered
- Spatial Data (4)
- Temporal Data (5)
- Genetic Sequence
Data
(5) TEMPORAL DATA MUSIC, IMAGES…
3 Preprocessing

What is
preprocessing
Process of turning raw data into a state
that can be easily used by an algorithm
3 Preprocessing
Why preprocess data?

Minimize GIGO
(Garbage in Garbage Out)

60%
OF EFFORT
3 Preprocessing
Why preprocess data?

Forbes
3.1 Data cleaning

Marital Transaction
Customer_ID Zip Gender Income Age
Status Amount
1001 10048 M 75,000 C M 5000
1002 J2S7K7 F -40,000 40 W 4000
1003 90210 10,000,000 45 S 7000
1004 6269 M 50,000 0 S 1000
1005 55101 F 99,999 36 D 3000
Five numeral U.S. Zip Code? Income equal to - $40.000 Age field contains “C”
- Not all countries use same - Income less than 0? Value beyond bounds - Other records have numeric
zip format for expected income, therefore an error values for field
- 10048 (U.S.) vs J2S7K7 - Caused by data entry error? - Record categorized into
(Canada) - Canadian dollars or U.S. dollars? group labeled C
- Discuss anomaly with database - Value must be resolved
Four digit Zip Code? administrator
- Leading zero Marital Status field contains “S”
truncated (6269 vs - What does this symbol means?
Marital Transaction
06269)
Customer_ID Zip Gender Income Age
Status Amount - Does “S” imply single or
- Database field as 1001 10048 M 75,000 C M 5000
separated?
numeric (chopped - Discuss anomaly with database
1002 J2S7K7 F -40,000 40 W 4000
off leading zero) administrator
1003 90210 10,000,000 45 S 7000
1004 6269 M 50,000 0 S 1000 Age field contains 0
Missing value in 1005 55101 F 99,999 36 D 3000 - Zero-value used to indicate
Gender missingness/unknown value?
- Customer refused to provide
their age?
Income equal to $10.000.000 Age field
Income equal to $99.999 - Date-type fields may become
- Assumed to measure gross annual
- Other values rounded to nearest $5,000 obsolete
income (possibly valid)
- Value may be completely valid - Use date of birth, then derive
- Still considered an outlier (extreme
- Can represent code used to denote missing values? Age
data value)
- Discuss anomaly with database administrator
3.1 Data cleaning
Missing values

What are missing

values?
Data value that is not
stored for a variable in
the observation of
interest. Why should we care?
Pose problems to data
analysis methods.
Absence of information
How to spot them?
rarely beneficial to task
of analysis. Descriptive summary
of the data.
What can we do?
Remove, fill with constant,
with mean/mode, apply a
predictive model…
3.1 Data cleaning
Missing values - Types

Missing completely at random (MCAR)

Neither the unobserved values of the variable with missing nor the other variables in the dataset predict
whether a value will be missing.
E.g.: missing laboratory values because a batch of lab samples was processed improperly

Missing at random (MAR)

Other variables (but not the variable with missing itself) in the dataset can be used to predict missingness.
E.g.: Men may be more likely to decline to answer some questions than women.

Missing not at random (MNAR)

The unobserved value of the variable with missing is related to the reason it’s missing.
E.g.: Individuals with very high income are more likely to decline to answer questions about their own income.
3.1 Data cleaning
Missing values - Delete

Marital Income
Customer_ID
Status
Age
(€)
Origin Should we delete records containing missing values?
1001 Single 27 1370 US - Not necessarily the best approach
1002 Divorced 36 Europe - Pattern of missing values may be systematic
1003 Married 42 1590 US
- Deleting records creates biased subset
1004 12 0 France
1005 Divorced 44 1370 Japan - Valuable information in other fields lost
3.1 Data cleaning
Missing values – Fill with constant

Marital Income
Customer_ID Age Origin
Status (€)
1001 Single 27 1370 US
- Missing numeric values replaced with 0.0
1002 Divorced 36 0.0 Europe
1003 Married 42 1590 US - Missing categorical values replaced with “Missing”
1004 Missing 12 0 France
1005 Divorced 44 1370 Japan
3.1 Data cleaning
Missing values – Fill with mean / mode

Marital Income
Customer_ID Age Origin
Status (€)
1001 Single 27 1370 US
1002 Divorced 36 1082.5 Europe - Missing numeric values replaced with the mean (1082.5)
1003 Married 42 1590 US
- Missing categorical values replaced with the mode (Divorced)
1004 Divorced 12 0 France
1005 Divorced 44 1370 Japan

We are going to talk about this in Data Cleaning: Identify Misclassifications & incoherencies
3.1 Data cleaning
Missing values – Fill with mean / mode

- Replacing mode or mean for missing values sometimes works well

- Mean is not always the best choice for typical value
- Resulting confidence levels for statistical inference become overoptimistic
- Domain experts should be consulted regarding approach to replace missing values
- Benefits and drawbacks resulting from replacement of missing values must be carefully evaluated

Marital Income
Customer_ID Age Origin
Status (€)
1001 Single 27 1370 US
1002 Divorced 36 .
1370 Europe
1003 Married 42 1590 US
1004 Missing 12 0 France
- Mean: 1082.5 1005 Divorced 44 1370 Japan

- Median: 1370
3.1 Data cleaning
Missing values – Fill with random value

- Values randomly taken from underlying distribution

Marital Income - Method superior compared to mean substitution
Customer_ID Age Origin
Status (€)
1001 Single 27 1370 US
2.5
1002 Divorced 36 1370 Europe
2
1003 Married 42 1590 US

Frequency
1004 Single 12 0 France 1.5

1005 Divorced 44 1370 Japan 1

0.5

0
0 1370 1590
Income
3.1 Data cleaning
Missing values – Apply a predictive model

KNN Imputer
- Match a point with its closest k neighbors in a multi-
dimensional space
- Suppose we are going to use just Age to estimate the
Marital Income
Customer_ID Age Origin Income:
Status (€)
1001 Single 27 1370 US - The nearest neighbor for {36, Missing} is {42, 1590}
1002 Divorced 36 1590 Europe - We fill the missing value in Income with 1590
1003 Married 42 1590 US
1004 12 0 France
1005 Divorced 44 1370 Japan Other predictive models can be used: Linear Regression,
Decision Tree, Neural Network…
We can do the same for filling the missing value in Marital
Status (after encoding the categorical data)
3.1 Data cleaning
Misclassifications & incoherencies

Verify if values are valid and consistent!

Divorced at 12?
- No guarantee resulting records make sense
- Alternative methods strive to replace values more
Marital Income
Customer_ID Age Origin
Status (€) precisely
1001 Single 27 1370 US
1002 Divorced 36 1590 Europe
Europe , France , USA , US
1003 Married 42 1590 USA
1004 Divorced 12 0 France - Records classified inconsistently with respect to origin of
1005 Divorced 44 1370 Japan customer
- Maintain consistency
- USA & US → North America
- France → Europe
3.1 Data cleaning
Outliers

What are outliers

An observation that lies
an abnormal distance
from other values in a
random sample from a Why should we care?
population.
Not necessarily errors! Outliers distort the data
distribution and statistical
methods and models are
sensitive to distortions. How to spot them?
Use graphical methods
(histograms,
scatterplots…), or What can we do?
statistical methods (Z-
score, Isolation Remove, clip, ignore…
Forests…)
3.1 Data cleaning
Outliers

Question:
How much a customer with an income of 2250€ will spent on my store?

MONETARY SPENT
y = 140 €

Why should we care?

2250 € INCOME
3.1 Data cleaning
Outliers

Question:
How much a customer with an income of 2250€ will spent on my store?

MONETARY SPENT
Why should we care?
y = 98 €

2250 € INCOME
3.1 Data cleaning
Outliers

Question:
How many distinct groups of customers do I have? What is the profile of
each one?

MONETARY SPENT
Why should we care?

INCOME
3.1 Data cleaning
Outliers

Question:
How many distinct groups of customers do I have? What is the profile of
each one?
Customers with high income

MONETARY SPENT
spend around 170 €

Customers with medium

income spend around 120 €
Why should we care?

Customers with low

income spent around 50€

INCOME
3.1 Data cleaning
Outliers

Question:
How many distinct groups of customers do I have? What is the profile of
each one?

MONETARY SPENT
Why should we care?

INCOME
3.1 Data cleaning
Outliers

Question:
How many distinct groups of customers do I have? What is the profile of
each one?

MONETARY SPENT
Customers with medium and
high income spend around
Why should we care?
150 €
Customers with high
income spend
around 80 €

Customers with low

income spent around 50€

INCOME
3.1 Data cleaning
Outliers

Question:
How many distinct groups of customers do I have? What is the profile of
each one?
Customers with

MONETARY SPENT
high income spend
around 170 €
Customers with medium
income spend around 120 €
Why should we care?
Customers with high
income spend
around 80 €

Customers with low

income spent around 50€

INCOME
3.1 Data cleaning
Outliers

How to spot them?

3.1 Data cleaning
Outliers

How to spot them?

3.1 Data cleaning
Measures of central tendency and spread

Measures of central tendency (a special case of measures of location)

- Mean (Numerical data) (𝜇)
- the average of the valid values taken by the variable
- For extremely skewed data, the mean becomes less representative of the variable
central tendency
- Mean is sensitive to outliers.
- Median (Numerical data)
- Defined as the field value in the middle when the field values are sorted
- Mode (Categorical data)
- Represents the value occurring with the greater frequency
More measures of location
- Percentiles (Numerical data)
- Quartiles (Numerical data)
3.1 Data cleaning
Measures of central tendency and spread

Measures of spread / variability

- Range
- Difference between the maximum and the minimum
- Standard Deviation ( 𝜎 )
- Measure of the amount of variation or dispersion of a set of values.
- Can be interpreted as the “typical” distance between a field value and the
mean, and most field values lie within two standard deviation of the mean.
- Mean absolute deviation (MAD)
- The average distance between each data point and the mean.
- Interquartile range (IQR)
- Difference between 75th and 25th percentiles
3.1 Data cleaning
Outliers
Remove
- Only the most extreme ones.
- Rule of thumb: No more than 3% of your data. If more, try the other approaches to the less
extreme outliers.
Clipping
- Clip Feature values between lower bound and upper bound.
- We can choose these lower bound and upper bound values using the percentile of the feature
What can we do?
(for example 1st and 99th percentile)
Assign a new value
- If an outlier seems to be due to a mistake in your data, try imputing a value (mean /
median / predictive model …)
Transform
- For example, try creating a percentile version of your original field and working with that
new field instead.
3.1 Data cleaning
Outliers

About the removal of outliers in predictive models:

- Sometimes the removal of outliers in the training set can change dramatically the
results of your predictive model
- Tree based algorithms are generally not affected by outliers:
- Decision Tree, Random Forest, Adaboost…
- Other algorithms are sensitive to outliers:
What can we do? - Linear and Logistic Regression, SVM, KNN, …

Examine an outlier further if:

- It changes your results!
- Run your analysis both with and without outliers
- If there is a substantial change, you should be careful…
3.2 Data transformation
Scaling

Variables tend to have ranges different from each other

- Some data mining algorithms adversely affected by differences in variable ranges
- Variables with greater ranges tend to have larger influence on data model’s results.
Therefore, numeric field values should be normalized
- Standardizes scale of effect each variable has on results

Most common normalization methods

- Min-max normalization
- Z-score or standardization
3.2 Data transformation
Scaling

MinMax Normalization
- Guarantees that all data point lie within a given range
- If we want to change the range to vary between a and b:
𝑣 − 𝑚𝑖𝑛𝑥
𝑣′ = 𝑎 + × (𝑏 − 𝑎)
𝑚𝑎𝑥𝑥 − 𝑚𝑖𝑛𝑥

𝑣 − 𝑚𝑖𝑛𝑥
- The most common range is [0,1]: 𝑣 ′ = 𝑚𝑎𝑥
𝑥 − 𝑚𝑖𝑛𝑥

Z-Score

𝑣 − 𝜇𝑥
𝑣′ =
𝜎𝑥
3.2 Data transformation ID Age Income
Scaling 1 20 980
2 37 1230
3 42 1020
4 18 890
5 34 1290
6 55 1720
Why should we care?
7 62 1650
8 21 1020

6,7
6,7

2,5
MinMax

1,3,4,8
1,4,8
2,3,5
3.2 Data transformation
Power transforms

Square root transformation Natural log transformation

𝒘𝒆𝒊𝒈𝒉𝒕 𝐥𝐨𝐠(𝒘𝒆𝒊𝒈𝒉𝒕)
3.2 Data transformation
Power transforms

Source
3.2 Data transformation
Dummy Variables

Flag (dummy) variables

- Some methods demand numerical features

- There is a need to recode the categorical values into one or more flag variables
- A flag variable is a categorical variable taking only two values, 0 or 1

Marital_Status Marital_Status Marital_Status

Gender Gender_Male Gender_Female Marital _Status Single Married Divorced
M 1 0 Single 1 0 0
F 0 1 Married 0 1 0
M 1 0 Divorced 0 0 1
M 1 0 Single 1 0 0
F 0 1 Widow 0 0 0

- When a categorical feature takes k ≥ 2 possible values, then define k - 1

3.2 Data transformation
Target encoding

For supervised learning

- the process of replacing a categorical value with the mean of the target variable

Encoded
Animal Target Animal Target
Animal
Cat 1
Cat 0.4 1
Hamster 0
Hamster 0.5 0
Cat 0
Probability of Cat 0.4 0
Cat 1 Animal Target 0 Target 1
1 Cat 0.4 1
Dog 1 Cat 3 2 0.4 Dog 0.67 1
Hamster 1 Dog 1 2 0.67 Hamster 0.5 1
Cat 0 Hamster 1 1 0.5 Cat 0.4 0
Dog 1
Dog 0.67 1
Cat 0
Cat 0.4 0
Dog 0
Dog 0.67 0

The problem? Prone to overfitting…

3.2 Data transformation
Binning / Discretizing

Binning numerical variables

Partition of the numerical variables into bins or bands
What?

Some algorithms prefer categorical rather than continuous variables

Why?

What can
0 14 28 42
we do?
0 14 28 42 0 14 28 42
Equal width Equal frequency K-Means Clustering
Divides the numerical Divides the numerical Uses a clustering algorithm, such as
variables into k categories variables into k categories, K-Means, to automatically calculate
of equal width. each having k/n records the “optimal” partitioning
3.2 Data transformation
Binning / Discretizing

Binning based on predictive value (Supervised problems)

100%
Divides the numerical predictors based on the effect each partition 90%
80%
has on the value of the target variable (for supervised learning) 70%
60%

Percent
50%
No Churn
Customers with less than four calls to customer service had a lower 40%
Churn
30%
churn rate than customers who had four or more calls to customer 20%
10%
service. 0%
0 1 2 3 4 5 6 7 8 9
Customer Service Calls
Bin the customer service calls variable into two classes:
- Low (fewer than four)
- High (Four or more)
3.2 Data transformation
Reclassify

Reclassifying categorical variables

Is the categorical equivalent of binning numerical variables

What? Applied to categorical features with too many field values

Logistic Regression and Decision Trees C4.5 perform suboptimal when confronted with

Why? predictors containing too many field values

What can
we do?
3.2 Data reduction

Remove Records

Duplicate records
Lead to an overweight of the data values

Remove features
DATA
REDUCTION Multicollinearity
A condition where some of the variables are correlated with each other
- Lead to instability in the solution space, possibly resulting in incoherent result.
- Inclusion of highly correlated variables overemphasizes particular components of the model
Too many variables
- Unnecessary complicate interpretation of analysis
- Violates principle of parsimony: one should consider keeping the number of variables to a size
that could be easily interpreted
- In supervised problems, can lead to overfitting
- Curse of dimensionality
3.2 Data reduction
Removing records

Duplicate records

Customer_ID Age Income € Frequency An ID should be

1001 23 870 7 unique! And all the
1002 28 1050 6 values match…
1003 37 1740 9 Remove the record
1001 23 870 7

Are we sure that

Gender Kids_Flag Marital_Status > 20 years
these two records
M 0 Married 1 belong to the same
M 1 Single 1 person?
F 0 Single 0 Probably not…
M 0 Married 1 Do not remove
3.2 Data reduction
Removing variables

Problem of multicollinearity – Correlated variables (Age & Year Birth)

Remove

ID Age Year Birth Gender Donation €

Missing Values (Donation €)
1001 23 1997 F
1002 28 1992 F Variables for which 90% or more of the values are
1003 37 1983 F
missing – Remove
1004 23 1997 F
But…
1005 32 1988 F Might present a pattern in the missingness:
1006 65 1955 F 75 - Who donate a lot report their donations
1007 34 1986 F - Who do not donate or donate a low value can
1008 46 1974 F skip this question
1009 31 1989 F
1010 32 1988 M Create a flag variable
(Pattern in the missingness may turn out to have
Unary or nearly unary variables (Gender) predictive power)
Remove ? (unary yes, nearly…look at the target)
3.2 Data reduction
Removing variables

• Curse of dimensionality
High dimensional data lead to sparse data – as the
number of features increases, complexity increases
and data analysis tasks become significantly harder.

Possible solutions…
• Principal component analysis (PCA) (out of scope) https://xkcd.com/721/

- Defines a new axis where the majority of variance is maintained

- The new axis are the principal components (PCs) of the data
- It is a linear projection – each PC is a linear combination of the
original features
- Geometrically, the PCs represent the directions of the data that
explain a maximal amount of variance
• Feature Selection Techniques (in some weeks….)
Obrigado!

Morada: Campus de Campolide, 1070-312 Lisboa, Portugal

Tel: +351 213 828 610 | Fax: +351 213 828 611

CRISP-DM Data Mining Methodology Guide
No ratings yet
CRISP-DM Data Mining Methodology Guide
25 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
Best Methodologies
No ratings yet
Best Methodologies
5 pages
Cs3352 Fods QB
No ratings yet
Cs3352 Fods QB
25 pages
Data Mining
No ratings yet
Data Mining
22 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
23 pages
Data2 Science Process Am
No ratings yet
Data2 Science Process Am
33 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
Data Mining Notes
No ratings yet
Data Mining Notes
43 pages
Exam 1
No ratings yet
Exam 1
12 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Data Wrangling and Visualization
No ratings yet
Data Wrangling and Visualization
48 pages
Unit 3
No ratings yet
Unit 3
18 pages
Chapter 3-IB
No ratings yet
Chapter 3-IB
69 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
Data Mining - Unit - 3
No ratings yet
Data Mining - Unit - 3
62 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Science - Ebook
No ratings yet
Data Science - Ebook
32 pages
Data Analytics Part 3
No ratings yet
Data Analytics Part 3
54 pages
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
No ratings yet
W02L01 - FA23 - AIC270 - Programming For AI - Syed Ahmed
22 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Data Analytics - Module-1.1
No ratings yet
Data Analytics - Module-1.1
42 pages
Module 1 - BCS602 - Chapter 02
No ratings yet
Module 1 - BCS602 - Chapter 02
90 pages
Unit 1
No ratings yet
Unit 1
11 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Data Processing in Research
No ratings yet
Data Processing in Research
31 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Data Science Methodology
No ratings yet
Data Science Methodology
3 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Science
No ratings yet
Data Science
5 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Introduction to Data Mining
No ratings yet
Introduction to Data Mining
89 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Module 2 Data Science
No ratings yet
Module 2 Data Science
28 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Unit 1
No ratings yet
Unit 1
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
BDA - Statistical Inference, Exploratory Data Analysis, and The Analytics Process
No ratings yet
BDA - Statistical Inference, Exploratory Data Analysis, and The Analytics Process
74 pages
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
No ratings yet
IBM Q1 Technical Marketing ASSET2 - Data Science Methodology-Best Practices For Successful Implementations Ov37176 PDF
6 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
Chapter 1 - Introduction To Business Process Management (Updated With Solutions)
No ratings yet
Chapter 1 - Introduction To Business Process Management (Updated With Solutions)
64 pages
06 - Decision Trees
100% (1)
06 - Decision Trees
83 pages
05 ZeroR OneR Bayes KNN
No ratings yet
05 ZeroR OneR Bayes KNN
76 pages
Ann MPDM Ii
No ratings yet
Ann MPDM Ii
42 pages
ITSM-2022-23.Aulas.05.The Service Value System
No ratings yet
ITSM-2022-23.Aulas.05.The Service Value System
15 pages
Business Process Management Guide
No ratings yet
Business Process Management Guide
15 pages
Chapter 2 - Process Identification (Updated With Solutions)
No ratings yet
Chapter 2 - Process Identification (Updated With Solutions)
54 pages
Chapter 7 - Process Redesign
No ratings yet
Chapter 7 - Process Redesign
80 pages
Chapter 5 - Practical
No ratings yet
Chapter 5 - Practical
10 pages
Lesson Plan: I. Objectives
No ratings yet
Lesson Plan: I. Objectives
4 pages
Academic Integrity in The New Normal Education: Perceptions of The Students and Instructors of Polytechnic College of Botolan
No ratings yet
Academic Integrity in The New Normal Education: Perceptions of The Students and Instructors of Polytechnic College of Botolan
17 pages
Developing Graduate Research Proposals and Completing A Graduate Project/Thesis/Dissertation
No ratings yet
Developing Graduate Research Proposals and Completing A Graduate Project/Thesis/Dissertation
10 pages
Case Processing Summary 1
No ratings yet
Case Processing Summary 1
3 pages
M Inning
100% (1)
M Inning
146 pages
Easy French Reader, Premium Fourth Edition R. de Roussy de Sales - Ebook PDF PDF Download
50% (2)
Easy French Reader, Premium Fourth Edition R. de Roussy de Sales - Ebook PDF PDF Download
153 pages
Advanced Feature Engineering and Data Preprocessing in Machine Learning
No ratings yet
Advanced Feature Engineering and Data Preprocessing in Machine Learning
7 pages
Creating Infographics With Adobe Illustrator: Volume 1: Learn The Basics and Design Your First Infographic Jennifer Harder Instant Download
100% (1)
Creating Infographics With Adobe Illustrator: Volume 1: Learn The Basics and Design Your First Infographic Jennifer Harder Instant Download
59 pages
Performance of Barangay Peacekeeping Action Teams of Buguey Cagayan
No ratings yet
Performance of Barangay Peacekeeping Action Teams of Buguey Cagayan
19 pages
Intro To Hypothesis Testing
No ratings yet
Intro To Hypothesis Testing
69 pages
The Good Dog 1st Edition Simon Rowell Instant Download
No ratings yet
The Good Dog 1st Edition Simon Rowell Instant Download
88 pages
(Ebook PDF) Classroom Assessment For Student Learning: Doing It Right - Using It Well 2nd Edition PDF Download
No ratings yet
(Ebook PDF) Classroom Assessment For Student Learning: Doing It Right - Using It Well 2nd Edition PDF Download
128 pages
Complete Business Statistics: Confidence Intervals
No ratings yet
Complete Business Statistics: Confidence Intervals
50 pages
CHAPTER 5 Project Report
No ratings yet
CHAPTER 5 Project Report
35 pages
Transient Analysis of Power Systems: A Practical Approach Editor: Juan A. Martinez-Velasco Instant Download
No ratings yet
Transient Analysis of Power Systems: A Practical Approach Editor: Juan A. Martinez-Velasco Instant Download
156 pages
USTER CLASSIMAT 5 Web Brochure PDF
No ratings yet
USTER CLASSIMAT 5 Web Brochure PDF
9 pages
Analiza Senzoriala
No ratings yet
Analiza Senzoriala
2 pages
ATS Complaint HR Keywords List
No ratings yet
ATS Complaint HR Keywords List
2 pages
Group 5 Assignment Edited
No ratings yet
Group 5 Assignment Edited
14 pages
Statistics Ready Reckoner
No ratings yet
Statistics Ready Reckoner
4 pages
Class 12 IP Practice Assignment Series 12
No ratings yet
Class 12 IP Practice Assignment Series 12
3 pages
Two-Sample Statistical Inference
No ratings yet
Two-Sample Statistical Inference
16 pages
Data Science Unit 5
No ratings yet
Data Science Unit 5
11 pages
Tutorial
No ratings yet
Tutorial
28 pages
Quantitative and Qualitative Approaches
100% (5)
Quantitative and Qualitative Approaches
35 pages
Quali and Quanti Research IHEP NEWEST
No ratings yet
Quali and Quanti Research IHEP NEWEST
81 pages
Sta1610 2023 TL 101 0 B
No ratings yet
Sta1610 2023 TL 101 0 B
30 pages
Mcclure Som
No ratings yet
Mcclure Som
12 pages
2007e Kempfert Becker - Empirical Axial Resistences Sheet Piles
No ratings yet
2007e Kempfert Becker - Empirical Axial Resistences Sheet Piles
7 pages
Mplus User Guide Ver - 7 - r6 - Web
No ratings yet
Mplus User Guide Ver - 7 - r6 - Web
856 pages

02 Data Preprocessing

Uploaded by

02 Data Preprocessing

Uploaded by

Master in Information Management

1996 ≈2000 2010

• CRISP-DM (Cross-Industry Standard Process for Data Mining), launched in

CRISP-DM • It includes six stages: Business Understanding, Data Understanding, Data

• CRISP-DM emphasizes understanding the business problem, collecting

Business Data • Data mining methodology

DEPLOYMENT DATA UNDERSTANDING

EVALUATION DATA PREPARATION

• SEMMA (Sample, Explore, Modify, Model, Assess) is a methodology (born

Image source: https://dqlab.id/penerapan-algoritma-data-science-semma-framework

• OSEMN (Obtain, Scrub, Explore, Model, Interpret) is a methodology, born

• OSEMN emphasizes the importance of data quality and exploration in the

Gather data Clean data to Find significant Build models Deploy

Source: Mason & Higgs (2010)

Image source: https://medium.com/analytics-vidhya/osemn-is-awesome-3c9e42c3067d

• CRISP-DM (6 phases) emphasizes understanding the business problem

Make sure that the data:

Get to know your data!

“Improve“ your data!

Improve your models

Get the best model

- Is a property or characteristic of an object

- Also know as record, point, case, sample,

CONTINUOUS DISCRETE NOMINAL ORDINAL BINARY

Record (1) TABLES (2) TRANSACTIONS

What are missing

Missing completely at random (MCAR)

Missing at random (MAR)

Missing not at random (MNAR)

- Replacing mode or mean for missing values sometimes works well

- Values randomly taken from underlying distribution

1005 Divorced 44 1370 Japan 1

Verify if values are valid and consistent!

What are outliers

Why should we care?

Customers with medium

Customers with low

Customers with low

Customers with low

How to spot them?

How to spot them?

Measures of central tendency (a special case of measures of location)

Measures of spread / variability

About the removal of outliers in predictive models:

Examine an outlier further if:

Variables tend to have ranges different from each other

Most common normalization methods

Square root transformation Natural log transformation

Flag (dummy) variables

- Some methods demand numerical features

Marital_Status Marital_Status Marital_Status

- When a categorical feature takes k ≥ 2 possible values, then define k - 1

For supervised learning

The problem? Prone to overfitting…

Binning numerical variables

Some algorithms prefer categorical rather than continuous variables

Binning based on predictive value (Supervised problems)

Reclassifying categorical variables

Is the categorical equivalent of binning numerical variables

What? Applied to categorical features with too many field values

Why? predictors containing too many field values

Customer_ID Age Income € Frequency An ID should be

Are we sure that

Problem of multicollinearity – Correlated variables (Age & Year Birth)

ID Age Year Birth Gender Donation €

- Defines a new axis where the majority of variance is maintained

Morada: Campus de Campolide, 1070-312 Lisboa, Portugal

You might also like