0% found this document useful (0 votes)

227 views22 pages

Data Prep and Cleaning For Machine Learning

This document provides an overview of data preparation and cleaning for machine learning. It explains that data preparation is an important part of the machine learning process that must be done before building or training a model. Poorly prepared data can negatively impact even sophisticated algorithms. The document then presents an 8-step checklist for data preparation, covering issues like missing values, duplicate data, incorrect/irrelevant data, outliers, feature scaling, feature engineering, and validation splitting. Each step is then explained in more detail with examples and best practices.

Uploaded by

Shubham J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

227 views22 pages

Data Prep and Cleaning For Machine Learning

Uploaded by

Shubham J

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

DATA PREP &

CLEANING FOR
MACHINE
LEARNING

AN OVERVIEW

Why?
Data Preparation & Cleaning is an extremely important
part of the overall Machine Learning process, one that must
be considered before ever looking to build or train a model.

A common phrase in Machine Learning is...

"Garbage in...Garbage Out!"

If the data isn't clean or isn’t prepared in an appropriate way,

even the fanciest of algorithms or models will struggle to
learn.

On top of this, ensuring the data is clean can actually be one

of the biggest boosters of model performance and accuracy!
8-Step Checklist
This 8-Step Data Preparation & Cleaning checklist will ensure
you’re always giving your ML model the best chance to learn &
perform!

Missing Values 1

2 Duplicate & Low Variation Data

Incorrect & Irrelevant Data 3

4 Categorical Data

Outliers 5

6 Feature Scaling

Feature Engineering/Selection 7

8 Validation Split

Note: Which steps are applicable can depend on the data you’re
using, the problem you’re solving, and on the type of model you’re
applying! However, in the vast majority of cases you can’t go
wrong at least considering these 8 steps!
Missing Values 1

In many cases, your chosen model or algorithm simply

won't know how to process missing values - and you will
be returned an error.

Even if the missing values don't lead to an error you do

always want to ensure that you are passing the model
the most useful information to learn from - and you
should consider whether missing values meet that criteria
or not.

The two most common approaches for dealing with

missing values:

Removal: Often you will simply remove any observations

(rows) where one or missing values are present. You can
also remove entire columns if no information is present.

Imputation: This is where you input or "impute"

replacement values where they were originally missing.
This can be based upon the column mean, median, or
most common value or more advanced approaches that
take into account other present data points to give an
estimation of what the missing value might be!
Duplicate & Low
Variation Data
2
When looking through your data, don't just look for
missing values - keep an eye out for duplicate data, or
data that has low variation.

Duplicate data is most commonly rows of data that are

exactly the same across all columns. This duplicate rows
do not add anything to the learning process of the model
or algorithm, but do add storage & processing overhead.

In the vast majority of cases you can remove duplicate

rows prior to training your model.

Low variation data is where a column in your dataset

contain only one (or few) unique value(s).

An example: There is a column in a house price dataset

called "property_type". Every row in this column has the
value "house". This column won’t add any value to the
learning process, so can be removed.
Incorrect &
Irrelevant Data
Irrelevant data is anything that isn’t related specifically
to the problem you’re looking to solve. For example, if
you're predicting house prices, but your dataset contains
commercial properties as well - these would need to be 3
removed.

Incorrect data can be hard to spot! An example could

be looking for values that shouldn’t be possible such a
negative house price values.

For categorical or text variables you should spend time

analysing the unique values within a column.

For example, you were predicting car prices and had a

column in the dataset called "car_colour". Upon
inspection you find values of "Orange", "orange", "orang"
are all present. These need to be rectifed prior to training
otherwise you will be limited the potential learning that an
take place!

Always explore your data thoroughly!

Categorical Data
Generally speaking, ML models like being fed numerical
data. They are less fond of categorical data!

Categorical data is anything that is listed as groups,

classes, or text. A simple example would be a column for
gender which contains values of either "Male" or "Female".

Your model or algorithm won’t know how to assign some

numerical importance to these values, so you often want
to turn these groups or classes into numerical values. 4

A common approach is called One Hot Encoding where

you create new columns, one for each unique class in your
categorical column. You fill these new columns with
values of 1 or 0 depending on which is true for each
observation.

Other encoding techniques you can consider are; Label

Encoding, Binary Encoding, Target Encoding, Ordinal
Encoding, & Feature Hashing.
Outliers
There is no formal definition for an outlier. You can think
of them as any data point that is very different to the
majority.

How you deal with outliers is dependent on the problem

you are solving, and the model you are applying. For
example, if your data contained one value that was 1000x
any other, this could badly affect a Linear Regression
model which tries to generalise a rule across all
observations. A Decision Tree would be unaffected
however, as it deals with each observation independently.

In practice, outliers are commonly isolated using the 5

number of standard deviations from the mean, or a rule
based upon the interquartile range.

In cases where you want to mitigate the effects of outliers,

you may look to simply remove any observations (rows)
that contain outlier values in one or more of the columns
or you may look to replace their values to reduce their
effect.

Always remember - just because a value is very high, or

very low, that does not mean it is wrong to be included.
Feature Scaling
Feature Scaling is where you force all the values from a
column in your data to exist on the same scale. In certain
scenarios it will help the model assess the relationships
between variables more fairly, and more accurately.

The two most common scaling techniques are:

Standardisation: rescales all values to have a mean of 0

and standard deviation of 1. In other words, the majority
of your values end up between -4 and +4

Normalisation: rescales data so that it exists in a range

between 0 and 1

Feature Scaling is essential for distance-based models

such as k-means or k-nearest-neighbours.
6
Feature Scaling is recommended for any algorithms that
utilise Gradient Descent such as Linear Regression,
Logistic Regression, and Neural Networks.

Feature Scaling is not necessary for tree-based

algorithms such as Decision Trees & Random Forests.
Feature Engineering
& Selection
Feature Engineering is the process of using further
knowledge to supplement or transform the original
feature set.

The key to good Feature Engineering is to create or refine

features that the algorithm or model can understand
better or that it will find more useful than the raw features
for solving the particular problem at hand.

Feature Selection is where you only keep a subset of the

most informative variables. This can be done using
human intuition, or dynamically based upon statistical
analysis.

A smaller feature set can lead to improved model

accuracy through reduced noise. It can mean a lower
computational cost, and improved processing speed. It
can also make your models easier to understand & 7
explain to stakeholders & customers!
Validation Split
The Validation Split is where you partition your data into a
training set, and a validation set (and sometimes a
test set as well).

You train the model with the training set only. The
validation and/or test sets, are held-back from training and
are used to assess model performance. They provide a
true understanding of how accurate predictions are on
new or unseen data.

An approach called k-fold cross validation can provide

you with an even more robust understanding of model
performance.

Here, the entire dataset is again partitioned into training &

validation sets, and the model is trained and assessed like
before. However, this process is done multiple times with
the training and test sets being rotated to encompass
different sets of observations within the data. You do this
k times - with your final predictive accuracy assessment
being based upon the average of each of the iterations.

8
Want to land an incredible
role in the exciting, future-
proof, and lucrative field of
Data Science?
LEARN THE
RIGHT SKILLS

A curriculum based on
input from hundreds of
leaders, hiring managers,
and recruiters

https://data-science-infinity.teachable.com
BUILD YOUR
PORTFOLIO

Create your professionally

made portfolio site that
includes 10 pre-built
projects

https://data-science-infinity.teachable.com
EARN THE
CERTIFICATION

Prove your skills with the

DSI Data Science
Professional Certification

https://data-science-infinity.teachable.com
LAND AN
AMAZING ROLE

Get guidance & support

based upon hundreds of
interviews at top tech
companies

https://data-science-infinity.teachable.com
Taught by former Amazon
& Sony PlayStation Data
Scientist Andrew Jones

What do DSI
students say?
"I had over 40 interviews without an offer.
After DSI I quickly got 7 offers including
one at KPMG and my amazing new role
at Deloitte!"
- Ritesh

"The best program I've been a part of,

hands down"
- Christian

"DSI is incredible - everything is taught in

such a clear and simple way, even the
more complex concepts!"
- Arianna

"I got it! Thank you so much for all your

advice & help with preparation - it truly
gave me the confidence to go in and
land the job!"
- Marta
"I've taken a number of Data Science
courses, and without doubt, DSI is the
best"
- William

"One of the best purchases towards

learning I have ever made"
- Scott

"I learned more than on any other

course, or reading entire books!"
- Erick

"I started a bootcamp last summer

through a well respected University, but I
didn't learn half as much from them!"
- GA
"100% worth it, it is amazing. I have
never seen such a good course and I
have done plenty of them!"
- Khatuna

"This is a world-class Data Science

experience. I would recommend this
course to every aspiring or professional
Data Scientist"
- David

"Andrew's guidance with my Resume &

throughout the interview process helped
me land my amazing new role (and at a
much higher salary than I expected!)"
- Barun

"DSI is a fantastic community & Andrew

is one of the best instructors!"

- Keith
"I'm now at University, and my Data
Science related subjects are a piece of
cake after completing this course!

I'm so glad I enrolled!" - Jose

"In addition to the great content,

Andrew's dedication to the growing DSI
community is amazing"
- Sophie

"The course has such high quality

content - you get your ROI even from the
first module"
- Donabel

"The Statistics 101 section was awesome!

I have now started to get confidence in
Statistics!"
- Shrikant
"I can't emphasise how good this
programme is...well worth the
investment!"
- Dejan

Come and join the

hundreds & hundreds of
other students getting the
results they want!

https://data-science-infinity.teachable.com

Matlab
100% (1)
Matlab
83 pages
Least Squares and Curve Fitting
No ratings yet
Least Squares and Curve Fitting
27 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Actuarial Economics
No ratings yet
Actuarial Economics
52 pages
High Impedance Fault Detection Techniques
No ratings yet
High Impedance Fault Detection Techniques
19 pages
Econ275 (Stanford) PDF
No ratings yet
Econ275 (Stanford) PDF
4 pages
Introduction To Time Series Analysis, Lectures
No ratings yet
Introduction To Time Series Analysis, Lectures
49 pages
LSTM Networks for AI Enthusiasts
No ratings yet
LSTM Networks for AI Enthusiasts
16 pages
Tutorial On Maximum Likelihood Estimation
100% (2)
Tutorial On Maximum Likelihood Estimation
11 pages
Feature Selection Technique
No ratings yet
Feature Selection Technique
7 pages
User Manual 5587
No ratings yet
User Manual 5587
36 pages
Bates 1996
No ratings yet
Bates 1996
40 pages
Matlab Ordinary Differential Equations - Part I
No ratings yet
Matlab Ordinary Differential Equations - Part I
97 pages
Random Forest
100% (1)
Random Forest
83 pages
ECE111 - Analog Electronics: Sandeep Saini Gaurav Chatterjee
No ratings yet
ECE111 - Analog Electronics: Sandeep Saini Gaurav Chatterjee
100 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
47 pages
Minor Project Report
No ratings yet
Minor Project Report
4 pages
Unit-4 (Dynamic Programming)
No ratings yet
Unit-4 (Dynamic Programming)
96 pages
Stochastic Integration and Stochastic Differential Equations A Gentle Introduction
No ratings yet
Stochastic Integration and Stochastic Differential Equations A Gentle Introduction
28 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
3 pages
Machine Learning Basics for Beginners
No ratings yet
Machine Learning Basics for Beginners
38 pages
Know Your Power-Pdf Version
100% (1)
Know Your Power-Pdf Version
276 pages
Cross-Validation in Machine Learning
No ratings yet
Cross-Validation in Machine Learning
18 pages
Coefficient Alpha, A Basic Introduction From The Perspectives of Classical Test Theory
No ratings yet
Coefficient Alpha, A Basic Introduction From The Perspectives of Classical Test Theory
21 pages
Radial Basis Functions With Adaptive Input and Composite Trend Representation For Portfolio Selection
100% (1)
Radial Basis Functions With Adaptive Input and Composite Trend Representation For Portfolio Selection
13 pages
Loan Prediction 10
No ratings yet
Loan Prediction 10
10 pages
(Class) Normal Distribution
No ratings yet
(Class) Normal Distribution
4 pages
MST Algorithms for Graph Theory
No ratings yet
MST Algorithms for Graph Theory
24 pages
ECON681 Homework Question
No ratings yet
ECON681 Homework Question
5 pages
Unified Power Quality Conditioner
No ratings yet
Unified Power Quality Conditioner
19 pages
Fuzzy & ANN Control Systems in Python
No ratings yet
Fuzzy & ANN Control Systems in Python
20 pages
Deep Learning With Multiple GPUs
No ratings yet
Deep Learning With Multiple GPUs
5 pages
Excel Automation with xlwings
No ratings yet
Excel Automation with xlwings
214 pages
Ai - Foundations of Machine Learning I
No ratings yet
Ai - Foundations of Machine Learning I
39 pages
Module I
No ratings yet
Module I
109 pages
Computer Graphics by Dinesh Thakur
No ratings yet
Computer Graphics by Dinesh Thakur
4 pages
What's New in Maximo 7.5
100% (2)
What's New in Maximo 7.5
137 pages
ANOVA for Financial Analysts
No ratings yet
ANOVA for Financial Analysts
13 pages
23.0 Logistic Regression-6
No ratings yet
23.0 Logistic Regression-6
24 pages
Unit 4
No ratings yet
Unit 4
108 pages
2016 - IT 244 - Assignment1
No ratings yet
2016 - IT 244 - Assignment1
4 pages
Business Statistics Cheat Sheet?
No ratings yet
Business Statistics Cheat Sheet?
7 pages
Power Electronics - P. S. Bimbra
100% (2)
Power Electronics - P. S. Bimbra
154 pages
Kuliah Matrik Ybus
No ratings yet
Kuliah Matrik Ybus
49 pages
Unit 2. Network Topology
No ratings yet
Unit 2. Network Topology
7 pages
Power Grid Monitoring Based On Machine Learning An
No ratings yet
Power Grid Monitoring Based On Machine Learning An
17 pages
DecisionTree Numerical ID3Prob
No ratings yet
DecisionTree Numerical ID3Prob
114 pages
Docker Python App
No ratings yet
Docker Python App
9 pages
1-Applications of Python
No ratings yet
1-Applications of Python
4 pages
Database Performance Evaluation
No ratings yet
Database Performance Evaluation
10 pages
Chemical Engineering Economics
No ratings yet
Chemical Engineering Economics
54 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
7 Data Preprocessing Steps in Machine Learning
No ratings yet
7 Data Preprocessing Steps in Machine Learning
5 pages
Anomalies in Dataset
No ratings yet
Anomalies in Dataset
4 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
ML Da
No ratings yet
ML Da
55 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Statistics For Data Science
100% (2)
Statistics For Data Science
39 pages
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
No ratings yet
AIPPTMaker - Data Preprocessing and Feature Engineering - Key To Improving AI Algorithm Performance
35 pages
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
No ratings yet
Predictive Analytics Modelling (21CSH-440) : Apex Institute of Technology
20 pages
Application Guidelines
No ratings yet
Application Guidelines
18 pages
Roadmap
No ratings yet
Roadmap
13 pages
Python Coding Interview Questions For Freshers
No ratings yet
Python Coding Interview Questions For Freshers
6 pages
SQL Solved Questions (Imp.)
No ratings yet
SQL Solved Questions (Imp.)
21 pages
Factors That Satisfy Customers: Heavens Touch Spa
No ratings yet
Factors That Satisfy Customers: Heavens Touch Spa
53 pages
I 001252944 Thesis
No ratings yet
I 001252944 Thesis
49 pages
Krueger - Experimental Estimates of Education Production
No ratings yet
Krueger - Experimental Estimates of Education Production
37 pages
Large Language Models Powered Aspect Bas
No ratings yet
Large Language Models Powered Aspect Bas
19 pages
Exploring The Lived Experience
No ratings yet
Exploring The Lived Experience
179 pages
Guidance On Hazard Identification and Classification
No ratings yet
Guidance On Hazard Identification and Classification
29 pages
GanapathiAbu Shanab2020
No ratings yet
GanapathiAbu Shanab2020
24 pages
Mis CRM
No ratings yet
Mis CRM
2 pages
Instant Download Managing For Public Service Performance: How People and Values Make A Difference Peter Leisink PDF All Chapter
100% (4)
Instant Download Managing For Public Service Performance: How People and Values Make A Difference Peter Leisink PDF All Chapter
50 pages
MRM Presentation
No ratings yet
MRM Presentation
13 pages
Ariswinata 2021
No ratings yet
Ariswinata 2021
4 pages
Wang Et Al., 2020
No ratings yet
Wang Et Al., 2020
12 pages
Research Proposal Group 6 Done 2 Copies
No ratings yet
Research Proposal Group 6 Done 2 Copies
4 pages
Advanced Statistics Course Overview
No ratings yet
Advanced Statistics Course Overview
5 pages
Catalog O Ring Guide ODE5712 GB 0704
No ratings yet
Catalog O Ring Guide ODE5712 GB 0704
88 pages
Boyle Rhonda Boyle Robin
No ratings yet
Boyle Rhonda Boyle Robin
36 pages
1 PB
No ratings yet
1 PB
17 pages
Fahad 2021
No ratings yet
Fahad 2021
13 pages
UX Research Study Plan Template
No ratings yet
UX Research Study Plan Template
3 pages
Ellinger 2016
No ratings yet
Ellinger 2016
13 pages
Probability Unit
No ratings yet
Probability Unit
21 pages
Chess in Schools
100% (1)
Chess in Schools
21 pages
Comparing Several Means: Anova
No ratings yet
Comparing Several Means: Anova
52 pages
Snowball Sampling
No ratings yet
Snowball Sampling
15 pages
Game-Based Writing for Grade 8
No ratings yet
Game-Based Writing for Grade 8
5 pages
Aspiring Math Teacher's Journey
No ratings yet
Aspiring Math Teacher's Journey
1 page
Survey Questionnaire For OJT Monitoring System
No ratings yet
Survey Questionnaire For OJT Monitoring System
3 pages
Purposes of Assessment
No ratings yet
Purposes of Assessment
12 pages
Financial Management Has An Impact On People of All Ages
No ratings yet
Financial Management Has An Impact On People of All Ages
8 pages
Chapter 6 Writing and Evaluating Test Items
No ratings yet
Chapter 6 Writing and Evaluating Test Items
12 pages

Data Prep and Cleaning For Machine Learning

Uploaded by

Data Prep and Cleaning For Machine Learning

Uploaded by

DATA PREP &

A common phrase in Machine Learning is...

"Garbage in...Garbage Out!"

If the data isn't clean or isn’t prepared in an appropriate way,

On top of this, ensuring the data is clean can actually be one

2 Duplicate & Low Variation Data

Incorrect & Irrelevant Data 3

In many cases, your chosen model or algorithm simply

Even if the missing values don't lead to an error you do

The two most common approaches for dealing with

Removal: Often you will simply remove any observations

Imputation: This is where you input or "impute"

Duplicate data is most commonly rows of data that are

In the vast majority of cases you can remove duplicate

Low variation data is where a column in your dataset

An example: There is a column in a house price dataset

Incorrect data can be hard to spot! An example could

For categorical or text variables you should spend time

For example, you were predicting car prices and had a

Always explore your data thoroughly!

Categorical data is anything that is listed as groups,

Your model or algorithm won’t know how to assign some

A common approach is called One Hot Encoding where

Other encoding techniques you can consider are; Label

How you deal with outliers is dependent on the problem

In practice, outliers are commonly isolated using the 5

In cases where you want to mitigate the effects of outliers,

Always remember - just because a value is very high, or

The two most common scaling techniques are:

Standardisation: rescales all values to have a mean of 0

Normalisation: rescales data so that it exists in a range

Feature Scaling is essential for distance-based models

Feature Scaling is not necessary for tree-based

The key to good Feature Engineering is to create or refine

Feature Selection is where you only keep a subset of the

A smaller feature set can lead to improved model

An approach called k-fold cross validation can provide

Here, the entire dataset is again partitioned into training &

Create your professionally

Prove your skills with the

Get guidance & support

"The best program I've been a part of,

"DSI is incredible - everything is taught in

"I got it! Thank you so much for all your

"One of the best purchases towards

"I learned more than on any other

"I started a bootcamp last summer

"This is a world-class Data Science

"Andrew's guidance with my Resume &

"DSI is a fantastic community & Andrew

I'm so glad I enrolled!" - Jose

"In addition to the great content,

"The course has such high quality

"The Statistics 101 section was awesome!

Come and join the

You might also like