0% found this document useful (0 votes)

12 views23 pages

Cours Preprocessing

The document outlines a course on data preprocessing and processing, focusing on the importance of data quality and techniques to enhance machine learning model performance. Key topics include data profiling, cleansing, reduction, transformation, enrichment, and validation, with practical applications in Python. Participants are expected to gain skills in data handling and address quality issues in real-world contexts.

Uploaded by

Eben Ngb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views23 pages

Cours Preprocessing

Uploaded by

Eben Ngb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Data preprocessing & processing

Dr. Souand P. G. TAHI & Dr. Ariane C. HOUETOHOSSOU

Laboratoire de Biomathématiques et Estimations Forestières
University of Abomey-Calavi
souandtahi@gmail.com/ harianecalmet@gmail.com

April 13, 2025

1 / 23
Course objective

• Understand the importance of data quality,

• Master data preprocessing and processing techniques,

• Apply these techniques using specialized tools.

2 / 23
Learning expectations

At the end of this course, participants should be able to:

• Acquire practical skills in data handling and preparation, .

• Understand best practices in data preprocessing to enhance machine learning model

performance,

• Be able to identify and address data quality issues in real-world contexts.

3 / 23
Outline

1. Concepts clarification
2. Data profiling
3. Data cleansing
4. Data Reduction
5. Data transformation
6. Data enrichment
7. Data validation.
8. Application in python

4 / 23
Concept clarification

5 / 23
Preprocessing: Definition and Importance

Definition

• Data preprocessing is the transformation of raw data into a format that is more suitable
and meaningful for analysis and model training.

• Data preprocessing plays a vital role in improving the quality and efficiency of ML models
by addressing issues such as missing values, noise, inconsistencies, and outliers in the
data.

Preprocessing importance

• reduced processing power and time required to train a new ML or AI algorithm or to

perform inference against it

6 / 23
Key data preprocessing steps?

7 / 23
What is data profiling

• Data profiling is the process of examining, analyzing, and creating useful summaries of
data.

• The process yields a high-level overview, which aids in the discovery of data quality
issues, risks, and overall trends.

• Produces critical insights into data that companies can leverage to their advantage.

8 / 23
How to do it

9 / 23
Benefits of data profiling

10 / 23
What is data cleansing?

• Data cleaning is the process to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data, and otherwise ensuring that the raw data are
suitable for feature engineering.

• It is the process of detecting and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database and refers to identifying incomplete, incorrect,
inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the
dirty or coarse data.

11 / 23
Key steps to data cleaning.

• Safely store data: Before making any changes, make sure the original raw data is stored
safely with a good backup strategy in place.
• Tidy dataset:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table
• Clean data
1. Accuracy : check how accurately the data value describes the object or event being
described.
Example: Syntactic accuracy where the value might be correct but doesn’t belong to the
correct domain of the variable. i.e. a negative value for age. Semantic accuracy is where the
value is in the correct domain but not accurate. i.e., The attribute gender is given the value
“female” for a person called John Smith.

12 / 23
Key steps to data cleaning.

2 Consistency : Checks whether all the values of a variable represent the same definition.
i.e., the distance is recorded in the same unit throughout the entire data set.
2 Completeness: checks how complete is the dataset with respect to variable values and/or
records?
- Variable values: Are there values missing for certain variables?
- Records: check if the dataset is completed for the analysis at hand. i.e. you set out to
survey 1,000 households but only have 900 completed.
2 Uniqueness: checks existence of any duplicate in data.
2 Handling outliers Check the existence of a data point that differs significantly from other
observations.

13 / 23
Handling Missing Values

14 / 23
Ways of calculating outliers

• Sort quantitative: variables from low to high and scan for extremely low or extremely
high values.
• Visualizations: plot data using a box plot or a box-and-whisker plot, so that you can see
the data distribution at a glance.
• Statistical outlier detection:convert extreme data points into z-scores that tell you how
many standard deviations they are from the mean. values with a z score greater than 3
or less than –3 are often determined to be outliers.
• Using the interquartile range: use the IQR to create “fences” around your data and
then define outliers as any values that fall outside those fences.

15 / 23
What is data reduction?

• Data reduction is the process in which an organization sets out to limit the amount of
data it’s storing.

• Raw data sets often include redundant data that comes from characterizing phenomena
in different ways or data that isn’t relevant to a particular ML, AI or analytics task.

• Data reduction techniques, such as principal component analysis, transform raw data into
a simpler form suitable for specific use cases.

16 / 23
Types of data reduction

17 / 23
What is data transformation?

• Data transformation is the process of converting, cleansing, and structuring data into a
usable format that can be analyzed to support decision making processes, and to propel
the growth of an organization.

• This technique converts the raw data into a required format to perform the following
data processing and modeling procedures efficiently.

• Data transformation is used when data needs to be converted to match that of the
destination system.

18 / 23
Type of data transformation?

19 / 23
What is data enrichment?

• Data enrichment is the process of enhancing the value of existing data with additional
information.

• Enrichment often uses reliable third-party data sources to add more information to
customer contact information or other data.

• The goal of data enrichment is to improve the value of the data by providing you with
more insight into customers.

• useful for business data; address, property, and location data;

20 / 23
Steps to data enrichment?

• Data Assessment: identifying the types and sources of data your organization
possesses.
• Identify Data Sources: identify internal or external data sources that can supplement
your current information.
• Data Cleansing: make sure data is cleaned with high quality.

• Data Integration:incorporating the identified external data sources with your existing
datasets.
• Validation and quality assurance: make post-integration and strict validation
processes.

21 / 23
What is data validation.

22 / 23
Application in python

23 / 23

Week 3
No ratings yet
Week 3
23 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
DM Unit 1
No ratings yet
DM Unit 1
18 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Unit II (DWDM)
No ratings yet
Unit II (DWDM)
19 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Cleaning Preprocessing
No ratings yet
Data Cleaning Preprocessing
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Handling and Visualization 3rd Unit
No ratings yet
Data Handling and Visualization 3rd Unit
4 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Mining
No ratings yet
Data Mining
22 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
BIA 5000 Introduction To Analytics - Lesson 6
No ratings yet
BIA 5000 Introduction To Analytics - Lesson 6
59 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Introduction To Data Analysis
100% (1)
Introduction To Data Analysis
94 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Pre Processing
No ratings yet
Pre Processing
43 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
DS Unit 2
No ratings yet
DS Unit 2
23 pages
Unit-I Da
No ratings yet
Unit-I Da
42 pages
Data Preprocessing: Clean, Transform, Integrate
No ratings yet
Data Preprocessing: Clean, Transform, Integrate
6 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
How Should Data Preparation Be Done For An Analytics Project
No ratings yet
How Should Data Preparation Be Done For An Analytics Project
30 pages
Data Munging for Data Scientists
No ratings yet
Data Munging for Data Scientists
54 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Mining for Business Insights
No ratings yet
Data Mining for Business Insights
38 pages
Part II, Meet 4 - CH 6 Dan 7 UNP
No ratings yet
Part II, Meet 4 - CH 6 Dan 7 UNP
19 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Chap 3
No ratings yet
Chap 3
26 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
32 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Unit 3
No ratings yet
Unit 3
18 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
VITBEE 2022 Syllabus Sample Questions
No ratings yet
VITBEE 2022 Syllabus Sample Questions
2 pages
The Jungle Excerpt Questions and HIPP
No ratings yet
The Jungle Excerpt Questions and HIPP
5 pages
Tool Safety for DIY Enthusiasts
No ratings yet
Tool Safety for DIY Enthusiasts
26 pages
P2AP PartIV Learnhowtodraftapatentapplication Final 0
No ratings yet
P2AP PartIV Learnhowtodraftapatentapplication Final 0
36 pages
Heating Catalogue 2019
No ratings yet
Heating Catalogue 2019
44 pages
Tugas Topic 4 Devi Permatasari
No ratings yet
Tugas Topic 4 Devi Permatasari
8 pages
Alien Influence on Atlantis and Humanity
100% (2)
Alien Influence on Atlantis and Humanity
10 pages
From The Books To The Streets
No ratings yet
From The Books To The Streets
31 pages
Science Ramban Part 1
100% (5)
Science Ramban Part 1
85 pages
Poems
No ratings yet
Poems
7 pages
HRM Assignment
No ratings yet
HRM Assignment
13 pages
VAP Prevention in Critical Care
No ratings yet
VAP Prevention in Critical Care
8 pages
Ch-27.4 Plain Carbon Steel
No ratings yet
Ch-27.4 Plain Carbon Steel
11 pages
Myrna Accordion and Orchestra Score
No ratings yet
Myrna Accordion and Orchestra Score
21 pages
SDG Quiz Answers
100% (2)
SDG Quiz Answers
2 pages
Health Law
100% (1)
Health Law
30 pages
Multicap Fund - One Pager
No ratings yet
Multicap Fund - One Pager
2 pages
Value Based Healthcare - DR Robert Kaplan
100% (1)
Value Based Healthcare - DR Robert Kaplan
43 pages
RNA & Protein Synthesis Quiz
67% (3)
RNA & Protein Synthesis Quiz
6 pages
Catalog Number Description Locking Ring Thrust Washer Trim Washer Panel Gasket
No ratings yet
Catalog Number Description Locking Ring Thrust Washer Trim Washer Panel Gasket
27 pages
BS 3293 Weld Neck and Slip On Flanges Class 150lbs
No ratings yet
BS 3293 Weld Neck and Slip On Flanges Class 150lbs
1 page
Hajvery University
No ratings yet
Hajvery University
1 page
Thesis Chapter 4 Qualitative
100% (3)
Thesis Chapter 4 Qualitative
8 pages
Corruption Analysis in Nigeria
No ratings yet
Corruption Analysis in Nigeria
12 pages
Art Journal 45 3 Video The Reflexive Medium PDF
No ratings yet
Art Journal 45 3 Video The Reflexive Medium PDF
85 pages
SAP PM - Key Figures For Order Costs
No ratings yet
SAP PM - Key Figures For Order Costs
3 pages
Personality & Components
No ratings yet
Personality & Components
2 pages
Estimation of Potassium in Tap Water by Flame Photometer
100% (1)
Estimation of Potassium in Tap Water by Flame Photometer
19 pages
Logistics Information System
No ratings yet
Logistics Information System
6 pages
DLL Matatag - Tle 8 q2 w2
100% (1)
DLL Matatag - Tle 8 q2 w2
12 pages

Cours Preprocessing

Uploaded by

Cours Preprocessing

Uploaded by

Data preprocessing & processing

Dr. Souand P. G. TAHI & Dr. Ariane C. HOUETOHOSSOU

April 13, 2025

• Understand the importance of data quality,

• Master data preprocessing and processing techniques,

• Apply these techniques using specialized tools.

At the end of this course, participants should be able to:

• Acquire practical skills in data handling and preparation, .

• Understand best practices in data preprocessing to enhance machine learning model

• Be able to identify and address data quality issues in real-world contexts.

• reduced processing power and time required to train a new ML or AI algorithm or to

• useful for business data; address, property, and location data;

You might also like