KEMBAR78
Cours Preprocessing | PDF | Machine Learning | Data
0% found this document useful (0 votes)
12 views23 pages

Cours Preprocessing

The document outlines a course on data preprocessing and processing, focusing on the importance of data quality and techniques to enhance machine learning model performance. Key topics include data profiling, cleansing, reduction, transformation, enrichment, and validation, with practical applications in Python. Participants are expected to gain skills in data handling and address quality issues in real-world contexts.

Uploaded by

Eben Ngb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views23 pages

Cours Preprocessing

The document outlines a course on data preprocessing and processing, focusing on the importance of data quality and techniques to enhance machine learning model performance. Key topics include data profiling, cleansing, reduction, transformation, enrichment, and validation, with practical applications in Python. Participants are expected to gain skills in data handling and address quality issues in real-world contexts.

Uploaded by

Eben Ngb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data preprocessing & processing

Dr. Souand P. G. TAHI & Dr. Ariane C. HOUETOHOSSOU


Laboratoire de Biomathématiques et Estimations Forestières
University of Abomey-Calavi
souandtahi@gmail.com/ harianecalmet@gmail.com

April 13, 2025

1 / 23
Course objective

• Understand the importance of data quality,

• Master data preprocessing and processing techniques,

• Apply these techniques using specialized tools.

2 / 23
Learning expectations

At the end of this course, participants should be able to:

• Acquire practical skills in data handling and preparation, .

• Understand best practices in data preprocessing to enhance machine learning model


performance,

• Be able to identify and address data quality issues in real-world contexts.

3 / 23
Outline

1. Concepts clarification
2. Data profiling
3. Data cleansing
4. Data Reduction
5. Data transformation
6. Data enrichment
7. Data validation.
8. Application in python

4 / 23
Concept clarification

5 / 23
Preprocessing: Definition and Importance

Definition

• Data preprocessing is the transformation of raw data into a format that is more suitable
and meaningful for analysis and model training.

• Data preprocessing plays a vital role in improving the quality and efficiency of ML models
by addressing issues such as missing values, noise, inconsistencies, and outliers in the
data.

Preprocessing importance

• reduced processing power and time required to train a new ML or AI algorithm or to


perform inference against it

6 / 23
Key data preprocessing steps?

7 / 23
What is data profiling

• Data profiling is the process of examining, analyzing, and creating useful summaries of
data.

• The process yields a high-level overview, which aids in the discovery of data quality
issues, risks, and overall trends.

• Produces critical insights into data that companies can leverage to their advantage.

8 / 23
How to do it

9 / 23
Benefits of data profiling

10 / 23
What is data cleansing?

• Data cleaning is the process to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data, and otherwise ensuring that the raw data are
suitable for feature engineering.

• It is the process of detecting and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database and refers to identifying incomplete, incorrect,
inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the
dirty or coarse data.

11 / 23
Key steps to data cleaning.

• Safely store data: Before making any changes, make sure the original raw data is stored
safely with a good backup strategy in place.
• Tidy dataset:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table
• Clean data
1. Accuracy : check how accurately the data value describes the object or event being
described.
Example: Syntactic accuracy where the value might be correct but doesn’t belong to the
correct domain of the variable. i.e. a negative value for age. Semantic accuracy is where the
value is in the correct domain but not accurate. i.e., The attribute gender is given the value
“female” for a person called John Smith.

12 / 23
Key steps to data cleaning.

2 Consistency : Checks whether all the values of a variable represent the same definition.
i.e., the distance is recorded in the same unit throughout the entire data set.
2 Completeness: checks how complete is the dataset with respect to variable values and/or
records?
- Variable values: Are there values missing for certain variables?
- Records: check if the dataset is completed for the analysis at hand. i.e. you set out to
survey 1,000 households but only have 900 completed.
2 Uniqueness: checks existence of any duplicate in data.
2 Handling outliers Check the existence of a data point that differs significantly from other
observations.

13 / 23
Handling Missing Values

14 / 23
Ways of calculating outliers

• Sort quantitative: variables from low to high and scan for extremely low or extremely
high values.
• Visualizations: plot data using a box plot or a box-and-whisker plot, so that you can see
the data distribution at a glance.
• Statistical outlier detection:convert extreme data points into z-scores that tell you how
many standard deviations they are from the mean. values with a z score greater than 3
or less than –3 are often determined to be outliers.
• Using the interquartile range: use the IQR to create “fences” around your data and
then define outliers as any values that fall outside those fences.

15 / 23
What is data reduction?

• Data reduction is the process in which an organization sets out to limit the amount of
data it’s storing.

• Raw data sets often include redundant data that comes from characterizing phenomena
in different ways or data that isn’t relevant to a particular ML, AI or analytics task.

• Data reduction techniques, such as principal component analysis, transform raw data into
a simpler form suitable for specific use cases.

16 / 23
Types of data reduction

17 / 23
What is data transformation?

• Data transformation is the process of converting, cleansing, and structuring data into a
usable format that can be analyzed to support decision making processes, and to propel
the growth of an organization.

• This technique converts the raw data into a required format to perform the following
data processing and modeling procedures efficiently.

• Data transformation is used when data needs to be converted to match that of the
destination system.

18 / 23
Type of data transformation?

19 / 23
What is data enrichment?

• Data enrichment is the process of enhancing the value of existing data with additional
information.

• Enrichment often uses reliable third-party data sources to add more information to
customer contact information or other data.

• The goal of data enrichment is to improve the value of the data by providing you with
more insight into customers.

• useful for business data; address, property, and location data;

20 / 23
Steps to data enrichment?

• Data Assessment: identifying the types and sources of data your organization
possesses.
• Identify Data Sources: identify internal or external data sources that can supplement
your current information.
• Data Cleansing: make sure data is cleaned with high quality.

• Data Integration:incorporating the identified external data sources with your existing
datasets.
• Validation and quality assurance: make post-integration and strict validation
processes.

21 / 23
What is data validation.

22 / 23
Application in python

23 / 23

You might also like