Data preprocessing & processing
Dr. Souand P. G. TAHI & Dr. Ariane C. HOUETOHOSSOU
Laboratoire de Biomathématiques et Estimations Forestières
University of Abomey-Calavi
souandtahi@gmail.com/ harianecalmet@gmail.com
April 13, 2025
1 / 23
Course objective
• Understand the importance of data quality,
• Master data preprocessing and processing techniques,
• Apply these techniques using specialized tools.
2 / 23
Learning expectations
At the end of this course, participants should be able to:
• Acquire practical skills in data handling and preparation, .
• Understand best practices in data preprocessing to enhance machine learning model
performance,
• Be able to identify and address data quality issues in real-world contexts.
3 / 23
Outline
1. Concepts clarification
2. Data profiling
3. Data cleansing
4. Data Reduction
5. Data transformation
6. Data enrichment
7. Data validation.
8. Application in python
4 / 23
Concept clarification
5 / 23
Preprocessing: Definition and Importance
Definition
• Data preprocessing is the transformation of raw data into a format that is more suitable
and meaningful for analysis and model training.
• Data preprocessing plays a vital role in improving the quality and efficiency of ML models
by addressing issues such as missing values, noise, inconsistencies, and outliers in the
data.
Preprocessing importance
• reduced processing power and time required to train a new ML or AI algorithm or to
perform inference against it
6 / 23
Key data preprocessing steps?
7 / 23
What is data profiling
• Data profiling is the process of examining, analyzing, and creating useful summaries of
data.
• The process yields a high-level overview, which aids in the discovery of data quality
issues, risks, and overall trends.
• Produces critical insights into data that companies can leverage to their advantage.
8 / 23
How to do it
9 / 23
Benefits of data profiling
10 / 23
What is data cleansing?
• Data cleaning is the process to find the easiest way to rectify quality issues, such as
eliminating bad data, filling in missing data, and otherwise ensuring that the raw data are
suitable for feature engineering.
• It is the process of detecting and correcting (or removing) corrupt or inaccurate records
from a record set, table, or database and refers to identifying incomplete, incorrect,
inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the
dirty or coarse data.
11 / 23
Key steps to data cleaning.
• Safely store data: Before making any changes, make sure the original raw data is stored
safely with a good backup strategy in place.
• Tidy dataset:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table
• Clean data
1. Accuracy : check how accurately the data value describes the object or event being
described.
Example: Syntactic accuracy where the value might be correct but doesn’t belong to the
correct domain of the variable. i.e. a negative value for age. Semantic accuracy is where the
value is in the correct domain but not accurate. i.e., The attribute gender is given the value
“female” for a person called John Smith.
12 / 23
Key steps to data cleaning.
2 Consistency : Checks whether all the values of a variable represent the same definition.
i.e., the distance is recorded in the same unit throughout the entire data set.
2 Completeness: checks how complete is the dataset with respect to variable values and/or
records?
- Variable values: Are there values missing for certain variables?
- Records: check if the dataset is completed for the analysis at hand. i.e. you set out to
survey 1,000 households but only have 900 completed.
2 Uniqueness: checks existence of any duplicate in data.
2 Handling outliers Check the existence of a data point that differs significantly from other
observations.
13 / 23
Handling Missing Values
14 / 23
Ways of calculating outliers
• Sort quantitative: variables from low to high and scan for extremely low or extremely
high values.
• Visualizations: plot data using a box plot or a box-and-whisker plot, so that you can see
the data distribution at a glance.
• Statistical outlier detection:convert extreme data points into z-scores that tell you how
many standard deviations they are from the mean. values with a z score greater than 3
or less than –3 are often determined to be outliers.
• Using the interquartile range: use the IQR to create “fences” around your data and
then define outliers as any values that fall outside those fences.
15 / 23
What is data reduction?
• Data reduction is the process in which an organization sets out to limit the amount of
data it’s storing.
• Raw data sets often include redundant data that comes from characterizing phenomena
in different ways or data that isn’t relevant to a particular ML, AI or analytics task.
• Data reduction techniques, such as principal component analysis, transform raw data into
a simpler form suitable for specific use cases.
16 / 23
Types of data reduction
17 / 23
What is data transformation?
• Data transformation is the process of converting, cleansing, and structuring data into a
usable format that can be analyzed to support decision making processes, and to propel
the growth of an organization.
• This technique converts the raw data into a required format to perform the following
data processing and modeling procedures efficiently.
• Data transformation is used when data needs to be converted to match that of the
destination system.
18 / 23
Type of data transformation?
19 / 23
What is data enrichment?
• Data enrichment is the process of enhancing the value of existing data with additional
information.
• Enrichment often uses reliable third-party data sources to add more information to
customer contact information or other data.
• The goal of data enrichment is to improve the value of the data by providing you with
more insight into customers.
• useful for business data; address, property, and location data;
20 / 23
Steps to data enrichment?
• Data Assessment: identifying the types and sources of data your organization
possesses.
• Identify Data Sources: identify internal or external data sources that can supplement
your current information.
• Data Cleansing: make sure data is cleaned with high quality.
• Data Integration:incorporating the identified external data sources with your existing
datasets.
• Validation and quality assurance: make post-integration and strict validation
processes.
21 / 23
What is data validation.
22 / 23
Application in python
23 / 23