KEMBAR78
C2 - Data Cleaning & Preprocessing | PDF | Outlier | Median
0% found this document useful (0 votes)
14 views59 pages

C2 - Data Cleaning & Preprocessing

The document outlines essential techniques for data cleaning and preprocessing in machine learning, including handling missing values, removing duplicates, and detecting outliers. It emphasizes the importance of data transformation, scaling, and encoding categorical data to prepare datasets for modeling. Additionally, it provides real-life examples and methods for implementing these techniques effectively.

Uploaded by

Priyanka Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views59 pages

C2 - Data Cleaning & Preprocessing

The document outlines essential techniques for data cleaning and preprocessing in machine learning, including handling missing values, removing duplicates, and detecting outliers. It emphasizes the importance of data transformation, scaling, and encoding categorical data to prepare datasets for modeling. Additionally, it provides real-life examples and methods for implementing these techniques effectively.

Uploaded by

Priyanka Rajput
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Cleaning and

Preprocessing
Agenda
 Identifying missing values in datasets, Techniques for handling missing values:
Deletion of missing values, Mean/median imputation
 Identifying and removing duplicate records
 Outlier detection methods: Z-score method,
 Data Transformation and Scaling-Standardization Min-max scaling,
 Handling Categorical Data : One-hot encoding.
Materials:

1. Gopinath Rebala, Ajay Ravi, Sanjay Churiwala, An Introduction to Machine


Learning, Springer, 2019
2. Miroslav Kubat, An Introduction to Machine Learning (2e), Springer, 2017
3. Ethem Alpaydin, Introduction to Machine Learning (2e), MIT Press, 2010
4. Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar, Foundations of
Machine Learning, MIT Press, 2012
Machine Learning Pipeline

A pipeline is essentially a series of connected stages or processes where the


output of one stage becomes the input for the next.
ML Pipeline
 Data collection: Identify and collect the necessary data from various sources.
 Data Preparation: Clean, preprocess, and transform the data to make it
suitable for modeling.
 Data Segregation: Split the data into training, validation, and testing sets to
evaluate the model’s performance accurately.
 Model Training: Train the machine learning model on the training data to
learn patterns and relationships.
 Model Evaluation: Evaluate the performance of multiple candidate models on
the validation set and select the best-performing one.
 Model Deployment: Deploy the chosen model into a production environment
to make real-time predictions.
 Performance Monitoring: Continuously monitor model performance, retrain,
and calibrate accordingly.
Where to collect Data?
 Kaggle - https://www.kaggle.com/
 UCI - https://archive.ics.uci.edu/
 Google dataset Search - https://datasetsearch.research.google.com/
What is Data Cleaning and Preprocessing?

 Data cleaning involves correcting or removing incorrect, corrupted, improperly

formatted, duplicate, or incomplete data.

 Data preprocessing transforms raw data into an understandable format. This can

include normalization, encoding categorical variables, handling missing values,

and more.
Why Data Cleaning and Preprocessing?
The Challenge of Raw Data
 Real-world data is often dirty, incomplete, inconsistent, and noisy.
 Examples of data issues: missing values, outliers, duplicates, incorrect data
types, etc.
 Visual representation of dirty data (e.g., a messy table or chart).

Consequences of Ignoring Data Cleaning


 Inaccurate models and predictions
 Biased results
 Wasted time and resources
 Loss of credibility
Steps in Data Cleaning and
Preprocessing

• Handling Missing Values: Strategies include removal, mean/mode/median


imputation, or using algorithms that support missing values.

• Removing Duplicates: Ensures each record is unique to avoid bias and redundancy.

• Correcting Errors: Fixes incorrect data entries, such as typos or outliers.

• Normalization and Standardization: Ensures that data follows a consistent scale.

• Encoding Categorical Variables: Converts categorical data into numerical format


for algorithms that require numerical input.
Real Life Examples
1. Healthcare Data:
 Scenario: Predicting patient readmission rates using hospital records.
 Issue: Missing values in patient records for fields like age, diagnosis, or treatment
details.
 Solution: Use mean or median imputation for numerical fields and mode
imputation for categorical fields, ensuring no critical data is lost.
2. Retail Data:
 Scenario: Analyzing customer purchasing behavior from transaction data.
 Issue: Duplicate records due to multiple entries of the same purchase.
 Solution: Remove duplicates to ensure each transaction is unique, providing
accurate insights into customer behavior.
3. Financial Data:
 Scenario: Fraud detection in credit card transactions.
 Issue: Inconsistent data entries with different formats for dates and amounts.
 Solution: Standardize the format of dates and amounts, ensuring consistency
across all records for accurate analysis.
I. Identifying and Handling Missing Values

Missing data can occur due to various reasons like sensor failures, human error during

data collection, etc.

Common techniques to handle missing values include:

 Deletion: If the number of missing values is small and the data is abundant, removing

rows/columns with missing entries might be acceptable.

 Mean/Median Imputation: Replacing missing entries with the mean/median of the

respective column.
1. Deletion of Missing Values

i) Listwise deletion:

 Remove entire rows that contain any missing values.

 Pros: Simple to implement.

 Cons: Can lead to significant data loss, especially if many rows have missing values.

ii) Pairwise deletion:

 Remove rows only when they are missing the value required for a specific analysis.

 Pros: More data is retained compared to listwise deletion.

 Cons: May lead to inconsistency in analyses since different subsets of data are used.
Listwise Deletion - Example
2. Mean/Median Imputation

Replacing missing entries with the mean/median of the respective column.

The Problem of Missing Data


 Missing data is a common issue in real-world datasets.
 Can lead to biased results and reduced model performance.
 Imputation is a technique to fill in missing values.
1. Mean Imputation
Mean is simply the average of a set of numbers. Calculated by adding all the
numbers in a dataset and dividing by the total number of values.
 Replaces missing values with the mean of the column.
 Easy to implement.
 Sensitive to outliers.
 Example: Calculating the mean age of a group and replacing missing ages with
that value.
Why Mean Imputation is More Common?

 Computational Efficiency: Calculating the mean is generally computationally

faster than calculating the median, especially for large datasets.

 Compatibility with Other Statistical Methods

 Historical Preference: Mean imputation has been a standard technique for a

long time, and it's often the default option in statistical software.
Mean Imputation - Example
2. Median Imputation
Median refers to a measure of central tendency that represents the middle value
in a dataset when the values are arranged in ascending or descending order.
 Replaces missing values with the median of the column.
 Less sensitive to outliers than mean imputation.
 Suitable for skewed data.
 Example: Calculating the median income of a population and replacing
missing incomes with that value.
Median Imputation - Example
When to Use Mean or Median Imputation?
 Mean imputation is suitable for normally distributed data with no outliers.
 Median imputation is better for skewed data or data with outliers.
 Consider the distribution of the data before choosing a method.
Limitations of Mean and Median Imputation

 Both methods can distort the original data distribution.


 Ignore the relationship between variables.
 Might not be suitable for all types of missing data.
II. Identifying and removing duplicate records

Duplicate records can significantly impact the quality and reliability of machine

learning models. They can introduce bias, reduce model accuracy, and increase

computational costs.
How does it occur?

 Data entry errors: Human mistakes during data input.

 Data integration: Combining data from multiple sources.

 Data sampling: Overlapping samples in different datasets.


Why is it a problem?

• Messy data: It makes your data look unorganized and confusing.

• Wrong results: If you use this data to train a machine learning model, it can
give you incorrect answers.

• Slower processing: Having extra data can slow down your computer.
Challenges in Identifying Duplicates

• Noisy Data: Real-world data often contains inconsistencies and errors, making
it difficult to determine exact duplicates.

• Data Volume: Large datasets can make the process computationally


expensive.

• Defining Duplicates: Deciding what constitutes a duplicate can be subjective,


especially for textual or categorical data.
How to find duplicates?

i) Simple Comparison:

 Compare records based on specific key attributes (e.g., ID, name, email).

 Suitable for exact duplicates.

ii) Record Linkage:

 Used for near duplicates.

 Involves creating a similarity score between records based on matching attributes.

 Techniques include: String similarity measures (e.g. Jaccard similarity)


How to remove duplicates?

• Deleting Duplicates: Remove all but one occurrence of a duplicate record.

• Combining Duplicates: Merge information from multiple duplicate records


into a single record.
Example: Customer Data

Consider a customer dataset with attributes like customer ID, name, address,
and phone number.

To identify duplicates, you might:


• Use exact matching for customer ID.
• Employ similarity-based matching for names and addresses to account for
variations in spelling or formatting.
• Consider phone numbers as additional indicators of potential duplicates.

By effectively identifying and removing duplicate records, we can improve the


quality of our machine learning models and obtain more reliable results.
III. Identifying and removing Outliers

An outlier is a data point that significantly differs from other observations in a

dataset. These points can sometimes be errors or anomalies, but they can also be

genuinely unusual data points that provide valuable information.


Why is Outlier Detection Important?
Outliers arise due to various reasons such as measurement errors, data processing
errors, or true anomalies. Understanding them is critical because their presence can
have substantial effects on our data analysis.

They can:

 Affect Mean and Standard Deviation

• Outliers can significantly skew your mean and inflate the standard deviation, distorting
the overall data distribution.

 Impact Model Accuracy

• Many machine learning algorithms are sensitive to the range and distribution of attribute
values. Outliers can mislead the training process, resulting in longer training times and
less accurate models.
Outlier detection methods: Z-score method

Outlier detection can be performed using several methods.

 Statistical Methods - Z-score:

• The Z-score method is a statistical technique used to identify outliers in a dataset.

• It is a measure of how many standard deviations an observation is from the mean. A

common rule of thumb is that a data point with a Z-score greater than 3 or less than -3 is

considered an outlier.
How does the Z-score Method Works ?
1. Calculate the mean (average) of the dataset.
2. Calculate the standard deviation of the dataset.
3. Calculate the Z-score for each data point:
1. Z-score = (Data point - Mean) / Standard Deviation
4. Set a threshold: Typically, data points with a Z-score greater than 3 or less
than -3 are considered outliers. This means they are more than 3 standard
deviations away from the mean.
Limitations of Z-Score Method

• Assumes data follows a normal distribution.


• Sensitive to large datasets.
• Might not be suitable for small datasets.

Alternatives:
There are other methods for outlier detection, such as IQR (Interquartile Range),
box plots, and more advanced techniques like clustering and anomaly detection
algorithms.
When to use which?

 Z-score: Suitable for normally distributed data and when you want to

understand how far a data point is from the average in terms of standard

deviations.

 IQR: Suitable for skewed data or when you want a more robust measure of

spread that is less influenced by extreme values.


Example:
Imagine you have a dataset of salaries. If the dataset is normally distributed, you

can use the Z-score to identify extremely high or low salaries. However, if the

salary data is skewed (e.g., with a few very high salaries), the IQR might be a

better choice as it's less affected by those extreme values.


Example – Z Score Calculation
Suppose we have the following dataset of exam scores: 10, 12, 14, 15, 22
 Step 1: Calculate the mean (average) of the dataset.

 Step 2: Calculate the standard deviation (σ) of the dataset.


 First, we find the squared differences from the mean:
 Now, sum these squared differences and divide by the number of data points

 Finally, take the square root of the variance to get the standard deviation:

 Step 3: Calculate the Z-scores for each data point


These Z-scores indicate how many standard deviations each data point is from the
mean. None of these Z-scores are greater than 3 or less than -3, so none of these data
points would be considered outliers by the common rule of thumb.
Why do we need to transform and scale data?

Imagine you're comparing apples and oranges. They're different fruits, right?

Similarly, in data, we often have features with different scales (like height in

centimeters and weight in kilograms). To make them comparable, we need to

transform them. This process is called scaling.


IV. Normalization and Standardization
 In machine learning, feature scaling is a crucial preprocessing step that
involves transforming numerical features into a common scale.

 Two common techniques for feature scaling are normalization and


standardization.

 They ensures that the data is in a format that allows the model to learn
effectively and perform optimally.
 Normalization (Min - Max Scaling):
• Rescales data to a fixed range, typically [0, 1], ensuring that all features
contribute equally.

 Standardization (Z - Score Scaling):


• Rescales data to have a mean of 0 and a standard deviation of 1, making features
comparable and improving algorithm performance.

Choosing the right technique depends on your specific dataset and the requirements
of the machine learning algorithm you are using.
Normalization
 Normalization involves scaling the values of a dataset to a range between 0 and 1.
This is particularly useful when the data doesn't have a normal distribution or when
you want all features to contribute equally to the result.

 It is also commonly referred to as Min-Max Scaling.


Normalization
 It is useful when:

• Data ranges vary significantly.

• Algorithms are sensitive to the scale of data.

• Features need to be compared on the same scale.

Example : If you are working on a ML project to classify flowers based on petal length
and width. The petal length ranges from 1 to 10 cm, while the petal width ranges from
0.1 to 1 cm. (We normalize the features to bring them to the same scale )
Normalization - Example
Imagine you have a dataset of heights of students in a class, measured in
centimeters: 150 cm,160 cm,170 cm,180 cm,190 cm
Standardization
 Standardization, on the other hand, transforms the data to have a mean
(average) of 0 and a standard deviation of 1. This is useful when you want to
compare data that follows a normal distribution. It is also known as Z-Score
Scaling.

 It is useful when:

• Data needs to follow a Normal distribution.

• Comparisons are made across different scales and units.

Example : If you are working on a dataset with features like age, income, and
years of education, which are on different scales and units. (We standardize the
features to have a mean of 0 and a standard deviation of 1 )
Standardization - Example
Using the same heights dataset: 150 cm, 160 cm, 170 cm, 180 cm, 190 cm
When to Use Which?
 Normalization (Min-max scaling) is preferred when you know the exact
range of your data and want to preserve the original distribution. It's often
used for algorithms that are sensitive to the scale of features.

 Standardization (Z Score scaling) is preferred when you don't know the


exact range of your data or when you want to remove the influence of
outliers. It's often used for algorithms that assume normally distributed
data.
V. Handling Categorical Data

When working with categorical data, machine learning algorithms require


numerical input. One-hot encoding is a technique used to convert categorical
data into a format that can be provided to ML algorithms to do a better job in
prediction.

 What is One-Hot Encoding?

One-hot encoding converts categorical variables into a set of binary (0 or 1)


variables. Each category becomes a new column, and the values are 1 if the
category is present and 0 otherwise.
One-Hot Encoding –Example

Example:
 Suppose we have a dataset with a categorical feature called "Color" with
three possible values: "Red”, "Blue” and "Green“.
Steps to Perform One-Hot Encoding

1. Identify the Categorical Feature: Determine which feature(s) in your dataset


are categorical.

2. Create Binary Columns: For each unique category in the feature, create a
new binary column.

3. Fill the Columns: Fill the new columns with 1s and 0s based on the presence
of the category in each row.
Why Use One-Hot Encoding?

• Machine Readable: Many machine learning algorithms (like linear regression,

logistic regression, etc.) cannot work with categorical data directly. They

require numerical input.

• Avoid Ordinal Relationship: One-hot encoding avoids any ordinal relationship

between the categories (unlike label encoding).


Challenges

 Too Many Columns: If you have many different types of item/object, you'll

end up with many columns. This can slow down the computer.

 Sparse Data: Most of the numbers in these new columns will be 0. This can

also make the computer work harder.

 Overfitting
When to Use One-Hot encoding?
• When categories have no inherent order (like colors, car brands).

 Alternatives:

• Label Encoding: Assigns a number to each category, but it assumes an order


which might not be correct.

• Target Encoding: Replaces a category with the mean of the target variable,
but can lead to overfitting.

Label Encoding
Example:

Consider a dataset of cars with features like color, brand, and price.

• Color: No inherent order (red, blue, green), so one-hot encoding is preferred.

• Brand: No inherent order (Toyota, Ford, Honda), so one-hot encoding is


preferred.

• Price Range: Categories like "low", "medium", "high" have an order, so label
encoding could be considered (but one-hot encoding is often safer).
Recap
 ML Pipeline

 Collection , Preparation, Segregation (train, validate, test)

 Model Training, Evaluation, Deployment, Performance Monitoring

 Where to collect data? ( Kaggle, UCI, Google dataset Search )

 Data cleaning involves correcting or removing incorrect, corrupted, improperly formatted, duplicate, or
incomplete data.

 Data preprocessing transforms raw data into an understandable format (normalization, encoding,
handling missing values etc. )

 Handling missing values – Deletion(Pair & Listwise), Imputation (Mean & Median)

 Identifying and removing duplicate records (Deleting & combining)

 Identifying and removing Outliers (Z – Score)

 Feature Scaling (Normalization and Standardization)

 Handling Categorical Data (One – Hot Encoding)

You might also like