0% found this document useful (0 votes)

21 views36 pages

Unit 2 Preprocessing in Data Analytics

Data pre-processing is essential for ensuring quality in datasets, addressing issues like missing values, noisy data, and inconsistencies. Techniques such as data cleaning, integration, transformation, and reduction are employed to enhance data quality and prepare it for analysis. Effective data cleaning methods include handling missing values, detecting outliers, and removing duplicates to ensure reliable results.

Uploaded by

temonam310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views36 pages

Unit 2 Preprocessing in Data Analytics

Uploaded by

temonam310

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

UNIT - 2

Why Data Pre Processing?

• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality results
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
Why Data Pre Processing?
• A well-accepted multidimensional view:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
• Broad categories:
• intrinsic, contextual, representational, and accessibility.
Data Cleaning
• It is the process of identifying and correcting errors or inconsistencies in
the raw dataset.

• It involves handling missing values, removing duplicates, and correcting

incorrect, noisy data or outlier data to ensure the dataset is accurate and
reliable.

• Clean data is essential for effective analysis, as it improves the quality of

results and enhances the performance of data models.
Data Cleaning – Missing Values
• Missing Values

• This occur when data is absent from a dataset.

• You can either:

• Ignore the rows with missing data or

• delete records with missing data

• fill the gaps manually, with the attribute mean, or by using the most probable
value.

• This ensures the dataset remains accurate and complete for analysis.
Data Cleaning - Types of Missing Data
• Missing Completely at Random (MCAR):
• The missingness of data is completely random and unrelated to any other
variables in the dataset, both observed and unobserved.
• For example, if a height measurement is missing due to a malfunctioning
device, it's likely MCAR.
• Missing at Random (MAR):
• The missingness of data can be explained by other variables in the dataset,
but not by the missing value itself.
• For example, if older people are more likely to skip a survey question about
income, the missing income data is MAR if you have access to age data.
Data Cleaning - Types of Missing Data
• Missing Not at Random (MNAR):
• The missingness of data is related to the value of the missing data itself.
• For example, if people with very high incomes are less likely to report their
income, the missing income data is MNAR.

• Structurally Missing:
• This type of missing data is expected due to the structure of the data collection
or analysis.
• An example would be missing values in a column that represents the number
of children a person has, if the person has no children.
Data Cleaning
Missing Values: Deletion technique
• Listwise Deletion Technique

• Listwise deletion removes entire rows from the dataset if any value in those rows
is missing, regardless of how many variables are involved in the analysis.

• It ensures consistency by working with a complete dataset.

• How It Works?
• All rows containing at least one missing value are dropped from the dataset.
• A complete-case analysis is then performed on the remaining rows.
Data Cleaning
Missing Values Deletion Techniques (Listwise Deletion)

• Example: A B C
• Using the following dataset: 1 2 NaN

4 NaN 6
• After applying listwise deletion, 7 8 9
only row 3 remains because
it is the only row with no missing values.

A B C
7 8 9
Data Cleaning
Missing Values: Deletion technique
• Pairwise Deletion Technique

• Pairwise deletion is a technique that evaluates each pair of variables independently,

using only the data points that are not missing for that pair.

• It maximizes the use of available data for each specific analysis.

• How It Works?

• For each computation (e.g., correlation, covariance), the method excludes only the
rows with missing values for the variables involved.

• All other rows are retained for other variable pairs.

Data Cleaning
Missing Values Deletion Technique (Pairwise Deletion)
• Consider the following dataset:

A B C
1 2 NaN
4 NaN 6
7 8 9

• To calculate the correlation between columns A and B, only rows 1 and 3 will
be used.
• For A and C, rows 2 and 3 will be used.
Pairwise vs. Listwise Deletion: Key Comparisons
Feature Pairwise Deletion Listwise Deletion
Uses available data for each variable Removes rows with missing values,
Data Usage
pair ensuring uniformity

Ensures a consistent dataset for all

Consistency Results in inconsistent sample sizes
analyses

Bias Risk of bias if data is not MCAR Risk of bias if data is not MCAR

Simplicity More complex to implement Straightforward and easy to apply

May reduce power due to smaller

Statistical Power Retains more data, increasing power
datasets

Suitable for modeling and regression

Suitability Ideal for exploratory analyses
tasks requiring complete data
Data Cleaning
• Missing Value - Imputation: Replace missing values with estimated values.
• Mean/Median/Mode Imputation: Fill missing values with the mean (average),
median (middle value), or mode (most frequent value) of the column.
• Forward/Backward Fill: Propagate the next or previous known value into the missing
cell.
• Interpolation: Estimate missing values using linear or non-linear methods, often
useful for time-series data.
• Advanced Techniques: Use predictive models like K-Nearest Neighbors (KNN) to
estimate missing values based on similar data points.
Data Cleaning – NOISY DATA
• Noise: random error or variance in a measured variable.
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• etc
• Other data problems which requires data cleaning
• duplicate records, incomplete data, inconsistent data
Data Cleaning
• Noisy Data: It refers to irrelevant or incorrect data that is difficult for
machines to interpret, often caused by errors in data collection or entry. It
can be handled in several ways:
• Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.
• Regression: Data can be smoothed by fitting it to a regression function, either linear or
multiple, to predict values.
• Clustering: This method groups similar data points together, with outliers either being
undetected or falling outside the clusters.

• These techniques help remove noise and improve data quality.

Outlier Detection & Handling
• Data points inconsistent with the majority of data
• Different outliers
• Valid: CEO’s salary,
• Noisy: One’s age = 200, widely deviated points
• Detection: Z-Score, Boxplots, Clustering, Curve Fitting,
Hypothesis Testing
• Removal methods
• Clustering
• Curve-fitting
• Hypothesis-testing with a given model
Data Cleaning
• Removing Duplicates
• It involves identifying and eliminating repeated data entries to ensure
accuracy and consistency in the dataset.

• This process prevents errors and ensures reliable analysis by keeping only
unique records.
Data Cleaning – Outlier Detection Techniques
• Z-SCORE METHOD
• Using Z score
method,we can find out
how many standard
deviations value away
from the mean.
• Common threshold: |Z|
> 3 indicates an outlier
Data Cleaning – Outlier Detection Techniques
Z - SCORE
• Figure shows area under normal curve and how much area that standard
deviation covers.
• 68% of the data points lie between + or -1 standard deviation.
• 95% of the data points lie between + or -2 standard deviation
• 99.7% of the data points lie between + or -3 standard deviation

• If the z score of a data point is more than 3 (because it cover 99.7% of area), it
indicates that the data value is quite different from the other values. It is taken as
outliers.
Data Cleaning – Outlier Detection Techniques
• Modified Z-Score
• Uses median and median absolute deviation (MAD) instead of mean
and standard deviation — more robust to outliers.

• Where MAD = median(|X-median|)

Data Cleaning – Outlier Detection Techniques
IQR – Technique (or BOXPLOTS)
Data Cleaning - Outlier Detection : Clustering
• Use clustering (e.g., k-means); points that don’t fit well into any cluster or lie far
from cluster centroids are outliers.
• Ex: DBSCAN (DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH
NOISE)
• DBSCAN is a density based clustering algorithm that divides a dataset into
subgroups of high density regions and identifies high density regions cluster as
outliers.
• This approch is similar to the K-mean clustering.
• There are two parameters required for DBSCAN.
• epsilon: a distance parameter that defines the radius to search for nearby neighbors.
• minimum amount of points required to form a cluster.
• Using epsilon and minPts, we can classify each data point
Data Cleaning – Outlier Detection
• VISUALIZING THE DATA
• Data visualization is useful for detecting outliers and unusual groups,
identifying trends and clusters etc. Here the list of data visualization
plots to spot the outliers.
• Box and whisker plot (box plot).
• Scatter plot.
• Histogram.
• Distribution Plot.
• QQ plot.
Data Cleaning - Removing Duplicates
Data Cleaning - Removing Duplicates
Data Integration

• It involves merging data from various sources into a single, unified

dataset.

• It can be challenging due to differences in data formats, structures,

and meanings.

• Techniques like record linkage and data fusion help in combining data
efficiently, ensuring consistency and accuracy.
Data Integration
• Record Linkage:
• The process of identifying and matching records from different datasets that
refer to the same entity, even if they are represented differently.
• It helps in combining data from various sources by finding corresponding
records based on common identifiers or attributes.
• Data Fusion:
• Involves combining data from multiple sources to create a more
comprehensive and accurate dataset.
• It integrates information that may be inconsistent or incomplete from
different sources, ensuring a unified and richer dataset for analysis.
Data Transformation

• It involves converting data into a format suitable for analysis.

• Common techniques include:
• Normalization, which scales data to a common range;

• Standardization, which adjusts data to have zero mean and unit variance; and

• Discretization, which converts continuous data into discrete categories.

• These techniques help prepare the data for more accurate analysis.
Data Transformation
• Data Normalization: The process of scaling data to a common range
to ensure consistency across variables.
• Discretization: Converting continuous data into discrete categories for
easier analysis.
• Data Aggregation: Combining multiple data points into a summary
form, such as averages or totals, to simplify analysis.
• Concept Hierarchy Generation: Organizing data into a hierarchy of
concepts to provide a higher-level view for better understanding and
analysis.
Data Reduction
• It reduces the dataset's size while maintaining key information.
• This can be done through:
• Feature selection, which chooses the most relevant features, and
• Feature extraction, which transforms the data into a lower-
dimensional space while preserving important details.
• It uses various reduction techniques such as:
• Dimensionality Reduction
• Numerosity Reduction
• Data Compression
Data Reduction

• Dimensionality Reduction (e.g., Principal Component Analysis): A

technique that reduces the number of variables in a dataset while
retaining its essential information.
• Numerosity Reduction: Reducing the number of data points by
methods like sampling to simplify the dataset without losing critical
patterns.
• Data Compression: Reducing the size of data by encoding it in a more
compact form, making it easier to store and process.

DMDW Unit II
No ratings yet
DMDW Unit II
57 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Mining
No ratings yet
Data Mining
22 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Week2 2
No ratings yet
Week2 2
25 pages
DWM
No ratings yet
DWM
14 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Preprocessing for Tech Students
No ratings yet
Data Preprocessing for Tech Students
59 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
PHD Seminar
No ratings yet
PHD Seminar
38 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Study Material Data Preprocessing
No ratings yet
Study Material Data Preprocessing
11 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Data Pre-processing in Machine Learning
No ratings yet
Data Pre-processing in Machine Learning
84 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
Session 4
No ratings yet
Session 4
40 pages
Correlation
No ratings yet
Correlation
14 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Wa0023.
No ratings yet
Wa0023.
34 pages
Digital Marketing Exam Answers
No ratings yet
Digital Marketing Exam Answers
4 pages
07 12 2024notes
No ratings yet
07 12 2024notes
12 pages
Software Reliability Unit III IV Detailed
No ratings yet
Software Reliability Unit III IV Detailed
2 pages
Wa0002.
No ratings yet
Wa0002.
4 pages
Software Reliability QA English
No ratings yet
Software Reliability QA English
1 page
Unit 1 Data - Analytics
No ratings yet
Unit 1 Data - Analytics
53 pages
Software Reliability Unit III IV
No ratings yet
Software Reliability Unit III IV
2 pages
Gen Ai-L5 Notes
No ratings yet
Gen Ai-L5 Notes
5 pages
Wa0007.
No ratings yet
Wa0007.
78 pages
MSDS 6. 33kv, 33 KV, PT
No ratings yet
MSDS 6. 33kv, 33 KV, PT
2 pages
Method 8242 Heterotrophic Bacteria
No ratings yet
Method 8242 Heterotrophic Bacteria
6 pages
Statistics Syllabus
No ratings yet
Statistics Syllabus
3 pages
Electrical Standard
No ratings yet
Electrical Standard
24 pages
Shear Force and Bending Moments: 5kN 10kN 8kN A C D B E
No ratings yet
Shear Force and Bending Moments: 5kN 10kN 8kN A C D B E
20 pages
PCI Precast Concrete Design Standards
No ratings yet
PCI Precast Concrete Design Standards
20 pages
Science 10 Second Grading Exam
No ratings yet
Science 10 Second Grading Exam
2 pages
12v Battery Charger Circuit With Auto Cut Off - Circuits Gallery
0% (1)
12v Battery Charger Circuit With Auto Cut Off - Circuits Gallery
40 pages
Math 9 LM Draft 3.24.2014
No ratings yet
Math 9 LM Draft 3.24.2014
241 pages
Asynchronous Tasks With FastAPI and Celery
No ratings yet
Asynchronous Tasks With FastAPI and Celery
4 pages
Lines and Planes
No ratings yet
Lines and Planes
3 pages
Cryptography & Network Security
No ratings yet
Cryptography & Network Security
10 pages
Challenging 2A04 Sol e
No ratings yet
Challenging 2A04 Sol e
5 pages
Catalog KharkovEnergoPribor en
No ratings yet
Catalog KharkovEnergoPribor en
64 pages
Using The Swift Futura Remote Video Unit
No ratings yet
Using The Swift Futura Remote Video Unit
12 pages
DQ Model
No ratings yet
DQ Model
7 pages
MCQ Problems
100% (1)
MCQ Problems
7 pages
Stairs: A Little Bit About Them: Slope
No ratings yet
Stairs: A Little Bit About Them: Slope
2 pages
HP Z2 Small Form Factor G5 Workstation - p7
No ratings yet
HP Z2 Small Form Factor G5 Workstation - p7
1 page
Lecture 8
No ratings yet
Lecture 8
16 pages
Bms Major Project
No ratings yet
Bms Major Project
11 pages
Solutions For Exercises in Mechanical Behavior of Materials, 2nd Edition by William Hosford
No ratings yet
Solutions For Exercises in Mechanical Behavior of Materials, 2nd Edition by William Hosford
6 pages
Valence Electrons Shown: Unit 2 - Bonding
No ratings yet
Valence Electrons Shown: Unit 2 - Bonding
5 pages
Pressurization Unit Specs & Details
No ratings yet
Pressurization Unit Specs & Details
16 pages
By Kevin E. Presley Training Coordinator
No ratings yet
By Kevin E. Presley Training Coordinator
35 pages
Jyotish Krishnamurthy Paddhati Bansal PDF
100% (2)
Jyotish Krishnamurthy Paddhati Bansal PDF
61 pages
Acceptance for Road Repair Contract
No ratings yet
Acceptance for Road Repair Contract
1 page
BMW's Decade in F1: Engine Evolution
100% (1)
BMW's Decade in F1: Engine Evolution
17 pages
Oracle E-Business Tax Extensibility
No ratings yet
Oracle E-Business Tax Extensibility
5 pages
G1 - 5551 SMD Sot23
No ratings yet
G1 - 5551 SMD Sot23
5 pages

Unit 2 Preprocessing in Data Analytics

Uploaded by

Unit 2 Preprocessing in Data Analytics

Uploaded by

UNIT - 2

Why Data Pre Processing?

• It involves handling missing values, removing duplicates, and correcting

• Clean data is essential for effective analysis, as it improves the quality of

• This occur when data is absent from a dataset.

• You can either:

• Ignore the rows with missing data or

• delete records with missing data

• It ensures consistency by working with a complete dataset.

• Pairwise deletion is a technique that evaluates each pair of variables independently,

• It maximizes the use of available data for each specific analysis.

• All other rows are retained for other variable pairs.

Ensures a consistent dataset for all

Simplicity More complex to implement Straightforward and easy to apply

May reduce power due to smaller

Suitable for modeling and regression

• These techniques help remove noise and improve data quality.

• Where MAD = median(|X-median|)

• It involves merging data from various sources into a single, unified

• It can be challenging due to differences in data formats, structures,

• It involves converting data into a format suitable for analysis.

• Discretization, which converts continuous data into discrete categories.

• Dimensionality Reduction (e.g., Principal Component Analysis): A

You might also like