UNIT - 2
Why Data Pre Processing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality results
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
Why Data Pre Processing?
• A well-accepted multidimensional view:
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
• Broad categories:
• intrinsic, contextual, representational, and accessibility.
Data Cleaning
• It is the process of identifying and correcting errors or inconsistencies in
the raw dataset.
• It involves handling missing values, removing duplicates, and correcting
incorrect, noisy data or outlier data to ensure the dataset is accurate and
reliable.
• Clean data is essential for effective analysis, as it improves the quality of
results and enhances the performance of data models.
Data Cleaning – Missing Values
• Missing Values
• This occur when data is absent from a dataset.
• You can either:
• Ignore the rows with missing data or
• delete records with missing data
• fill the gaps manually, with the attribute mean, or by using the most probable
value.
• This ensures the dataset remains accurate and complete for analysis.
Data Cleaning - Types of Missing Data
• Missing Completely at Random (MCAR):
• The missingness of data is completely random and unrelated to any other
variables in the dataset, both observed and unobserved.
• For example, if a height measurement is missing due to a malfunctioning
device, it's likely MCAR.
• Missing at Random (MAR):
• The missingness of data can be explained by other variables in the dataset,
but not by the missing value itself.
• For example, if older people are more likely to skip a survey question about
income, the missing income data is MAR if you have access to age data.
Data Cleaning - Types of Missing Data
• Missing Not at Random (MNAR):
• The missingness of data is related to the value of the missing data itself.
• For example, if people with very high incomes are less likely to report their
income, the missing income data is MNAR.
• Structurally Missing:
• This type of missing data is expected due to the structure of the data collection
or analysis.
• An example would be missing values in a column that represents the number
of children a person has, if the person has no children.
Data Cleaning
Missing Values: Deletion technique
• Listwise Deletion Technique
• Listwise deletion removes entire rows from the dataset if any value in those rows
is missing, regardless of how many variables are involved in the analysis.
• It ensures consistency by working with a complete dataset.
• How It Works?
• All rows containing at least one missing value are dropped from the dataset.
• A complete-case analysis is then performed on the remaining rows.
Data Cleaning
Missing Values Deletion Techniques (Listwise Deletion)
• Example: A B C
• Using the following dataset: 1 2 NaN
4 NaN 6
• After applying listwise deletion, 7 8 9
only row 3 remains because
it is the only row with no missing values.
A B C
7 8 9
Data Cleaning
Missing Values: Deletion technique
• Pairwise Deletion Technique
• Pairwise deletion is a technique that evaluates each pair of variables independently,
using only the data points that are not missing for that pair.
• It maximizes the use of available data for each specific analysis.
• How It Works?
• For each computation (e.g., correlation, covariance), the method excludes only the
rows with missing values for the variables involved.
• All other rows are retained for other variable pairs.
Data Cleaning
Missing Values Deletion Technique (Pairwise Deletion)
• Consider the following dataset:
A B C
1 2 NaN
4 NaN 6
7 8 9
• To calculate the correlation between columns A and B, only rows 1 and 3 will
be used.
• For A and C, rows 2 and 3 will be used.
Pairwise vs. Listwise Deletion: Key Comparisons
Feature Pairwise Deletion Listwise Deletion
Uses available data for each variable Removes rows with missing values,
Data Usage
pair ensuring uniformity
Ensures a consistent dataset for all
Consistency Results in inconsistent sample sizes
analyses
Bias Risk of bias if data is not MCAR Risk of bias if data is not MCAR
Simplicity More complex to implement Straightforward and easy to apply
May reduce power due to smaller
Statistical Power Retains more data, increasing power
datasets
Suitable for modeling and regression
Suitability Ideal for exploratory analyses
tasks requiring complete data
Data Cleaning
• Missing Value - Imputation: Replace missing values with estimated values.
• Mean/Median/Mode Imputation: Fill missing values with the mean (average),
median (middle value), or mode (most frequent value) of the column.
• Forward/Backward Fill: Propagate the next or previous known value into the missing
cell.
• Interpolation: Estimate missing values using linear or non-linear methods, often
useful for time-series data.
• Advanced Techniques: Use predictive models like K-Nearest Neighbors (KNN) to
estimate missing values based on similar data points.
Data Cleaning – NOISY DATA
• Noise: random error or variance in a measured variable.
• Incorrect attribute values may due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• etc
• Other data problems which requires data cleaning
• duplicate records, incomplete data, inconsistent data
Data Cleaning
• Noisy Data: It refers to irrelevant or incorrect data that is difficult for
machines to interpret, often caused by errors in data collection or entry. It
can be handled in several ways:
• Binning Method: The data is sorted into equal segments, and each segment is
smoothed by replacing values with the mean or boundary values.
• Regression: Data can be smoothed by fitting it to a regression function, either linear or
multiple, to predict values.
• Clustering: This method groups similar data points together, with outliers either being
undetected or falling outside the clusters.
• These techniques help remove noise and improve data quality.
Outlier Detection & Handling
• Data points inconsistent with the majority of data
• Different outliers
• Valid: CEO’s salary,
• Noisy: One’s age = 200, widely deviated points
• Detection: Z-Score, Boxplots, Clustering, Curve Fitting,
Hypothesis Testing
• Removal methods
• Clustering
• Curve-fitting
• Hypothesis-testing with a given model
Data Cleaning
• Removing Duplicates
• It involves identifying and eliminating repeated data entries to ensure
accuracy and consistency in the dataset.
• This process prevents errors and ensures reliable analysis by keeping only
unique records.
Data Cleaning – Outlier Detection Techniques
• Z-SCORE METHOD
• Using Z score
method,we can find out
how many standard
deviations value away
from the mean.
• Common threshold: |Z|
> 3 indicates an outlier
Data Cleaning – Outlier Detection Techniques
Z - SCORE
• Figure shows area under normal curve and how much area that standard
deviation covers.
• 68% of the data points lie between + or -1 standard deviation.
• 95% of the data points lie between + or -2 standard deviation
• 99.7% of the data points lie between + or -3 standard deviation
• If the z score of a data point is more than 3 (because it cover 99.7% of area), it
indicates that the data value is quite different from the other values. It is taken as
outliers.
Data Cleaning – Outlier Detection Techniques
• Modified Z-Score
• Uses median and median absolute deviation (MAD) instead of mean
and standard deviation — more robust to outliers.
• Where MAD = median(|X-median|)
Data Cleaning – Outlier Detection Techniques
IQR – Technique (or BOXPLOTS)
Data Cleaning - Outlier Detection : Clustering
• Use clustering (e.g., k-means); points that don’t fit well into any cluster or lie far
from cluster centroids are outliers.
• Ex: DBSCAN (DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH
NOISE)
• DBSCAN is a density based clustering algorithm that divides a dataset into
subgroups of high density regions and identifies high density regions cluster as
outliers.
• This approch is similar to the K-mean clustering.
• There are two parameters required for DBSCAN.
• epsilon: a distance parameter that defines the radius to search for nearby neighbors.
• minimum amount of points required to form a cluster.
• Using epsilon and minPts, we can classify each data point
Data Cleaning – Outlier Detection
• VISUALIZING THE DATA
• Data visualization is useful for detecting outliers and unusual groups,
identifying trends and clusters etc. Here the list of data visualization
plots to spot the outliers.
• Box and whisker plot (box plot).
• Scatter plot.
• Histogram.
• Distribution Plot.
• QQ plot.
Data Cleaning - Removing Duplicates
Data Cleaning - Removing Duplicates
Data Integration
• It involves merging data from various sources into a single, unified
dataset.
• It can be challenging due to differences in data formats, structures,
and meanings.
• Techniques like record linkage and data fusion help in combining data
efficiently, ensuring consistency and accuracy.
Data Integration
• Record Linkage:
• The process of identifying and matching records from different datasets that
refer to the same entity, even if they are represented differently.
• It helps in combining data from various sources by finding corresponding
records based on common identifiers or attributes.
• Data Fusion:
• Involves combining data from multiple sources to create a more
comprehensive and accurate dataset.
• It integrates information that may be inconsistent or incomplete from
different sources, ensuring a unified and richer dataset for analysis.
Data Transformation
• It involves converting data into a format suitable for analysis.
• Common techniques include:
• Normalization, which scales data to a common range;
• Standardization, which adjusts data to have zero mean and unit variance; and
• Discretization, which converts continuous data into discrete categories.
• These techniques help prepare the data for more accurate analysis.
Data Transformation
• Data Normalization: The process of scaling data to a common range
to ensure consistency across variables.
• Discretization: Converting continuous data into discrete categories for
easier analysis.
• Data Aggregation: Combining multiple data points into a summary
form, such as averages or totals, to simplify analysis.
• Concept Hierarchy Generation: Organizing data into a hierarchy of
concepts to provide a higher-level view for better understanding and
analysis.
Data Reduction
• It reduces the dataset's size while maintaining key information.
• This can be done through:
• Feature selection, which chooses the most relevant features, and
• Feature extraction, which transforms the data into a lower-
dimensional space while preserving important details.
• It uses various reduction techniques such as:
• Dimensionality Reduction
• Numerosity Reduction
• Data Compression
Data Reduction
• Dimensionality Reduction (e.g., Principal Component Analysis): A
technique that reduces the number of variables in a dataset while
retaining its essential information.
• Numerosity Reduction: Reducing the number of data points by
methods like sampling to simplify the dataset without losing critical
patterns.
• Data Compression: Reducing the size of data by encoding it in a more
compact form, making it easier to store and process.