Data Preprocessing
Dec 2024
Outline
• Data quality problems
• Data preprocessing
• Data cleaning
• Data integration
• Data transformation
• Data reduction
Introduction
• Object and attributes
Data a sets of
objects/samples/vectors/insta
nces/etc., placed on the rows
of a table
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute
• Distinction between attributes and attribute values
– Same attribute can be mapped to different attribute values.
• Example height can be measured in feet or meters
– Different attributes can be mapped to the same set of values
• Example: Attribute values for ID and age are integers
– Properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value
Attributes with Examples
Data Quality
• Data have quality if they satisfy the requirements of the
intended use.
– Accuracy
– Completeness
– Consistency
– Timeliness
– Believability: how much the data are trusted by users.
– Interpretability: how easy the data are understood.
Data Quality Problems
• Noise refers to distortion (modification) of original values, due
to different interferences mainly occurring in the process of
data collecting
– Faulty data collection instruments
– Data entry errors
– Data transmission problems
– Technology limitation
– Inconsistency in naming convention
Data Quality Problems
• Outliers often generated by measurement errors
• Refers to data objects with characteristics that are
considerably different than most of the other data objects in
the data set.
– Outlier is an object (observation) that is, in a certain way,
distant from the rest of the data.
– Represents an ‘alien’ object in the dataset
Data Quality Problems
• Missing values: attribute values that is not stored for an
attribute
• Reasons for missing values
– Information is not collected (e.g., people decline to give their
age and weight)
– Attributes may not be applicable to all cases (e.g., annual
income is not applicable to children)
Data Quality Problems
• Duplication: data set may include data objects that are
duplicates, or almost duplicates of one another
– Major issue when merging data from heterogeneous sources
• Example:
– Same person with multiple email addresses
Data Quality Problems
• Inconsistent: containing discrepancies
– e.g., Age=”42” Birthday=’03/07/1997’
– e.g., Rating was ”1,2,3”, now rating ”A, B, C” ?
– e.g., discrepancy between duplicate records
• Impossible data combination (eg. Gender: Male, Pregnant:
Yes)
• Specified ML model needs information in a specified format–
RF– Null value
Why data Preprocessing is Important?
• Data preprocessing consumes most of the time and
implementation efforts and can be more critical than the machine-
learning algorithm itself.
• Less data (fewer attributes): machine learning methods can learn
faster
• Higher accuracy: machine learning methods can generalize better
• Simple results: they are easier to understand
Data Preprocessing Major Tasks
Data Cleaning
• Dirty data can cause confusion for the mining procedure,
resulting in unreliable output.
• Routines work to “clean” the data by:
– Filling in missing values
– Smoothing noisy data
– Identifying or removing outliers
– Resolving inconsistencies.
Data Cleaning
• Handling missing values:
– Ignore the tuple (data object) whenever a class label is missed.
• By ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple that might be useful.
• Not very effective, unless the tuple contains several
attributes with missing value.
– Fill in the missing values manually- time consuming and
not feasible for a large data set.
Data Cleaning
• Handling missing values:
– Estimate Missing Values
• Use a global constant
– Simple, it is not foolproof.
• Use the attribute mean and median
– For all samples belonging to the same class as the given tuple
• Use the most probable value (popular)
– Determined by regression, Bayesian formalism, or decision
tree
– Popular strategy since it uses the most information from the
present data to predict missing values
• Choosing the right technique is a choice that depends on the
problem domain.
• Missing value may not imply an error in the data!
Example: Missing Values Handling
method
Attribute Data type Handling method
Name
Sex Nominal Replace by the mode value.
Age Numeric Replace by the mean value.
Religion Nominal Replace by the mode value.
Height Numeric Replace by the mean value.
Marital status Nominal Replace by the mode value.
Job Nominal Replace by the mode value.
Weight Numeric Replace by the mean value.
Noise Data
• Noise is a random error or variance in a variable measure
• Noisy data are data with a large amount of additional
meaningless information called noise. They are corrupted and
distorted data.
• How to handle?
– Binning method
– Regression
– Clustering
• Noisy data can be smoothed:
– Binning methods:- smooth a sorted data value by consulting its
neighborhood-- the values around it.
– Perform local smoothing
– Distribute the sorted values into bins
Noise Data
• Noisy data can be smoothed:
– Equal-width (distance) partitioning:
• It divides the range into N intervals of equal size: uniform
grid
• If A and B are the lowest and highest values of the attribute,
the width of intervals will be W = (B-A)/N
• Skewed data is not handled well
Noise Data
• Noisy data can be smoothed:
– Equal-depth (frequency) partitioning
• It divides the rang into N intervals, each containing
approximately the same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
– Smoothing by bin means: each value in a bin is replaced by the
mean value of the bin.
– Smoothing by bin median: replaced by the bin median
– Smoothing by bin boundaries: minimum and maximum
values in a given bin are identified as the bin boundaries
Binning-Example
Regression
•Conforms data values to a function.
•Linear regression involves finding
the “best” line to fit two attributes
(or variables) so that one attribute
can be used to predict the other.
•Multiple linear regression more
than two attributes are involved
and the data are fit to a
multidimensional surface.
Clustering
•Outliers may be detected as values that
fall outside of the sets of clusters.
Data Cleaning as a Process
• The first step in data cleaning as a process is discrepancy
detection
• Caused by several factors:
– Poorly designed data entry forms
– Human error in data entry
– Deliberate errors
– Data decay: example, outdated addresses
– Data representations and inconsistent use of codes
– Inconsistencies due to data integration , example different names in
different database
Data Cleaning as a Process
• Commercial tools that can aid in the discrepancy detection
step
– Data scrubbing tools: use simple domain knowledge to detect
errors and make corrections
• Rely on parsing and fuzzy matching techniques
– Data auditing tools: find discrepancies by analyzing the data to
discover rules and relationships
• Detecting data that violate such conditions
• Once we find discrepancies, we typically need to define and
apply (a series of) transformations to correct them.
Data Cleaning as a Process
• Commercial tools can assist in the data transformation:
– ETL (extraction/transformation/loading) tools: allow simple
transformations; example, replace “gender” by “sex.”
Data Integration
• Blending data from multiple sources into a coherent data store.
“Data integration is the combination of technical and business processes
used to combine data from disparate sources into meaningful and
valuable information.”
• Careful integration can help reduce and avoid redundancies and
inconsistencies.
• The semantic heterogeneity and structure of data pose great
challenges in data integration.
Data Integration
• In addition to detecting redundancies between attributes,
duplication should also be detected at the tuple level.
• The use of denormalized tables is another source of data
redundancy.
Data Integration
• Entity identification problem:
– There are a number of issues to consider during data integration--
Schema integration and object matching.
• How can equivalent real-world entities from multiple data
sources be matched up? -- the entity identification problem.
• Customer-id in one database and cust-number in another refer
to the same attribute.
– Solution: meta data-- data about data
Data Integration
• Redundancy and Correlation Analysis :
– Redundancy is another important issue in data integration.
– Inconsistencies in attribute or dimension naming can also cause
redundancies in data integration.
– Some redundancies can be detected by correlation analysis.
• Chi-square: nominal attribute
• Correlation/covariance: numeric attribute
Data Integration Approaches
• Data Consolidation
– Brings data together from several separate systems
– The goal is to reduce the number of data storage locations.
• Data Propagation
– Data propagation is the use of applications to copy data from
one location to another.
– It is event-driven and can be done synchronously or
asynchronously
Data Integration Approaches
• Data Virtualization
– Uses an interface to provide a near real-time, unified view of
data from disparate sources with different data models.
• Data Federation
– A form of data virtualization
– Uses a virtual database and creates a common data model for
heterogeneous data from different systems
• Data Warehousing
– Data warehouses are storage repositories for data
– Data warehousing implies the cleansing, reformatting, and
storage of data
Data Reduction
• Most machine learning and data mining techniques may not
be effective for high-dimensional data
• Data reduction techniques can be applied to obtain a reduced
representation of the data set that is much smaller in volume,
yet closely maintains the integrity of the original data.
• Analytics on the reduced data set should be more efficient yet
produce the same (or almost the same) analytical results.
Data Reduction
• Data reduction strategies include:
– Dimensionality reduction: process of reducing the number of
random variables or attributes under consideration.
• Methods: Attribute subset selection, wavelet transform and
principal components analysis.
– Numerosity reduction: replace the original data volume by
alternative, smaller forms of data representation.
• Method:
– Parametric : a model is used to estimate the data, so
that typically only the data parameters need to be
stored, instead of the actual data.
» Regression and log-linear models
Data Reduction
• Data reduction strategies include:
• Nonparametric methods for storing reduced representations
of the data include:
– Histograms, clustering, sampling, data cube aggregation
– Data compression: transformations are applied so as to
obtain a reduced or “compressed” representation of the
original data.
Data Reduction
– Lossless compression: If the original data can be
reconstructed from the compressed data without any
information loss.
– Lossy: If we can reconstruct only an approximation of the
original data.
Data Reduction- Wavelet Transform
• DWT:- when applied to a data vector X, transforms it to a numerically
different vector, X’, of wavelet coefficients.
– The two vectors are of the same length.
• “How can this technique be useful for data reduction if the wavelet
transformed data are of the same length as the original data?”
• The Wavelet Transform is a mathematical tool used in signal
processing and data analysis that decomposes a signal into its
constituent parts at different frequency levels.
• compressed approximation of the data can be retained by
storing only a small fraction of the strongest of the wavelet
coefficients.
Data Reduction- Wavelet Transform
• Given a set of coefficients, an approximation of the original
data can be constructed by applying the inverse of the DWT
used.
• WT give good results on sparse or skewed data and on data
with ordered attributes.
Data Reduction- Wavelet Transform
• Suppose that the data to be reduced consist of tuples or data
vectors described by n attributes or dimensions.
• Searches for k n-dimensional orthogonal vectors that can best
be used to represent the data, where k is less than or equal to
n.
• Goal is to find a projection that captures the largest amount
of variation in data.
Data Reduction: PCA
• PCA is a technique for forming new variables which
are linear composites of the original variables.
– Reduce the dimensionality of a data set by finding a new set
of variables.
• PCs may be used as inputs to multiple regression and
cluster analysis.
• PCA tends to be better at handling sparse data,
whereas wavelet transforms are more suitable for
data of high dimensionality.
Data Reduction: PCA
• Retains most of the sample's information (the variation
present in the sample, given by the correlations between the
original variables)
– The new variables are called principal components and the
values of the new variables are called principal component
scores.
• PCA is a dimensionality reduction technique that transforms a
dataset into a set of orthogonal components, known as
principal components.
Data Reduction: PCs
• PCs are a series of linear least squares fits to a sample, each
orthogonal to all the previous.
– The 1st PC is a minimum distance fit to a line in X space
– The 2nd PC is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
Data Reduction: PCA- How it works
• Find the eigenvectors of the covariance matrix
• The eigenvectors define the new space
• Covariance Matrix: PCA starts by computing the covariance matrix
of the data, which represents how different dimensions (features)
of the data vary together.
• Eigenvalues: indicate the amount of variance captured by each
principal component.
• Eigenvectors: represent the directions of the axes along which the
data varies the most.
• PCs may be used as inputs to multiple regression and cluster
analysis.
Data Reduction: Feature Subset Selection
• Another way to reduce dimensionality of data
• Redundant features
– Duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of sales
tax paid
• Irrelevant features
– Contain no information that is useful for the data mining task at
hand
– Example: students ’ID is often irrelevant to the task of
predicting students’ GPA
Data Reduction: Feature Subset Selection
• Techniques
– Brute-force approach:
• Try all possible feature subsets as input to machine learning
algorithm
• Embedded approaches:
• Feature selection occurs naturally as part of the machine learning
algorithm
– Filter approaches:
• Features are selected before machine learning algorithm is run
– Wrapper approaches:
• Use the machine learning algorithm as a black box to find best
subset of attributes
Data Reduction: Histograms
• Histograms use binning to approximate data distributions and
are a popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution
of A into disjoint subsets, referred to as buckets or bins.
• The numbers have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12,
14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20,
20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.
A histogram for price using singleton
buckets—each bucket represents one price–
value/frequency pair.
Data Reduction: Histograms
• How are the buckets determined and the attribute values
partitioned?
• Partitioning rules:
– Equal-width: In an equal-width histogram, the width of each
bucket range is uniform.
– Equal-frequency (or equal-depth): In an equal-frequency
histogram, the buckets are created so that, roughly, the
frequency of each bucket is constant.
• Histograms are highly effective at approximating both sparse
and dense data, as well as highly skewed and uniform data.
Application of DR
• Text mining
• Image retrieval
• Microarray data analysis
• Protein classification
• Face recognition
• Handwritten digit recognition
• Intrusion detection
Data Reduction-Example
Data Transformation and Discretization
• Data are transformed or consolidated into forms appropriate for
mining
• Involves the following:
– Smoothing: remove noise from the data – binning, regression,
and clustering
– Attribute construction: new attributes are constructed from the
given set of attributes
– Aggregation: summary or aggregation operations are applied to
the data
Data Transformation and Discretization
• Data are transformed or consolidated into forms appropriate
for mining
• Involves the following:
– Normalization: scale the attribute value with in a small
specified range– -1.0 to 1.0 or 0 to 1
– Discretization: raw values of a numeric attribute (e.g., age) are
replaced by interval labels
– Concept hierarchy generation for nominal data: eg. Street can
be generalized to higher-level concepts, like city or country
Data Transformation by Normalization
• To help avoid dependence on the choice of measurement
units, the data should be normalized or standardized.
• Normalizing the data attempts to give all attributes an equal
weight.
• Normalization useful for classification (neural networks,
nearest-neighbor) and clustering.
• Methods for data normalization:
– Min-max normalization
– Z-score normalization
– Normalization by decimal scaling
Data Transformation by Normalization
• Min-max normalization performs a linear transformation on
the original data.
• Min-max normalization preserves the relationships among the
original data values.
Data Transformation by Normalization
Data Transformation by Normalization
Discretization by Binning
• Binning is a top-down splitting technique based on a specified
number of bins.
• These methods are also used as discretization methods for
data reduction and concept hierarchy generation.
Case-Study
Online data preprocessing: a case study approach
(Mohammed et al., 2019)
• Implemented preprocessing to Flight MH370 social data. After
preprocessing, they used the resultant data to examine the flight
community structure, discover types of social relationships, reveal
the truth behind some of the unusual events, and study people
coping behavior (adaptation patterns) during disaster time.