Lecture 3 & 4: Data
Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
August 21, 2025 1
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values, lacking
certain attributes of interest, or containing only
aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes
or names
No quality data, no quality mining results!
Quality decisions must be based on quality data
Data warehouse needs consistent integration of
quality data
August 21, 2025 2
Major Tasks in Data
Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces
the same or similar analytical results
August 21, 2025 3
Forms of data
preprocessing
August 21, 2025 4
Lecture 4: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
August 21, 2025 5
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy
data
Correct inconsistent data
August 21, 2025 6
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred.
August 21, 2025 7
How to Handle Missing
Data?
Ignore the tuple: usually done when class label is missing
(assuming the tasks in classification—not effective when the
percentage of missing values per attribute varies considerably.
Fill in the missing value manually: tedious + infeasible?
Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class?!
Use the attribute mean to fill in the missing value
Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
Use the most probable value to fill in the missing value:
inference-based such as Bayesian formula or decision tree
August 21, 2025 8
Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which requires data cleaning
duplicate records
incomplete data
inconsistent data
August 21, 2025 9
How to Handle Noisy Data?
Binning method:
first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
Regression
smooth by fitting the data into regression
functions
August 21, 2025 10
Lecture 4: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary Data Mining: Concepts and
August 21, 2025 Techniques 11
Data Integration
Data integration:
combines data from multiple sources into a
coherent store
Schema integration
integrate metadata from different sources
Entity identification problem: identify real world
entities from multiple data sources, e.g., A.cust-
id B.cust-#
Detecting and resolving data value conflicts
for the same real world entity, attribute values
from different sources are different
possible reasons: different representations,
different scales, e.g., metric vs. British units
Data Mining: Concepts and
August 21, 2025 Techniques 12
Data in Data
Integration
Redundant data occur often when integration of
multiple databases
The same attribute may have different names in
different databases
One attribute may be a “derived” attribute in
another table
Redundant data may be able to be detected by
correlational analysis
Careful integration of the data from multiple sources
may help reduce/avoid redundancies and
inconsistencies and improve mining speed and
quality Data Mining: Concepts and
August 21, 2025 Techniques 13
Data
Transformation
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Attribute/feature construction
New attributes constructed from the given ones
Data Mining: Concepts and
August 21, 2025 Techniques 14
Lecture 3-4: Data
Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
August 21, 2025 15
Data Reduction
Strategies
Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to
run on the complete data set
Data reduction
Obtains a reduced representation of the data set
that is much smaller in volume but yet produces
the same (or almost the same) analytical results
Data reduction strategies
Data cube aggregation
Dimensionality reduction
Numerosity reduction
Discretization and concept hierarchy generation
Data Mining: Concepts and
August 21, 2025 Techniques 16
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features
reduce # of patterns in the patterns, easier to
understand
Heuristic methods (due to exponential # of choices):
step-wise forward selection
step-wise backward elimination
combining forward selection and backward
elimination
decision-tree induction
Data Mining: Concepts and
August 21, 2025 Techniques 17
Data Compression
String compression
There are extensive theories and well-tuned
algorithms
Typically lossless
But only limited manipulation is possible without
expansion
Audio/video compression
Typically lossy compression, with progressive
refinement
Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
Data Mining: Concepts and
August 21, 2025 Techniques 18
Data Compression
Original Data Compressed
Data
lossless
os sy
l
Original Data
Approximated
Data Mining: Concepts and
August 21, 2025 Techniques 19
Lecture 3-4: Data
Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary Data Mining: Concepts and
August 21, 2025 Techniques 20
Discretization
Three types of attributes:
Nominal — values from an unordered
set
Ordinal — values from an ordered set
Continuous — real numbers
Data Mining: Concepts and
August 21, 2025 Techniques 21
Discretization and Concept
hierachy
Discretization
reduce the number of values for a given
continuous attribute by dividing the range of
the attribute into intervals
Interval labels can then be used to replace
actual data values
Concept hierarchies
reduce the data by collecting and replacing
low level concepts (such as numeric values for
the attribute age) by higher level concepts
(such as young, middle-aged, or senior).
Data Mining: Concepts and
August 21, 2025 Techniques 22
Discretization and concept
hierarchy generation for numeric
data
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
Segmentation by natural partitioning
Data Mining: Concepts and
August 21, 2025 Techniques 23
Lecture 4: Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary Data Mining: Concepts and
August 21, 2025 Techniques 24
Summary
Data preparation is a big issue for both
warehousing and mining
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but still an
active area of research
Data Mining: Concepts and
August 21, 2025 Techniques 25