CIS 517 :Data Mining and
Warehousing
Week4-5: Data Preprocessing
(Chapter 3)
Data Warehouse and Data Mining 1
1
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete:
• lacking attribute values
• lacking certain attributes of interest
• containing only aggregate data
– noisy:
• containing errors or outliers
– inconsistent:
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data
Data Warehouse and Data Mining 2
Forms of data preprocessing
• 1. Data cleaning
• 2. Data integration
• 3.Data
transformation
• 4. Data
reduction
Data Warehouse and Data Mining 3
Major Tasks in Data Preprocessing
• I. Data cleaning
– fill in missing values
– smooth noisy data
– identify or remove outliers
– resolve inconsistencies
• II. Data integration
– Integration of multiple databases, data cubes, or files
• III. Data transformation
– Normalization and aggregation
Data Warehouse and Data Mining 4
Major Tasks in Data Preprocessing (cont)
• IV. Data reduction
– Obtains reduced representation in volume but
produces the same or similar analytical
results
Data Warehouse and Data Mining 5
I. Data Cleaning : Data Preprocessing
• Data cleaning tasks
– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
Data Warehouse and Data Mining 6
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?
• Use a global constant to fill in the missing value: e.g., “unknown”,
a new class?!
• Use the attribute mean to fill in the missing value
• Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Data Warehouse and Data Mining 7
Noisy Data
• Other data problems which requires data
cleaning
1. duplicate records
2. incomplete data
3. inconsistent data
Data Warehouse and Data Mining 8
How to Handle Noisy Data?
• 1.1 Binning method:
– first sort data and partition into (equi-depth) bins
then one can
• smooth by bin means
• smooth by bin median
• smooth by bin boundaries
• 1.2 Clustering
– detect and remove outliers
• 1.3 Regression
– smooth by fitting the data into regression functions
Data Warehouse and Data Mining 9
1.1 Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal size:
uniform grid
– if A and B are the lowest and highest values of the
attribute, the width of intervals will be:
W = (B-A)/N.
Data Warehouse and Data Mining 10
1.1 Simple Discretization Methods: Binning
• Equal-depth (frequency) partitioning:
– It divides the range into N intervals, each
containing approximately same number of
samples
Data Warehouse and Data Mining 11
Binning Methods for Data Smoothing
• Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins: (size = 4)
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4 = 9
- Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4 = 23
- Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4 = 29
* Smoothing by bin boundaries: (select min or max value)
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Warehouse and Data Mining 12
Binning Methods for Data Smoothing
• Sorted data for price (in dollars):
4, 8, 9, 12, 15, 21, 21, 21, 24, 25, 26, 26, 28, 29, 34
Partition into (equi-depth) bins: (size = 5)
* Smoothing by bin median:
- Bin 1: 9, 9, 9, 9, 9
- Bin 2: 21, 21, 21, 21, 21
- Bin 3: 28, 28, 28, 28, 28
Data Warehouse and Data Mining 13
1.2 Cluster Analysis
outliers
Select outliner from cluster analysis
Data Warehouse and Data Mining 14
1.2 Cluster Analysis
• Partition data set into clusters, and one can
store cluster representation only
• Can be very effective if data is clustered but not
if data is “smeared”
• Can have hierarchical clustering and be stored
in multi-dimensional index tree structures
Data Warehouse and Data Mining 15
1.3 Regression
y
Y1
Y1’ y=x+1
X1 x
Data Warehouse and Data Mining 16
II. Data Integration : Data Preprocessing
• Data integration:
– combines data from multiple sources into a coherent
store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities
from multiple data sources,
– e.g., A.cust-id B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from
different sources are different
– possible reasons: different representations, different
scales, e.g., metric vs. British units
Data Warehouse and Data Mining 17
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of multiple
databases
– The same attribute.... may have different names in different
databases
– One attribute may be a “derived” attribute in another table,
• e.g., annual revenue
• Redundant data may be able to be detected by
correlational analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Data Warehouse and Data Mining 18
Data Integration
Data Warehouse and Data Mining 19
III. Data Transformation : Data Preprocessing
1. Smoothing: remove noise from data
2. Aggregation: summarization, data cube construction
3. Generalization: concept hierarchy
4. Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
5. Attribute/feature construction
– New attributes constructed from the given ones
6. Discretization
Data Warehouse and Data Mining 20
Data Transformation: Normalization
• Min-max normalization: to [new_minA, new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
– Ex. Let income range $12,000 to $98,000
73,600 12normalized
,000
(1.0 0) 0 to [0.0,
0.716
98,000 12,000
1.0]. Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
v A
v'
A
73,600 54,000
1.225
16,000
– Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Warehouse and Data Mining 21
Discretization
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).
Data Warehouse and Data Mining 22
Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic
rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization
Data Warehouse and Data Mining 23
Data Discretization Methods
• Typical methods:
– Binning
– Histogram analysis
– Clustering analysis
– Decision-tree analysis
Data Warehouse and Data Mining
24
Binning
• Attribute values (for one attribute e.g., age):
– 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equal-width binning – for bin width of e.g., 10:
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equal-frequency binning – for bin density of e.g., 3:
– Bin 1: 0, 4, 12
– Bin 2: 16, 16, 18
– Bin 3: 24, 26, 28
Data Warehouse and Data Mining 25
Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
• Concept hierarchy can be automatically formed for both numeric and
nominal data—For numeric data, use discretization methods shown
Data Warehouse and Data Mining 26
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit
data grouping
– {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
– E.g., for a set of attributes: {street, city, state,
Data Warehouse and Data Mining 27
country}
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data
set
– The attribute with the most distinct values is placed at the
lowest level of the hierarchy
– Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values
Data Warehouse and Data Mining 28
IV. Data Reduction : Data Preprocessing
• Warehouse may store terabytes of data:
– Complex data analysis/mining may take a very long time
to run on the complete data set
• Data reduction
– Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same
(or almost the same) analytical results
Data Warehouse and Data Mining 29
IV. Data Reduction : Data Preprocessing
• Data reduction strategies
4.1 Dimensionality reduction
4.2 Numerosity reduction
4.3 Data Compression
Data Warehouse and Data Mining 30
4.1 Dimensionality Reduction
• Dimensionality reduction is the process of
reducing the number of random variables or
attributes under consideration.
• Attribute subset selection is a method of
dimensionality reduction in which irrelevant,
weakly relevant, or redundant attributes or
dimensions are detected and removed .
– Select a minimum set of features such that the probability
distribution of different classes given the values for those features is
as close as possible to the original distribution given the values of
all features
Data Warehouse and Data Mining 31
4.1 Dimensionality Reduction
• Heuristic methods
(due to exponential # of choices):
1. step-wise forward selection
2. step-wise backward elimination
3. combining forward selection and backward
elimination
4. decision-tree induction
Data Warehouse and Data Mining 32
4.1 Dimensionality Reduction
• Heuristic methods
Data Warehouse and Data Mining 33
4.1 Dimensionality Reduction
• Heuristic methods
Data Warehouse and Data Mining 34
4.2 Data Compression
• Data compression are applied so as to obtain a
reduced or “compressed” representation of the
original data.
• If the original data can be reconstructed from the
compressed data without any information loss, the data
reduction is called lossless.
• If, instead, we can reconstruct only an approximation of
the original data, then the data reduction is called lossy.
• Dimensionality and numerosity reduction may also be
considered as forms of data compression.
Data Warehouse and Data Mining 35
Data Compression
Original Data Compressed
Data
lossless
os sy
l
Original Data
Approximated
Data Warehouse and Data Mining 36
4.3 Numeric Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Major families: histograms, clustering, sampling, …
Data Warehouse and Data Mining 37
Non-parametric methods 1. Histograms
Data Warehouse and Data Mining 38
Non-parametric methods 1. Histograms
Data Warehouse and Data Mining 39
Non-parametric methods 1. Histograms
Data Warehouse and Data Mining 40
2.Clustering
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only.
• Can be very effective if data is clustered but not if data is
“smeared”.
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
• There are many choices of clustering definitions and
clustering algorithms.
Data Warehouse and Data Mining 41
Non-parametric methods 2. Clustering
C1 C2
C3
Data Warehouse and Data Mining 42
Non-parametric methods 3. Sampling
• Sampling can be used as a data reduction technique
because it allows a large data set to be represented by a
much smaller random data sample (or subset).
Data Warehouse and Data Mining 43
Non-parametric methods 3. Sampling
Data Warehouse and Data Mining 44
Non-parametric methods 3. Sampling
Data Warehouse and Data Mining 45
If want sample of 10 ?
Data Warehouse and Data Mining 46
If want sample of 10 ?
Data Warehouse and Data Mining 47
Data Warehouse and Data Mining 48
If want sample of 10 ?
Data Warehouse and Data Mining 49
Data Warehouse and Data Mining 50
Data Warehouse and Data Mining 51
If want sample of 12 ?
Data Warehouse and Data Mining 52
If want sample of 12 ?
Data Warehouse and Data Mining 53
Data Warehouse and Data Mining 54
Data Warehouse and Data Mining 55
Sampling
SW OR om
SR le rand
p t
(sim le withou
samp ment)
pl ace
re
SRSW
R
Raw Data
Data Warehouse and Data Mining 56
Data Cube Aggregation
• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.
• Multiple levels of aggregation in data cubes
– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve
the task
• Queries regarding aggregated information should be
answered using data cube, when possible
Data Warehouse and Data Mining 57
Data Cube Aggregation
Data Warehouse and Data Mining 58
Data Cube Aggregation
Data Warehouse and Data Mining 59
Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
2
(Observed Expected )
2
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Data Warehouse and Data Mining 60
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
• Χ2 (chi-square) calculation (numbers in parenthesis
are expected counts calculated based on the data
distribution in the two categories)
2 (250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2
507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are
correlated in the group
Data Warehouse and Data Mining 61
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment
coefficient)
i 1 (ai A)(bi B)
n n
(ai bi ) n AB
rA, B i 1
(n 1) A B (n 1) A B
where n is the number of tuples, and are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated
Data Warehouse and Data Mining 62
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.
Data Warehouse and Data Mining 63
Correlation (viewed as linear
relationship)
• Correlation measures the linear relationship between
objects
• To compute correlation, we standardize data objects,
A and B, and then take their dot product
a 'k (ak mean( A)) / std ( A)
b'k (bk mean( B)) / std ( B )
correlatio n( A, B ) A' B'
Data Warehouse and Data Mining 64
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value,
B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence
Data Warehouse and Data Mining 65
Co-Variance: An Example
• It can be simplified in computation as
• Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Thus, A and B rise together since Cov(A, B) > 0.
Data Warehouse and Data Mining 66
Summary
• Data quality: accuracy, completeness, consistency,
timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
– Entity identification problem; Remove redundancies;
Detect inconsistencies
• Data reduction
– Dimensionality reduction; Numerosity reduction; Data
compression
• Data transformation and data discretization
– Normalization; Concept hierarchy generation
Data Warehouse and Data Mining 67