Data Mining:
Concepts and Techniques
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
2
1
Data Quality: Why Preprocess the Data?
◼ Measures for data quality: A multidimensional view
◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
2
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
5
Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
6
3
Incomplete (Missing) Data
◼ Data is not always available
◼ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the
time of entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred
7
How to Handle Missing Data?
◼ Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the
same class
4
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ faulty data collection instruments
◼ data entry problems
◼ data transmission problems
◼ technology limitation
◼ inconsistency in naming convention
◼ Other data problems which require data cleaning
◼ duplicate records
◼ incomplete data
◼ inconsistent data
How to Handle Noisy Data?
◼ Binning
◼ first sort data and partition into (equal-frequency) bins
◼ then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
◼ Regression
◼ smooth by fitting the data into regression functions
◼ Clustering
◼ detect and remove outliers
◼ Combined computer and human inspection
◼ detect suspicious values and check by human (e.g.,
deal with possible outliers)
10
10
5
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency, distribution)
◼ Check field overloading
◼ Check uniqueness rule, consecutive rule and null rule
◼ Use commercial tools
◼ Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
◼ Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
◼ Data migration and integration
◼ Data migration tools: allow transformations to be specified
◼ ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
◼ Integration of the two processes
◼ Iterative and interactive
11
11
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
12
12
6
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources
◼ Detecting and resolving data value conflicts
◼ For the same real world entity, attribute values from different
sources are different
◼ Possible reasons: different representations, different scales
13
13
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
26
26
7
Data Reduction Strategies
◼ Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
◼ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
◼ Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes
◼ Wavelet transforms
◼ Principal Components Analysis (PCA)
◼ Feature subset selection, feature creation
◼ Numerosity reduction (some simply call it: Data Reduction)
◼ Regression and Log-Linear Models
◼ Histograms, clustering, sampling
◼ Data cube aggregation
◼ Data compression
27
27
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ The possible combinations of subspaces will grow exponentially
◼ Dimensionality reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization
28
28
8
Data Reduction 2: Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller
forms of data representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
◼ Non-parametric methods
◼ Do not assume models
◼ Major families: histograms, clustering, sampling, …
40
40
Parametric Data Reduction
◼ Linear regression
◼ Data modeled to fit a straight line
◼ Often uses the least-square method to fit the line
◼ Multiple regression
◼ Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
41
41
9
Histogram Analysis
◼ Divide data into buckets and 40
store average (sum) for each 35
bucket
30
◼ Partitioning rules:
25
◼ Equal-width: equal bucket 20
range
15
◼ Equal-frequency (or equal- 10
depth)
5
0
100000
10000
20000
30000
40000
50000
60000
70000
80000
90000
44
44
Clustering
◼ Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only
45
45
10
Sampling
◼ Sampling: obtaining a small sample s to represent the
whole data set N
◼ Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
46
46
Data Reduction 3: Data Compression
◼ String compression
◼ There are extensive theories and well-tuned algorithms
◼ Typically lossless, but only limited manipulation is
possible without expansion
◼ Audio/video compression
◼ Typically lossy compression, with progressive refinement
◼ Sometimes small fragments of signal can be
reconstructed without reconstructing the whole
◼ Time sequence is not audio
◼ Typically short and vary slowly with time
◼ Dimensionality and numerosity reduction may also be
considered as forms of data compression
51
51
11
Data Compression
Original Data Compressed
Data
lossless
Original Data
Approximated
52
52
Chapter 3: Data Preprocessing
◼ Data Preprocessing: An Overview
◼ Data Quality
◼ Major Tasks in Data Preprocessing
◼ Data Cleaning
◼ Data Integration
◼ Data Reduction
◼ Data Transformation and Data Discretization
◼ Summary
53
53
12
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing 54
54
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
A
73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v' = Where j is the smallest integer such that Max(|ν’|) < 1
10 j
55
55
13
Data Discretization Methods
◼ Typical methods: All the methods can be applied recursively
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis (unsupervised, top-down split or
bottom-up merge)
◼ Decision-tree analysis (supervised, top-down split)
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
57
57
Simple Discretization: Binning
◼ Equal-width (distance) partitioning
◼ Divides the range into N intervals of equal size: uniform grid
◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
◼ The most straightforward, but outliers may dominate presentation
◼ Skewed data is not handled well
◼ Equal-depth (frequency) partitioning
◼ Divides the range into N intervals, each containing approximately
same number of samples
◼ Good data scaling
◼ Managing categorical attributes can be tricky
58
58
14
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
59
59
15