L6 Data Preprocessing
L6 Data Preprocessing
◼ Data Quality
◼ Data Cleaning
Prasanna S. Haddela
Senior Lecturer ◼ Data Integration
Faculty of Computing, SLIIT
◼ Data Reduction
◼ Summary
2
1 2
Knowledge Discovery in Databases (KDD) Data Quality: Why Preprocess the Data?
3 4
3 4
◼ Data cleaning
◼ Data Preprocessing: An Overview
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies ◼ Data Quality
◼ Data transformation and data discretization
◼ Major Tasks in Data Preprocessing
◼ Normalization
◼ Concept hierarchy generation ◼ Data Cleaning
◼ Data reduction
◼ Data Transformation and Data Discretization
◼ Dimensionality reduction
◼ Numerosity reduction ◼ Data Reduction
Data compression
Data Integration
◼
◼
◼ Data integration
◼ Integration of multiple databases, data cubes, or files ◼ Summary
5 6
5 6
1
Data Cleaning Incomplete (Missing) Data
Data in the Real World is Dirty: Lots of potentially incorrect data,
◼
◼ Data is not always available
e.g., instrument faulty, human or computer error, transmission error
◼ E.g., many tuples have no recorded value for several
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data attributes, such as customer income in sales data
◼ e.g., Occupation=“ ” (missing data) ◼ Missing data may be due to
◼ noisy: containing noise, errors, or outliers ◼ equipment malfunction
◼ e.g., Salary=“−10” (an error) ◼ inconsistent with other recorded data and thus deleted
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ data not entered due to misunderstanding
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C” ◼ certain data may not be considered important at the
◼ discrepancy between duplicate records time of entry
◼ Intentional (e.g., disguised missing data) ◼ not register history or changes of the data
◼ Jan. 1 as everyone’s birthday? ◼ Missing data may need to be inferred
7 8
7 8
9 10
◼ Incorrect attribute values may be due to ◼ then one can smooth by bin means, smooth by bin
11 12
11 12
2
Data Preprocessing
◼ Data Quality
◼ Data Cleaning
◼ Data Reduction
◼ Data Integration
◼ Summary
14
13
13 14
15 16
Normalization Normalization
◼ Z-score normalization (μ: mean, σ: standard deviation): ◼ Normalization by decimal scaling
v − A v
v' = v' =
A 10 j
Where j is the smallest integer such that Max(|ν’|) < 1
◼ ◼
17 18
17 18
3
Discretization
Three types of attributes
◼ Weka practical: Lab1.4 Normalization ◼
19 20
19 20
◼ Typical methods: All the methods can be applied recursively ◼ Equal-width (distance) partitioning
◼ Binning ◼ Divides the range into N intervals of equal size: uniform grid
◼ Top-down split, unsupervised ◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
◼ Histogram analysis
◼ bin are defined as [min + w], [min + 2w] …. [min + nw]
◼ Top-down split, unsupervised
◼ The most straightforward, but outliers may dominate presentation
◼ Clustering analysis (unsupervised, top-down split or
◼ Skewed data is not handled well
bottom-up merge)
◼ Equal-depth (frequency) partitioning
◼ Decision-tree analysis (supervised, top-down split)
◼ Divides the range into N intervals, each containing approximately
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
same number of samples
merge)
◼ Managing categorical attributes can be tricky
21 22
21 22
23 24
4
Discretization by Classification &
Correlation Analysis
◼ Weka practical: Lab1.2 Binning ◼ Classification (e.g., decision tree analysis)
◼ Supervised: Given class labels, e.g., cancerous vs. benign
25 26
25 26
◼ Concept hierarchy organizes concepts (i.e., attribute values) ◼ Some hierarchies can be automatically generated based on
hierarchically and is usually associated with each dimension in a data
the analysis of the number of distinct values per attribute in
the data set
warehouse
◼ The attribute with the most distinct values is placed at
◼ Concept hierarchies facilitate drilling and rolling in data warehouses to the lowest level of the hierarchy
view data in multiple granularity
◼ Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by country 15 distinct values
higher level concepts (such as youth, adult, or senior)
◼ Concept hierarchies can be explicitly specified by domain experts province_or_ state 365 distinct values
and/or data warehouse designers
◼ Concept hierarchy can be automatically formed for both numeric and city 3567 distinct values
nominal data. For numeric data, use discretization methods shown.
street 674,339 distinct values
27 28
27 28
◼ Data Transformation and Data Discretization ◼ Feature subset selection, feature creation
◼ Wavelet transforms
◼ Data Reduction ◼ Numerosity reduction (some simply call it: Data Reduction)
29
30
29 30
5
Principal Component Analysis (PCA)
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality ◼ Find a projection that captures the largest amount of variation in data
◼ When dimensionality increases, data becomes increasingly sparse ◼ The original data are projected onto a much smaller space, resulting
◼ Density and distance between points, which is critical to clustering, outlier in dimensionality reduction. We find the eigenvectors of the
analysis, becomes less meaningful covariance matrix, and these eigenvectors define the new space
◼ The possible combinations of subspaces will grow exponentially
◼ Dimensionality reduction x2
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining e
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Principal Component Analysis
◼ Supervised and nonlinear techniques (e.g., feature selection)
◼ Wavelet transforms
x1
31
32
31 32
33 34
33 34
◼ Step-wise attribute elimination: ◼ Mapping data to new space (see: data reduction)
35 36
6
Parametric Data Reduction: Regression
Data Reduction 2: Numerosity Reduction and Log-Linear Models
◼ Reduce data volume by choosing alternative, smaller ◼ Linear regression
forms of data representation ◼ Data modeled to fit a straight line
◼ Parametric methods (e.g., regression)
◼ Often uses the least-square method to fit the line
◼ Assume the data fits some model, estimate model
◼ Multiple regression
parameters, store only the parameters, and discard
◼ Allows a response variable Y to be modeled as a
the data (except possible outliers)
linear function of multidimensional feature vector
◼ Ex.: Log-linear models—obtain value at a point in m-
38 39
38 39
20000
30000
40000
50000
60000
70000
80000
90000
40 41
40 41
42 43
7
Sampling: With or without Replacement Sampling: Cluster or Stratified Sampling
Raw Data
44 45
44 45
46 47
◼ Data Quality
◼ Data Reduction
Original Data ◼ Data Integration
Approximated
◼ Summary
48
49
48 49
8
Data Integration Handling Redundancy in Data Integration
50 51
◼ Data compression
◼ Data Reduction
◼ Data integration from multiple sources:
◼ Data Integration ◼ Entity identification problem
◼ Remove redundancies
◼ Summary ◼ Detect inconsistencies
52
53
52 53