Data Preprocessing
January 20, 2015
Data Mining: Concepts and Techniques
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
January 20, 2015
Data Mining: Concepts and Techniques
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
January 20, 2015
Data Mining: Concepts and Techniques
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values,
lacking certain attributes of interest, or
containing only aggregate data
noisy: containing errors or outliers
e.g., occupation=
e.g., Salary=-10
inconsistent: containing discrepancies in
codes or names
January 20, 2015
e.g., Age=42 Birthday=03/07/1997
e.g., Was rating 1,2,3, now rating A, B, C
e.g., discrepancy between duplicate records
Data Mining: Concepts and Techniques
Why Is Data Dirty?
Incomplete data may come from
Noisy data (incorrect values) may come from
Faulty data collection instruments
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from
Not applicable data value when collected
Different considerations between the time when the data
was collected and when it is analyzed.
Human/hardware/software problems
Different data sources
Functional dependency violation (e.g., modify some linked
data)
Duplicate records also need data cleaning
January 20, 2015
Data Mining: Concepts and Techniques
Why Is Data Preprocessing
Important?
No quality data, no quality mining results!
Quality decisions must be based on quality data
e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
Data warehouse needs consistent integration of
quality data
Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse
January 20, 2015
Data Mining: Concepts and Techniques
Multi-Dimensional Measure of Data
Quality
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable,
Consistency: some modified but some not,
dangling,
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be
understood?
Major Tasks in Data Preprocessing
Data cleaning
Data integration
Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
Forms of Data Preprocessing
January 20, 2015
Data Mining: Concepts and Techniques
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
January 20, 2015
Data Mining: Concepts and Techniques
10
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph
Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
January 20, 2015
Data Mining: Concepts and Techniques
11
Data Cleaning
Importance
Data cleaning is one of the three biggest
problems in data warehousingRalph Kimball
Data cleaning is the number one problem in
data warehousingDCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
January 20, 2015
Data Mining: Concepts and Techniques
12
Incomplete (Missing) Data
Data is not always available
13
E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
Incomplete (Missing) Data
Data is not always available
14
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred
How to Handle Missing Data?
Ignore the tuple: usually done when class label is
missing (when doing classification)not effective
when the % of missing values per attribute varies
considerably
Fill in the missing value manually: tedious +
infeasible?
15
How to Handle Missing Data?
Ignore the tuple: usually done when class label is
missing (when doing classification)not effective when
the % of missing values per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with
16
a global constant : e.g., unknown, a new class?!
the attribute mean
the attribute mean for all samples belonging to the
same class: smarter
the most probable value: inference-based such as
Bayesian formula or decision tree
Noisy Data
Noise: random error or variance in a measured
variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
January 20, 2015
Data Mining: Concepts and Techniques
17
How to Handle Noisy Data?
Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
January 20, 2015
Data Mining: Concepts and Techniques
18
Simple Discretization Methods:
Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B A)/N.
The most straightforward, but outliers may dominate
presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky
January 20, 2015
Data Mining: Concepts and Techniques
19
Binning Methods for Data
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
January 20, 2015
Data Mining: Concepts and Techniques
20
Binning Methods for Data
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
January 20, 2015
Data Mining: Concepts and Techniques
21
Binning Methods for Data
Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
January 20, 2015
Data Mining: Concepts and Techniques
22
How to Handle Noisy Data?
Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions
January 20, 2015
Data Mining: Concepts and Techniques
23
Regression
y
Y1
y=x+1
Y1
X1
January 20, 2015
Data Mining: Concepts and Techniques
24
How to Handle Noisy Data?
Binning
first sort data and partition into (equalfrequency) bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression
functions
Clustering
detect and remove outliers
January 20, 2015
Data Mining: Concepts and Techniques
25
Cluster Analysis
January 20, 2015
Data Mining: Concepts and Techniques
26
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by
bin median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human
(e.g., deal with possible outliers)
January 20, 2015
Data Mining: Concepts and Techniques
27
Problems
3.3 Suppose that the data for analysis includes the
attribute age. The age values for the data tuples
are (in increasing order)
13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,3
3,35,35,35,35,36,40,45,46,52,70.
i. Use smoothing by bin means and bondaries to
smooth the data, using a bin depth of 3. Illustrate
your steps.
ii. How might you determine the outliers?
January 20, 2015
Data Mining: Concepts and Techniques
28
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potters Wheels)
29