0% found this document useful (0 votes)

10 views15 pages

Data Preprocessing

The document discusses data preprocessing in data mining, highlighting its importance for data quality and the major tasks involved, such as data cleaning, integration, reduction, and transformation. It details various issues like missing and noisy data, methods for handling these issues, and strategies for data reduction. The document also covers techniques like normalization and discretization for effective data transformation.

Uploaded by

try.admerch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views15 pages

Data Preprocessing

Uploaded by

try.admerch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1

Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
2

1
Data Quality: Why Preprocess the Data?

◼ Measures for data quality: A multidimensional view

◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be
understood?

Major Tasks in Data Preprocessing

◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

2
Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
5

Data Cleaning
◼ Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
◼ e.g., Occupation=“ ” (missing data)
◼ noisy: containing noise, errors, or outliers
◼ e.g., Salary=“−10” (an error)
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C”
◼ discrepancy between duplicate records
◼ Intentional (e.g., disguised missing data)
◼ Jan. 1 as everyone’s birthday?
6

3
Incomplete (Missing) Data

◼ Data is not always available

◼ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
◼ Missing data may be due to
◼ equipment malfunction
◼ inconsistent with other recorded data and thus deleted
◼ data not entered due to misunderstanding
◼ certain data may not be considered important at the
time of entry
◼ not register history or changes of the data
◼ Missing data may need to be inferred
7

How to Handle Missing Data?

◼ Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the
same class

4
Noisy Data
◼ Noise: random error or variance in a measured variable
◼ Incorrect attribute values may be due to
◼ faulty data collection instruments

◼ data entry problems

◼ data transmission problems

◼ technology limitation

◼ inconsistency in naming convention

◼ Other data problems which require data cleaning

◼ duplicate records

◼ incomplete data

◼ inconsistent data

How to Handle Noisy Data?

◼ Binning
◼ first sort data and partition into (equal-frequency) bins

◼ then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

◼ Regression
◼ smooth by fitting the data into regression functions

◼ Clustering
◼ detect and remove outliers

◼ Combined computer and human inspection

◼ detect suspicious values and check by human (e.g.,

deal with possible outliers)

5
Data Cleaning as a Process
◼ Data discrepancy detection
◼ Use metadata (e.g., domain, range, dependency, distribution)

◼ Check field overloading

◼ Check uniqueness rule, consecutive rule and null rule

◼ Use commercial tools

◼ Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

◼ Data auditing: by analyzing data to discover rules and

relationship to detect violators (e.g., correlation and clustering

to find outliers)
◼ Data migration and integration
◼ Data migration tools: allow transformations to be specified

◼ ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface
◼ Integration of the two processes
◼ Iterative and interactive

Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
12

6
Data Integration
◼ Data integration:
◼ Combines data from multiple sources into a coherent store
◼ Schema integration: e.g., A.cust-id  B.cust-#
◼ Integrate metadata from different sources
◼ Entity identification problem:
◼ Identify real world entities from multiple data sources
◼ Detecting and resolving data value conflicts
◼ For the same real world entity, attribute values from different
sources are different
◼ Possible reasons: different representations, different scales

Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
26

7
Data Reduction Strategies
◼ Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
◼ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
◼ Data reduction strategies
◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

Data Reduction 1: Dimensionality Reduction

◼ Curse of dimensionality
◼ When dimensionality increases, data becomes increasingly sparse
◼ Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
◼ The possible combinations of subspaces will grow exponentially
◼ Dimensionality reduction
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining
◼ Allow easier visualization

8
Data Reduction 2: Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller
forms of data representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model

parameters, store only the parameters, and discard

the data (except possible outliers)
◼ Non-parametric methods
◼ Do not assume models

◼ Major families: histograms, clustering, sampling, …

Parametric Data Reduction

◼ Linear regression
◼ Data modeled to fit a straight line

◼ Often uses the least-square method to fit the line

◼ Multiple regression
◼ Allows a response variable Y to be modeled as a

linear function of multidimensional feature vector

9
Histogram Analysis
◼ Divide data into buckets and 40
store average (sum) for each 35
bucket
30
◼ Partitioning rules:
25
◼ Equal-width: equal bucket 20
range
15
◼ Equal-frequency (or equal- 10
depth)
5
0

100000
10000

20000

30000

40000

50000

60000

70000

80000

90000
44

Clustering
◼ Partition data set into clusters based on similarity, and
store cluster representation (e.g., centroid and diameter)
only

10
Sampling

◼ Sampling: obtaining a small sample s to represent the

whole data set N
◼ Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data

Data Reduction 3: Data Compression

◼ String compression
◼ There are extensive theories and well-tuned algorithms

◼ Typically lossless, but only limited manipulation is

possible without expansion

◼ Audio/video compression
◼ Typically lossy compression, with progressive refinement

◼ Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

◼ Time sequence is not audio
◼ Typically short and vary slowly with time

◼ Dimensionality and numerosity reduction may also be

considered as forms of data compression
51

11
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated

Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
53

12
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing 54

Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 − 12,000
1.0]. Then $73,000 is mapped to 98,000 − 12,000 (1.0 − 0) + 0 = 0.716
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
 A

73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v
v' = Where j is the smallest integer such that Max(|ν’|) < 1
10 j
55

13
Data Discretization Methods
◼ Typical methods: All the methods can be applied recursively
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis (unsupervised, top-down split or
bottom-up merge)
◼ Decision-tree analysis (supervised, top-down split)
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

Simple Discretization: Binning

◼ Equal-width (distance) partitioning

◼ Divides the range into N intervals of equal size: uniform grid
◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
◼ The most straightforward, but outliers may dominate presentation
◼ Skewed data is not handled well

◼ Equal-depth (frequency) partitioning

◼ Divides the range into N intervals, each containing approximately
same number of samples
◼ Good data scaling
◼ Managing categorical attributes can be tricky
58

14
Binning Methods for Data Smoothing
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
59

3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
03 Preprocessing
No ratings yet
03 Preprocessing
59 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
19 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Unit - II
No ratings yet
Unit - II
56 pages
01 Data Pre Processing
No ratings yet
01 Data Pre Processing
46 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Lecture 3 and 4 - Data Preprocessing
No ratings yet
Lecture 3 and 4 - Data Preprocessing
25 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Correlation
No ratings yet
Correlation
14 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
45 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
30 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
Pre Processing
No ratings yet
Pre Processing
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
67 pages
Unit2 Part2
No ratings yet
Unit2 Part2
67 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Week 2
No ratings yet
Week 2
96 pages
Unit 1 C
No ratings yet
Unit 1 C
63 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
CESTAT30 - 01.01.introduction To The Course - Lecture
No ratings yet
CESTAT30 - 01.01.introduction To The Course - Lecture
8 pages
Cs2307 Networks Lab Manual
No ratings yet
Cs2307 Networks Lab Manual
71 pages
Bochaver Et Al 20221687066806424
No ratings yet
Bochaver Et Al 20221687066806424
17 pages
Questionnaire Performance Testing
No ratings yet
Questionnaire Performance Testing
10 pages
Process Costing Weighted-Average Worksheet
No ratings yet
Process Costing Weighted-Average Worksheet
5 pages
APM Agents
No ratings yet
APM Agents
102 pages
Miteco - 13207-Sparepartslist - 09-17
No ratings yet
Miteco - 13207-Sparepartslist - 09-17
67 pages
MHWirth - Pile Top Drill Rigs - en (Brochure)
No ratings yet
MHWirth - Pile Top Drill Rigs - en (Brochure)
12 pages
DLP Set 2 Year 3 Version 2024
100% (1)
DLP Set 2 Year 3 Version 2024
29 pages
Vectors, Tensors, and Curvilinear Coordinates: © 2003 by CRC Press LLC
No ratings yet
Vectors, Tensors, and Curvilinear Coordinates: © 2003 by CRC Press LLC
24 pages
Grade 8 Cbse Math 2nd Term Sample Paper 1
100% (1)
Grade 8 Cbse Math 2nd Term Sample Paper 1
2 pages
Harmonics in Power System
100% (1)
Harmonics in Power System
26 pages
Preparing For Geometry
No ratings yet
Preparing For Geometry
21 pages
Jyotish Krishnamurthy Paddhati Bansal PDF
100% (2)
Jyotish Krishnamurthy Paddhati Bansal PDF
61 pages
J Diamond 2018 03 006
No ratings yet
J Diamond 2018 03 006
22 pages
Light and Shadows Quiz Key
No ratings yet
Light and Shadows Quiz Key
2 pages
Colleges Pune City 3
No ratings yet
Colleges Pune City 3
4 pages
Concrete Durability Enhancer
No ratings yet
Concrete Durability Enhancer
2 pages
Appendix 4 - SPECIFICATION FOR STRUCTURAL STEEL MATERIAL FOR OFFSHORE STRUCTURES
100% (3)
Appendix 4 - SPECIFICATION FOR STRUCTURAL STEEL MATERIAL FOR OFFSHORE STRUCTURES
21 pages
Guideline For Customer Notifications PCN V5.0
No ratings yet
Guideline For Customer Notifications PCN V5.0
21 pages
Vectors and Equilibrium Guide
No ratings yet
Vectors and Equilibrium Guide
14 pages
02 Air Conditioning Tools
No ratings yet
02 Air Conditioning Tools
23 pages
Stairs: A Little Bit About Them: Slope
No ratings yet
Stairs: A Little Bit About Them: Slope
2 pages
HVAC Duct Design Lab Guide
No ratings yet
HVAC Duct Design Lab Guide
8 pages
ML Project Report
No ratings yet
ML Project Report
40 pages
CS502 Fundamentals of Algorithms
No ratings yet
CS502 Fundamentals of Algorithms
24 pages
Tutorial 1
No ratings yet
Tutorial 1
18 pages
Physical Science Question Class X
No ratings yet
Physical Science Question Class X
9 pages
TCC Number 119 4 4
No ratings yet
TCC Number 119 4 4
1 page
Network Security
No ratings yet
Network Security
7 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Measures for data quality: A multidimensional view

Major Tasks in Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Data is not always available

How to Handle Missing Data?

◼ data entry problems

◼ data transmission problems

◼ inconsistency in naming convention

◼ Other data problems which require data cleaning

How to Handle Noisy Data?

◼ then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

◼ Combined computer and human inspection

deal with possible outliers)

◼ Check field overloading

◼ Check uniqueness rule, consecutive rule and null rule

◼ Use commercial tools

◼ Data scrubbing: use simple domain knowledge (e.g., postal

code, spell-check) to detect errors and make corrections

relationship to detect violators (e.g., correlation and clustering

◼ ETL (Extraction/Transformation/Loading) tools: allow users to

Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

Data Reduction 1: Dimensionality Reduction

parameters, store only the parameters, and discard

◼ Major families: histograms, clustering, sampling, …

Parametric Data Reduction

◼ Often uses the least-square method to fit the line

linear function of multidimensional feature vector

◼ Sampling: obtaining a small sample s to represent the

Data Reduction 3: Data Compression

◼ Typically lossless, but only limited manipulation is

possible without expansion

◼ Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

◼ Dimensionality and numerosity reduction may also be

Original Data Compressed

Chapter 3: Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

Simple Discretization: Binning

◼ Equal-width (distance) partitioning

◼ Equal-depth (frequency) partitioning

You might also like