KEMBAR78
L6 Data Preprocessing | PDF | Data | Data Quality
0% found this document useful (0 votes)
7 views9 pages

L6 Data Preprocessing

The document provides an overview of data preprocessing in data mining, emphasizing the importance of data quality and the major tasks involved such as data cleaning, integration, transformation, and reduction. It discusses the challenges of handling missing and noisy data, as well as techniques for normalization and discretization. Additionally, it highlights data reduction strategies like dimensionality reduction and feature selection to improve data analysis efficiency.

Uploaded by

samplesaji1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views9 pages

L6 Data Preprocessing

The document provides an overview of data preprocessing in data mining, emphasizing the importance of data quality and the major tasks involved such as data cleaning, integration, transformation, and reduction. It discusses the challenges of handling missing and noisy data, as well as techniques for normalization and discretization. Additionally, it highlights data reduction strategies like dimensionality reduction and feature selection to improve data analysis efficiency.

Uploaded by

samplesaji1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Data Preprocessing

Fundamentals of Data Mining ◼ Data Preprocessing: An Overview

◼ Data Quality

Data Preprocessing ◼ Major Tasks in Data Preprocessing

◼ Data Cleaning
Prasanna S. Haddela
Senior Lecturer ◼ Data Integration
Faculty of Computing, SLIIT
◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
2

1 2

Knowledge Discovery in Databases (KDD) Data Quality: Why Preprocess the Data?

◼ Measures for data quality: A multidimensional view


◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?

3 4

3 4

Major Tasks in Data Preprocessing Data Preprocessing

◼ Data cleaning
◼ Data Preprocessing: An Overview
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies ◼ Data Quality
◼ Data transformation and data discretization
◼ Major Tasks in Data Preprocessing
◼ Normalization
◼ Concept hierarchy generation ◼ Data Cleaning
◼ Data reduction
◼ Data Transformation and Data Discretization
◼ Dimensionality reduction
◼ Numerosity reduction ◼ Data Reduction
Data compression
Data Integration


◼ Data integration
◼ Integration of multiple databases, data cubes, or files ◼ Summary
5 6

5 6

1
Data Cleaning Incomplete (Missing) Data
Data in the Real World is Dirty: Lots of potentially incorrect data,

◼ Data is not always available
e.g., instrument faulty, human or computer error, transmission error
◼ E.g., many tuples have no recorded value for several
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data attributes, such as customer income in sales data
◼ e.g., Occupation=“ ” (missing data) ◼ Missing data may be due to
◼ noisy: containing noise, errors, or outliers ◼ equipment malfunction
◼ e.g., Salary=“−10” (an error) ◼ inconsistent with other recorded data and thus deleted
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ data not entered due to misunderstanding
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C” ◼ certain data may not be considered important at the
◼ discrepancy between duplicate records time of entry
◼ Intentional (e.g., disguised missing data) ◼ not register history or changes of the data
◼ Jan. 1 as everyone’s birthday? ◼ Missing data may need to be inferred
7 8

7 8

How to Handle Missing Data?


◼ Ignore the tuple: usually done when class label is missing ◼ Weka practical: Lab1.1 Missing Data
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the
same class: smarter
◼ the most probable value: inference-based such as
Bayesian formula or decision tree
9 10

9 10

Noisy Data How to Handle Noisy Data?

◼ Noise: random error or variance in a measured variable ◼ Binning


◼ first sort data and partition into (equal-frequency) bins

◼ Incorrect attribute values may be due to ◼ then one can smooth by bin means, smooth by bin

◼ faulty data collection instruments median, smooth by bin boundaries, etc.


◼ data entry problems ◼ Regression
◼ data transmission problems ◼ smooth by fitting the data into regression functions

◼ technology limitation ◼ Clustering


◼ inconsistency in naming convention ◼ detect and remove outliers

◼ Combined computer and human inspection


◼ detect suspicious values and check by human (e.g.,

deal with possible outliers)

11 12

11 12

2
Data Preprocessing

◼ Weka practical: Lab1.3 Remove Outlier ◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Transformation and Data Discretization

◼ Data Reduction

◼ Data Integration

◼ Summary
14
13

13 14

Data Transformation Normalization


◼ A function that maps the entire set of values of a given attribute to a ◼ Min-max normalization: to [new_minA, new_maxA]
new set of replacement values s.t. each old value can be identified
with one of the new values
Methods v − minA

v' = (new _ maxA − new _ minA) + new _ minA
◼ Normalization: Scaled to fall within a smaller, specified range maxA − minA
◼ min-max normalization
◼ Ex. Let income range 12,000 to 98,000 normalized to [0.0, 1.0].
◼ z-score normalization
Then 73,000 is mapped to
◼ normalization by decimal scaling
◼ Smoothing: Remove noise from data 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
◼ Attribute/feature construction 98,000 − 12,000
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Discretization 15
16

15 16

Normalization Normalization
◼ Z-score normalization (μ: mean, σ: standard deviation): ◼ Normalization by decimal scaling

v − A v
v' = v' =
 A 10 j
Where j is the smallest integer such that Max(|ν’|) < 1
◼ ◼

◼ Ex. Let μ = 54,000, σ = 16,000. Then 73,000 is mapped to ◼ Ex.


Salary Formula Normalized by
73,600 − 54,000 decimal scaling
= 1.225 73,000 73,000/100,000 0.73
16,000
80,000 80,000/100,000 0.8

17 18

17 18

3
Discretization
Three types of attributes
◼ Weka practical: Lab1.4 Normalization ◼

◼ Nominal—values from an unordered set, e.g., color, profession


◼ Ordinal—values from an ordered set, e.g., military or academic
rank
◼ Numeric—real numbers, e.g., integer or real numbers

◼ Discretization: Divide the range of a continuous attribute into intervals


◼ Interval labels can then be used to replace actual data values
◼ Reduce data size by discretization

19 20

19 20

Data Discretization Methods Simple Discretization: Binning

◼ Typical methods: All the methods can be applied recursively ◼ Equal-width (distance) partitioning
◼ Binning ◼ Divides the range into N intervals of equal size: uniform grid

◼ Top-down split, unsupervised ◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
◼ Histogram analysis
◼ bin are defined as [min + w], [min + 2w] …. [min + nw]
◼ Top-down split, unsupervised
◼ The most straightforward, but outliers may dominate presentation
◼ Clustering analysis (unsupervised, top-down split or
◼ Skewed data is not handled well
bottom-up merge)
◼ Equal-depth (frequency) partitioning
◼ Decision-tree analysis (supervised, top-down split)
◼ Divides the range into N intervals, each containing approximately
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
same number of samples
merge)
◼ Managing categorical attributes can be tricky
21 22

21 22

calc Binning Methods for Data Smoothing


❑ Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
23 24

23 24

4
Discretization by Classification &
Correlation Analysis
◼ Weka practical: Lab1.2 Binning ◼ Classification (e.g., decision tree analysis)
◼ Supervised: Given class labels, e.g., cancerous vs. benign

◼ Using entropy to determine split point (discretization point)


◼ Top-down, recursive split

◼ Correlation analysis (e.g., Chi-merge: χ2-based discretization)


◼ Supervised: use class information

◼ Bottom-up merge: find the best neighboring intervals (those


having similar distributions of classes, i.e., low χ2 values) to merge

◼ Merge performed recursively, until a predefined stopping condition

25 26

25 26

Concept Hierarchy Generation Automatic Concept Hierarchy Generation

◼ Concept hierarchy organizes concepts (i.e., attribute values) ◼ Some hierarchies can be automatically generated based on
hierarchically and is usually associated with each dimension in a data
the analysis of the number of distinct values per attribute in
the data set
warehouse
◼ The attribute with the most distinct values is placed at
◼ Concept hierarchies facilitate drilling and rolling in data warehouses to the lowest level of the hierarchy
view data in multiple granularity
◼ Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by country 15 distinct values
higher level concepts (such as youth, adult, or senior)
◼ Concept hierarchies can be explicitly specified by domain experts province_or_ state 365 distinct values
and/or data warehouse designers
◼ Concept hierarchy can be automatically formed for both numeric and city 3567 distinct values
nominal data. For numeric data, use discretization methods shown.
street 674,339 distinct values
27 28

27 28

Data Preprocessing Data Reduction Strategies


◼ Data reduction: Obtain a reduced representation of the data set that
◼ Data Preprocessing: An Overview is much smaller in volume but yet produces the same (or almost the
same) analytical results
◼ Data Quality ◼ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
◼ Major Tasks in Data Preprocessing run on the complete data set.
◼ Data reduction strategies
◼ Data Cleaning ◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Principal Components Analysis (PCA)

◼ Data Transformation and Data Discretization ◼ Feature subset selection, feature creation

◼ Wavelet transforms

◼ Data Reduction ◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models


◼ Data Integration ◼ Histograms, clustering, sampling

◼ Data cube aggregation


◼ Summary
◼ Data compression

29
30

29 30

5
Principal Component Analysis (PCA)
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality ◼ Find a projection that captures the largest amount of variation in data
◼ When dimensionality increases, data becomes increasingly sparse ◼ The original data are projected onto a much smaller space, resulting
◼ Density and distance between points, which is critical to clustering, outlier in dimensionality reduction. We find the eigenvectors of the
analysis, becomes less meaningful covariance matrix, and these eigenvectors define the new space
◼ The possible combinations of subspaces will grow exponentially
◼ Dimensionality reduction x2
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining e
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Principal Component Analysis
◼ Supervised and nonlinear techniques (e.g., feature selection)
◼ Wavelet transforms
x1
31
32

31 32

Attribute Subset Selection


◼ Another way to reduce dimensionality of data
◼ Redundant attributes
◼ Duplicate much or all of the information contained in
one or more other attributes
◼ E.g., purchase price of a product and the amount of
sales tax paid
◼ Irrelevant attributes
◼ Contain no information that is useful for the data
http://mengnote.blogspot.com/2013/05/an-intuitive-explanation-of-pca.html mining task at hand
◼ E.g., students' ID is often irrelevant to the task of
predicting students' GPA

33 34

33 34

Heuristic Search in Attribute Selection Attribute Creation (Feature Generation)


◼ Typical heuristic attribute selection methods: ◼ Create new attributes (features) that can capture the
◼ Best single attribute under the attribute independence important information in a data set more effectively than
assumption: choose by significance tests the original ones
◼ Best step-wise feature selection: ◼ Three general methodologies
◼ The best single-attribute is picked first ◼ Attribute extraction

◼ Then next best attribute condition to the first, ... ◼ Domain-specific

◼ Step-wise attribute elimination: ◼ Mapping data to new space (see: data reduction)

◼ Repeatedly eliminate the worst attribute ◼ E.g., Fourier transformation, wavelet

◼ Best combined attribute selection and elimination


transformation, manifold approaches (not covered)
◼ Attribute construction
◼ Optimal branch and bound:
◼ Combining features (see: discriminative frequent
◼ Use attribute elimination and backtracking
patterns)
◼ Data discretization
35 36

35 36

6
Parametric Data Reduction: Regression
Data Reduction 2: Numerosity Reduction and Log-Linear Models
◼ Reduce data volume by choosing alternative, smaller ◼ Linear regression
forms of data representation ◼ Data modeled to fit a straight line
◼ Parametric methods (e.g., regression)
◼ Often uses the least-square method to fit the line
◼ Assume the data fits some model, estimate model
◼ Multiple regression
parameters, store only the parameters, and discard
◼ Allows a response variable Y to be modeled as a
the data (except possible outliers)
linear function of multidimensional feature vector
◼ Ex.: Log-linear models—obtain value at a point in m-

D space as the product on appropriate marginal ◼ Log-linear model


subspaces ◼ Approximates discrete multidimensional probability

◼ Non-parametric methods distributions


◼ Do not assume models

◼ Major families: histograms, clustering, sampling, …

38 39

38 39

Histogram Analysis Clustering


◼ Divide data into buckets and 40 ◼ Partition data set into clusters based on similarity, and
store average (sum) for each 35 store cluster representation (e.g., centroid and diameter)
bucket only
30
◼ Partitioning rules: ◼ Can be very effective if data is clustered but not if data
25
◼ Equal-width: equal bucket 20 is “smeared”
range ◼ Can have hierarchical clustering and be stored in multi-
15
◼ Equal-frequency (or equal- 10 dimensional index tree structures
depth) ◼ There are many choices of clustering definitions and
5
clustering algorithms
0
100000
10000

20000

30000

40000

50000

60000

70000

80000

90000

40 41

40 41

Sampling Types of Sampling

◼ Sampling: obtaining a small sample s to represent the ◼ Simple random sampling


whole data set N ◼ There is an equal probability of selecting any particular
item
◼ Allow a mining algorithm to run in complexity that is
◼ Sampling without replacement
potentially sub-linear to the size of the data
◼ Once an object is selected, it is removed from the

◼ Key principle: Choose a representative subset of the data population


◼ Simple random sampling may have very poor ◼ Sampling with replacement
◼ A selected object is not removed from the population
performance in the presence of skew
◼ Stratified sampling:
◼ Develop adaptive sampling methods, e.g., stratified ◼ Partition the data set, and draw samples from each
sampling: partition (proportionally, i.e., approximately the same
percentage of the data)
◼ Note: Sampling may not reduce database I/Os (page at a
◼ Used in conjunction with skewed data
time)
42 43

42 43

7
Sampling: With or without Replacement Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

Raw Data
44 45

44 45

Data Cube Aggregation Data Compression


◼ The lowest level of a data cube (base cuboid) ◼ String compression
◼ There are extensive theories and well-tuned algorithms
◼ The aggregated data for an individual entity of interest
◼ Typically lossless, but only limited manipulation is
◼ E.g., a customer in a phone calling data warehouse
possible without expansion
◼ Multiple levels of aggregation in data cubes ◼ Audio/video compression
◼ Further reduce the size of data to deal with ◼ Typically lossy compression, with progressive refinement

◼ Reference appropriate levels ◼ Sometimes small fragments of signal can be

reconstructed without reconstructing the whole


◼ Use the smallest representation which is enough to
◼ Time sequence is not audio
solve the task
◼ Typically short and vary slowly with time
◼ Queries regarding aggregated information should be
◼ Dimensionality and numerosity reduction may also be
answered using data cube, when possible considered as forms of data compression
46 47

46 47

Data Compression Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

Original Data Compressed ◼ Major Tasks in Data Preprocessing


Data ◼ Data Cleaning
lossless
◼ Data Transformation and Data Discretization

◼ Data Reduction
Original Data ◼ Data Integration
Approximated
◼ Summary
48
49

48 49

8
Data Integration Handling Redundancy in Data Integration

◼ Data integration: ◼ Redundant data occur often when integration of multiple


◼ Combines data from multiple sources into a coherent store databases
◼ Schema integration: e.g., A.cust-id  B.cust-# ◼ Object identification: The same attribute or object
◼ Integrate metadata from different sources may have different names in different databases
◼ Entity identification problem: ◼ Derivable data: One attribute may be a “derived”
◼ Identify real world entities from multiple data sources, e.g., Bill attribute in another table, e.g., annual revenue
Clinton = William Clinton
◼ Redundant attributes may be able to be detected by
◼ Detecting and resolving data value conflicts
correlation analysis and covariance analysis
◼ For the same real world entity, attribute values from different
◼ Careful integration of the data from multiple sources may
sources are different
help reduce/avoid redundancies and inconsistencies and
◼ Possible reasons: different representations, different scales, e.g.,
metric vs. British units
improve mining speed and quality
50 51

50 51

Data Preprocessing Summary


◼ Data quality: accuracy, completeness, consistency, timeliness,
◼ Data Preprocessing: An Overview believability, interpretability
◼ Data cleaning: e.g. missing/noisy values, outliers
◼ Data Quality
◼ Data transformation and data discretization
◼ Normalization
◼ Major Tasks in Data Preprocessing
◼ Concept hierarchy generation

◼ Data Cleaning ◼ Data reduction


◼ Dimensionality reduction
◼ Data Transformation and Data Discretization ◼ Numerosity reduction

◼ Data compression
◼ Data Reduction
◼ Data integration from multiple sources:
◼ Data Integration ◼ Entity identification problem

◼ Remove redundancies
◼ Summary ◼ Detect inconsistencies

52
53

52 53

You might also like