0% found this document useful (0 votes)

7 views9 pages

L6 Data Preprocessing

The document provides an overview of data preprocessing in data mining, emphasizing the importance of data quality and the major tasks involved such as data cleaning, integration, transformation, and reduction. It discusses the challenges of handling missing and noisy data, as well as techniques for normalization and discretization. Additionally, it highlights data reduction strategies like dimensionality reduction and feature selection to improve data analysis efficiency.

Uploaded by

samplesaji1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views9 pages

L6 Data Preprocessing

Uploaded by

samplesaji1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Data Preprocessing

Fundamentals of Data Mining ◼ Data Preprocessing: An Overview

◼ Data Quality

Data Preprocessing ◼ Major Tasks in Data Preprocessing

◼ Data Cleaning
Prasanna S. Haddela
Senior Lecturer ◼ Data Integration
Faculty of Computing, SLIIT
◼ Data Reduction

◼ Data Transformation and Data Discretization

◼ Summary
2

1 2

Knowledge Discovery in Databases (KDD) Data Quality: Why Preprocess the Data?

◼ Measures for data quality: A multidimensional view

◼ Accuracy: correct or wrong, accurate or not
◼ Completeness: not recorded, unavailable, …
◼ Consistency: some modified but some not, dangling, …
◼ Timeliness: timely update?
◼ Believability: how trustable the data are correct?
◼ Interpretability: how easily the data can be understood?

3 4

Major Tasks in Data Preprocessing Data Preprocessing

◼ Data cleaning
◼ Data Preprocessing: An Overview
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies ◼ Data Quality
◼ Data transformation and data discretization
◼ Major Tasks in Data Preprocessing
◼ Normalization
◼ Concept hierarchy generation ◼ Data Cleaning
◼ Data reduction
◼ Data Transformation and Data Discretization
◼ Dimensionality reduction
◼ Numerosity reduction ◼ Data Reduction
Data compression
Data Integration
◼
◼
◼ Data integration
◼ Integration of multiple databases, data cubes, or files ◼ Summary
5 6

5 6

1
Data Cleaning Incomplete (Missing) Data
Data in the Real World is Dirty: Lots of potentially incorrect data,
◼
◼ Data is not always available
e.g., instrument faulty, human or computer error, transmission error
◼ E.g., many tuples have no recorded value for several
◼ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data attributes, such as customer income in sales data
◼ e.g., Occupation=“ ” (missing data) ◼ Missing data may be due to
◼ noisy: containing noise, errors, or outliers ◼ equipment malfunction
◼ e.g., Salary=“−10” (an error) ◼ inconsistent with other recorded data and thus deleted
◼ inconsistent: containing discrepancies in codes or names, e.g.,
◼ data not entered due to misunderstanding
◼ Age=“42”, Birthday=“03/07/2010”
◼ Was rating “1, 2, 3”, now rating “A, B, C” ◼ certain data may not be considered important at the
◼ discrepancy between duplicate records time of entry
◼ Intentional (e.g., disguised missing data) ◼ not register history or changes of the data
◼ Jan. 1 as everyone’s birthday? ◼ Missing data may need to be inferred
7 8

7 8

How to Handle Missing Data?

◼ Ignore the tuple: usually done when class label is missing ◼ Weka practical: Lab1.1 Missing Data
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
◼ Fill in the missing value manually: tedious + infeasible?
◼ Fill in it automatically with
◼ a global constant : e.g., “unknown”, a new class?!
◼ the attribute mean
◼ the attribute mean for all samples belonging to the
same class: smarter
◼ the most probable value: inference-based such as
Bayesian formula or decision tree
9 10

9 10

Noisy Data How to Handle Noisy Data?

◼ Noise: random error or variance in a measured variable ◼ Binning

◼ first sort data and partition into (equal-frequency) bins

◼ Incorrect attribute values may be due to ◼ then one can smooth by bin means, smooth by bin

◼ faulty data collection instruments median, smooth by bin boundaries, etc.

◼ data entry problems ◼ Regression
◼ data transmission problems ◼ smooth by fitting the data into regression functions

◼ technology limitation ◼ Clustering

◼ inconsistency in naming convention ◼ detect and remove outliers

◼ Combined computer and human inspection

◼ detect suspicious values and check by human (e.g.,

deal with possible outliers)

11 12

2
Data Preprocessing

◼ Weka practical: Lab1.3 Remove Outlier ◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Transformation and Data Discretization

◼ Data Reduction

◼ Data Integration

◼ Summary
14
13

13 14

Data Transformation Normalization

◼ A function that maps the entire set of values of a given attribute to a ◼ Min-max normalization: to [new_minA, new_maxA]
new set of replacement values s.t. each old value can be identified
with one of the new values
Methods v − minA
◼
v' = (new _ maxA − new _ minA) + new _ minA
◼ Normalization: Scaled to fall within a smaller, specified range maxA − minA
◼ min-max normalization
◼ Ex. Let income range 12,000 to 98,000 normalized to [0.0, 1.0].
◼ z-score normalization
Then 73,000 is mapped to
◼ normalization by decimal scaling
◼ Smoothing: Remove noise from data 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
◼ Attribute/feature construction 98,000 − 12,000
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Discretization 15
16

15 16

Normalization Normalization
◼ Z-score normalization (μ: mean, σ: standard deviation): ◼ Normalization by decimal scaling

v − A v
v' = v' =
 A 10 j
Where j is the smallest integer such that Max(|ν’|) < 1
◼ ◼

◼ Ex. Let μ = 54,000, σ = 16,000. Then 73,000 is mapped to ◼ Ex.

Salary Formula Normalized by
73,600 − 54,000 decimal scaling
= 1.225 73,000 73,000/100,000 0.73
16,000
80,000 80,000/100,000 0.8

17 18

3
Discretization
Three types of attributes
◼ Weka practical: Lab1.4 Normalization ◼

◼ Nominal—values from an unordered set, e.g., color, profession

◼ Ordinal—values from an ordered set, e.g., military or academic
rank
◼ Numeric—real numbers, e.g., integer or real numbers

◼ Discretization: Divide the range of a continuous attribute into intervals

◼ Interval labels can then be used to replace actual data values
◼ Reduce data size by discretization

19 20

Data Discretization Methods Simple Discretization: Binning

◼ Typical methods: All the methods can be applied recursively ◼ Equal-width (distance) partitioning
◼ Binning ◼ Divides the range into N intervals of equal size: uniform grid

◼ Top-down split, unsupervised ◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
◼ Histogram analysis
◼ bin are defined as [min + w], [min + 2w] …. [min + nw]
◼ Top-down split, unsupervised
◼ The most straightforward, but outliers may dominate presentation
◼ Clustering analysis (unsupervised, top-down split or
◼ Skewed data is not handled well
bottom-up merge)
◼ Equal-depth (frequency) partitioning
◼ Decision-tree analysis (supervised, top-down split)
◼ Divides the range into N intervals, each containing approximately
◼ Correlation (e.g., 2) analysis (unsupervised, bottom-up
same number of samples
merge)
◼ Managing categorical attributes can be tricky
21 22

21 22

calc Binning Methods for Data Smoothing

❑ Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
23 24

23 24

4
Discretization by Classification &
Correlation Analysis
◼ Weka practical: Lab1.2 Binning ◼ Classification (e.g., decision tree analysis)
◼ Supervised: Given class labels, e.g., cancerous vs. benign

◼ Using entropy to determine split point (discretization point)

◼ Top-down, recursive split

◼ Correlation analysis (e.g., Chi-merge: χ2-based discretization)

◼ Supervised: use class information

◼ Bottom-up merge: find the best neighboring intervals (those

having similar distributions of classes, i.e., low χ2 values) to merge

◼ Merge performed recursively, until a predefined stopping condition

25 26

Concept Hierarchy Generation Automatic Concept Hierarchy Generation

◼ Concept hierarchy organizes concepts (i.e., attribute values) ◼ Some hierarchies can be automatically generated based on
hierarchically and is usually associated with each dimension in a data
the analysis of the number of distinct values per attribute in
the data set
warehouse
◼ The attribute with the most distinct values is placed at
◼ Concept hierarchies facilitate drilling and rolling in data warehouses to the lowest level of the hierarchy
view data in multiple granularity
◼ Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by country 15 distinct values
higher level concepts (such as youth, adult, or senior)
◼ Concept hierarchies can be explicitly specified by domain experts province_or_ state 365 distinct values
and/or data warehouse designers
◼ Concept hierarchy can be automatically formed for both numeric and city 3567 distinct values
nominal data. For numeric data, use discretization methods shown.
street 674,339 distinct values
27 28

27 28

Data Preprocessing Data Reduction Strategies

◼ Data reduction: Obtain a reduced representation of the data set that
◼ Data Preprocessing: An Overview is much smaller in volume but yet produces the same (or almost the
same) analytical results
◼ Data Quality ◼ Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
◼ Major Tasks in Data Preprocessing run on the complete data set.
◼ Data reduction strategies
◼ Data Cleaning ◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Principal Components Analysis (PCA)

◼ Data Transformation and Data Discretization ◼ Feature subset selection, feature creation

◼ Wavelet transforms

◼ Data Reduction ◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Data Integration ◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Summary
◼ Data compression

29
30

29 30

5
Principal Component Analysis (PCA)
Data Reduction 1: Dimensionality Reduction
◼ Curse of dimensionality ◼ Find a projection that captures the largest amount of variation in data
◼ When dimensionality increases, data becomes increasingly sparse ◼ The original data are projected onto a much smaller space, resulting
◼ Density and distance between points, which is critical to clustering, outlier in dimensionality reduction. We find the eigenvectors of the
analysis, becomes less meaningful covariance matrix, and these eigenvectors define the new space
◼ The possible combinations of subspaces will grow exponentially
◼ Dimensionality reduction x2
◼ Avoid the curse of dimensionality
◼ Help eliminate irrelevant features and reduce noise
◼ Reduce time and space required in data mining e
◼ Allow easier visualization
◼ Dimensionality reduction techniques
◼ Principal Component Analysis
◼ Supervised and nonlinear techniques (e.g., feature selection)
◼ Wavelet transforms
x1
31
32

31 32

Attribute Subset Selection

◼ Another way to reduce dimensionality of data
◼ Redundant attributes
◼ Duplicate much or all of the information contained in
one or more other attributes
◼ E.g., purchase price of a product and the amount of
sales tax paid
◼ Irrelevant attributes
◼ Contain no information that is useful for the data
http://mengnote.blogspot.com/2013/05/an-intuitive-explanation-of-pca.html mining task at hand
◼ E.g., students' ID is often irrelevant to the task of
predicting students' GPA

33 34

Heuristic Search in Attribute Selection Attribute Creation (Feature Generation)

◼ Typical heuristic attribute selection methods: ◼ Create new attributes (features) that can capture the
◼ Best single attribute under the attribute independence important information in a data set more effectively than
assumption: choose by significance tests the original ones
◼ Best step-wise feature selection: ◼ Three general methodologies
◼ The best single-attribute is picked first ◼ Attribute extraction

◼ Then next best attribute condition to the first, ... ◼ Domain-specific

◼ Step-wise attribute elimination: ◼ Mapping data to new space (see: data reduction)

◼ Repeatedly eliminate the worst attribute ◼ E.g., Fourier transformation, wavelet

◼ Best combined attribute selection and elimination

transformation, manifold approaches (not covered)
◼ Attribute construction
◼ Optimal branch and bound:
◼ Combining features (see: discriminative frequent
◼ Use attribute elimination and backtracking
patterns)
◼ Data discretization
35 36

35 36

6
Parametric Data Reduction: Regression
Data Reduction 2: Numerosity Reduction and Log-Linear Models
◼ Reduce data volume by choosing alternative, smaller ◼ Linear regression
forms of data representation ◼ Data modeled to fit a straight line
◼ Parametric methods (e.g., regression)
◼ Often uses the least-square method to fit the line
◼ Assume the data fits some model, estimate model
◼ Multiple regression
parameters, store only the parameters, and discard
◼ Allows a response variable Y to be modeled as a
the data (except possible outliers)
linear function of multidimensional feature vector
◼ Ex.: Log-linear models—obtain value at a point in m-

D space as the product on appropriate marginal ◼ Log-linear model

subspaces ◼ Approximates discrete multidimensional probability

◼ Non-parametric methods distributions

◼ Do not assume models

◼ Major families: histograms, clustering, sampling, …

38 39

Histogram Analysis Clustering

◼ Divide data into buckets and 40 ◼ Partition data set into clusters based on similarity, and
store average (sum) for each 35 store cluster representation (e.g., centroid and diameter)
bucket only
30
◼ Partitioning rules: ◼ Can be very effective if data is clustered but not if data
25
◼ Equal-width: equal bucket 20 is “smeared”
range ◼ Can have hierarchical clustering and be stored in multi-
15
◼ Equal-frequency (or equal- 10 dimensional index tree structures
depth) ◼ There are many choices of clustering definitions and
5
clustering algorithms
0
100000
10000

20000

30000

40000

50000

60000

70000

80000

90000

40 41

Sampling Types of Sampling

◼ Sampling: obtaining a small sample s to represent the ◼ Simple random sampling

whole data set N ◼ There is an equal probability of selecting any particular
item
◼ Allow a mining algorithm to run in complexity that is
◼ Sampling without replacement
potentially sub-linear to the size of the data
◼ Once an object is selected, it is removed from the

◼ Key principle: Choose a representative subset of the data population

◼ Simple random sampling may have very poor ◼ Sampling with replacement
◼ A selected object is not removed from the population
performance in the presence of skew
◼ Stratified sampling:
◼ Develop adaptive sampling methods, e.g., stratified ◼ Partition the data set, and draw samples from each
sampling: partition (proportionally, i.e., approximately the same
percentage of the data)
◼ Note: Sampling may not reduce database I/Os (page at a
◼ Used in conjunction with skewed data
time)
42 43

42 43

7
Sampling: With or without Replacement Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

Raw Data
44 45

44 45

Data Cube Aggregation Data Compression

◼ The lowest level of a data cube (base cuboid) ◼ String compression
◼ There are extensive theories and well-tuned algorithms
◼ The aggregated data for an individual entity of interest
◼ Typically lossless, but only limited manipulation is
◼ E.g., a customer in a phone calling data warehouse
possible without expansion
◼ Multiple levels of aggregation in data cubes ◼ Audio/video compression
◼ Further reduce the size of data to deal with ◼ Typically lossy compression, with progressive refinement

◼ Reference appropriate levels ◼ Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

◼ Use the smallest representation which is enough to
◼ Time sequence is not audio
solve the task
◼ Typically short and vary slowly with time
◼ Queries regarding aggregated information should be
◼ Dimensionality and numerosity reduction may also be
answered using data cube, when possible considered as forms of data compression
46 47

46 47

Data Compression Data Preprocessing

◼ Data Preprocessing: An Overview

◼ Data Quality

Original Data Compressed ◼ Major Tasks in Data Preprocessing

Data ◼ Data Cleaning
lossless
◼ Data Transformation and Data Discretization

◼ Data Reduction
Original Data ◼ Data Integration
Approximated
◼ Summary
48
49

48 49

8
Data Integration Handling Redundancy in Data Integration

◼ Data integration: ◼ Redundant data occur often when integration of multiple

◼ Combines data from multiple sources into a coherent store databases
◼ Schema integration: e.g., A.cust-id  B.cust-# ◼ Object identification: The same attribute or object
◼ Integrate metadata from different sources may have different names in different databases
◼ Entity identification problem: ◼ Derivable data: One attribute may be a “derived”
◼ Identify real world entities from multiple data sources, e.g., Bill attribute in another table, e.g., annual revenue
Clinton = William Clinton
◼ Redundant attributes may be able to be detected by
◼ Detecting and resolving data value conflicts
correlation analysis and covariance analysis
◼ For the same real world entity, attribute values from different
◼ Careful integration of the data from multiple sources may
sources are different
help reduce/avoid redundancies and inconsistencies and
◼ Possible reasons: different representations, different scales, e.g.,
metric vs. British units
improve mining speed and quality
50 51

50 51

Data Preprocessing Summary

◼ Data quality: accuracy, completeness, consistency, timeliness,
◼ Data Preprocessing: An Overview believability, interpretability
◼ Data cleaning: e.g. missing/noisy values, outliers
◼ Data Quality
◼ Data transformation and data discretization
◼ Normalization
◼ Major Tasks in Data Preprocessing
◼ Concept hierarchy generation

◼ Data Cleaning ◼ Data reduction

◼ Dimensionality reduction
◼ Data Transformation and Data Discretization ◼ Numerosity reduction

◼ Data compression
◼ Data Reduction
◼ Data integration from multiple sources:
◼ Data Integration ◼ Entity identification problem

◼ Remove redundancies
◼ Summary ◼ Detect inconsistencies

52
53

52 53

Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
CH 3
No ratings yet
CH 3
68 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Week 4 - 5 - Data Preprocessing
No ratings yet
Week 4 - 5 - Data Preprocessing
67 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
UNIT 3 Data Preprocessing
No ratings yet
UNIT 3 Data Preprocessing
22 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Data Mining Module 2 Important Topics PYQs
No ratings yet
Data Mining Module 2 Important Topics PYQs
35 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
DWM
No ratings yet
DWM
14 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Week 2
No ratings yet
Week 2
96 pages
Module 2 (C) - Data Preprocessing
No ratings yet
Module 2 (C) - Data Preprocessing
50 pages
Unit 2
No ratings yet
Unit 2
37 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
L5 Clustering
No ratings yet
L5 Clustering
6 pages
2025 S1 IT2070 Lecture 04 Trees
No ratings yet
2025 S1 IT2070 Lecture 04 Trees
39 pages
2025 S1 IT2070 Lecture 03 Linked Lists
No ratings yet
2025 S1 IT2070 Lecture 03 Linked Lists
24 pages
2025 S1 IT2070 Lecture 01 Stacks
No ratings yet
2025 S1 IT2070 Lecture 01 Stacks
30 pages
2025 S1 IT2070 Lecture 05 Recursion
No ratings yet
2025 S1 IT2070 Lecture 05 Recursion
22 pages
2024 S2 IT3061 Lecture 01 Introduction To Cloud Computing
No ratings yet
2024 S2 IT3061 Lecture 01 Introduction To Cloud Computing
22 pages
Lecture 04
No ratings yet
Lecture 04
41 pages
Lecture 05
No ratings yet
Lecture 05
51 pages
Lecture 03
No ratings yet
Lecture 03
53 pages
Activity 6.4.1 - Basic VLSM Calculation and Addressing Design
No ratings yet
Activity 6.4.1 - Basic VLSM Calculation and Addressing Design
5 pages
GSM Based Industrial Security System: Abstract: Security and Automation Is A Prime
No ratings yet
GSM Based Industrial Security System: Abstract: Security and Automation Is A Prime
6 pages
France Telecom
No ratings yet
France Telecom
17 pages
Ug1532 Zcu670 Eval BD - WTMKX
No ratings yet
Ug1532 Zcu670 Eval BD - WTMKX
94 pages
C# Type Casting Guide
No ratings yet
C# Type Casting Guide
10 pages
Azure Training Guide for Noida Pros
No ratings yet
Azure Training Guide for Noida Pros
7 pages
(SOLVED) Installer Does Not See The NVMe Drive - MX Linux Forum
No ratings yet
(SOLVED) Installer Does Not See The NVMe Drive - MX Linux Forum
4 pages
RFID Telematics Folder EN
No ratings yet
RFID Telematics Folder EN
16 pages
GPS Tracker Feature Comparison
No ratings yet
GPS Tracker Feature Comparison
1 page
Excel Data Analysis 2022
No ratings yet
Excel Data Analysis 2022
3 pages
Datacom 2500
100% (1)
Datacom 2500
152 pages
Big Data Insights for Businesses
No ratings yet
Big Data Insights for Businesses
13 pages
Ajax Pagination With JQuery
No ratings yet
Ajax Pagination With JQuery
1 page
IbrahimAhmedKhan CV
No ratings yet
IbrahimAhmedKhan CV
1 page
Week 2 - Intro To CSG
No ratings yet
Week 2 - Intro To CSG
28 pages
Ygopro For Android
No ratings yet
Ygopro For Android
7 pages
Database Developer's Guide With Visual C++ 4 Second Edition
No ratings yet
Database Developer's Guide With Visual C++ 4 Second Edition
1,351 pages
ControlNet Counters, Warnings and Cable Redundancy
No ratings yet
ControlNet Counters, Warnings and Cable Redundancy
3 pages
Python 3.6.1 Updates for Developers
No ratings yet
Python 3.6.1 Updates for Developers
145 pages
Sap MRS
100% (2)
Sap MRS
16 pages
BGP Route Reflector
No ratings yet
BGP Route Reflector
5 pages
Breakout Board V5 Type English User Manual PDF
No ratings yet
Breakout Board V5 Type English User Manual PDF
14 pages
IBM System Storage N Series N6210 and N6240 Offer Enterprise-Class Fibre Channel, iSCSI, and NAS Storage With Gateway Options
No ratings yet
IBM System Storage N Series N6210 and N6240 Offer Enterprise-Class Fibre Channel, iSCSI, and NAS Storage With Gateway Options
40 pages
Implementing Central Finance in SAP S4HANA S4F61 - EN - Col17 - 20
No ratings yet
Implementing Central Finance in SAP S4HANA S4F61 - EN - Col17 - 20
1 page
Java Application Development Array Manipulation - Use Try With Multi Catch Description
No ratings yet
Java Application Development Array Manipulation - Use Try With Multi Catch Description
2 pages
HTML 5 Web Security
No ratings yet
HTML 5 Web Security
82 pages
Acceleration Program - FasterCapital
No ratings yet
Acceleration Program - FasterCapital
10 pages
Computer Maintenance & Cleaning Guide
No ratings yet
Computer Maintenance & Cleaning Guide
5 pages
Especificaciones DVR PDF
No ratings yet
Especificaciones DVR PDF
1 page
How Oracle Uses Memory On Aix
100% (1)
How Oracle Uses Memory On Aix
29 pages

L6 Data Preprocessing

Uploaded by

L6 Data Preprocessing

Uploaded by

Data Preprocessing

Fundamentals of Data Mining ◼ Data Preprocessing: An Overview

Data Preprocessing ◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

◼ Measures for data quality: A multidimensional view

Major Tasks in Data Preprocessing Data Preprocessing

How to Handle Missing Data?

Noisy Data How to Handle Noisy Data?

◼ Noise: random error or variance in a measured variable ◼ Binning

◼ faulty data collection instruments median, smooth by bin boundaries, etc.

◼ technology limitation ◼ Clustering

◼ Combined computer and human inspection

deal with possible outliers)

◼ Weka practical: Lab1.3 Remove Outlier ◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

Data Transformation Normalization

◼ Ex. Let μ = 54,000, σ = 16,000. Then 73,000 is mapped to ◼ Ex.

◼ Nominal—values from an unordered set, e.g., color, profession

◼ Discretization: Divide the range of a continuous attribute into intervals

Data Discretization Methods Simple Discretization: Binning

calc Binning Methods for Data Smoothing

◼ Using entropy to determine split point (discretization point)

◼ Correlation analysis (e.g., Chi-merge: χ2-based discretization)

◼ Bottom-up merge: find the best neighboring intervals (those

◼ Merge performed recursively, until a predefined stopping condition

Concept Hierarchy Generation Automatic Concept Hierarchy Generation

Data Preprocessing Data Reduction Strategies

◼ Principal Components Analysis (PCA)

◼ Regression and Log-Linear Models

◼ Data cube aggregation

Attribute Subset Selection

Heuristic Search in Attribute Selection Attribute Creation (Feature Generation)

◼ Then next best attribute condition to the first, ... ◼ Domain-specific

◼ Repeatedly eliminate the worst attribute ◼ E.g., Fourier transformation, wavelet

◼ Best combined attribute selection and elimination

D space as the product on appropriate marginal ◼ Log-linear model

◼ Non-parametric methods distributions

◼ Major families: histograms, clustering, sampling, …

Histogram Analysis Clustering

Sampling Types of Sampling

◼ Sampling: obtaining a small sample s to represent the ◼ Simple random sampling

◼ Key principle: Choose a representative subset of the data population

Raw Data Cluster/Stratified Sample

Data Cube Aggregation Data Compression

◼ Reference appropriate levels ◼ Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

Data Compression Data Preprocessing

◼ Data Preprocessing: An Overview

Original Data Compressed ◼ Major Tasks in Data Preprocessing

◼ Data integration: ◼ Redundant data occur often when integration of multiple

Data Preprocessing Summary

◼ Data Cleaning ◼ Data reduction

You might also like