0% found this document useful (0 votes)

42 views50 pages

2 Data Pre-Processing

The document discusses data preprocessing techniques. It explains that data preprocessing involves transforming raw data into an understandable format through techniques like data cleaning, integration, transformation and reduction. The goal of preprocessing is to improve data quality by handling issues like incomplete, noisy and inconsistent data.

Uploaded by

NooR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views50 pages

2 Data Pre-Processing

Uploaded by

NooR

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Data

Preprocessing
Contents

What & Why preprocess the Data

Data Cleaning
Data integration
Data Transformation
Data reduction
Data Preprocessing

It is a data mining technique that involves

transformation of raw data into an
understandable format.
Why to Preprocess Data

• Data in the real world is:

 Incomplete: lacking values, certain attributes of
interest etc.
 Noisy: containing errors or outliers
 Inconsistent: lack of compatibility or similarity
between two or more facts.
Why to Preprocess Data

• No Quality Data, no Quality mining:

 Quality decision must be based on quality data.
Measure of Data Quality

 Accuracy
 Completeness
 Consistency
 Timeliness
 Value added
 Interpretability
 Accessibility etc
Data Preprocessing
Techniques

 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction
What is Data?
Attributes Class
attribute
Tid Refund Marital Taxable
Collection of data objects and Income Cheat
Status
their attributes 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No

10
10 No Single 90K Yes
Types of Attributes

 There are different types of attributes

Nominal
 Examples: ID numbers, eye color, zip codes
Ordinal
 Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
Interval
 Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
Ratio
 Examples: temperature in Kelvin, length, time, counts
Properties of Attribute
Values
 The type of an attribute depends on which of the
following properties it possesses:
Distinctness: = 
Order: < >
Addition: + -
Multiplication: */

Nominal attribute: distinctness

Ordinal attribute: distinctness & order
Interval attribute: distinctness, order & addition
Ratio attribute: all 4 properties
Data Types and Forms
A1 A2 … An C
 Attribute-value data:

 Data types
numeric, categorical
(see the hierarchy for
its relationship)
static, dynamic
(temporal)

7
Record Data
Data that consists of a collection of records, each of which consists of
a fixed set of attributes

Tid Refund Marital Taxable

Income Cheat
Status
1 Yes Single 125K No
2 No Married 100K No
Record
3 No Single 70K No
Data Matrix
4 Yes Married 120K No Document Data
5 No Divorced 95K Yes Transaction Data
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No

10
10 No Single 90K Yes
Data Matrix
 Data objects with the fixed set of numeric attributes
 Consider them as points in a multi-dimensional space, where each
dimension represents a distinct attribute

 Represent by an m by n matrix,
 where there are m rows, one for each object, and n columns, one for
each attribute

Projection Projection Distance Load Thickness

of x Load of y load

10.23 5.27 15.22 2.7 1.2

12.65 6.25 16.22 2.2 1.1
Document Data
Each document becomes a `term' vector,
each term is a component (attribute) of the vector,
the value of each component is the number of times
the corresponding term occurs in the document.

timeout

season
coach

game
score
pla y
team

wi n
ball

lost
Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data

A special type of record data, where

each record (transaction) involves a set of items.
For example, a grocery store transactions.

TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Graph Data

 World Wide Web

 Molecular Structures
Generic graph and HTML Links

2
5 1
2
5

Benzene Molecule: C6H6

Ordered data Sequences of transactions

 Spatial Data
 Temporal Data
 Sequential Data
 Genetic Sequence Data

Genomic sequence data

GGTTCCGCCTTCAGCCCCGCGCC

Spatio-Temporal Data CGCAGGGCCCGCCCCGCGCCGTC

GAGAAGGGCCCGCCTGGCGGGCG

GGGGGAGGCGGGGCCGCCCGAGC

CCAACCGAGTCCGACCAGGTGCC
Average Monthly Temperature of land and ocean
CCCTCTGCTCGGCCTAGACCTGA
The data analysis pipeline
 Mining is not the only step in the analysis process

Data Result
Preprocessing
Data Mining Post-processing

 Preprocessing: real data is noisy, incomplete and inconsistent

 Data cleaning is required to make sense of the data
 Techniques: Sampling, Dimensionality Reduction, Feature selection

 Post-Processing: Make the data actionable and useful to the user

 Statistical analysis of importance
 Visualization
Data Preprocessing

Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy generation
Data Quality

Examples of data quality problems:

Noise and outliers Tid Refund Marital
Status
Taxable
Income Cheat

Missing values 1 Yes Single 125K No

Duplicate data 2 No Married 100K No

3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

6 No NULL 60K No

Missing values 7 Yes Divorced 220K NULL

8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Noise
 Noise refers to modification of original values
 Examples: distortion of a person’s voice when talking on a poor phone

Two Sine Two Sine Waves

Waves + Noise
Outliers
 Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
 Can help to
detect new phenomenon or
discover unusual behavior in data
detect problems
Sample applications of outlier detection
 Fraud detection
 Abnormal buying patterns can characterize credit card abuse
 Medicine
 Unusual symptoms or test results may indicate potential health problems of a
patient
 Public health
 The occurrence of a particular disease
 Sports statistics
 Outstanding (in a positive as well as a negative sense) players
may be identified as having abnormal parameter values
 Detecting measurement errors
 • Data derived from sensors (e.g. in a given scientific experiment) may contain
measurement errors
 • Abnormal values could provide an indication of a measurement error
 “One person‘s noise could be another person‘s signal.”
Missing Values
 Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

 Handling missing values

Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their
probabilities)
Duplicate Data

 Data set may include data objects that are duplicates, or

almost duplicates of one another
Major issue when merging data from heterogeous sources

 Examples:
Same person with multiple email addresses
Forms of data
preprocessing
• Fill in missing values
• Smooth noisy data
• Remove outliers
• Resolve inconsistencies

Normalization and aggregation

How to Handle Noisy Data?
 Binning method:
first sort data and partition into (equi-depth) bins
then smooth by bin means, smooth by bin median, smooth by bin
boundaries, etc.
 Clustering
detect and remove outliers
 Combined computer and human inspection
detect suspicious values and check by human
 Regression
smooth by fitting the data
into regression functions
Simple Discretization Methods:
Binning

Equal-width (distance) partitioning:

Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute,
the width of intervals will be: W = (B-A)/N.
The most straightforward but outliers may dominate presentation
Skewed data is not handled well.
Simple Discretization Methods:
Binning

Equal-depth (frequency) partitioning:

Divides the range into N intervals, each containing
approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky.
Binning Example
 Attribute values (for one attribute e.g., age):
0, 4, 12, 16, 16, 18, 24, 26, 28
 Equi-width binning – for bin width of e.g.,
10: Bin 1: 0, 4 [-,10) bin
Bin 2: 12, 16, 16, 18 [10,20) bin
Bin 3: 24, 26, 28 [20,+) bin
– denote negative infinity, + positive infinity
 Equi-frequency binning – for bin density of e.g.,
3: Bin 1: 0, 4, 12 [-, 14) bin
Bin 2: 16, 16, 18 [14, 21) bin
Bin 3: 24, 26, 28 [21,+] bin
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Integration
 Data integration:
combines data from multiple sources into a coherent
store

 Detecting and resolving data value conflicts

for the same real world entity, attribute values from
different sources are different
possible reasons:
different representations,
 different scales, e.g., metric vs. British units
Handling Redundant Data
 Redundant data occur often when integration of multiple
databases is done
The same attribute may have different names in different
databases
Careful integration of the data from multiple sources may help
reduce/avoid
redundancies
inconsistencies and
improve mining speed and quality
Data Transformation

 Transform or consolidate data into forms appropriate for

mining
Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Daily sales data aggregated to compute monthly or annual amount

Normalization: scaled to fall within a small, specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Data Transformation

 Aggregation: summarization, data cube construction

Daily sales data aggregated to compute monthly or annual amount

A data cube for sales

Data Transformation: Normalization

 An attribute values are scaled to fall within a small, specified

range , such as 0.0 to 1.0
 Min-Max normalization
performs a linear transformation on the original data.
v  minA
v' (new _ maxA  new _ minA)  new _
 maxA  minA
minA Let min and max values for the attribute income are
 Example:
$12,000 and $98,000, respectively.
 Map income to the range [0.0;1.0].
Data Transformation: Normalization

 z-score normalization(or zero-mean normalization)

An attribute A, values are normalized based on the mean and
standard deviation of A.
v  m e a nA
v'
s ta n d _ d e vA

 Example: Let mean and standard deviation of the values for the
attribute income are $54,000 and $16,000, respectively.
 With z-score normalization, a value of $73,600 for income is
transformed to
Data Transformation: Normalization

 Decimal scaling
normalizes by moving the decimal point of values of attribute A.
The number of decimal points moved depends on the
maximum absolute value of A.

v
v'
Where j is the smallest integer such that Max(| v'
 10
j

|)<1
 Example: Suppose that the recorded values of A range from -986 to 917.
 The maximum absolute value of A is 986.
 To normalize by decimal scaling, we therefore divide each value by
1,000 (i.e., j = 3)
 -986 normalizes to -0.986 and 917 normalizes to 0.917.
Data Reduction

 Warehouse may store terabytes of data: Complex data

analysis/mining may take a very long time to run on the
complete data set

 Data reduction
Obtains a reduced representation of the data set that is much
smaller in volume
but produces the same (or almost the same) analytical results
Data Reduction Strategies
 Dimensionality reduction
 Data compression
 use encoding schemes to reduce the data set size
 Numerosity reduction
 data is replaced or estimated by alternative smaller data
representations
Sampling
Histograms
Clustering
 Discretization and concept hierarchy
generation
 replace raw data values for attributes by ranges or higher
conceptual
levels
Histograms
 A popular data reduction 4
technique 0
 Divide data into buckets and
3
store average (sum) for each 3
50
bucket
 Can be constructed optimally 25
in one dimension using
2
dynamic programming
0
1
5
1
0
100002000030000400005000060000700008000090000 100000
5
Cluster Analysis

Partition data into

clusters, and store
cluster representation
only
Sampling

 Statisticians sample because obtaining the entire set of data

of interest is too expensive or time consuming.
 Example: What is the average height of a person in Pakistan?
 We cannot measure the height of everybody

 Sampling is used in data mining because processing the entire

set of
data of interest is too expensive or time consuming.
 Example: We have 1M documents. What fraction has at least 100 words in
common?
 Computing number of common words for all pairs requires 1012 comparisons
 Example: What fraction of tweets in a year contain the word “Lahore”?
 300M tweets per day, if 100 characters on average, 86.5TB to store all tweets
Sampling …

 The key principle for effective sampling is the following:

using a sample will work almost as well as using the entire data
sets, if the sample is representative

A sample is representative if it has approximately the same

property (of interest) as the original set of data

Otherwise we say that the sample introduces some bias

Types of Sampling
 Simple Random Sampling
There is an equal probability of selecting any particular item

 Sampling without replacement

As each item is selected, it is removed from the population

 Sampling with replacement

Objects are not removed from the population as they are
selected for the sample.
In sampling with replacement, the same object can be picked
up more than once.
This makes analytical computation of probabilities easier
Sampling

Raw Data
Sample Size

8000 points 2000 Points 500

Points
Discretization

 Discretization:
Divide the range of a continuous attribute into intervals
Reduce data size by discretization
Interval labels can be used to replace actual data values.
Discretization for numeric data

 Binning
sensitive to the user-specified number of bins and
outliers

 Histogram

 Clustering analysis

 Segmentation by natural Partitioning

Summar
y
 Data preparation is a big issue for both warehousing and mining

 Data preparation includes

Data cleaning and data integration
Data reduction and feature selection
Discretization

 A lot a methods have been developed but still an active area of

research

M2 PPT
No ratings yet
M2 PPT
60 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
33 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Session 4
No ratings yet
Session 4
40 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
Data Mining for Quality Improvement
100% (1)
Data Mining for Quality Improvement
34 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Mining: Concepts and Techniques: September 16, 2020 1
No ratings yet
Data Mining: Concepts and Techniques: September 16, 2020 1
46 pages
Unit 2
No ratings yet
Unit 2
37 pages
Data Mining Chapter 2 Data Preprocessing
No ratings yet
Data Mining Chapter 2 Data Preprocessing
56 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
CH 3
No ratings yet
CH 3
68 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Correlation
No ratings yet
Correlation
14 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Pre Processing
No ratings yet
Data Pre Processing
14 pages
AMA3724
No ratings yet
AMA3724
3 pages
4-Fold Selection Element 42 500 Safety-Related
No ratings yet
4-Fold Selection Element 42 500 Safety-Related
4 pages
Takeuchi Tb225 Parts Manual
No ratings yet
Takeuchi Tb225 Parts Manual
338 pages
T NG Ôn HSG Day 3
No ratings yet
T NG Ôn HSG Day 3
8 pages
Design and Operation of Cyclones
No ratings yet
Design and Operation of Cyclones
40 pages
Email Engine 700 PDF
No ratings yet
Email Engine 700 PDF
328 pages
SAP Sales & Distribution Guide
100% (2)
SAP Sales & Distribution Guide
2 pages
Cecomp: Cecomp Battery Powered Digital Pressure Gauges DPG1000B, F4B
No ratings yet
Cecomp: Cecomp Battery Powered Digital Pressure Gauges DPG1000B, F4B
2 pages
Movie Data
No ratings yet
Movie Data
11 pages
Pavement Materials Questions and Answers
No ratings yet
Pavement Materials Questions and Answers
3 pages
440-450 Datasheet
No ratings yet
440-450 Datasheet
2 pages
AXB18-1040 AXA IN E-Rental - 210x210mm - EN - LR PDF
No ratings yet
AXB18-1040 AXA IN E-Rental - 210x210mm - EN - LR PDF
6 pages
Kernel Perceptron
No ratings yet
Kernel Perceptron
28 pages
Logix5550 Controller: Firmware Release Notes
No ratings yet
Logix5550 Controller: Firmware Release Notes
2 pages
Using Social Media in School
No ratings yet
Using Social Media in School
1 page
Statement 5
No ratings yet
Statement 5
4 pages
1 164048signed-Draft
No ratings yet
1 164048signed-Draft
3 pages
Manual Kick Tolerance Guide
100% (1)
Manual Kick Tolerance Guide
3 pages
Sun Cluster
100% (1)
Sun Cluster
87 pages
Computer Science Resume
100% (1)
Computer Science Resume
6 pages
Manual SerDia2010 EN PDF
No ratings yet
Manual SerDia2010 EN PDF
225 pages
Unit 5 Evaluating Information Sources
No ratings yet
Unit 5 Evaluating Information Sources
11 pages
Importance of Language Laboratory in Developing La
No ratings yet
Importance of Language Laboratory in Developing La
6 pages
Seagate HDD Data Sheet
No ratings yet
Seagate HDD Data Sheet
2 pages
G8 - 2nd Quarterly Exam
No ratings yet
G8 - 2nd Quarterly Exam
3 pages
LEAKED SEO SWIPES Rank1.com From Panel Rank Facebook - Ad - 'S Made With Getkong - Ai
No ratings yet
LEAKED SEO SWIPES Rank1.com From Panel Rank Facebook - Ad - 'S Made With Getkong - Ai
14 pages
Risk Analytics Data Driven Decisions Under Uncertainty Rodriguez
100% (1)
Risk Analytics Data Driven Decisions Under Uncertainty Rodriguez
483 pages
Class X Case Strudy
No ratings yet
Class X Case Strudy
12 pages
Sindh Police Constable Fee Slip
0% (1)
Sindh Police Constable Fee Slip
1 page
(SAP BTP Onboarding Series) Joule - Getting Starte... - SAP Community
No ratings yet
(SAP BTP Onboarding Series) Joule - Getting Starte... - SAP Community
72 pages

2 Data Pre-Processing

Uploaded by

2 Data Pre-Processing

Uploaded by

Data

What & Why preprocess the Data

It is a data mining technique that involves

• Data in the real world is:

• No Quality Data, no Quality mining:

 There are different types of attributes

Nominal attribute: distinctness

Tid Refund Marital Taxable

Projection Projection Distance Load Thickness

10.23 5.27 15.22 2.7 1.2

A special type of record data, where

 World Wide Web

Benzene Molecule: C6H6

Genomic sequence data

Spatio-Temporal Data CGCAGGGCCCGCCCCGCGCCGTC

 Preprocessing: real data is noisy, incomplete and inconsistent

 Post-Processing: Make the data actionable and useful to the user

Examples of data quality problems:

Missing values 1 Yes Single 125K No

Duplicate data 2 No Married 100K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

Missing values 7 Yes Divorced 220K NULL

Two Sine Two Sine Waves

 Handling missing values

 Data set may include data objects that are duplicates, or

Normalization and aggregation

Equal-width (distance) partitioning:

Equal-depth (frequency) partitioning:

 Detecting and resolving data value conflicts

 Transform or consolidate data into forms appropriate for

Aggregation: summarization, data cube construction

Normalization: scaled to fall within a small, specified range

normalization by decimal scaling

 Aggregation: summarization, data cube construction

A data cube for sales

 An attribute values are scaled to fall within a small, specified

 z-score normalization(or zero-mean normalization)

 Warehouse may store terabytes of data: Complex data

Partition data into

 Statisticians sample because obtaining the entire set of data

 Sampling is used in data mining because processing the entire

 The key principle for effective sampling is the following:

A sample is representative if it has approximately the same

Otherwise we say that the sample introduces some bias

 Sampling without replacement

 Sampling with replacement

8000 points 2000 Points 500

 Segmentation by natural Partitioning

 Data preparation includes

 A lot a methods have been developed but still an active area of

You might also like