0% found this document useful (0 votes)

48 views40 pages

Data Preprocessing Guide

The document provides an overview of data preprocessing, including definitions of data, attributes, and types of attributes such as nominal, ordinal, interval, and ratio. It discusses the importance of understanding dataset characteristics, handling missing values, outliers, and the methods for data normalization and discretization. Additionally, it covers techniques for managing categorical and continuous attributes, as well as measures of similarity and dissimilarity in data analysis.

Uploaded by

l225000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views40 pages

Data Preprocessing Guide

Uploaded by

l225000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Data

Preprocessing
What is Data?
Attributes
Collection of data objects and their attributes

Tid Refund Marital Taxable

Status Income Cheat

1 Yes Single 125K No

2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
Objects
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Types of Attributes
➢ There are different types of attributes
➢ Nominal
➢ Examples: ID numbers, eye color, zip codes
➢ Ordinal
➢ Examples: rankings (e.g., taste of potato chips on a scale
from 1-10), grades, height in {tall, medium, short}
➢ Interval
➢ Examples: calendar dates, temperatures in Celsius or
Fahrenheit.
➢ Ratio
➢ Examples: temperature in Kelvin, length, time, counts
Properties of Attribute Values
➢ The type of an attribute depends on which of the
following properties it possesses:
➢ Distinctness: = 
➢ Order: < >
➢ Addition: + -
➢ Multiplication: */

➢ Nominal attribute: distinctness

➢ Ordinal attribute: distinctness & order
➢ Interval attribute: distinctness, order & addition
➢ Ratio attribute: all 4 properties
Discrete, Continuous,
& Asymmetric Attributes
➢ Discrete Attribute
➢ Has only a finite or countably infinite set of values
➢ Ex: zip codes, counts, or the set of words in a collection of documents
➢ Often represented as integer variables (Nominal, ordinal, binary attributes)
➢ Continuous Attribute
➢ Has real numbers as attribute values
➢ Interval and ratio attributes
➢ Ex: temperature, height, or weight
➢ Asymmetric Attribute
➢ Only presence is regarded as important
➢ Ex: If students are compared on the basis of the courses they do not take,
then most students would seem very similar
Step 1: To describe the dataset

What do your records represent?

What does each attribute mean?

What type of attributes?

• Categorical
• Numerical
• Discrete
• Continuous
• Binary – Asymmetric
Step 2: To explore the dataset
➢ Preliminary investigation of the data to better
understand its specific characteristics
➢ It can help to answer some of the data mining questions
➢ To help in selecting pre-processing tools
➢ To help in selecting appropriate data mining algorithms
➢ Things to look at
➢ Class balance Visualization tools
➢ Dispersion of data attribute values are important
➢ Skewness, outliers, missing values Histograms, scatter
➢ Attributes that vary together plots
Useful Statistics
➢ Discrete attributes
➢ Frequency of each value
➢ Mode = value with highest frequency
➢ Continuous attributes
➢ Range of values, i.e. min and max
➢ Mean (average)
➢ Sensitive to outliers
➢ Median
➢ Better indication of the ”middle” of a set of values in a skewed
distribution
➢ Skewed distribution
➢ mean and median are quite different
Skewed Distributions of
Attribute Values
Dispersion of Data
➢ How do the values of an attribute spread?
➢ Variance
➢ Variance is sensitive to outliers
𝑛
1
ത )2
𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑋) = ෍(𝑥𝑖 − 𝑥
𝑛
𝑖=1
➢ What if the distribution of values is multimodal, i.e.
data has several bumps?

➢ Visualization tools are useful

Attributes that Vary Together
➢ Correlation is a measure that describe how two attributes
vary together
𝑐𝑜𝑟𝑟(𝑥, 𝑦)
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )(𝑦𝑖 − 𝑦ത )
=
σ𝑛𝑖=1(𝑥𝑖 − 𝑥ҧ )2 σ𝑛𝑖=1(𝑦𝑖 − 𝑦ത )2
Data Quality
➢ Examples of data quality problems:
➢ Noise and outliers Tid Refund Marital Taxable
Status Income Cheat
➢ Missing values 1 Yes Single 125K No

➢ Duplicate data 2 No Married 100K No

3 No Single 70K No
4 Yes Married 120K No

A mistake or a millionaire? 5 No Divorced 10000K Yes

6 No NULL 60K No

Missing values 7 Yes Divorced 220K NULL

8 No Single 85K Yes
9 No Married 90K No
Inconsistent duplicate entries 9 No Single 90K No
10
Missing Values
➢ Reasons for missing values
➢ Information is not collected
(e.g., people decline to give their age and weight)
➢ Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

➢ Handling missing values

➢ Eliminate Data Objects
➢ Estimate Missing Values
➢ Ignore the Missing Value During Analysis
➢ Replace with all possible values (weighted by their probabilities)
Outliers
➢ Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set
➢ Can help to
➢ detect new phenomenon or
➢ discover unusual behavior in data
➢ detect problems
How to Handle Noisy Data?
➢ Binning method:
➢ first sort data and partition into (equi-depth) bins
➢ then smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
➢ Clustering
➢ detect and remove outliers
➢ Combined computer and human inspection
➢ detect suspicious values and check by human
➢ Regression
➢ smooth by fitting the data
into regression functions
Discretization

Divide the range of a continuous attribute into intervals

➢ Interval labels can be used to replace actual data
values.

➢ Reduce data size by discretization

➢ Some data mining algorithms only work with discrete
➢ attributes
➢ E.g. Apriori for ARM
Binning (Equal-width)
➢ Equal-width (distance) partitioning
➢ Divide the attribute values x into k equally sized bins
➢ If xmin ≤ x ≤ xmax then the bin width δ is given by
𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
𝛿=
𝑘

➢ Disadvantages:
➢ outliers may dominate presentation
➢ Skewed data is not handled well.
Binning (Equal-frequency)
➢ Equal-depth (frequency) partitioning:

➢ Divides the range into N intervals, each containing

approximately same number of samples
➢ Good data scaling

➢ Disadvantage:
➢ Many occurrences of the same continuous value could
cause the values to be assigned into different bins
➢ Managing categorical attributes can be tricky.
Binning Example
Attribute values (for one attribute e.g., age):
• 0, 4, 12, 16, 16, 18, 24, 26, 28

Equi-width binning – for bin width of e.g., 10:

• Bin 1: 0, 4 [-,10) bin
• Bin 2: 12, 16, 16, 18 [10,20) bin
• Bin 3: 24, 26, 28 [20,+) bin
• – denote negative infinity, + positive infinity

Equi-frequency binning – for bin density of e.g., 3:

• Bin 1: 0, 4, 12 [-, 14) bin
• Bin 2: 16, 16, 18 [14, 21) bin
• Bin 3: 24, 26, 28 [21,+] bin
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,
25, 26, 28, 29, 34
* Partition into Equi-depth bins:
Equi-depth bins:
Bin 1: 4, 8, 9, 15
Bin 2: 21, 21, 24, 25
Bin 3: 26, 28, 29, 34
Smoothing by bin
Smoothing by bin means: boundaries:

Bin 1: Bin 1: 9, 9, 9, 9 Bin 1: 4, 4, 4, 15

Bin 2: 23, 23, 23, 23 Bin 2: 21, 21, 25, 25
Bin 3: 29, 29, 29, 29 Bin 3: 26, 26, 26, 34
Data Normalization
An attribute values are scaled to fall within a small, specified
range , such as 0.0 to 1.0
➢ Min-Max normalization
➢ performs a linear transformation on the original data.
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

➢ Example: Let min and max values for the attribute income are
$12,000 and $98,000, respectively.
➢ Map income to the range [0.0;1.0].
Data Normalization
➢ z-score normalization(or zero-mean normalization)
➢ An attribute A, values are normalized based on the mean and
standard deviation of A.

v − meanA
v' =
stand _ devA
➢ Example: Let mean= 54,000 and standard deviation=16,000 for the
attribute income
➢ With z-score normalization, a value of $73,600 for income is
transformed to
Continuous and Categorical
Attributes
How to apply association analysis formulation to non-
asymmetric binary variables?
Session Country Session Number of
Browser
Id Length Web Pages Gender Buy
Type
(sec) viewed
1 USA 982 8 Male IE No
2 China 811 10 Female Netscape No
3 USA 2125 45 Female Mozilla Yes
4 Germany 596 4 Male IE Yes
5 Australia 123 9 Male Mozilla No
… … … … … … …
10

Example of Association Rule:

{Number of Pages [5,10)  (Browser=Mozilla)} → {Buy = No}
Handling Categorical Attributes
➢ Categorical Attributes:
➢finite number of possible values,
➢no ordering among value
➢ Transform categorical attribute into asymmetric
binary variables
➢ Introduce a new “item” for each distinct
attribute-value pair
➢ Example: replace Browser Type attribute with
➢ Browser Type = Internet Explorer
➢ Browser Type = Mozilla
➢ Browser Type = Chrome
Handling Categorical Attributes
➢ Potential Issues
➢What if attribute has many possible values
➢ Example: attribute country has more than 200
possible values
➢ Many of the attribute values may have very low
support
➢Potential solution: Aggregate the low-support
attribute values

Replace less frequent attribute values

into category called others
Handling Categorical Attributes

➢ Potential Issue: What if distribution of attribute values is

highly skewed
➢ Example: In an online survey, we collected information
regarding attributes gender, education, state, computer
at home, chat online, shop online and privacy concern.
➢ 85 % of the participant have computer at home
➢ {Computer at home =yes, shop Online =yes} ->{ privacy
concerns = yes}
➢ Better: {shop Online =yes} ->{ privacy concerns = yes}
➢ Potential solution: drop the highly frequent items
Handling Continuous Attributes
➢ Different kinds of rules:
➢Age[21,35)  Salary[70k,120k) → Buy
➢Salary[70k,120k)  Buy → Age: =28, =4
➢ Different methods:
➢Discretization-based
➢Equal-width binning
➢Equal-depth binning
➢Clustering
➢Statistics-based
Similarity and

Dissimilarity
Similarity and Dissimilarity
• Numerical measure of how alike two data
objects are.
Similarity • Is higher when objects are more alike.
• Often falls in the range [0,1]

• Numerical measure of how different are

two data objects
Dissimilarity • Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies

Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for
Simple Attributes

p and q are the attribute values for two data objects.

Euclidean Distance

➢ Euclidean Distance
n
dist =  ( pk − qk )
2
k =1
Where n is the number of dimensions (attributes)
pk and qk are the kth attributes (components) or data
objects p and q.

➢ Standardization is necessary, if scales differ.

General Approach for Combining
Similarities
Sometimes attributes are of many different types, but an
overall similarity is needed.
Using Weights to Combine
Similarities

➢ May not want to treat all attributes the same.

➢Use weights wk which are between 0 and 1 and
sum to 1.
Example

➢ One categorical variable, test-1,

➢ d(i, j) evaluates to 0 if objects i = j , and 1 otherwise
Example

➢ Ordinal variable, test-2,

➢ d(i, j) = |i-j| / (n-1), we have 0 to n-1 values
Example

➢ Ratio scaled variable, test-3,

➢ Normalize (min-max normalization)
➢ Max =64 , min =22
➢ Distance measure (Manhattan or Euclidean distance )
Example

➢ Variable of Mixed types

➢ We use the dissimilarity matrices for the three variables.
Assignment 4
➢ Preprocessing and Clustering using K-means in
Pyspark
➢Due on 24th May
Project ???
➢ Replaced with
➢more assignments
➢a mini project
➢review of a latest research paper on Spark

Lec 5
No ratings yet
Lec 5
24 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Data and Attributes in Data Mining
No ratings yet
Data and Attributes in Data Mining
47 pages
DM Lec03
No ratings yet
DM Lec03
37 pages
Data Preprocessing PDF
No ratings yet
Data Preprocessing PDF
57 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
No ratings yet
S1 - 25 (NSP) - ML - CS 2 - 3rd Aug 2025
67 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
Data Pre-Processing Essentials
No ratings yet
Data Pre-Processing Essentials
21 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
Data Mining CH2
No ratings yet
Data Mining CH2
69 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
1 Data Mining
No ratings yet
1 Data Mining
47 pages
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
No ratings yet
DM Unit1 - 1 INTRODUCTION TO DATA MINING and Types of Data 19I504
42 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Full
No ratings yet
Full
367 pages
Preprocessing 1
No ratings yet
Preprocessing 1
11 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
2-Data Preprocessing
No ratings yet
2-Data Preprocessing
104 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Session 1 - Getting To Know Data
No ratings yet
Session 1 - Getting To Know Data
62 pages
3-Preparing The Data-10-01-2024
No ratings yet
3-Preparing The Data-10-01-2024
127 pages
ML - Data - Preprocessing For Machine Learning
No ratings yet
ML - Data - Preprocessing For Machine Learning
44 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Mining
No ratings yet
Mining
129 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
A.I. Lecture 5 NEW
No ratings yet
A.I. Lecture 5 NEW
96 pages
Chap2 Data
No ratings yet
Chap2 Data
87 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Data
No ratings yet
Data
84 pages
Data Mining Notes C2
No ratings yet
Data Mining Notes C2
12 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
02 Data
No ratings yet
02 Data
35 pages
Lecture Notes For Chapter 2 Introduction To Data Mining
No ratings yet
Lecture Notes For Chapter 2 Introduction To Data Mining
34 pages
Chapter-2 Getting To Know Your Data
No ratings yet
Chapter-2 Getting To Know Your Data
92 pages
Chapter 2
No ratings yet
Chapter 2
57 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Introduction To K-Fold Cross-Validation
No ratings yet
Introduction To K-Fold Cross-Validation
6 pages
Smart PLS
100% (4)
Smart PLS
54 pages
Unveiling Relationships (A Guide To Correlation Analysis Using SPSS)
No ratings yet
Unveiling Relationships (A Guide To Correlation Analysis Using SPSS)
12 pages
Linear Regression Analysis in Education and Economics
No ratings yet
Linear Regression Analysis in Education and Economics
19 pages
Gibbs 2
No ratings yet
Gibbs 2
9 pages
Median and Quartiles Practice Strips
No ratings yet
Median and Quartiles Practice Strips
1 page
Chapter 7: Hypothesis Testing With One Sample
No ratings yet
Chapter 7: Hypothesis Testing With One Sample
6 pages
Mid-Semester Test: Civil Statistics
No ratings yet
Mid-Semester Test: Civil Statistics
2 pages
Content: Dplyr, Readr, TM, Ggplot2/+ggforce/, Tidyr, Broom Dplyr
No ratings yet
Content: Dplyr, Readr, TM, Ggplot2/+ggforce/, Tidyr, Broom Dplyr
8 pages
Uster Statistics-50s CBD Compact Cone
100% (1)
Uster Statistics-50s CBD Compact Cone
6 pages
Goa Institute of Management: 1. Course Description
No ratings yet
Goa Institute of Management: 1. Course Description
7 pages
abdi-PLSC and PLSR2012
No ratings yet
abdi-PLSC and PLSR2012
31 pages
Descriptive Statistics Cheat Sheet
No ratings yet
Descriptive Statistics Cheat Sheet
1 page
Operation Management
No ratings yet
Operation Management
43 pages
Statistics Study Guide Chi-Square
No ratings yet
Statistics Study Guide Chi-Square
4 pages
Final Exam
100% (1)
Final Exam
2 pages
Artikel Jurnal PKDP
No ratings yet
Artikel Jurnal PKDP
22 pages
Giong Chuong 7
100% (1)
Giong Chuong 7
6 pages
Measures of Variability
100% (1)
Measures of Variability
11 pages
Understanding Correlation Basics
No ratings yet
Understanding Correlation Basics
9 pages
© Ncert Not To Be Republished: Measures of Dispersion
No ratings yet
© Ncert Not To Be Republished: Measures of Dispersion
17 pages
Exercise Bayesian
No ratings yet
Exercise Bayesian
2 pages
Psy 712: Psychometrics
No ratings yet
Psy 712: Psychometrics
8 pages
(Ebook PDF) Essentials of Business Analytics 2nd Edition Full Access
100% (2)
(Ebook PDF) Essentials of Business Analytics 2nd Edition Full Access
142 pages
Stat and Prob - Q4 - Mod8 - Solving Problems Involving Test of Hypothesis On Population Mean
No ratings yet
Stat and Prob - Q4 - Mod8 - Solving Problems Involving Test of Hypothesis On Population Mean
22 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
Midterm Text
No ratings yet
Midterm Text
13 pages
Multiplexer and De-Multiplexer
No ratings yet
Multiplexer and De-Multiplexer
3 pages
Distribution Fitting - Batch Fit Distribution Fitting - Batch Fit
No ratings yet
Distribution Fitting - Batch Fit Distribution Fitting - Batch Fit
8 pages
Introduction To Normal Distribution
No ratings yet
Introduction To Normal Distribution
8 pages