0% found this document useful (0 votes)

45 views56 pages

Data Mining Chapter 2 Data Preprocessing

a note on data mining ioe attached for the ioe exam, TU KEC KANTIPUR ENGINEERING COLLEGE , DHAPAKHEL Binod Wosti

Uploaded by

akafle99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views56 pages

Data Mining Chapter 2 Data Preprocessing

a note on data mining ioe attached for the ioe exam, TU KEC KANTIPUR ENGINEERING COLLEGE , DHAPAKHEL Binod Wosti

Uploaded by

akafle99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

Chapter 2

Data Preprocessing
Data Quality
• Real world database are highly unprotected from noise,
missing and inconsistent data due to their typically huge size
and their possible origin from multiple, heterogeneous
sources.
• Low quality data will lead to low quality mining results.
• Data pre-processing is required to handle these above
mentioned facts.
• The methods for data preprocessing are organized into
– Data Cleaning
– Data Integration
– Data Transformation
– Data Reduction
– Data Discritization
Data Cleaning
• Mostly concern with
– Fill-in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data
– Eliminate duplicate data
– Missing Data
• Data is not always available because many tuples may not have
recorded values for several attributes such as age, income.
• Missing data may be due to:
• Equipment Malfunction
• Inconsistent with other recorded data and thus deleted.
• Data not entered due to misunderstanding
• Certain data may not be considered important at the time of entry.
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing. Not
effective when the percentage of missing values per attribute
varies considerably.

• Fill-in missing values manually: Tedious and infeasible task.

• Use a global constant to fill-in missing values.

• Use an attribute mean fill-in missing values belonging to the

same class.

• Use the most probable value to fill-in missing value.

Noisy Data
• Noisy data is a form of error because of random error in a measured
variable.
• Incorrect attribute values may be due to:
• Faulty data collection instruments
• Data entry problem
• Data transmission problem
• Technology limitation
• Inconsistency in naming convention

How to Handle Noisy Data

– Clustering: Detect and remove outliers
– Regression: Smooth by fitting the data into regressi9on function
– Binning Method: First sort the data and partition into different boundaries with
mean, median values.
– Combined computer and human inspection, doing so suspicious values are
detected by human
Outliers
• Outliers are a set of data points that are considerably
dissimilar or inconsistent with the remaining data.
• In most of the cases they are inference of noise while in
some cases they may actually carry valuable information.
Outliers can occur because of:
– Transient malfunction of data measurement
– Error in data transmission or transcription
– Changes in system behavior
– Data contamination from outside the population
examined.
– Flaw in assumed theory
Outliers
How to Handle Outliers
• There are three fundamental approaches to the problem of
outlier’s detection

• Type 1:
– Determine the outliers with no prior knowledge of data. This is a
learning approach analogous to unsupervised learning.
• Type 2:
– Model with normality and abnormality. Analogous to supervised
learning.
• Type 3:
– Model with normality. Semi- supervised learning approach
Data Integration
• Combines data from multiple sources into a coherent store.
• Integrate meta data from different sources (Schema Integration)
– Problem: - Entity Identification Problem.
– Different sources have different values for same attributes.
– Data Redundancy
• These problems are mainly because of different representation,
different scales etc.
How to handle redundant data in data integration?
• Redundant data may be able to be detected by correlation
analysis.
• Step-wise and careful integration of data from multiple sources
may help to improve mining speed and quality.
Data Transformation
• Changing data from one form to another form.
• Approaches:
– Smoothing: Remove noise from data.
– Aggregation: Summarizations of data
– Generalization: Hierarchy climbing of data
– Normalization: Scaled to fall within a small specified range.
Types
– Min-Max Normalization:
• V’ = ((V-min)/(max-min)* (new_max – new_min)) + new_min
– Z-Score Normalization:
• V’ = (V-min)/ stand_dev.
– Normalization by decimal scaling:
• V’= V/ 10j where j is the smallest integer such that max (|V’|) <1
Data Aggregation:
Combining two or more attributes (or objects)
into a single attribute (or object).
• Purpose
– Data reduction: Reduce the number of attributes
or objects
– Change of scale: Cities aggregated into regions,
states, countries, etc
– More “stable” data: Aggregated data tends to have
less variability
Data Reduction:
• Warehouse may store terabytes of data hence complex data
mining may take a very long time to run on complete data set.

• Data reduction is the process of obtaining a reduced

representation of data set that is much smaller in volume but
yet produces the same or almost same analytical results.

• Different methods such as data sampling, dimensionality

reduction, data cube, aggregation, discritization and hierarchy
are used for data reduction.

• Data compression can also be used mostly in media files or

data.
Data Sampling:
• It is one of main method for data selection
• It is often used for both the preliminary investigation of the
data and the final data analysis.
• Statisticians sample because obtaining the entire set of data of
interest is too expensive or time consuming.
• Sampling is used in data mining because processing the entire
set of data of interest is too expensive or time consuming.
• Often used for both preliminary investigation of data and the
final data analysis.
• Important since obtaining entire set of data of interest is too
expensive or time consuming.
• Sampling should be representative since it must represent
approximately the same property as the original set of data.
Sampling types
• Simple Random Sampling: Equal probability of selecting any
particular item.

• Sampling without replacement: As each item is selected, it is

removed from population.

• Sampling with replacement: Objects are not removed from the

population as they are selected from the sample. The same
objects can be picked-up more than once.

• Stratified Sampling: Split the data into several partitions, then

draw random samples from each partition.
Data Discretization
• Convert continuous data into discrete data.
• Partition data into different classes.
Two approaches are:
• Equal width (distance) partitioning:
– It divides the range into N intervals of equal size.
– If A and B are the lowest and the highest values of the attribute, the
width of interval will be
W = (A – B)/N.
– The most straight forward approach for data discretization.
• Equal depth (frequency) partitioning:
– It divides the range into N intervals, each containing approximately
same number of samples.
– Good data scaling
– Managing categorical attributes can be tricky.
OLAP
• OLAP stands for On-Line Analytical Processing.
• An OLAP cube is a data structure that allows fast analysis of data.
• OLAP tools were developed to solve multi-dimensional data
analysis which stores their data in a special multi-dimensional
format (data cube) with no updating facility.
• An OLAP toll doesn’t learn, it creates no new knowledge and they
can’t reach new solutions.
• Information of multi-dimension nature can’t be easily analyzed
when the table has the standard 2-D representation.
• A table with n- independent attributes can be seen as an n-
dimensional space.
• It is required to explore the relationships between several
dimensions and standard relational databases are not very good for
this.
OLAP Tool
OLAP operations:
OLAP operations:
OLAP operations:
OLAP operations:
OLAP operations:
OLAP operations:
OLTP (Online Transaction Processing)

• Used to carry out day to day business functions such as ERP (Enterprise
Resource Planning), CRM ( Customer Relationship Planning)

• OLTP system solved a critical business problem of automating daily

business functions and running real time report and analysis.
OLAP Vs OLTP
Facts OLTP OLAP
Source of Data Operational Data Data warehouse (From various
database)
Purpose of data Control and run fundamental For planning, problem solving
business tasks and decision support
Queries Simple queries Complex queries and
algorithms
Processing Typically very fast Depends on data size,
Speed techniques and algorithms
Space Can be relatively small Larger due to aggregated
requirements databases
Database Design Highly Normalized with Typically denormalized with
many tables. fewer tables. Use of star or
snowflake schema.
Similarity and Dissimilarity of OLAP and OLTP
• Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
• Dissimilarity
– Numerical measure of how different are two data objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Various types of Similarity and Dissimilarity of
Measures
1. Jaccard Coefficient/Jaccard Distance
2. Cosine Similarity/Cosine Dissimilarity
3. Manhattan Distance
4. Euclidean Distance
4. Minkowski Distance
5. Hamming Similarity/ Hamming Distance
JACCARD SIMILARITY
• Jaccard similarity measures the similarity
between two sets by comparing the size of
their intersection to the size of their union.
JACCARD SIMILARITY
JACCARD SIMILARITY
JACCARD Distance
JACCARD Distance
JACCARD Distance
COSINE SIMILARITY
• Cosine similarity measures the cosine of the angle between
two non-zero vectors in an inner product space. It tells us how
similar the directions of the vectors are, regardless of their
magnitude.
COSINE SIMILARITY
COSINE SIMILARITY
COSINE DISSIMILARITY
• Cosine dissimilarity is a measure of how
different two vectors are in direction. It is the
complement of cosine similarity:
• Cosine Dissimilarity=1−Cosine Similarity
• Ranges from 0 to 2.
• 0 indicates that the vectors are identical.
• 1 indicates that the vectors are orthogonal.
• 2 indicates that the vectors are diametrically
opposed.
Manhattan Distance
Manhattan Distance
Manhattan Distance
Euclidean Distance
• Euclidean distance measures the straight-line
distance between two points in Euclidean
space. It's the most intuitive way to quantify
how far apart two points (or vectors) are.
Euclidean Distance
Minkowski Distance
Supremum Distance
• The supremum distance between two vectors
is the maximum absolute difference between
their corresponding elements.
Supremum Distance
SIMPLE MATCHING COEFICIENT
• The Simple Matching Coefficient is a similarity measure used
for comparing two binary vectors. It calculates the proportion
of matching attributes (both 0s and 1s) between the two
vectors.
SIMPLE MATCHING COEFICIENT
Dissimilarity of Symmetric Binary Attributes
Dissimilarity of Symmetric Binary Attributes
Hamming Similarity
• Hamming similarity is a measure of similarity
between two strings (or binary vectors) of equal
length. It’s closely related to the Hamming
distance, which counts the number of positions
where the two strings differ.
• Hamming distance between two equal-length
strings is the number of positions at which the
corresponding symbols are different.
• Hamming similarity is often defined as the
proportion of positions where the two strings are
the same.
Hamming Similarity
Hamming Distance
Hamming Distance

Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Session 4
No ratings yet
Session 4
40 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Preprocessing & Attributes
No ratings yet
Data Preprocessing & Attributes
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Mining Unit 3
No ratings yet
Data Mining Unit 3
64 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Preprocessing-Cleaning & Reduction
No ratings yet
Preprocessing-Cleaning & Reduction
42 pages
Week2 2
No ratings yet
Week2 2
25 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
04 DM BI Data Preprocessing
No ratings yet
04 DM BI Data Preprocessing
93 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Science Preprocessing Guide
No ratings yet
Data Science Preprocessing Guide
40 pages
CH 2
No ratings yet
CH 2
36 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
8 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
40 pages
Chapter 3
No ratings yet
Chapter 3
43 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Week3 - Data Preprocessing, Extraction and Preparation
No ratings yet
Week3 - Data Preprocessing, Extraction and Preparation
34 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Data Mining
No ratings yet
Data Mining
5 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
Major Presentation
No ratings yet
Major Presentation
19 pages
Data Mining Chapter 3 Classification
No ratings yet
Data Mining Chapter 3 Classification
82 pages
Data Mining Chapter 1 Introduction
No ratings yet
Data Mining Chapter 1 Introduction
39 pages
Data Mining Chapter 5 Cluster Analysis
No ratings yet
Data Mining Chapter 5 Cluster Analysis
44 pages
Data Mining Chapter 4 Association Analysis
No ratings yet
Data Mining Chapter 4 Association Analysis
31 pages
? Case Study Questionnaire (Ujjyalo 90
No ratings yet
? Case Study Questionnaire (Ujjyalo 90
4 pages
Statistics and Probability: Quarter 4 - Module 1: Testing Hypothesis
No ratings yet
Statistics and Probability: Quarter 4 - Module 1: Testing Hypothesis
4 pages
MSE204 Lecture Questions
No ratings yet
MSE204 Lecture Questions
24 pages
Data Processing and Analysis Guide
No ratings yet
Data Processing and Analysis Guide
8 pages
Computing Unit 4
No ratings yet
Computing Unit 4
37 pages
Assessment of Learning Ii: Dr. Louie D. Asuncion Instructor
No ratings yet
Assessment of Learning Ii: Dr. Louie D. Asuncion Instructor
73 pages
Psy 103
No ratings yet
Psy 103
3 pages
STAT7055 Spring Session 2017 Topic 1 Tutorial Questions
No ratings yet
STAT7055 Spring Session 2017 Topic 1 Tutorial Questions
4 pages
Hypothesis Testing Worksheet
No ratings yet
Hypothesis Testing Worksheet
5 pages
SL 4.9 Normal Distribution and Calculations
No ratings yet
SL 4.9 Normal Distribution and Calculations
31 pages
QA Syllabus
No ratings yet
QA Syllabus
9 pages
Bivariate Normal 1 With Answers PDF
No ratings yet
Bivariate Normal 1 With Answers PDF
22 pages
Institute of Actuaries of India: Subject CT3-Probability and Mathematical Statistics May 2008 Examination
No ratings yet
Institute of Actuaries of India: Subject CT3-Probability and Mathematical Statistics May 2008 Examination
10 pages
Two-Way ANOVA Analysis Guide
No ratings yet
Two-Way ANOVA Analysis Guide
11 pages
Hypothesis Testing: Example 1: Does A New Drug Improve Cancer Survival Rates?
No ratings yet
Hypothesis Testing: Example 1: Does A New Drug Improve Cancer Survival Rates?
25 pages
Z Table
100% (1)
Z Table
2 pages
Econometrics Jimma Assignment
No ratings yet
Econometrics Jimma Assignment
6 pages
Elevate Boosts Learning Outcomes
No ratings yet
Elevate Boosts Learning Outcomes
7 pages
Unit 2 Problems
No ratings yet
Unit 2 Problems
9 pages
Empowerment Boosts Innovation
No ratings yet
Empowerment Boosts Innovation
31 pages
Special Project - MATH 403 - D
No ratings yet
Special Project - MATH 403 - D
6 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Contents
100% (3)
Contents
8 pages
4.normal Distribution Haomin2021
No ratings yet
4.normal Distribution Haomin2021
94 pages
Are School Uniforms A Good Fit Results F
No ratings yet
Are School Uniforms A Good Fit Results F
29 pages
Part 8 Linear Regression
No ratings yet
Part 8 Linear Regression
6 pages
Probability & Statistics Assignment
No ratings yet
Probability & Statistics Assignment
1 page
Assignment1 - Solution Managerial Statistics
No ratings yet
Assignment1 - Solution Managerial Statistics
5 pages
Statistics For Managers Using Microsoft Excel: 4 Edition
No ratings yet
Statistics For Managers Using Microsoft Excel: 4 Edition
28 pages
MDC Lecture 1 - Anova
No ratings yet
MDC Lecture 1 - Anova
10 pages
Independent Samples T Test Step-By-Step JASP Guide
No ratings yet
Independent Samples T Test Step-By-Step JASP Guide
17 pages

Data Mining Chapter 2 Data Preprocessing

Uploaded by

Data Mining Chapter 2 Data Preprocessing

Uploaded by

Chapter 2

• Fill-in missing values manually: Tedious and infeasible task.

• Use a global constant to fill-in missing values.

• Use an attribute mean fill-in missing values belonging to the

• Use the most probable value to fill-in missing value.

How to Handle Noisy Data

• Data reduction is the process of obtaining a reduced

• Different methods such as data sampling, dimensionality

• Data compression can also be used mostly in media files or

• Sampling without replacement: As each item is selected, it is

• Sampling with replacement: Objects are not removed from the

• Stratified Sampling: Split the data into several partitions, then

• OLTP system solved a critical business problem of automating daily

You might also like