0% found this document useful (0 votes)

36 views22 pages

UNIT 3 Data Preprocessing

Data preprocessing is a crucial step in data mining that transforms raw data into a usable format by ensuring data quality through tasks such as data cleaning, integration, reduction, and transformation. Key aspects of data quality include accuracy, completeness, consistency, timeliness, believability, and interpretability. Techniques like binning, regression, and normalization are employed to handle issues like missing or noisy data and to prepare data for effective analysis.

Uploaded by

narayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views22 pages

UNIT 3 Data Preprocessing

Uploaded by

narayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Data Warehousing & Data

Mining
1

B S C . C S I T , 7 TH S E M
UNIT: 3 DATA PREPROCESSING
Data Preprocessing
2

 Data preprocessing is a data mining technique which is used to transform

the raw data in a useable and understandable format.
 The data mining algorithms can not work with raw data so the quality of
data must be checked before applying data mining algorithms.
Why is Data preprocessing important?
 Preprocessing of data is mainly used to check the data quality. The quality
can be checked by the following:
 Accuracy: To check whether the data entered is correct or not.
 Completeness: To check whether the data is available or not recorded.
 Consistency: To check whether the same data is kept in all the places that
do or do not match.
 Timeliness: The data should be updated correctly.
 Believability: The data should be trustable.
 Interpretability: The understandability of the data.
Major Tasks in Data Preprocessing
3

 The four major tasks in data preprocessing are as follows:

1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation
Data Cleaning
4

 The data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets.
 Data cleaning also replaces the missing values.
 There are some techniques to handle missing data and noise data
1. Missing data
This situation arises when some data is missing in the data. It can be handled in various
ways like:
 Standard values like “Not Available” or “NA” can be used to replace the missing
values.
 Missing values can also be filled manually but it is not recommended when that
dataset is big.
 The attribute’s mean value can be used to replace the missing value when the data is
normally distributed where in the case of non-normal distribution median value of
the attribute can be used.
 While using regression or decision tree algorithms the missing value can be replaced
by the most probable value.
Data Cleaning
5

2. Noisy Data Noisy data is a meaningless data that can’t be interpreted by

machines. It can be generated due to faulty data collection, data entry errors
etc. It can be handled in following ways :
a. Binning
 This method is to smooth or handle noisy data.
 First, the data is sorted then and then the sorted values are separated and
stored in the form of bins.
 There are three methods for smoothing data in the bin.
Smoothing by bin mean method: In this method, the values in the bin are
replaced by the mean value of the bin.
Smoothing by bin median: In this method, the values in the bin are
replaced by the median value.
Smoothing by bin boundary: In this method, the using minimum and
maximum values of the bin values are taken and the values are replaced by
the closest boundary value.
Data Cleaning
6

Binning Example:
Data Cleaning
7

Binning Example Description:

 In the previous example, the data for price are first sorted and
then partitioned into equal frequency bins of size 3. i.e. each
bin contains 3 values.
 In smoothing by bin means, each value in a bin is replaced by
the mean value of the bin.(eg. In Bin1, mean value of 4, 8 and
15 is (4+8+15) / 3 =9)
 In smoothing by bin boundaries, the minimum and maximum
values in a given bin are identified as the bin boundaries. Each
bin value is then replaced by the closest boundary value. (eg.
In Bin1, the minimum value is 4 and maximum value is 15,
since 8 is closest to 4 rather than 15 so bin value is replaced by
4)
Data Cleaning
8

b. Regression: Data smoothing can also be done by regression, a

technique that conforms data values to a function.
 Linear regression involves finding the best line to fit two
attributes so that one attribute can be used to predict the other.
 Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are
fit to a multidimensional surface.
c. Outlier analysis: Outlier may be detected by clustering.
Where similar values are organized into groups or cluster. The
values that fall outside of the set of cluster may be considered
outlier.
Data Integration
9

 Data integration is the process of combining multiple sources into a

single dataset. The Data integration process is one of the main
components in data management. There are some problems to be
considered during data integration.
 Schema integration: Integrates metadata(a set of data that describes
other data) from different sources.
 Entity identification problem: Identifying entities from multiple
databases. For example, the system or the use should know student _id
of one database and student_name of another database belongs to the
same entity.
 Detecting and resolving data value concepts: The data taken from
different databases while merging may differ. Like the attribute values
from one database may differ from another database. For example, the
date format may differ like “MM/DD/YYYY” or “DD/MM/YYYY”.
Data Reduction
10

 This process helps to reduced the volume of the data which makes the
analysis easier yet produces the same or almost the same result.
 This reduction also helps to reduce storage space.
 There are some of the techniques in data reduction are Dimensionality
reduction, Numerosity reduction, Data compression.
 Dimensionality reduction:
 This process is necessary for real-world applications as the data size is
big.
 In this process, the reduction of random variables or attributes is done
so that the dimensionality of the data set can be reduced.
 Combining and merging the attributes of the data without losing its
original characteristics.
Data Reduction
11

 This also helps in the reduction of storage space and computation time is
reduced.
 When the data is highly dimensional the problem called “Curse of
Dimensionality” occurs.
 Numerosity Reduction:
 In this method, the representation of the data is made smaller by reducing the
volume.
 There will not be any loss of data in this reduction.
 Data compression:
 The compressed form of data is called data compression.
 This compression can be lossless or lossy.
 When there is no loss of information during compression it is called lossless
compression. Whereas lossy compression reduces information but it removes only
the unnecessary information.
Data Transformation
12

 The change made in the format or the structure of the data is called
data transformation. This step can be simple or complex based on the
requirements. There are some methods in data transformation.
 Smoothing: With the help of algorithms, we can remove noise from
the dataset and helps in knowing the important features of the dataset.
By smoothing we can find even a simple change that helps in
prediction.
 Aggregation: In this method, the data is stored and presented in the
form of a summary. The data set which is from multiple sources is
integrated into with data analysis description. This is an important step
since the accuracy of the data depends on the quantity and quality of
the data. When the quality and the quantity of the data are good the
results are more relevant.
Data Transformation
13

 Discretization: The continuous data here is split into intervals.

Discretization reduces the data size. For example, rather than
specifying the class time, we can set an interval like (3 pm-5 pm, 6 pm-
8 pm).

 Normalization: It is the method of scaling the data so that it can be

represented in a smaller range. Example ranging from -1.0 to 1.0.
Forms of Data preprocessing
14
Data Discretization and Concept Hierarchies
15

Data Discretization
 Dividing the range of a continuous attribute into intervals.
 Interval labels can then be used to replace actual data values.
 Reduce the number of values for a given continuous attribute.
 Some classification algorithms only accept categorically attributes.
 This leads to a concise, easy-to-use, knowledge-level representation of mining
results.
 Discretization techniques can be categorized based on whether it uses class
information or not such as follows:
Supervised Discretization - This discretization process uses class
information.
Unsupervised Discretization - This discretization process does not use class
information.
 Discretization techniques can be categorized based on which direction it
proceeds as follows:
Data Discretization and Concept Hierarchies
16

Top-down Discretization -
 If the process starts by first finding one or a few points called
split points or cut points to split the entire attribute range and then
repeat this recursively on the resulting intervals.
Bottom-up Discretization -
 Starts by considering all of the continuous values as potential
split-points.
 Removes some by merging neighborhood values to form
intervals, and then recursively applies this process to the resulting
intervals.
Data Discretization and Concept Hierarchies
17

Concept Hierarchies
 Discretization can be performed rapidly on an attribute to provide a
hierarchical partitioning of the attribute values, known as a Concept
Hierarchy.
 Concept hierarchies can be used to reduce the data by collecting and replacing
low-level concepts with higher-level concepts.
 In the multidimensional model, data are organized into multiple dimensions,
and each dimension contains multiple levels of abstraction defined by concept
hierarchies.
 This organization provides users with the flexibility to view data from different
perspectives.
 Data mining on a reduced data set means fewer input and output operations
and is more efficient than mining on a larger data set.
 Because of these benefits, discretization techniques and concept hierarchies are
typically applied before data mining, rather than during mining.
Concept hierarchy
18
Data Mining Primitives OR
Data Mining Task Primitives
19
Data Mining Primitives OR
Data Mining Task Primitives
20
Data Mining Primitives OR
Data Mining Task Primitives
21
22

END OF UNIT 3

7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Module1.5 Preprocessing
No ratings yet
Module1.5 Preprocessing
40 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Unit 2
No ratings yet
Unit 2
37 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining and Data Warehousing CSPC-308
No ratings yet
Data Mining and Data Warehousing CSPC-308
51 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
DWDM Unit 3
No ratings yet
DWDM Unit 3
16 pages
2.3 Data Cleaning
No ratings yet
2.3 Data Cleaning
24 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Lect 4
No ratings yet
Lect 4
30 pages
Chapter-3 Data Processing
No ratings yet
Chapter-3 Data Processing
54 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
59 pages
Session-2-CO3-Introduction To Data Preprocessing
No ratings yet
Session-2-CO3-Introduction To Data Preprocessing
39 pages
Chap 3
No ratings yet
Chap 3
55 pages
DataPreprocessing 2
No ratings yet
DataPreprocessing 2
68 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Syllabus: Data Warehousing and Data Mining
No ratings yet
Syllabus: Data Warehousing and Data Mining
18 pages
Data Mining
No ratings yet
Data Mining
22 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
DWM
No ratings yet
DWM
14 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
Data Mining - Lecture 2
No ratings yet
Data Mining - Lecture 2
23 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Correlation
No ratings yet
Correlation
14 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
15 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Week2 2
No ratings yet
Week2 2
25 pages
Unit - II
No ratings yet
Unit - II
56 pages
3 Prep
No ratings yet
3 Prep
53 pages
Chapter 6
No ratings yet
Chapter 6
32 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Lecture 123
No ratings yet
Lecture 123
20 pages
Unit 2 Preprocessing
No ratings yet
Unit 2 Preprocessing
39 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Data Preprocessing - Data Cleaning
100% (2)
Data Preprocessing - Data Cleaning
29 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Unit II - Data Preprocessing and Classification RSK-1
No ratings yet
Unit II - Data Preprocessing and Classification RSK-1
115 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
CEMENT Industry Analysis
No ratings yet
CEMENT Industry Analysis
34 pages
Physical Science Class Guide
No ratings yet
Physical Science Class Guide
7 pages
Static Liquefaction-Type Tailings Dam Failures: Understanding Options For Detecting Failures
No ratings yet
Static Liquefaction-Type Tailings Dam Failures: Understanding Options For Detecting Failures
6 pages
Fire Control and Navigation: Arjun Gunnery Simulator
No ratings yet
Fire Control and Navigation: Arjun Gunnery Simulator
2 pages
Enlistment 2024 2025
No ratings yet
Enlistment 2024 2025
196 pages
Anti-Lock Braking System - Wikipedia, The Free Encyclopedia
No ratings yet
Anti-Lock Braking System - Wikipedia, The Free Encyclopedia
7 pages
145 148 +Ram+Kumar
No ratings yet
145 148 +Ram+Kumar
4 pages
Acson AC Repair Manual
No ratings yet
Acson AC Repair Manual
108 pages
Sei Proprio Il Mio Typo La Vita Segreta Delle Font Garfield PDF Download
100% (1)
Sei Proprio Il Mio Typo La Vita Segreta Delle Font Garfield PDF Download
26 pages
Unit-I TO Composite Materials
100% (1)
Unit-I TO Composite Materials
40 pages
' SEPAKAT SETIA PERUNDING (SDN) BHD, ,, MM, ,",,, - "
No ratings yet
' SEPAKAT SETIA PERUNDING (SDN) BHD, ,, MM, ,",,, - "
1 page
Tantric Buddhism in Khmer History
No ratings yet
Tantric Buddhism in Khmer History
17 pages
Case Presentation ENT
No ratings yet
Case Presentation ENT
39 pages
Treatment Guidelines
No ratings yet
Treatment Guidelines
0 pages
Unit 7 Text Questions
No ratings yet
Unit 7 Text Questions
2 pages
Descriptio: Model GXO Sliding Sleeve
No ratings yet
Descriptio: Model GXO Sliding Sleeve
4 pages
Of Scientific Research Students and Lecture: Preparing A Paper Using Microsoft Word For Publication in Journal
No ratings yet
Of Scientific Research Students and Lecture: Preparing A Paper Using Microsoft Word For Publication in Journal
5 pages
EAE 133 Lab 3 Report
No ratings yet
EAE 133 Lab 3 Report
8 pages
The Advaita Vedânta Home Page - Bhamati and Vivarana Schools
No ratings yet
The Advaita Vedânta Home Page - Bhamati and Vivarana Schools
3 pages
Math741 - HW 4
No ratings yet
Math741 - HW 4
3 pages
Paradise Weekly Model Test Cee Mds Based Model Test: (Saturday, Kartik 19, 2079)
No ratings yet
Paradise Weekly Model Test Cee Mds Based Model Test: (Saturday, Kartik 19, 2079)
35 pages
Salas: Curanderismo Hub & Eco Challenges
No ratings yet
Salas: Curanderismo Hub & Eco Challenges
9 pages
Paper Chromatography
No ratings yet
Paper Chromatography
12 pages
SubcellularFractionation Fa15
No ratings yet
SubcellularFractionation Fa15
25 pages
LitCharts Anthropomorphism
100% (1)
LitCharts Anthropomorphism
3 pages
Yirye Fashion Inc
No ratings yet
Yirye Fashion Inc
15 pages
Week 2 Actual Class
No ratings yet
Week 2 Actual Class
72 pages
Chapter One Background of The Study
No ratings yet
Chapter One Background of The Study
3 pages
4D BIM Navisworks Manage Guide
No ratings yet
4D BIM Navisworks Manage Guide
14 pages
Solar Dryer Design for Mango Slices
No ratings yet
Solar Dryer Design for Mango Slices
8 pages

UNIT 3 Data Preprocessing

Uploaded by

UNIT 3 Data Preprocessing

Uploaded by

Data Warehousing & Data

 Data preprocessing is a data mining technique which is used to transform

 The four major tasks in data preprocessing are as follows:

2. Noisy Data Noisy data is a meaningless data that can’t be interpreted by

Binning Example Description:

b. Regression: Data smoothing can also be done by regression, a

 Data integration is the process of combining multiple sources into a

 Discretization: The continuous data here is split into intervals.

 Normalization: It is the method of scaling the data so that it can be

You might also like