Module 2 Data Preprocessing

Uploaded by

vishal.patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views31 pages

Module 2 Data Preprocessing

Uploaded by

vishal.patil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Machine Learning in

Practice
OE
Module 2 Data Pre-processing
• Introduction
• Types
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Introduction
• What is data pre-processing?
• Removing of noise, missing values, duplicate values, performing
transformation and reduction for maintaining the quality of data.

• What’s the goal or purpose of this activity?

• To maintain quality of data.
• If quality will compromise then result will be compromised.
Data Quality
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Interpretability
Data Cleaning
• Clean the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies.
• Naming inconsistencies may also occur for attribute values.
• E.g Customer_id in one data store and cust_id in another.
• Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Data Cleaning
• a. Missing Values
• 1. Ignore the tuple
• 2. Fill in the missing value manually
• 3. Use a global constant to fill in the missing value
• 4. Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value
• 5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple
• 6. Use the most probable value to fill in the missing value
Data Cleaning
• b. Noisy data
• Noise is a random error or variance in a measured variable. Noise can
be removed using smoothing techniques.
• Data smoothing techniques
• 1. Binning:
• Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they
perform local smoothing.
Binning methods for data
smoothing.
• In smoothing by bin means, each value in a bin is replaced by the
mean value of the bin. For example, the mean of the values 4, 8, and
15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by
the value 9.
• Similarly, smoothing by bin medians can be employed, in which each
bin value is replaced by the bin median.
• In smoothing by bin boundaries, the minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value.
• In general, the larger the width, the greater the effect of the
smoothing.
Data Smoothing methods
• 2. Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
• Linear regression involves finding the “best” line to fit two attributes
(or variables) so that one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Data Smoothing methods
• 3. Outlier analysis: Outliers may be detected by clustering, for
example, where similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be
considered outliers.
Handling categorical features
• A categorical variable takes only a limited number of values.
• Consider a survey that asks how often you eat breakfast and provides
four options: "Never", "Rarely", "Most days", or "Every day".
• In this case, the data is categorical, because responses fall into a fixed
set of categories.
• If people responded to a survey about which brand of car they
owned, the responses would fall into categories like "Honda",
"Toyota", and "Ford".
• In this case, the data is also categorical.
Handling categorical features
• Three Approaches
• 1. Drop Categorical Variables:
• The easiest approach to dealing with categorical variables is to simply
remove them from the dataset.
• This approach will only work well if the columns did not contain useful
information.
Handling categorical features
• Three Approaches
• 2. Label Encoding:
• Label encoding assigns each unique value to a different integer.
Handling categorical features
• Three Approaches
• 3.One-Hot Encoding:
• One-hot encoding creates new columns indicating the presence (or
absence) of each possible value in the original data.
Feature selection
• The input variables that we give to our machine learning models are
called features.
• Each column in our dataset constitutes a feature.
• To train an optimal model, we need to make sure that we use only the
essential features.
• If we have too many features, the model can capture the unimportant
patterns and learn from noise.
• The method of choosing the important parameters of our data is
called Feature Selection.
Feature selection
Feature selection
• 1. Filter Method:
• In this method, features are dropped based on their relation to the
output, or how they are correlating to the output.
• We use correlation to check if the features are positively or negatively
correlated to the output labels and drop features accordingly.
• Eg: Information Gain, Chi-Square Test, Fisher’s Score, etc.
Feature selection
• 2. Wrapper Method:
• We split our data into subsets and train a model using this.
• Based on the output of the model, we add and subtract features and
train the model again.
• It forms the subsets using a greedy approach and evaluates the
accuracy of all the possible combinations of features.
• Eg: Forward Selection, Backwards Elimination, etc.
Feature selection
• 3. Intrinsic Method:
• This method combines the qualities of both the Filter and Wrapper
method to create the best subset.
• This method takes care of the machine training iterative process while
maintaining the computation cost to be minimum.
• Eg: Lasso and Ridge Regression.
Feature selection
Input Variable Output Variable Feature Selection Model

• Pearson’s correlation coefficient

Numerical Numerical
• Spearman’s rank coefficient

• ANOVA correlation coefficient

(linear).
Numerical Categorical
• Kendall’s rank coefficient
(nonlinear).

• Kendall’s rank coefficient (linear).

Categorical Numerical • ANOVA correlation coefficient
(nonlinear).

• Chi-Squared test (contingency

Categorical Categorical tables).
• Mutual Information
Feature Reduction
Attribute Subset Selection
• It reduces the data set size by removing irrelevant or redundant
attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes.
• 1. Stepwise forward selection: The procedure starts with an empty set of
attributes as the reduced set. The best of the original attributes is determined
and added to the reduced set. At each subsequent iteration or step, the best
of the remaining original attributes is added to the set.
• 2. Stepwise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
• 3. Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so
that, at each step, the procedure selects the best attribute and removes the
worst from among the remaining attributes.
• 4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART)
were originally intended for classification. Decision tree induction constructs a
flowchart like structure where each internal (nonleaf) node denotes a test on
an attribute, each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction. At each node, the algorithm
chooses the “best” attribute to partition the data into individual classes.
Histograms (non-parametric)
• Histograms use binning to approximate data distributions and are a
popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–value/frequency pair,
the buckets are called singleton buckets.
• Often, buckets instead represent continuous ranges for the given
attribute.
Example
• The following data are a list of AllElectronics prices for commonly sold
• items (rounded to the nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
• “How are the buckets determined and the attribute values
partitioned?” There are several partitioning rules, including the
following:
• Equal-width: In an equal-width histogram, the width of each bucket
range is uniform (e.g., the width of $10 for the buckets).
• Equal-frequency (or equal-depth): In an equal-frequency histogram,
the buckets are created so that, roughly, the frequency of each bucket
is constant (i.e., each bucket contains roughly the same number of
contiguous data samples).
Data Transformation by
Normalization
• To avoid dependence on the choice of measurement units, the data
should be normalized or standardized.
• Normalizing the data attempts to give all attributes an equal weight.
• Min-max normalization performs a linear transformation on the
original data.
• Suppose that minA and maxA are the minimum and maximum values
of an attribute, A.
• Min-max normalization maps a value, vi , of A to vi’ in the range
[new_minA,new_maxA] by computing
• In z-score normalization (or zero-mean normalization), the values for
an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A.
• A value, vi , of A is normalized to vi’ by computing
• Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of A.
• A value, vi , of A is normalized to vi’ by computing

Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
Preprocessing - M2
No ratings yet
Preprocessing - M2
53 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
ML 4
No ratings yet
ML 4
17 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
ML Notes
No ratings yet
ML Notes
44 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
DWM Exp6 C49
No ratings yet
DWM Exp6 C49
15 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Preprocessing 935
No ratings yet
Preprocessing 935
68 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
Data Pre-Processing Guide
No ratings yet
Data Pre-Processing Guide
33 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
62 pages
W2-Data Preparation
No ratings yet
W2-Data Preparation
46 pages
L6 Data Preprocessing
No ratings yet
L6 Data Preprocessing
9 pages
Data Preprocessing Techniques Guide
No ratings yet
Data Preprocessing Techniques Guide
35 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Unit 2
No ratings yet
Unit 2
34 pages
Data Preprocessing and Feature Engineering
No ratings yet
Data Preprocessing and Feature Engineering
32 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
85 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
69 pages
Data Science Slides
No ratings yet
Data Science Slides
57 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Example
No ratings yet
Example
3 pages
Module 4 Supervised Algoritms-II
No ratings yet
Module 4 Supervised Algoritms-II
40 pages
Solved Example of Decision Tree Gini Index
No ratings yet
Solved Example of Decision Tree Gini Index
4 pages
Module 1
No ratings yet
Module 1
28 pages
Deep Learning CNN
No ratings yet
Deep Learning CNN
2 pages
New Doc 2018-10-03 17.22.00 - 20200303130026 PDF
No ratings yet
New Doc 2018-10-03 17.22.00 - 20200303130026 PDF
1 page
MOS Module 1
No ratings yet
MOS Module 1
19 pages
Display PDF - PHP
No ratings yet
Display PDF - PHP
4 pages
Print - Udyam Registration Certificate
No ratings yet
Print - Udyam Registration Certificate
4 pages
Registration
No ratings yet
Registration
1 page
Fy Mtech Fem Assignments
No ratings yet
Fy Mtech Fem Assignments
4 pages
IEDheater E On Y Axis
No ratings yet
IEDheater E On Y Axis
1 page
Dynamic Response
No ratings yet
Dynamic Response
13 pages
AEE-Tutorial 1
No ratings yet
AEE-Tutorial 1
7 pages
Engineering Stats & Analysis Guide
No ratings yet
Engineering Stats & Analysis Guide
5 pages
Optics
No ratings yet
Optics
6 pages
AO1 Research Analysis Evaluation
No ratings yet
AO1 Research Analysis Evaluation
21 pages
Case Study CPH Group 4
No ratings yet
Case Study CPH Group 4
27 pages
Amersham ECL Plus - RPN 2132
No ratings yet
Amersham ECL Plus - RPN 2132
32 pages
Fellowship Product Management Program Brochure
No ratings yet
Fellowship Product Management Program Brochure
18 pages
S3 题目
No ratings yet
S3 题目
24 pages
Geotextile Sewing & Dewatering Analysis
No ratings yet
Geotextile Sewing & Dewatering Analysis
2 pages
Digital Pathology: Historical Perspectives, Current Concepts & Future Applications 1st Edition Keith J. Kaplan Full Access
No ratings yet
Digital Pathology: Historical Perspectives, Current Concepts & Future Applications 1st Edition Keith J. Kaplan Full Access
105 pages
Underground Mining Method
100% (3)
Underground Mining Method
230 pages
1 Executive Summary Technical Analysis Is A Financial Term Used To Denote A Security Analysis For
No ratings yet
1 Executive Summary Technical Analysis Is A Financial Term Used To Denote A Security Analysis For
67 pages
Strategic Management Notes
No ratings yet
Strategic Management Notes
54 pages
PM - REVISED Multiple Choice Questions Today
100% (4)
PM - REVISED Multiple Choice Questions Today
41 pages
Jurnal Galih
No ratings yet
Jurnal Galih
15 pages
Break Out Session 3 - 28th - COCSSO
No ratings yet
Break Out Session 3 - 28th - COCSSO
57 pages
Suzuki Nakata de Keyser 2019 MLJSIDesirabledifficulty
No ratings yet
Suzuki Nakata de Keyser 2019 MLJSIDesirabledifficulty
15 pages
Experimental Epidemiology Guide
No ratings yet
Experimental Epidemiology Guide
20 pages
Currency Conversion Lesson Plan
No ratings yet
Currency Conversion Lesson Plan
6 pages
Maximizing The Role of Social Media Towards Online Selling in Select Barangay in Ibaan, Batangas
No ratings yet
Maximizing The Role of Social Media Towards Online Selling in Select Barangay in Ibaan, Batangas
36 pages
Advertising and Promotion - Chapter 6
100% (1)
Advertising and Promotion - Chapter 6
39 pages
Introductory Econometrics
No ratings yet
Introductory Econometrics
16 pages
Flying Wing Vtol Fsd-Fyp
No ratings yet
Flying Wing Vtol Fsd-Fyp
60 pages
Contemporary Ideas On Ship Stability - 2019
No ratings yet
Contemporary Ideas On Ship Stability - 2019
931 pages
Chapter 1 - Intro To Generative AI
No ratings yet
Chapter 1 - Intro To Generative AI
4 pages
The Mediating Role of Cognitive Reappraisal Between Childhood Physical Abuse and Substance Abuse Among Young Adult Residents in A Selected Treatment and Rehabilitation Center in Bicutan
No ratings yet
The Mediating Role of Cognitive Reappraisal Between Childhood Physical Abuse and Substance Abuse Among Young Adult Residents in A Selected Treatment and Rehabilitation Center in Bicutan
14 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Pediatric Mortality Scoring Study
No ratings yet
Pediatric Mortality Scoring Study
6 pages
National University of Lesotho Department of Statistics and Demography St1311 - Introduction To Statistics 1 - Tutorial 1
No ratings yet
National University of Lesotho Department of Statistics and Demography St1311 - Introduction To Statistics 1 - Tutorial 1
5 pages
Understanding Conflict Dynamics
No ratings yet
Understanding Conflict Dynamics
21 pages
Mis CRM
No ratings yet
Mis CRM
2 pages

Module 2 Data Preprocessing

Uploaded by

Module 2 Data Preprocessing

Uploaded by

Machine Learning in

• What’s the goal or purpose of this activity?

• Pearson’s correlation coefficient

• ANOVA correlation coefficient

• Kendall’s rank coefficient (linear).

• Chi-Squared test (contingency

You might also like