KEMBAR78
Module 2 Data Preprocessing | PDF | Regression Analysis | Statistics
0% found this document useful (0 votes)
7 views31 pages

Module 2 Data Preprocessing

Uploaded by

vishal.patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views31 pages

Module 2 Data Preprocessing

Uploaded by

vishal.patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Machine Learning in

Practice
OE
Module 2 Data Pre-processing
• Introduction
• Types
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation
Introduction
• What is data pre-processing?
• Removing of noise, missing values, duplicate values, performing
transformation and reduction for maintaining the quality of data.

• What’s the goal or purpose of this activity?


• To maintain quality of data.
• If quality will compromise then result will be compromised.
Data Quality
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Interpretability
Data Cleaning
• Clean the data by filling in missing values, smoothing noisy data,
identifying or removing outliers, and resolving inconsistencies.
• Naming inconsistencies may also occur for attribute values.
• E.g Customer_id in one data store and cust_id in another.
• Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
Data Cleaning
• a. Missing Values
• 1. Ignore the tuple
• 2. Fill in the missing value manually
• 3. Use a global constant to fill in the missing value
• 4. Use a measure of central tendency for the attribute (e.g., the mean
or median) to fill in the missing value
• 5. Use the attribute mean or median for all samples belonging to the
same class as the given tuple
• 6. Use the most probable value to fill in the missing value
Data Cleaning
• b. Noisy data
• Noise is a random error or variance in a measured variable. Noise can
be removed using smoothing techniques.
• Data smoothing techniques
• 1. Binning:
• Binning methods smooth a sorted data value by consulting its
“neighborhood,” that is, the values around it.
• The sorted values are distributed into a number of “buckets,” or bins.
Because binning methods consult the neighborhood of values, they
perform local smoothing.
Binning methods for data
smoothing.
• In smoothing by bin means, each value in a bin is replaced by the
mean value of the bin. For example, the mean of the values 4, 8, and
15 in Bin 1 is 9. Therefore, each original value in this bin is replaced by
the value 9.
• Similarly, smoothing by bin medians can be employed, in which each
bin value is replaced by the bin median.
• In smoothing by bin boundaries, the minimum and maximum values
in a given bin are identified as the bin boundaries. Each bin value is
then replaced by the closest boundary value.
• In general, the larger the width, the greater the effect of the
smoothing.
Data Smoothing methods
• 2. Regression: Data smoothing can also be done by regression, a
technique that conforms data values to a function.
• Linear regression involves finding the “best” line to fit two attributes
(or variables) so that one attribute can be used to predict the other.
• Multiple linear regression is an extension of linear regression, where
more than two attributes are involved and the data are fit to a
multidimensional surface.
Data Smoothing methods
• 3. Outlier analysis: Outliers may be detected by clustering, for
example, where similar values are organized into groups, or “clusters.”
Intuitively, values that fall outside of the set of clusters may be
considered outliers.
Handling categorical features
• A categorical variable takes only a limited number of values.
• Consider a survey that asks how often you eat breakfast and provides
four options: "Never", "Rarely", "Most days", or "Every day".
• In this case, the data is categorical, because responses fall into a fixed
set of categories.
• If people responded to a survey about which brand of car they
owned, the responses would fall into categories like "Honda",
"Toyota", and "Ford".
• In this case, the data is also categorical.
Handling categorical features
• Three Approaches
• 1. Drop Categorical Variables:
• The easiest approach to dealing with categorical variables is to simply
remove them from the dataset.
• This approach will only work well if the columns did not contain useful
information.
Handling categorical features
• Three Approaches
• 2. Label Encoding:
• Label encoding assigns each unique value to a different integer.
Handling categorical features
• Three Approaches
• 3.One-Hot Encoding:
• One-hot encoding creates new columns indicating the presence (or
absence) of each possible value in the original data.
Feature selection
• The input variables that we give to our machine learning models are
called features.
• Each column in our dataset constitutes a feature.
• To train an optimal model, we need to make sure that we use only the
essential features.
• If we have too many features, the model can capture the unimportant
patterns and learn from noise.
• The method of choosing the important parameters of our data is
called Feature Selection.
Feature selection
Feature selection
• 1. Filter Method:
• In this method, features are dropped based on their relation to the
output, or how they are correlating to the output.
• We use correlation to check if the features are positively or negatively
correlated to the output labels and drop features accordingly.
• Eg: Information Gain, Chi-Square Test, Fisher’s Score, etc.
Feature selection
• 2. Wrapper Method:
• We split our data into subsets and train a model using this.
• Based on the output of the model, we add and subtract features and
train the model again.
• It forms the subsets using a greedy approach and evaluates the
accuracy of all the possible combinations of features.
• Eg: Forward Selection, Backwards Elimination, etc.
Feature selection
• 3. Intrinsic Method:
• This method combines the qualities of both the Filter and Wrapper
method to create the best subset.
• This method takes care of the machine training iterative process while
maintaining the computation cost to be minimum.
• Eg: Lasso and Ridge Regression.
Feature selection
Input Variable Output Variable Feature Selection Model

• Pearson’s correlation coefficient


Numerical Numerical
• Spearman’s rank coefficient

• ANOVA correlation coefficient


(linear).
Numerical Categorical
• Kendall’s rank coefficient
(nonlinear).

• Kendall’s rank coefficient (linear).


Categorical Numerical • ANOVA correlation coefficient
(nonlinear).

• Chi-Squared test (contingency


Categorical Categorical tables).
• Mutual Information
Feature Reduction
Attribute Subset Selection
• It reduces the data set size by removing irrelevant or redundant
attributes (or dimensions).
• The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data
classes is as close as possible to the original distribution obtained
using all attributes.
• 1. Stepwise forward selection: The procedure starts with an empty set of
attributes as the reduced set. The best of the original attributes is determined
and added to the reduced set. At each subsequent iteration or step, the best
of the remaining original attributes is added to the set.
• 2. Stepwise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
• 3. Combination of forward selection and backward elimination: The stepwise
forward selection and backward elimination methods can be combined so
that, at each step, the procedure selects the best attribute and removes the
worst from among the remaining attributes.
• 4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART)
were originally intended for classification. Decision tree induction constructs a
flowchart like structure where each internal (nonleaf) node denotes a test on
an attribute, each branch corresponds to an outcome of the test, and each
external (leaf) node denotes a class prediction. At each node, the algorithm
chooses the “best” attribute to partition the data into individual classes.
Histograms (non-parametric)
• Histograms use binning to approximate data distributions and are a
popular form of data reduction.
• A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–value/frequency pair,
the buckets are called singleton buckets.
• Often, buckets instead represent continuous ranges for the given
attribute.
Example
• The following data are a list of AllElectronics prices for commonly sold
• items (rounded to the nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.
• “How are the buckets determined and the attribute values
partitioned?” There are several partitioning rules, including the
following:
• Equal-width: In an equal-width histogram, the width of each bucket
range is uniform (e.g., the width of $10 for the buckets).
• Equal-frequency (or equal-depth): In an equal-frequency histogram,
the buckets are created so that, roughly, the frequency of each bucket
is constant (i.e., each bucket contains roughly the same number of
contiguous data samples).
Data Transformation by
Normalization
• To avoid dependence on the choice of measurement units, the data
should be normalized or standardized.
• Normalizing the data attempts to give all attributes an equal weight.
• Min-max normalization performs a linear transformation on the
original data.
• Suppose that minA and maxA are the minimum and maximum values
of an attribute, A.
• Min-max normalization maps a value, vi , of A to vi’ in the range
[new_minA,new_maxA] by computing
• In z-score normalization (or zero-mean normalization), the values for
an attribute, A, are normalized based on the mean (i.e., average) and
standard deviation of A.
• A value, vi , of A is normalized to vi’ by computing
• Normalization by decimal scaling normalizes by moving the decimal
point of values of attribute A. The number of decimal points moved
depends on the maximum absolute value of A.
• A value, vi , of A is normalized to vi’ by computing

You might also like