Data Preprocessing
COMP3314
Machine Learning
COMP 3314 2
Introduction
● Preprocessing a dataset is a crucial step
○ Garbage in, garbage out
○ Quality of data and amount of useful information it contains are
key factors
● Data-gathering methods are often loosely controlled, resulting in
out-of-range values (e.g., Income: −100), impossible data
combinations (e.g., Sex: Male, Pregnant: Yes), missing values, etc.
● Preprocessing is often the most important phase of a machine
learning project
COMP 3314 3
Outline
● In this chapter you will learn how to …
○ Remove and impute missing values from the dataset
○ Get categorical data into shape
○ Select relevant features
● Specifically, we will looking at the following topics
○ Dealing with missing data
○ Nominal and ordinal features
○ Partitioning a dataset into training and testing sets
○ Bringing features onto the same scale
○ Selecting meaningful features
○ Sequential feature selection algorithms
○ Random forests
COMP 3314 4
Dealing with Missing Data
● Missing data is common in real-world applications
○ Samples might be missing one or more values
● ML models are unable to handle this
● Two ways to handle this
○ Remove entries
○ Imputing missing values from other samples and features (repair)
COMP 3314 5
Identifying Missing Values
● Consider the following simple example generated from CSV
COMP 3314 6
Identifying Missing Values
● For larger data, it can be tedious to look for missing values
○ Use the isnull method to return a DataFrame with Boolean
values that indicate whether a cell
■ contains a numeric value (False), or if
■ data is missing (True)
● Use sum() to count the number of missing values per column
COMP 3314 7
Remove Missing Data
● One option is to simply remove the corresponding features (columns) or
samples (rows)
● Rows with missing values can be dropped via the dropna method with
argument axis=0
● Columns with missing values can be dropped via the dropna method with
argument axis=1
COMP 3314 8
Dropna
● The dropna method supports several additional parameters that can
come in handy
only drop rows only drop rows
where all drop rows that where NaN appear
columns are have less than 4 in specific columns
NaN real values (here: 'C')
COMP 3314 9
Remove Missing Data
● Convenient approach
● Disadvantage
○ May remove too many samples
■ Risk losing valuable information
■ Our classifier may need them to discriminate between
classes
● Could make a reliable analysis impossible
● Alternative approach: Interpolation
COMP 3314 10
Interpolation
● Estimate missing values from the other training samples in our dataset
● Example: Mean imputation
○ Replace missing value with the mean value of the entire feature column
mean and median are for
Try to change to: numerical data only,
- median most_frequent and constant can
- most_frequent be used for numerical data or
- constant, fill_value=42 strings
COMP 3314 11
Scikit-Learn Estimator API
● SimpleImputer is a Transformer class
○ Used for data transformation
○ Two essential methods
■ fit
■ transform
● Estimator class
○ Very similar to transformer class
○ Two essential methods
■ fit
■ predict
■ Transform (optional)
COMP 3314 12
Transformer - Fit and Transform
● fit method
○ Used to learn the
parameters from the
training data
● transform method
○ Uses those parameters
to transform the data
Note: Number of features need
to be identical
COMP 3314 13
Estimator - Fit and Predict
● Use fit method to learn parameters
○ Additionally provide class labels
● Use predict method to make predictions
about unlabeled data
COMP 3314 14
Handling Categorical Data - infinity to infinity
● We have been exclusively working with numerical data
● How to handle categorical data? A categorical feature can take
on one of a limited, and usually
● Example of categorical data fixed, number of possible
there is a fixed number of distinct value
values
XL L M
COMP 3314 15
Categorical Data
● It is common that real-world datasets contain categorical features
○ How to deal with this type of data?
● Nominal features vs ordinal features
○ Ordinal features can be sorted / ordered
■ E.g., t-shirt size, because we can define an order XL>L>M
○ Nominal features don't imply any order
■ E.g., t-shirt color
COMP 3314 16
Example Dataset
nominal ordinal numerical
COMP 3314 17
Mapping Ordinal Features
● To ensure correct interpretation of ordinal features, convert string values
to integers
● Reverse-mapping to go back
COMP 3314 18
Encoding Class Labels
● Most models require integer encoding for class labels
○ Note: class labels are not ordinal, and it doesn't matter which integer number
we assign to a particular string label
COMP 3314 19
LabelEncoder
● Alternatively, there is a convenient LabelEncoder class directly
implemented in scikit-learn to achieve this
Shortcut of calling fit
and transform
separately
COMP 3314 20
One-Hot Encoding
● We could use a similar approach to transform the nominal color column
of our dataset, as follows
○ Problem:
■ Model may assume that green > blue, and red > green
■ This could result in suboptimal model
● Workaround: Use one-hot encoding
○ Create a dummy feature for each unique value of nominal features
■ E.g., a blue sample is encoded as blue = 1 , green = 0 , red = 0
COMP 3314 21
One-Hot Encoding
● Use the OneHotEncoder available in scikit-learn’s preprocessing
module
-1 means unknown
dimension and we want
numpy to figure it out
Apply to only a
single column
red
blue
COMP 3314 22
One-Hot Encoding via ColumnTransformer
● To selectively transform columns in a multi-feature array, use
ColumnTransformer
○ Accepts a list of (name, transformer, column(s)) tuple
Only modify the first
column
dummy feature (color)
COMP 3314 23
One-Hot Encoding - Via Pandas
● An even more convenient way to create those dummy features via
one-hot encoding is to use the get_dummies method implemented
in pandas
○ get_dummies will only convert string columns
COMP 3314 24
One-Hot Encoding - Dropping First Feature
● Note that we do not lose any information by removing one dummy column
○ E.g., if we remove the column color_blue, the feature information is still
preserved since if we observe color_green=0 and color_red=0, it implies that
the observation must be blue
COMP 3314 25
UCI Wine Dataset
● The UCI wine dataset consists of 178 wine samples with 13 features describing
their different chemical properties
COMP 3314 26
UCI Wine Dataset: Training-Testing
● Let’s first divide the dataset into separate training and testing sets
30% for testing
COMP 3314 27
UCI Wine Dataset: Training-Testing
● It is important to balance the trade-off between inaccurate estimation of
generalization error and withholding too much information from the
learning algorithm
● In practice, the most commonly used splits are 60:40, 70:30, or 80:20,
depending on the size of the initial dataset
○ For large datasets, 90:10 or 99:1 splits are also common and
if we need 50 samples,
appropriate in 100 dataset, training : testing = 50:50
in 500 dataset, training: testing = 90 : 10
● Instead of discarding the allocated test data after model training and
Bigger dataset, smaller testing ratio
evaluation, we can retrain a classifier on the entire dataset as it could
improve the predictive performance of the model
○ While this approach is generally recommended, it could lead to worse
generalization performance testing set should not be > 50%
COMP 3314 28
Feature Scaling
● The majority of ML algorithms require feature scaling
○ Decision trees and random forests are two of few ML algorithms that don’t
require feature scaling
● Importance
○ Consider the squared error function in Adaline for two dimensional features
where one feature is measured on a scale from 1 to 10 and the second feature is
measured on a scale from 1 to 100,000
■ The second feature would contribute to the error with a much higher
significance
● Two common approaches to bring different features onto the same scale
○ Normalization
■ E.g., rescaling features to a range of [0, 1]
○ Standardization
■ E.g., center features at mean 0 with standard deviation 1
COMP 3314 29
Feature Scaling - Normalization Find the min and max value
● Most often, normalization refers to the rescaling of features to a range of [0, 1]
● To normalize our data, we can simply apply a min-max scaling to each feature column
○ A new value x(i)norm of a sample x(i) is calculated as follows
○ Here xmin is the smallest value in a feature column and xmax the largest
COMP 3314 30
Feature Scaling - Standardization
● Standardization is more practical for various reasons including retaining useful
information about outliers
● A new value x(i)std of a sample x(i) is calculated as follows
● Here μx is the sample mean of feature column and σx the corresponding standard deviation
● Similar to the MinMaxScaler class, scikit-learn also implements a class for standardization
COMP 3314 31
Normalization vs. Standardization
● The following example illustrates the difference between
standardization and normalization
COMP 3314 32
Robust Scaler
● More advanced methods for feature scaling are available in sklearn
● The RobustScaler is especially helpful and recommended if
working with small datasets that contain many outliers
COMP 3314 33
Feature Selection
● Selects a subset of relevant features
○ Simplify model for easier interpretation
○ Shorten training time
○ Avoid curse of dimensionality
○ Reduce overfitting
● Feature selection ≠ feature extraction (covered in next chapter)
○ Selecting subset of the features ≠ creating new features
● We are going to look at two techniques for feature selection
○ L1 Regularization
○ Sequential Backward Selection (SBS)
COMP 3314 34
L1 vs. L2 Regularization
● L2 regularization (penalty) used in chapter 3
● Another approach: L1 regularization (penalty)
● This will usually yield sparse feature weights
○ Most feature weights will be zero zero = not selected (discarded)
● Sparsity can be useful in practice if we have a high dimensional dataset with
many features that are irrelevant
● L1 regularization can be taken as a technique for feature selection
COMP 3314 35
Geometric Interpretation
● To better understand how L1 regularization encourages sparsity, let’s take a look
at a geometric interpretation of regularization
● Consider the sum of squared errors cost function used for Adaline
● Plot of the contours of a convex cost function for two coefficients w1 and w2
increasing when we move out
COMP 3314 36
Geometric Interpretation: L2 Regularization
● Regularization adds a penalty to the cost function to encourage smaller weights
○ By increasing the regularization strength λ we shrink the weights towards
zero and decrease the dependency of our model on the training data
cannot just minimize the cost, because
the penalty will be huge
need to balance between cost and
penalty
all points at the same distance of
the circle = same penalty value
COMP 3314 37
Geometric Interpretation: L1 Regularization
● Since the L1 penalty is the sum of the absolute weight coefficients we can
represent it as a diamond-shape
● It is more likely that the optimum is located on the axes, which encourages
sparsity
closer to the minimum cost the minimum value are very likely to be
located at the sharp corner
Mathematical details can be found in
Section 3.4 of
The Elements of Statistical Learning
COMP 3314 38
Sparse Solution
● We can simply set the penalty parameter to ‘l1’ for models in scikit-learn that
support L1 regularization
● In scikit-learn, w0 corresponds to intercept_ and wj (for j > 0) corresponds to the
values in coef_
COMP 3314 39
Sparse Solution - Regularization Strength
converge to zero
too small = all zero large regularization strength => non zero value
should find the correct C, not too large & not too small
COMP 3314 40
Sequential Backward Selection (SBS)
● Reduces an initial d-dimensional space to a k-dimensional subspace (k < d)
by automatically selecting features that are most relevant
● Idea:
○ Sequentially remove features until desired feature number is reached
○ Define a criterion function J to be maximized
■ E.g., performance of the classifier after removal
■ Use a validation subset of the training set for performance
evaluation
○ Eliminate the feature that causes the least performance loss
remove 1 - calculate - put it back, remove another features ,... || see which feature give the maximum performance
COMP 3314 41
SBS
Steps:
1. Initialize the algorithm with k = d
d is the dimensionality of the full feature space Xd
2. Determine the feature x- = argmax J (Xk - x) that maximizes the criterion
function J
3. Remove the feature x- from the feature set
Xk-1 = Xk - x-
k=k-1
4. Terminate if k equals the number of desired features;
otherwise, go to step 2
● In the following we will implement SBS in Python from scratch
COMP 3314 42
COMP 3314 43
removing features
from 13 -1 -1 -1 ...
Best choice = 3 features
COMP 3314 44
SBS - Analyzing the Result
● The smallest feature subset (k = 3) that yielded such a good performance on the
validation dataset has the following features
● The accuracy of the KNN classifier on the original test set is as follows
● The three-feature subset has the following accuracy
less testing accuracy
COMP 3314 45
Feature Selection Algorithms in scikit-learn
● There are many more feature selection algorithms available via
scikit-learn
● A comprehensive discussion of the different feature selection
methods is beyond the scope of this lecture
○ A good summary with illustrative examples can be found here
COMP 3314 46
Assessing Feature Importance
● We can determine relevant features using random forest
○ Measure the feature importance as the averaged information gain
● The random forest implementation in scikit-learn already collects the
feature importance values for us
○ Access them via the feature_importances_ attribute after fitting a
RandomForestClassifier
● In the following we will train a forest of 500 trees on the Wine dataset
and rank the 13 features by their respective importance measures
COMP 3314 47
different ML algorithm choose different features
(different result)
COMP 3314 48
Conclusion
● Handle missing data correctly
● Encode categorical variables correctly
● Map ordinal and nominal feature values to integer representations
● L1 regularization can help us to avoid overfitting by reducing the
complexity of a model
● Used a sequential feature selection algorithm to select meaningful
features from a dataset
COMP 3314 49
References
● Most materials in this chapter are
based on
○ Book
○ Code
COMP 3314 50
References
● Some materials in this chapter
are based on
○ Book
○ Code
COMP 3314 51
References
● The Elements of Statistical Learning: Data Mining, Inference, and
Prediction, Second Edition
○ Trevor Hastie, Robert Tibshirani, Jerome Friedman
● https://web.stanford.edu/~hastie/ElemStatLearn/
● Pandas User Guide: Working with missing data