Feature Selection
Machine learning
Prepared by / Abdelrahman Hassan
INTRODUCTION
5
WHAT IS FEATURE SELECTION?
1
Agenda FEATURE SELECTION MODELS
6
HOW TO CHOOSE A FEATURE
SELECTION MODEL?
2
Introduction
• The input variables that we give to our machine learning models are
called features. Each column in our dataset constitutes a feature. To
train an optimal model, we need to make sure that we use only the
essential features. If we have too many features, the model can
capture the unimportant patterns and learn from noise. The method
of choosing the important parameters of our data is called Feature
Selection.
20XX presentation title 3
Cont.
• To train a model, we collect enormous quantities of data
to help the machine learn better. Usually, a good portion
of the data collected is noise, while some of the
columns of our dataset might not contribute
significantly to the performance of our model. Further,
having a lot of data can slow down the training process
and cause the model to be slower. The model may also
learn from this irrelevant data and be inaccurate.
20XX presentation title 4
Cont.
• Consider a table which contains information on old cars. The model
decides which cars must be crushed for spare parts.
20XX presentation title 5
Cont.
• In the above table, we can see that the model of the car, the year of
manufacture, and the miles it has traveled are important to find out if
the car is old enough to be crushed or not. However, the name of the
previous owner of the car does not decide if the car should be crushed
or not. Further, it can confuse the algorithm into finding patterns
between names and the other features. Hence, we can drop the
column.
20XX presentation title 6
Cont.
20XX presentation title 7
What is feature selection?
• Feature Selection is the method of reducing the input variable to
your model by using only relevant data and getting rid of noise in
data.
20XX presentation title 8
Feature selection models
• Feature selection models are of two types:
1. Supervised Models: Supervised feature selection refers to the method
which uses the output label class for feature selection. They use the
target variables to identify the variables which can increase the
efficiency of the model
2. Unsupervised Models: Unsupervised feature selection refers to the
method which does not need the output label class for feature selection.
We use them for unlabeled data.
20XX presentation title 9
Cont.
20XX presentation title 10
Cont.
• Filter Method: In this method, features
are dropped based on their relation to
the output, or how they
are correlating to the output. We use
correlation to check if the features are
positively or negatively correlated to
the output labels and drop features
accordingly. E.g: Information Gain,
Fisher’s Score, etc.
20XX presentation title 11
Cont.
• Wrapper Method: We split our data into
subsets and train a model using this.
Based on the output of the model, we
add and subtract features and train the
model again. It forms the subsets using
a greedy approach and evaluates the
accuracy of all the possible
combinations of features. E.g: Forward
Selection, Backwards Elimination, etc.
20XX presentation title 12
Cont.
• Intrinsic Method: This method
combines the qualities of both the
Filter and Wrapper method to create
the best subset.
20XX presentation title 13
Cont.
20XX presentation title 14
How to choose a feature selection
model?
• How do we know which feature selection model will work out for our
model? The process is relatively simple, with the model depending on
the types of input and output variables.
Variables are of two main types:
• Numerical Variables: Which include integers and float numbers.
• Categorical Variables: Which include labels, strings, Boolean variables,
etc.
20XX presentation title 15
Cont.
• Based on whether we have
numerical or categorical
variables as inputs and
outputs, we can choose our
feature selection model as
follows:
20XX presentation title 16
Thank you