0% found this document useful (0 votes)

572 views7 pages

Feature Selection in Python ML

The document discusses 3 feature selection techniques in machine learning: 1) Univariate Selection uses statistical tests to select the features with the strongest relationship to the target variable. 2) Feature Importance extracts the importance score of each feature from a tree-based model, with higher scores indicating more important features. 3) Correlation Matrix with Heatmap identifies which features are most correlated to the target variable using a color-coded correlation matrix. The techniques are demonstrated on a mobile phone price dataset.

Uploaded by

Дхиа Еддине

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

572 views7 pages

Feature Selection in Python ML

Uploaded by

Дхиа Еддине

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Feature Selection Techniques in

Machine Learning with Python

Raheel Shaikh Follow
Oct 28, 2018 · 5 min read

With the new day comes new strength and new thoughts — Eleanor Roosevelt

We all may have faced this problem of identifying the related features from a set of data
and removing the irrelevant or less important features with do not contribute much to
our target variable in order to achieve better accuracy for our model.

Feature Selection is one of the core concepts in machine learning which hugely
impacts the performance of your model. The data features that you use to train your
machine learning models have a huge influence on the performance you can achieve.

Irrelevant or partially relevant features can negatively impact model performance.

Feature selection and Data cleaning should be the first and most important step of your
model designing.

In this post, you will discover feature selection techniques that you can use in Machine
Learning.

Feature Selection is the process where you automatically or manually select those
features which contribute most to your prediction variable or output in which you are
interested in.

Having irrelevant features in your data can decrease the accuracy of the models and
make your model learn based on irrelevant features.

How to select features and what are Benefits of performing feature selection
before modeling your data?

· Reduces Overfitting: Less redundant data means less opportunity to make decisions
based on noise.

· Improves Accuracy: Less misleading data means modeling accuracy improves.

· Reduces Training Time: fewer data points reduce algorithm complexity and
algorithms train faster.

I want to share my personal experience with this.

I prepared a model by selecting all the features and I got an accuracy of around 65%
which is not pretty good for a predictive model and after doing some feature selection
and feature engineering without doing any logical changes in my model code my
accuracy jumped to 81% which is quite impressive

Now you know why I say feature selection should be the first and most important step of
your model design.

Feature Selection Methods:

I will share 3 Feature selection techniques that are easy to use and also gives good
results.
1. Univariate Selection

2. Feature Importance

3.Correlation Matrix with Heatmap

Let’s have a look at these techniques one by one with an example

You can download the dataset from here

https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv

Description of variables in the above file

battery_power: Total energy a battery can store in one time measured in mAh

blue: Has Bluetooth or not

clock_speed: the speed at which microprocessor executes instructions

dual_sim: Has dual sim support or not

fc: Front Camera megapixels

four_g: Has 4G or not

int_memory: Internal Memory in Gigabytes

m_dep: Mobile Depth in cm

mobile_wt: Weight of mobile phone

n_cores: Number of cores of the processor

pc: Primary Camera megapixels

px_height

Pixel Resolution Height

px_width: Pixel Resolution Width

ram: Random Access Memory in MegaBytes

sc_h: Screen Height of mobile in cm

sc_w: Screen Width of mobile in cm

talk_time: the longest time that a single battery charge will last when you are

three_g: Has 3G or not

touch_screen: Has touch screen or not

wifi: Has wifi or not

price_range: This is the target variable with a value of 0(low cost), 1(medium cost),
2(high cost) and 3(very high cost).

1. Univariate Selection

Statistical tests can be used to select those features that have the strongest relationship
with the output variable.

The scikit-learn library provides the SelectKBest class that can be used with a suite of
different statistical tests to select a specific number of features.

The example below uses the chi-squared (chi²) statistical test for non-negative features
to select 10 of the best features from the Mobile Price Range Prediction Dataset.

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range

#apply SelectKBest class to extract top 10 best features

bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe
columns
print(featureScores.nlargest(10,'Score')) #print 10 best features

Top 10 Best Features using SelectKBest class

2. Feature Importance
You can get the feature importance of each feature of your dataset by using the feature
importance property of the model.

Feature importance gives you a score for each feature of your data, the higher the score
more important or relevant is the feature towards your output variable.

Feature importance is an inbuilt class that comes with Tree Based Classifiers, we will be
using Extra Tree Classifier for extracting the top 10 features for the dataset.

import pandas as pd
import numpy as np

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class
feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_,
index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
top 10 most important features in data

3.Correlation Matrix with Heatmap

Correlation states how the features are related to each other or the target variable.

Correlation can be positive (increase in one value of feature increases the value of the
target variable) or negative (increase in one value of feature decreases the value of the
target variable)

Heatmap makes it easy to identify which features are most related to the target variable,
we will plot heatmap of correlated features using the seaborn library.

import pandas as pd
import numpy as np
import seaborn as sns

data = pd.read_csv("D://Blogs//train.csv")
X = data.iloc[:,0:20] #independent columns
y = data.iloc[:,-1] #target column i.e price range
#get correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn"
)
Have a look at the last row i.e price range, see how the price range is correlated with
other features, ram is the highly correlated with price range followed by battery power,
pixel height and width while m_dep, clock_speed and n_cores seems to be least
correlated with price_range.

Classification Algorithm Guide
100% (2)
Classification Algorithm Guide
23 pages
ML Project Shivani Pandey
100% (2)
ML Project Shivani Pandey
49 pages
Machine Learning and Linear Regression
100% (1)
Machine Learning and Linear Regression
55 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Heart Disease Prediction Guide
100% (1)
Heart Disease Prediction Guide
73 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Assignment # 01 Bscs - 7 Semester: Machine Learning
100% (1)
Assignment # 01 Bscs - 7 Semester: Machine Learning
5 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
3 pages
Machine Learning Basics Stanford Notes
No ratings yet
Machine Learning Basics Stanford Notes
15 pages
Feature Engineering Guide
100% (2)
Feature Engineering Guide
44 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
Loading The Dataset: 'Churn - Modelling - CSV'
No ratings yet
Loading The Dataset: 'Churn - Modelling - CSV'
6 pages
7 Classification
100% (3)
7 Classification
63 pages
Machine Learning Theory
100% (1)
Machine Learning Theory
12 pages
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
100% (1)
C2M2 - Assignment: 1 Risk Models Using Tree-Based Models
38 pages
Feature Selection in Machine Learning
No ratings yet
Feature Selection in Machine Learning
9 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Stats & ML Model Comparisons
100% (1)
Stats & ML Model Comparisons
72 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
ConvNet Insights for Tech Enthusiasts
No ratings yet
ConvNet Insights for Tech Enthusiasts
7 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Building Powerful Image Classification Models Using Very Little Data
No ratings yet
Building Powerful Image Classification Models Using Very Little Data
20 pages
Career Plans For Next 2 Years
No ratings yet
Career Plans For Next 2 Years
11 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
100% (1)
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
11 pages
Oil Export Indonesia
100% (1)
Oil Export Indonesia
12 pages
Matplotlib Basics for Beginners
No ratings yet
Matplotlib Basics for Beginners
16 pages
Variosalgoritmos - Jupyter Notebook
100% (1)
Variosalgoritmos - Jupyter Notebook
9 pages
Role of Machine Learning in The Field of Fiber Reinforced Polymer
No ratings yet
Role of Machine Learning in The Field of Fiber Reinforced Polymer
6 pages
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
100% (1)
### Data Exploration: 'Yes' 'No' 'Agency' 'Direct' 'Employee Referral' 'Yes' 'No'
6 pages
Machine Learning Course with Python
No ratings yet
Machine Learning Course with Python
120 pages
Assignment No - 6-1
100% (1)
Assignment No - 6-1
3 pages
Bagging and Boosting Regression Algorithms
100% (1)
Bagging and Boosting Regression Algorithms
84 pages
Data Mining Project: Clustering & Model Analysis
100% (1)
Data Mining Project: Clustering & Model Analysis
40 pages
Feature Engineering Handout
No ratings yet
Feature Engineering Handout
33 pages
Deep Learning Lab Practicals
No ratings yet
Deep Learning Lab Practicals
24 pages
Churn Modeling
100% (1)
Churn Modeling
11 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
TensorFlow 2.x Basics: Tensors Guide
No ratings yet
TensorFlow 2.x Basics: Tensors Guide
50 pages
Data Science Experiment Guide
100% (2)
Data Science Experiment Guide
43 pages
Twitter Sentiment Analysis Project
100% (1)
Twitter Sentiment Analysis Project
14 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Bank Customer Churn Analysis - Jupyter Notebook
No ratings yet
Bank Customer Churn Analysis - Jupyter Notebook
11 pages
Unit 4 Basics of Feature Engineering
No ratings yet
Unit 4 Basics of Feature Engineering
33 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Feature Engineering / Feature Selection
No ratings yet
Feature Engineering / Feature Selection
33 pages
CCS355 Neural Networks and Deep Learning Lab
No ratings yet
CCS355 Neural Networks and Deep Learning Lab
43 pages
Beginner's Guide to ML Models
No ratings yet
Beginner's Guide to ML Models
12 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Machine Learning
100% (5)
Machine Learning
56 pages
Combined ML
100% (1)
Combined ML
705 pages
Neural Network Loss & Regularization
No ratings yet
Neural Network Loss & Regularization
112 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Linear Regression with Scikit-Learn
No ratings yet
Linear Regression with Scikit-Learn
8 pages
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
No ratings yet
Experiment No.: 9: T. Y. B. Tech (CSE) - II Subject: Open Source Lab-II
4 pages
Humidity Oven
No ratings yet
Humidity Oven
9 pages
TL103 Inf4831
No ratings yet
TL103 Inf4831
7 pages
Movie Data
No ratings yet
Movie Data
11 pages
4-Fold Selection Element 42 500 Safety-Related
No ratings yet
4-Fold Selection Element 42 500 Safety-Related
4 pages
Crops Diseases Detection and Solution System
No ratings yet
Crops Diseases Detection and Solution System
9 pages
Unit 3 Computer CN Hand Written Notes
No ratings yet
Unit 3 Computer CN Hand Written Notes
17 pages
TMI-Sample Session Plan
No ratings yet
TMI-Sample Session Plan
3 pages
Yordanos Birru Thesis
No ratings yet
Yordanos Birru Thesis
80 pages
PowerPoint 2013 Guide & Features
No ratings yet
PowerPoint 2013 Guide & Features
13 pages
Using Social Media in School
No ratings yet
Using Social Media in School
1 page
High Performance: Pretensioned Spun High Strength Concrete Piles
No ratings yet
High Performance: Pretensioned Spun High Strength Concrete Piles
7 pages
BS en 1089-3 - 2011
100% (5)
BS en 1089-3 - 2011
20 pages
Ashirvad UGD Price List 08-May-23
No ratings yet
Ashirvad UGD Price List 08-May-23
8 pages
Grokking Concurrency 1st Edition Kirill Bobrov Download
100% (6)
Grokking Concurrency 1st Edition Kirill Bobrov Download
59 pages
IT Cost Data Integrity Briefing
No ratings yet
IT Cost Data Integrity Briefing
10 pages
Total Quality Management in Government and Public Sector Organizations
No ratings yet
Total Quality Management in Government and Public Sector Organizations
5 pages
Mould Steel for Injection Moulding
No ratings yet
Mould Steel for Injection Moulding
1 page
Wiring Diagram DSE 7320 AMF
100% (1)
Wiring Diagram DSE 7320 AMF
1 page
Importance of Language Laboratory in Developing La
No ratings yet
Importance of Language Laboratory in Developing La
6 pages
Numerical Test 1: Assessmentday
No ratings yet
Numerical Test 1: Assessmentday
11 pages
Crankshaft Slippage
100% (3)
Crankshaft Slippage
3 pages
Network and IT Infrastructure Services
No ratings yet
Network and IT Infrastructure Services
44 pages
Sap MM User Guide Invoice Verification Miro
100% (1)
Sap MM User Guide Invoice Verification Miro
9 pages
Statement 5
No ratings yet
Statement 5
4 pages
Experiment: 5: AIM: Study of CB & CE Characteristics of Transistor Theory
100% (4)
Experiment: 5: AIM: Study of CB & CE Characteristics of Transistor Theory
5 pages
Tutorial Ipi2win
No ratings yet
Tutorial Ipi2win
32 pages
Manual Fanuc Ladder Iii PDF
No ratings yet
Manual Fanuc Ladder Iii PDF
791 pages
Kami Export - Year-8-Autumn 2-Pre-Learning HA
No ratings yet
Kami Export - Year-8-Autumn 2-Pre-Learning HA
9 pages
SME Research by Prof Muyungi
No ratings yet
SME Research by Prof Muyungi
14 pages
A660-Air Conditioning Laboratory Unit
No ratings yet
A660-Air Conditioning Laboratory Unit
2 pages

Feature Selection in Python ML

Uploaded by

Feature Selection in Python ML

Uploaded by

Feature Selection Techniques in

Machine Learning with Python

Irrelevant or partially relevant features can negatively impact model performance.

· Improves Accuracy: Less misleading data means modeling accuracy improves.

I want to share my personal experience with this.

Feature Selection Methods:

3.Correlation Matrix with Heatmap

Let’s have a look at these techniques one by one with an example

You can download the dataset from here

Description of variables in the above file

blue: Has Bluetooth or not

clock_speed: the speed at which microprocessor executes instructions

dual_sim: Has dual sim support or not

fc: Front Camera megapixels

four_g: Has 4G or not

int_memory: Internal Memory in Gigabytes

m_dep: Mobile Depth in cm

mobile_wt: Weight of mobile phone

n_cores: Number of cores of the processor

pc: Primary Camera megapixels

Pixel Resolution Height

px_width: Pixel Resolution Width

sc_h: Screen Height of mobile in cm

sc_w: Screen Width of mobile in cm

three_g: Has 3G or not

touch_screen: Has touch screen or not

wifi: Has wifi or not

#apply SelectKBest class to extract top 10 best features

Top 10 Best Features using SelectKBest class

3.Correlation Matrix with Heatmap

You might also like