KEMBAR78
Feature Engg Pre Processing Python | PDF | Statistical Classification | Categorical Variable
0% found this document useful (0 votes)
114 views68 pages

Feature Engg Pre Processing Python

Feature engineering involves selecting and extracting relevant features from data to improve machine learning models. There are several techniques for feature engineering including feature selection, transformation, and scaling. Feature selection methods like filter, wrapper, and embedded methods help reduce dimensionality and improve model performance. Feature scaling techniques like standardization and normalization standardize feature values to make algorithms work more effectively.

Uploaded by

Gaurav Rohilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views68 pages

Feature Engg Pre Processing Python

Feature engineering involves selecting and extracting relevant features from data to improve machine learning models. There are several techniques for feature engineering including feature selection, transformation, and scaling. Feature selection methods like filter, wrapper, and embedded methods help reduce dimensionality and improve model performance. Feature scaling techniques like standardization and normalization standardize feature values to make algorithms work more effectively.

Uploaded by

Gaurav Rohilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Feature Engineering

Feature Engineering

• The transformation stage in the data preparation process includes an important


step known as Feature Engineering.
• Feature Engineering refers to selecting and extracting the right features from the
data that are relevant to the task and model in consideration.
Aspects of Feature Engineering
Types of Data

Training Data
• It assists in learning and forming a predictive hypothesis for future data.
Test Data
• Data provided to test a hypothesis created via prior learning is known as test data.
• Typically 20% of labeled data is reserved for the test.
Validation data
• It is a dataset used to retest the hypothesis (in case the algorithm got overfitted to even the test
data due to multiple attempts at testing).


Feature Selection
• This becomes even more important when the number of features are very large.
You need not use every feature at your disposal for creating an algorithm. You can
assist your algorithm by feeding in only those features that are really important.
Reasons to use feature selection are:
• It enables the machine learning algorithm to train faster.
• It reduces the complexity of a model and makes it easier to interpret.
• It improves the accuracy of a model if the right subset is chosen.
• It reduces overfitting.
Feature Selection Methods
• There are three types of feature selection methods
1. Filter Methods
2. Wrapper Methods
3. Embedded methods

We will discuss each one of them one by one.


Filter Methods
Filter methods are generally used as a preprocessing step. The selection of features is independent of any
machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical
tests for their correlation with the outcome variable.

The correlation coefficients are calculated based on the types feature data and response data as shown in the table
below:
measure
s
Brief explanation of correlation coefficients
• Pearson’s Correlation: It is used as a measure for quantifying linear dependence between
two continuous variables X and Y. Its value varies from -1 to +1. Pearson’s correlation is
given as:

• LDA: Linear discriminant analysis is used to find a linear combination of features that
characterizes or separates two or more classes (or levels) of a categorical variable.
• ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that
it is operated using one or more categorical independent features and one continuous
dependent feature. It provides a statistical test of whether the means of several groups are
equal or not.
• Chi-Square: It is a is a statistical test applied to the groups of categorical features to
evaluate the likelihood of correlation or association between them using their frequency
distribution.
Correlation Statistics
• The scikit-learn library provides an implementation of most of the useful statistical measures.
• For example:
• Pearson’s Correlation Coefficient: f_regression()
• ANOVA: f_classif()
• Chi-Squared: chi2()
• Mutual Information: mutual_info_classif() and mutual_info_regression()
• Also, the SciPy library provides an implementation of many more statistics, such as Kendall’s tau (
kendalltau) and Spearman’s rank correlation (spearmanr).
Selection Method
• The scikit-learn library also provides many different filtering methods once statistics have been calculated
for each input variable with the target.
• Two of the more popular methods include:
• Select the top k variables: SelectKBest
• Select the top percentile variables: SelectPercentile
Transform Variables
Consider transforming the variables in order to access different statistical methods.
• For example, you can transform a categorical variable to ordinal, even if it is not,
and see if any interesting results come out.
• You can also make a numerical variable discrete (e.g. bins); try categorical-based
measures.
• Some statistical measures assume properties of the variables, such as Pearson’s
that assumes a Gaussian probability distribution to the observations and a linear
relationship. You can transform the data to meet the expectations of the test and try
the test regardless of the expectations and compare results
Regression Feature Selection:
(Numerical Input, Numerical Output)

• This section demonstrates feature selection for a regression problem that as


numerical inputs and numerical outputs.
• A test regression problem is prepared using the make_regression() function.
• Feature selection is performed using Pearson’s Correlation Coefficient via the
f_regression() function.

• Output: Running the example first creates the regression dataset, then defines the
feature selection and applies the feature selection procedure to the dataset,
returning a subset of the selected input features.

Classification Feature Selection:
(Numerical Input, Categorical Output)

• This section demonstrates feature selection for a classification problem that as


numerical inputs and categorical outputs.
• A test regression problem is prepared using the make_classification() function.
• Feature selection is performed using ANOVA F measure via the f_classif() function.

Running the example first creates the classification dataset, then defines the feature selection and applies the feature
selection procedure to the dataset, returning a subset of the selected input features.
Classification Feature Selection:
(Categorical Input, Categorical Output)

• The two most commonly used feature selection methods for categorical input data
when the target variable is also categorical (e.g. classification predictive modeling)
are the chi-squared statistic and the mutual information statistic.

• Depending upon the type the code is written for feature selection as follows:
SelectKBest(score_func=chi2, k=4)
Or
SelectKBest(score_func=mutual_info_classif, k=4)
Wrapper Methods

• In wrapper methods, we try to use a subset of features and train a model using
them. Based on the inferences that we draw from the previous model, we decide to
add or remove features from your subset. The problem is essentially reduced to a
search problem. These methods are usually computationally very expensive.
Wrapper Methods
• Forward Selection: Forward selection is an iterative method in which we start with
having no feature in the model. In each iteration, we keep adding the feature which
best improves our model till an addition of a new variable does not improve the
performance of the model.
• Backward Elimination: In backward elimination, we start with all the features and
removes the least significant feature at each iteration which improves the
performance of the model. We repeat this until no improvement is observed on
removal of features.
• Recursive Feature elimination: It is a greedy optimization algorithm which aims
to find the best performing feature subset. It repeatedly creates models and keeps
aside the best or the worst performing feature at each iteration. It constructs the
next model with the left features until all the features are exhausted. It then ranks
the features based on the order of their elimination.
Recursive Feature Elimination example code
Embedded Methods

• Embedded methods combine the qualities’ of filter and wrapper methods. It’s
implemented by algorithms that have their own built-in feature selection methods.
Embedded Methods
• Some of the most popular examples of these methods are LASSO and RIDGE
regression which have inbuilt penalization functions to reduce overfitting.
• Lasso regression performs L1 regularization which adds penalty equivalent to
absolute value of the magnitude of coefficients.
• Ridge regression performs L2 regularization which adds penalty equivalent to
square of the magnitude of coefficients.
• Other examples of embedded methods are Regularized trees, Memetic algorithm,
Random multinomial logit.
Filter Vs Wrapper methods
The main differences between the filter and wrapper methods for feature selection are:
• Filter methods measure the relevance of features by their correlation with dependent
variable while wrapper methods measure the usefulness of a subset of feature by
actually training a model on it.
• Filter methods are much faster compared to wrapper methods as they do not involve
training the models. On the other hand, wrapper methods are computationally very
expensive as well.
• Filter methods use statistical methods for evaluation of a subset of features while
wrapper methods use cross validation.
• Filter methods might fail to find the best subset of features in many occasions but
wrapper methods can always provide the best subset of features.
• Using the subset of features from the wrapper methods make the model more prone to
overfitting as compared to using subset of features from the filter methods.
Feature Scaling

• Feature Scaling is a method used in Machine Learning for standardization of


independent variables of data features.
Importance of Feature Scaling below.
• Let’s consider a situation where input data has two features, one ranging from
value 1 to 100 and the other from 1 to 10000.
• This might cause an error in machine learning algorithms, like mean squared error
method, when the optimizer tries to minimize larger errors in the second feature.
• The computed Euclidean distances between samples will be dominated by the
second feature axis in the K-nearest neighbors (KNN) algorithm.
• The solution lies in scaling all the features on a similar scale (0 to 1) or (1 to 10).
Techniques of Feature Scaling

There are 2 types of Feature Scaling.


• Standardization
• Normalization
Feature Scaling: Standardization

Let us understand Standardization technique below.


• Standardization is a popular feature scaling method, which gives data the property
of a standard normal distribution (also known as Gaussian distribution).
• All features are standardized on the normal distribution (a mathematical model).
• The mean of each feature is centered at zero, and the feature column has a
standard deviation of one.
Standardization: Example

• To standardize the jth feature, you need to subtract the sample mean uj from every
training sample and divide it by its standard deviation σj as given below:

Here, xj is a vector consisting of the jth feature values of all training samples n.
• Given below is a sample NumPy code that uses NumPy mean and standard
• functions to standardize features from a sample data set X (x0, x1...) :

The ML library scikit-learn implements a class for standardization called StandardScaler, as


demonstrated here
Feature Scaling: Normalization

• In most cases, normalization refers to the rescaling of data features between 0 and
1, which is a special case of Min-Max scaling.
• Normalization: Example
In the given equation, subtract the min value for each feature from each feature
instance and divide by the spread between max and min.

In effect, it measures the relative percentage of distance of each instance from the min value for that feature.
The ML library scikit-learn has a MinMaxScaler class for normalization.
Difference between Standardization and Normalization

• The following table shows the difference between standardization and


normalization for a sample dataset with values from 1 to 5:
Dimensionality Reduction
Let’s look at some aspects of Dimensionality Reduction below.
• Dimensionality reduction involves the transformation of data to new dimensions in a
way that facilitates discarding of some dimensions without losing any key
information.
• Large-scale problems bring about several dimensions that can become very
difficult to visualize
• Some of such dimensions can be easily dropped for a better visualization.
Example: Car attributes might contain maximum speed in both units, kilometer per
hour, and miles per hour. One of these can be safely discarded in order to reduce the
dimensions and simplify the data.
Below mentioned are some of the Dimensionality Reduction aspects.
• Principal component analysis (PCA) is a technique for dimensionality reduction that
helps in arriving at better visualization models.
• Let’s consider the pilots who like to fly radio-controlled helicopters. Assume x1 =
the piloting skill of the pilot and x2 = passion to fly.
• RC helicopters are difficult to fly and only those students that truly enjoy flying can
become good pilots. So, the two factors x1 and x2 are correlated, and this
correlation may be represented by the piloting “karma” u1 and only a small amount
of noise lies off this axis (represented by u2 ).
• Most of the data lie along u1, making it the principal component.
• Hence, you can safely work with u1 alone and discard u2 dimension. So, the 2D
problem now becomes a 1D problem.
Principal Component Analysis (PCA)
• Before the PCA algorithm is developed, you need to preprocess the data to
normalize its mean and variance.

•Steps 1 and 2 reduce the mean of the data, and steps 3 and 4 rescale each coordinate to have unit
variance. It ensures that different attributes are treated on the same scale.
•For instance, if x1 was maxed speed in mph (taking values in high tens or low hundreds) and x2 was the
number of seats (taking values 2-4), then this renormalization rescales the attributes to make them more
comparable to each other.
Principal Component Analysis (PCA)(Contd.)

• How do you find the axis of variation u on which most of the data lies?
• When you project this data to lie along the axis of the unit vector, you would like to
preserve most of it, such that its variance is maximized (which means most data is
covered).
• Intuitively, the data starts off with some amount of variance (information).
• The figure shows this normalized data.
• Let’s project data onto different u axes as shown in the charts given on the left.
• Dots represent the projection of data points on this line.
• In figure A, projected data has a large amount of variance, and the points are far
from zero.
• In figure B, projected data has a low amount of variance, and the points are closer
to zero.Hence, figure A is a better choice to project the data.
• The length of projection of x on a unit vector u is given by xTu. This also represent
the distance of the projection of x from the origin.
• Hence, to maximize the variance of the projections, you can choose a unit length u:
•It is also known as the covariance matrix of the data
(assuming that it has zero mean).
•Generally, if you need to project data onto the k-dimensional
subspace (k < n), you choose u1, u2...uk to be the top k
You get the principal Eigenvector* of Eigenvectors of ∑.
•All the ui now form a new orthogonal basis for the data.
•Then, to represent x(i) in this new basis, you need to
compute the corresponding vector:
• The vector y(i) is a lower k-dimensional approximation of x(i). This is known as the
dimensionality reduction.
• The vectors u1,u2...uk are called the first k principal components of the data.
Applications of PCA
• Noise Reduction
• PCA can eliminate noise or noncritical aspects of the data set to reduce complexity.
Also, during image processing or comparison, image compression can be done
with PCA, eliminating the noise such as lighting variations in face images.
• Compression
• It is used to map high dimensional data to lower dimensions. For example, instead
of having to deal with multiple car types (dimensions), we can cluster them into
fewer types.
• Preprocess
• It reduces data dimensions before running a supervised learning program and
saves on computations as well as reduces overfitting.
PCA: 3D to 2D Conversion

• 3D Data ----changes to----- After PCA, one finds only two dimensions being
important—Red and Green that carry most of the variance. The blue dimension has
limited variance, and hence it is eliminated.
Key Takeaways

• Data preparation allows simplification of data to make it ready for Machine


Learning and involves data selection, filtering, and transformation.
• Data must be sufficient, representative of real-world data, and of high quality.
• Feature Engineering helps in selecting the right features and extracting the most
relevant features.
• Feature scaling transforms features to bring them on a similar scale, in order to
make them comparable in ML routines.
• Dimensionality Reduction allows reducing dimensions in datasets to simplify ML
training.
Commonly used Steps
• Getting the dataset
• Importing libraries
• Importing datasets
• Finding Missing Data
• Encoding Categorical Data
• Splitting dataset into training and test set
• Feature scaling
Getting the Datasets
• To create a machine learning model, the first thing we required is a dataset as a
machine learning model completely works on data.
• The collected data for a particular problem in a proper format is known as the dataset.
• Dataset may be of different formats for different purposes, such as, if we want to create
a machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient.
• So each dataset is different from another dataset. To use the dataset in our code, we
usually put it into a CSV file. However, sometimes, we may also need to use an HTML
or xlsx file.
• For real-world problems, we can download datasets online from various sources such
as https://www.kaggle.com/uciml/datasets, UCI repository,Govt. data sites like
india.gov.in, Stanford University repository, https://archive.ics.uci.edu/ml/index.php etc.
• We can also create our dataset by gathering data using various API with Python and
put that data into a .csv file.
Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries. There are three specific libraries that we will use for data
preprocessing, which are:
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code. It is the fundamental package for scientific calculation in
Python. It also supports to add large, multidimensional arrays and matrices.
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library,
and with this library, we need to import a sub-library pyplot. This library is used to
plot any type of charts in Python for the code.
Pandas: The last library is the Pandas library, which is one of the most famous
Python libraries and used for importing and managing the datasets. It is an open-
source data manipulation and analysis library.
Importing the Datasets

We need to import the datasets which we have collected for our machine learning
project. But before importing a dataset, we need to set the current directory as a
working directory. To set a working directory in Spyder IDE, we need to follow the
below steps:
• Save your Python file in the directory which contains dataset.
• Go to File explorer option in Spyder IDE, and select the required directory.
• Click on F5 button or run option to execute the file.
read_csv() function:
• Now to import the dataset, we will use read_csv() function of pandas library, which
is used to read a csv file and performs various operations on it. Using this function,
we can read a csv file locally as well as through an URL.
• We can use read_csv function as below:
• Here, data_set is a name of the variable to store our dataset, and inside the
function, we have passed the name of our dataset.
Extracting Dependent and Independent variables
• In machine learning, it is important to distinguish the matrix of features
(independent variables) and dependent variables from dataset.
• In our dataset, there are three independent variables that are Country, Age, and
Salary, and one is a dependent variable which is Purchased.
Extracting independent variable:
• To extract an independent variable, we will use iloc[ ] method of Pandas library. It
is used to extract the required rows and columns from the dataset.

In the above code, the first colon(:) is used to take all the rows, and the second colon(:) is for all the columns. Here we
have used :-1, because we don't want to take the last column as it contains the dependent variable. So by doing this,
we will get the matrix of features.
Output
• By executing the above code, we will get output as:

As we can see in the above output, there are only three variables.
Extracting Dependent and Independent variables
• Extracting dependent variable:
To extract dependent variables, again, we will use Pandas .iloc[] method.

Here we have taken all the rows with the last column only. It will give the array of dependent variables.
By executing the above code, we will get output as:
Handling Missing data:
The next step of data preprocessing is to handle missing data in the datasets. If our
dataset contains some missing data, then it may create a huge problem for our
machine learning model. Hence it is necessary to handle missing values present in
the dataset.
Ways to handle missing data:
• There are mainly two ways to handle missing data, which are:
• By deleting the particular row: The first way is used to commonly deal with null
values. In this way, we just delete the specific row or column which consists of null
values. But this way is not so efficient and removing data may lead to loss of
information which will not give the accurate output.
• By calculating the mean: In this way, we will calculate the mean of that column or
row which contains any missing value and will put it on the place of missing value.
This strategy is useful for the features which have numeric data such as age,
salary, year, etc. Here, we will use this approach.
Python code
• To handle missing values, we will use Scikit-learn library in our code, which
contains various libraries for building machine learning models. Here we will use
Imputer class of sklearn.preprocessing library. Below is the code for it:
Encoding Categorical data:

• Categorical data is data which has some categories such as, in our dataset; there
are two categorical variable, Country, and Purchased.
• Machine learning model completely works on mathematics and numbers, so it is
necessary to encode these categorical variables into numbers.
For Country variable:
• Firstly, we will convert the country variables into categorical data. So to do this, we
will use LabelEncoder() class from preprocessing library.
Explanation

• In the encoding code, we have imported LabelEncoder class of sklearn library.


This class has successfully encoded the variables into digits.
• But in our case, there are three country variables, and as we can see in the above
output, these variables are encoded into 0, 1, and 2.
• By these values, the machine learning model may assume that there is some
correlation between these variables which will produce the wrong output. So to
remove this issue, we will use dummy encoding.
Dummy Variables

• Dummy variables are those variables which have values 0 or 1. The 1 value gives
the presence of that variable in a particular column, and rest variables become 0.
With dummy encoding, we will have a number of columns equal to the number of
categories.
• In our dataset, we have 3 categories so it will produce three columns having 0 and
1 values. For Dummy Encoding, we will use OneHotEncoder class of
preprocessing library.
As we can see in the output, all the variables are encoded into numbers 0 and 1 and divided into three columns.
Resulting Datasets
• For Purchased Variable:

For the second categorical variable, we will only use labelencoder object of LableEncoder class. Here we are not
using OneHotEncoder class because the purchased variable has only two categories yes or no, and which are
automatically encoded into 0 and 1.
Splitting the Dataset into the Training set and Test set

• In machine learning data preprocessing, we divide our dataset into a training set
and test set. This is one of the crucial steps of data preprocessing as by doing this,
we can enhance the performance of our machine learning model.
• If we have given training to our machine learning model by a dataset and we test it
by a completely different dataset. Then, it will create difficulties for our model to
understand the correlations between the models.
• If we train our model very well and its training accuracy is also very high, but we
provide a new dataset to it, then it will decrease the performance. So we always try
to make a machine learning model which performs well with the training set and
also with the test dataset. Here, we can define these datasets as:
• Training Set: A subset of dataset to train the machine learning model, and we
already know the output.
• Test set: A subset of dataset to test the machine learning model, and by using the
test set, model predicts the output.
For splitting the dataset, we will use the below lines of code and output is also
shown:

As we can see in the above image, the x and y


variables are divided into 4 different variables
with corresponding values.
• In the code, the first line is used for splitting arrays of the dataset into random train
and test subsets.
• In the second line, we have used four variables for our output that are
• x_train: features for the training data
• x_test: features for testing data
• y_train: Dependent variables for training data
• y_test: Independent variable for testing data
• In train_test_split() function, we have passed four parameters in which first two
are for arrays of data, and test_size is for specifying the size of the test set. The
test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing
sets.
• The last parameter random_state is used to set a seed for a random generator so
that you always get the same result, and the most used value for this is 42.
Feature Scaling
• Feature scaling is the final step of data preprocessing in machine learning. It is a
technique to standardize the independent variables of the dataset in a specific
range. In feature scaling, we put our variables in the same range and in the same
scale so that no any variable dominate the other variable.
• Consider the below dataset:
As we can see, the age and salary
column values are not on the
same scale. A machine learning
model is based on Euclidean
distance, and if we do not scale
the variable, then it will cause
some issue in our machine
learning model.
Euclidian Distance

If we compute any two values from age and


salary, then salary values will dominate the age
values, and it will produce an incorrect result. So
to remove this issue, we need to perform feature
scaling for machine learning.
Feature Scaling Techniques
• If we compute any two values from age and salary, then salary values will dominate
the age values, and it will produce an incorrect result. So to remove this issue, we
need to perform feature scaling for machine learning.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
• Feature Scaling: Standardization
• Let us understand Standardization technique below.
• Standardization is a popular feature scaling method, which gives data the property
of a standard normal distribution (also known as Gaussian distribution).
• All features are standardized on the normal distribution (a mathematical model).
• The mean of each feature is centered at zero, and the feature column has a
standard deviation of one.
• Now, we will create the object of StandardScaler class for independent variables
or features. And then we will fit and transform the training dataset

For test dataset, we will directly apply transform() function instead of fit_transform() because it is already done in training
set.
Output

By executing the above lines of code, we will get the scaled values for x_train and
x_test as:
• x_train:
output
• x_test:

As we can see in the above output, all the variables are


scaled between values -1 to 1.
Thank You

You might also like