Machine Learning Notes
Machine Learning Notes
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Linear regression is utilized for regression tasks, while logistic regression helps
accomplish classification tasks. Supervised machine learning is a widely used machine
learning technique that predicts future outcomes or events. It uses labeled datasets to learn and
generate accurate predictions.
The logistic function, also called the sigmoid function was developed by statisticians
to describe properties of population growth in ecology, rising quickly and maxing out at the
carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued
number and map it into a value between 0 and 1, but never exactly at those limits.
1 / (1 + e^-value)
Where e is the base of the natural logarithms (Euler’s number or the EXP() function in
your spreadsheet) and value is the actual numerical value that you want to transform. Below is
a plot of the numbers between -5 and 5 transformed into the range 0 and 1 using the logistic
function.
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
x = input value
y = predicted output
b0 = bias or intercept term
b1 = coefficient for input (x)
This equation is similar to linear regression, where the input values are combined
linearly to predict an output value using weights or coefficient values. However, unlike
linear regression, the output value modeled here is a binary value (0 or 1) rather than a
numeric value.
7. What is classification problem?
8. What are the problem in using linear regression approach work for classification?
There are two things that explain why Linear Regression is not suitable for
classification. The first one is that Linear Regression deals with continuous values whereas
classification problems mandate discrete values. The second problem is regarding the shift
in threshold value when new data points are added
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The over fitted model has low bias and high
variance.
13 MARKS
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide
the above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
The above equation is the final equation for Logistic Regression.
On the basis of the categories, Logistic Regression can be classified into three
types:
Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has
recently launched a new SUV car. So the company wanted to check how many users from
the dataset, wants to purchase the car.
o For this problem, we will build a Machine Learning model using the Logistic
regression algorithm. The dataset is shown in the below image. In this problem, we
will predict the purchased variable (Dependent Variable) by using age and salary
(Independent variables).
2. EXPLAIN THE CLEAR EXAMPLE WITH STEPS USED IN LOGISTIC
REGRESSION?
Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has recently
launched a new SUV car. So the company wanted to check how many users from the dataset,
wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic
regression algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we will
use the same steps as we have done in previous topics of Regression. Below are the steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we can
use it in our code efficiently. It will be the same as we have done in Data pre-processing topic.
The code for this is given below:
By executing the above lines of code, we will get the dataset as the output. Consider the
given image:
Now, we will extract the dependent and independent variables from the given
dataset. Below is the code for it:
In the above code, we have taken [2, 3] for x because our independent variables are age
and salary, which are at index 2, 3. And we have taken 4 for y variable because our dependent
variable is at index 4. The output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:
1. #feature Scaling
2. from sklearn.preprocessing import StandardScaler
3. st_x= StandardScaler()
4. x_train= st_x.fit_transform(x_train)
5. x_test= st_x.transform(x_test)
After importing the class, we will create a classifier object and use it to fit the model to the
logistic regression. Below is the code for it:
Output: By executing the above code, we will get the below output:
Our model is well trained on the training set, so we will now predict the result by
using test set data. Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the
variable explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to
purchase or not purchase the car.
Now we will create the confusion matrix here to check the accuracy of the
classification. To create it, we need to import the confusion_matrix function of the
sklearn library. After importing the function, we will call it using a new variable cm.
The function takes two parameters, mainly y_true( the actual values) and y_pred (the
targeted value return by the classifier). Below is the code for it:
Output:
By executing the above code, a new confusion matrix will be created. Consider the below
image:
We can find the accuracy of the predicted result by interpreting the confusion matrix.
By above output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect
Output).
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
In the above code, we have imported the ListedColormap class of Matplotlib library
to create the colormap for visualizing the result. We have created two new
variables x_set and y_set to replace x_train and y_train. After that, we have used
the nm.meshgrid command to create a rectangular grid, which has a range of -1(minimum) to
1 (maximum). The pixel points we have taken are of 0.01 resolution.
Output: By executing the above code, we will get the below output:
o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the
result for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-
axis and Estimated salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably
0, i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points
in the purple region(Not buying the car). So we can say that younger users with a high
estimated salary purchased the car, whereas an older user with a low estimated salary did not
purchase the car.
We have successfully visualized the training set result for the logistic regression, and our
goal for this classification is to divide the users who purchased the SUV car and who did not
purchase the car. So from the output graph, we can clearly see the two regions (Purple and Green)
with the observation points. The Purple region is for those users who didn't buy the car, and
Green Region is for those users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we
have used the Linear model for Logistic Regression. In further topics, we will learn for non-linear
Classifiers.
Our model is well trained using the training dataset. Now, we will visualize the result for
new observations (Test set). The code for the test set will remain same as above except that here
we will use x_test and y_test instead of x_train and y_train. Below is the code for it:
Output:
The above graph shows the test set result. As we can see, the graph is divided into
two regions (Purple and Green). And Green observations are in the green region, and Purple
observations are in the purple region. So we can say it is a good prediction and model. Some
of the green and purple data points are in different regions, which can be ignored as we have
already calculated this error using the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this
classification problem.
3. What is cost function in logistic regression?
Logistic regression:
The cost function summarizes how well the model is behaving. In other
words, we use the cost function to measure how close the model’s predictions are to the actual
outputs.
In linear regression, we use mean squared error (MSE) as the cost function. But in logistic
regression, using the mean of the squared differences between actual and predicted outcomes
as the cost function might give a wavy, non-convex solution; containing many local optima:
In this case, finding an optimal solution with the gradient descent method is not
possible. Instead, we use a logarithmic function to represent the cost of logistic
regression. It is guaranteed to be convex for all input values, containing only one minimum,
allowing us to run the gradient descent algorithm.
When dealing with a binary classification problem, the logarithmic cost of error
depends on the value of . We can define the cost for two cases separately:
In linear regression, we use the Mean squared error which was the difference between
y_predicted and y_actual and this is derived from the maximum likelihood estimator. The
Image
In logistic regression Yi is a non-linear function (Ŷ=1/1+ e-z). If we use this in the
above MSE equation then it will give a non-convex graph with many local minima as shown
The problem here is that this cost function will give results with local minima, which
is a big problem because then we’ll miss out on our global minima and our error will increase.
In order to solve this problem, we derive a different cost function for logistic
regression called log loss which is also derived from the maximum likelihood
estimation method.
In the next section, we’ll talk a little bit about the maximum likelihood estimator and
what it is used for. We’ll also try to see the math behind this log loss function.
4. DERIVATION OF COST FUNCTION WITH REQUIRED STEPS?
COST FUNCTION:
The cost function summarizes how well the model is behaving. In other
words, we use the cost function to measure how close the model’s predictions are to
the actual outputs.
Before we derive our cost function we’ll first find a derivative for our sigmoid
Step-1: Use chain rule and break the partial derivative of log-likelihood.
Now since we have our derivative of the cost function, we can write our gradient descent
algorithm as:
If the slope is negative (downward slope) then our gradient descent will add some value
to our new value of the parameter directing it towards the minimum point of the convex curve.
Whereas if the slope is positive (upward slope) the gradient descent will minus some value to
We have successfully calculated our Cost Function. But we need to minimize the
loss to make a good predicting algorithm. To do that, we have the Gradient Descent
Algorithm.
Here we have plotted a graph between J()and . Our objective is to find the deepest
point (global minimum) of this function. Now the deepest point is where the J()is minimum.
Now, you need to subtract the result from to get the new .
Taking derivatives is simple. Just the basic calculus you must have done in your high
school is enough. The major issue is with the Learning Rate( ). Taking a good learning rate is
important and often difficult.
If you take a very small learning rate, each step will be too small, and hence you will
take up a lot of time to reach the local minimum.
Now, if you tend to take a huge learning rate value, you will overshoot the minimum
and never converge again. There is no specific rule for the perfect learning rate. You need to
tweak it to prepare the best model.
So think about all those calculations! It’s massive, and hence there was a need for a
slightly modified Gradient Descent Algorithm, namely – Stochastic Gradient Descent
Algorithm (SGD).
The only difference SGD has with Normal Gradient Descent is that, in SGD, we don’t
deal with the entire training instance at a single time. In SGD, we compute the gradient of the
cost function for just a single random example at each iteration.
Now, doing so brings down the time taken for computations by a huge margin especially for
large datasets. The path taken by SGD is very haphazard and noisy (although a noisy path may
give us a chance to reach global minima).
But that is okay, since we do not have to worry about the path taken.
Mini-Batch Gradient Descent is just taking a smaller batch of the entire dataset, and then
minimizing the loss on it.
This process is more efficient than both the above two Gradient Descent Algorithms. Now the
batch size can be of-course anything you want.
But researchers have shown that it is better if you keep it within 1 to 100, with 32 being the
best batch size. Hence batch size = 32 is kept default in most frameworks.
Let’s assume you want to predict the future price movements of a stock.
You then decide to gather the historic daily prices of the stock for the last 10 days and
plot the stock price on a scatter plot as shown below:
The chart above shows that the actual stock prices are some-what random.
To capture the stock price movements, you assess and gather data for the following 16
features which you know the stock price is dependent on:
1. Company’s profits
3. Company’s dividends
9. Inflation
Once the data is gathered, cleaned, scaled and transformed, you split the data into
training and test data sets. Furthermore, you feed the training data into your machine learning
model to train it.
Once the model is trained, you decide to test the accuracy of your model by passing in
test data set.
The chart above shows that the actual stock prices are random. However, the predicted
stock price is a smooth curve. It has not fit itself too close to the training set and therefore it is
capable of generalising unseen data better.
However, let’s assume your plot actual vs predicted stock prices and you experience
the following charts:
This means that the algorithm has a very strong pre-conception of the data. It implies that it has
high-bias. This is known as under-fitting. These models are not good for predicting new data.
This is the other extreme. It might look as if it’s doing a great job at predicting the
stock price. However, this is known as over-fitting. It is also known as high-variance because
it has learned the training data so well that it cannot generalise well to make predictions on
new and unseen data. These models are not good for predicting new data. If we feed the model
new data then it’s accuracy will end up being extremely poor. It is also indicating that we are
not training our model with enough data.
Overfitting is when your model has over-trained itself on the data that is fed to train
it. It could be because there are way too many features in the data or because we have not
supplied enough data. It happens when the difference between the actual and predicted values is
close to 0.
The models that have been over-fit on the training data do not generalize well to new
examples. They are not good at predicting unseen data.
This implies that they are extremely accurate during training and yield very poor
results during the prediction of unseen data.
If the measure of accuracy such as mean error squared is substantially lower during
training of the model and the accuracy deteriorates on the test data set then it implies that your
We can randomly remove the features and assess the accuracy of the algorithm
iteratively but it is a very tedious and slow process. There are essentially four common ways to
reduce over-fitting.
1. Reduce Features:
The most obvious option is to reduce the features. You can compute the correlation
matrix of the features and reduce the features that are highly correlated with each other:
import matplotlib.pyplot as plt
plt.matshow(dataframe.corr())
plt.show()
You can select model selection algorithms. These algorithms can choose features with
greater importance. The problem with these techniques is that we might end up losing valuable
information.
You should aim to feed enough data to your models so that the models are trained,
tested and validated thoroughly. Aim to give 60% of the data to train the model, 20% of the data
to test and 20% of the data to validate the model.
3. Regularization:
The aim of regularization is to keep all of the features but impose a constraint on the
magnitude of the coefficients.
It is preferred because you do not have to lose the features by penalizing the features.
When the constraints are applied to the parameters, then the model is less prone to over-fitting as it
produces a smooth function.
The regularization parameters, known as penalty factors, are introduced which control
the parameters and ensure that the model is not over-training itself on the training data.
These parameters are set to smaller values to eliminate overfitting. When the
coefficients take large values then the regularization parameters penalize the optimization function.
There are two common regularization techniques:
1. LASSO
Lasso is a feature selection tool and it can completely eliminate non-important features. Adds a
penalty which is the absolute of the magnitude of the coefficients. This ensures that the features
do not end up applying high weight on the prediction of the algorithm. As a result, some of the
weights will end up being to zero. This means that the data of some of the features will not be
used in the algorithm.
from sklearn import linear_model
model = linear_model.Lasso(alpha=0.1)
model.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
2. RIDGE
Adds a penalty which is the square of the magnitude of the coefficients. As a result, some of the
weights will be very close to 0. As a result, it ends up smoothing the effect of the features.
from sklearn.linear_model import Ridge
model = Ridge(alpha=1.0)
model.fit(X, y)
We use many algorithms such as Naïve Bayes, Decision trees, SVM, Random forest
classifier, KNN, and logistic regression for classification. But we might learn about only a few of
them here because our motive is to understand multiclass classification. So, using a few algorithms
we will try to cover almost all the relevant concepts related to multiclass classification.
Naive Bayes
Naive Bayes is a parametric algorithm which means it requires a fixed set of parameters or
assumptions to simplify the machine’s learning process. In parametric algorithms, the number of
parameters used is independent of the size of training data.
It assumes that features of a dataset are completely independent of each other. But it is
generally not true that is why we also call it a ‘naïve’ algorithm.
It is a classification model based on conditional probability and uses Bayes theorem to predict the
class of unknown datasets. This model is mostly used for large datasets as it is easy to build and is
fast for both training and making predictions. Moreover, without hyperparameter tuning, it can give
you better results as compared to other algorithms.
Naïve Bayes can also be an extremely good text classifier as it performs well, such as in the spam
ham dataset.
By P (A|B), we are trying to find the probability of event A given that event B is true. It is
also known as posterior probability.
Event B is known as evidence.
P (A) is called priori of A which means it is probability of event before evidence is seen.
P (B|A) is known as conditional probability or likelihood.
Note: Naïve Bayes’ is linear classifier which might not be suitable to classes that are not linearly
separated in a dataset. Let us look at the figure below:
As can be seen in Fig.2b, Classifiers such as KNN can be used for non-linear classification
instead of Naïve Bayes classifier.
KNN is a supervised machine learning algorithm that can be used to solve both
classification and regression problems. It is one of the simplest algorithms yet powerful one.
It does not learn a discriminative function from the training data but memorizes the training
data instead. Due to the very same reason, it is also known as a lazy algorithm.
How it works?
The K-nearest neighbor algorithm forms a majority vote between the K most
similar instances, and it uses a distance metric between the two data points for defining them
as similar. Most popular choice is Euclidean distance which is written as:
K in KNN is the hyperparameter that can be chosen by us to get the best possible
fit for the dataset. If we keep the smallest value for K, i.e. K=1, then the model will show low
bias, but high variance because our model will be overfitted in this case. Whereas, a larger
value for K, lets suppose k=10, will surely smoothen our decision boundary, which means
low variance but high bias. So we always go for a trade-off between the bias and variance,
known as bias-variance trade-off.
Let us understand more about it by looking at its advantages and disadvantages:
Advantages-
K value is difficult to find as it must work well with test data also, not only with the
training data
It is a lazy algorithm as it does not make any models
It is computationally extensive because it measures distance with each data point
Decision Trees
As the name suggests, the decision tree is a tree-like structure of decisions made
based on some conditional statements. This is one of the most used supervised learning
methods in classification problems because of their high accuracy, stability, and easy
interpretation. They can map linear as well as non-linear relationships in a good way.
Let us look at the figure below, Fig.3, where we have used adult census income dataset with
two independent variables and one dependent variable. Our target or dependent variable is
income, which has binary classes i.e, <=50K or >50K.
We can see that the algorithm works based on some conditions, such as Age <50 and
Hours>=40, to further split into two buckets for reaching towards homogeneity. Similarly, we
can move ahead for multiclass classification problem datasets, such as Iris data.
Now a question arises in our mind. How should we decide which column to take first and
what is the threshold for splitting? For splitting a node and deciding threshold for splitting, we
use entropy or Gini index as measures of impurity of a node. We aim to maximize the purity
or homogeneity on each split, as we saw in above diagram.
training.
Adapted from Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural network
from overfitting”, Dropout is used as a regularization technique — it prevents overfitting by
ensuring that no units are codependent (more on this later).
When we apply dropout to a neural network, we’re creating a “thinned” network with
unique combinations of the units in the hidden layers being dropped randomly at different
points in time during training. Each time the gradient of our model is updated, we generate a
new thinned neural network with different units dropped based on a probability
hyperparameter p. Training a network using dropout can thus be viewed as training loads of
different thinned neural networks and merging them into one network that picks up the key
properties of each thinned network.
This process allows dropout to reduce the overfitting of models on training data.
This graph, taken from the paper “Dropout: A Simple Way to Prevent Neural
Networks from Overfitting” by Srivastava et al., compares the change in classification error of
models without dropout to the same models with dropout (keeping all other hyperparameters
constant). All the models have been trained on the MNIST dataset.
It is observed that the models with dropout had a lower classification error than
the same models without dropout at any given point in time. A similar trend was observed
when the models were used to train other datasets in vision, as well as speech recognition and
text analysis.
The lower error is because dropout helps prevent overfitting on the training data
by reducing the reliance of each unit in the hidden layer on other units in the hidden layers.
These diagrams taken from the same paper show the features learned by an auto
encoder on MNIST with one layer of 256 units without dropout (a) and the features learned by
an identical auto encoder that used a dropout of p = 0.5 (b). It can be observed in figure a that
the units don’t seem to pick up on any meaningful feature, whereas in figure b, the units seem
to have picked up on distinct edges and spots in the data provided to them.
This indicates that dropout helps break co-adaptations among units, and each unit can
act more independently when dropout regularization is used. In other words, without dropout,
the network would never be able to catch a unit A compensating for another unit B’s flaws.
With dropout, at some point unit A would be ignored and the training accuracy would decrease
as a result, exposing the inaccuracy of unit B.
15 marks
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".
Logistic Regression is much similar to the Linear Regression except that how they
are used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
Example: There is a dataset given which contains the information of various users
obtained from the social networking sites. There is a car making company that has recently
launched a new SUV car. So the company wanted to check how many users from the dataset,
wants to purchase the car.
For this problem, we will build a Machine Learning model using the Logistic regression
algorithm. The dataset is shown in the below image. In this problem, we will predict
the purchased variable (Dependent Variable) by using age and salary (Independent
variables).
Steps in Logistic Regression: To implement the Logistic Regression using Python, we
will use the same steps as we have done in previous topics of Regression. Below are the
steps:
1. Data Pre-processing step: In this step, we will pre-process/prepare the data so that we
can use it in our code efficiently. It will be the same as we have done in Data pre-
processing topic. The code for this is given below:
By executing the above lines of code, we will get the dataset as the output. Consider the
given image:
Now, we will extract the dependent and independent variables from the given dataset. Below is the
code for it:
In the above code, we have taken [2, 3] for x because our independent variables are age and salary,
which are at index 2, 3. And we have taken 4 for y variable because our dependent variable is at
index 4. The output will be:
Now we will split the dataset into a training set and test set. Below is the code for it:
6. #feature Scaling
7. from sklearn.preprocessing import StandardScaler
8. st_x= StandardScaler()
9. x_train= st_x.fit_transform(x_train)
10. x_test= st_x.transform(x_test)
We have well prepared our dataset, and now we will train the dataset using the training set. For
providing training or fitting the model to the training set, we will import
the LogisticRegression class of the sklearn library.
After importing the class, we will create a classifier object and use it to fit the model to the logistic
regression. Below is the code for it:
Output: By executing the above code, we will get the below output:
Out[5]:
Our model is well trained on the training set, so we will now predict the result by using test set data.
Below is the code for it:
In the above code, we have created a y_pred vector to predict the test set result.
Output: By executing the above code, a new vector (y_pred) will be created under the variable
explorer option. It can be seen as:
The above output image shows the corresponding predicted users who want to purchase or not
purchase the car.
Now we will create the confusion matrix here to check the accuracy of the classification. To create it,
we need to import the confusion_matrix function of the sklearn library. After importing the function,
we will call it using a new variable cm. The function takes two parameters, mainly y_true( the actual
values) and y_pred (the targeted value return by the classifier). Below is the code for it:
Output:
By executing the above code, a new confusion matrix will be created. Consider the below image:
We can find the accuracy of the predicted result by interpreting the confusion matrix. By above
output, we can interpret that 65+24= 89 (Correct Output) and 8+3= 11(Incorrect Output).
Finally, we will visualize the training set result. To visualize the result, we will
use ListedColormap class of matplotlib library. Below is the code for it:
In the above code, we have imported the ListedColormap class of Matplotlib library to create the
colormap for visualizing the result. We have created two new variables x_set and y_set to
replace x_train and y_train. After that, we have used the nm.meshgrid command to create a
rectangular grid, which has a range of -1(minimum) to 1 (maximum). The pixel points we have taken
are of 0.01 resolution.
To create a filled contour, we have used mtp.contourf command, it will create regions of provided
colors (purple and green). In this function, we have passed the classifier.predict to show the
predicted data points predicted by the classifier.
Output: By executing the above code, we will get the below output:
o In the above graph, we can see that there are some Green points within the green region
and Purple points within the purple region.
o All these data points are the observation points from the training set, which shows the result
for purchased variables.
o This graph is made by using two independent variables i.e., Age on the x-axis and Estimated
salary on the y-axis.
o The purple point observations are for which purchased (dependent variable) is probably 0,
i.e., users who did not purchase the SUV car.
o The green point observations are for which purchased (dependent variable) is probably 1
means user who purchased the SUV car.
o We can also estimate from the graph that the users who are younger with low salary, did not
purchase the car, whereas older users with high estimated salary purchased the car.
o But there are some purple points in the green region (Buying the car) and some green points
in the purple region(Not buying the car). So we can say that younger users with a high
estimated salary purchased the car, whereas an older user with a low estimated salary did not
purchase the car.
We have successfully visualized the training set result for the logistic regression, and our goal for this
classification is to divide the users who purchased the SUV car and who did not purchase the car. So
from the output graph, we can clearly see the two regions (Purple and Green) with the observation
points. The Purple region is for those users who didn't buy the car, and Green Region is for those
users who purchased the car.
Linear Classifier:
As we can see from the graph, the classifier is a Straight line or linear in nature as we have used the
Linear model for Logistic Regression. In further topics, we will learn for non-linear Classifiers.
Our model is well trained using the training dataset. Now, we will visualize the result for new
observations (Test set). The code for the test set will remain same as above except that here we will
use x_test and y_test instead of x_train and y_train. Below is the code for it:
Output:
The above graph shows the test set result. As we can see, the graph is divided into two regions
(Purple and Green). And Green observations are in the green region, and Purple observations are in
the purple region. So we can say it is a good prediction and model. Some of the green and purple data
points are in different regions, which can be ignored as we have already calculated this error using
the confusion matrix (11 Incorrect output).
Hence our model is pretty good and ready to make new predictions for this classification problem.