Session-11 Machine Learning
Session-11 Machine Learning
Introduction
In linear regression, the type of data we deal with is quantitative, whereas we use
classification models to deal with qualitative data or categorical data. The algorithms used
for solving a classification problem first predict the probability of each of the categories of
the qualitative variables, as the basis for making the classification. And, as the probabilities
are continuous numbers, classification using probabilities also behave like regression
methods. Logistic regression is one such type of classification model which is used to
classify the dependent variable into two or more classes or categories.
Let’s suppose you took a survey and noted the response of each person as satisfied, neutral
or Not satisfied. Let’s map each category:
Satisfied – 2
Neutral – 1
Not Satisfied – 0
But this doesn’t mean that the gap between Not satisfied and Neutral is same as Neutral
and satisfied. There is no mathematical significance of these mapping. We can also map the
categories like:
Satisfied – 0
Neutral – 1
Not Satisfied – 2
It’s completely fine to choose the above mapping. If we apply linear regression to both the
type of mappings, we will get different sets of predictions. Also, we can get prediction values
like 1.2, 0.8, 2.3 etc. which makes no sense for categorical values. So, there is no normal
method to convert qualitative data into quantitative data for use in linear regression.
Although, for binary classification, i.e. when there only two categorical values, using the
least square method can give decent results. Suppose we have two categories Black and
White and we map them as follows:
Black – 0
White - 1
We can assign predicted values for both the categories such as Y> 0.5 goes to class white
and vice versa. Although, there will be some predictions for which the value can be greater
than 1 or less than 0 making them hard to classify in any class. Nevertheless, linear
regression can work decently for binary classification but not that well for multi-class
classification. Hence, we use classification methods for dealing with such problems.
Logistic Regression
Logistic regression is one such regression algorithm which can be used for performing
classification problems. It calculates the probability that a given value belongs to a specific
class. If the probability is more than 50%, it assigns the value in that particular class else if
the probability is less than 50%, the value is assigned to the other class. Therefore, we can
say that logistic regression acts as a binary classifier.
But, when we use equation(i) to calculate probability, we would get values less than 0 as
well as greater than 1. That doesn’t make any sense . So, we need to use such an equation
which always gives values between 0 and 1, as we desire while calculating the probability.
Sigmoid function
1. The sigmoid function’s range is bounded between 0 and 1. Thus it’s useful in calculating
the probability for the Logistic function.
2. It’s derivative is easy to calculate than other functions which is useful during gradient
descent calculation.
3. It is a simple way of introducing non-linearity to the model.
Although there are other functions as well, which can be used, but sigmoid is the most
common function used for logistic regression. We will talk about the rest of the functions in
the neural network section.
Accuracy
Recall
Precision
F1 Score
Specifity
AUC( Area Under the Curve)
ROC(Receiver Operator Characteristic)
Classification Report
Confusion Matrix
Where the terms have the meaning:
True Positive(TP): A result that was predicted as positive by the classification model and
also is positive
True Negative(TN): A result that was predicted as negative by the classification model
and also is negative
False Positive(FP): A result that was predicted as positive by the classification model but
actually is negative
False Negative(FN): A result that was predicted as negative by the classification model
but actually is positive.
The Credibility of the model is based on how many correct predictions did the model do.
Accuracy
The mathematical formula is :
(𝑇𝑃+𝑇𝑁)
Accuracy= (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)
Or, it can be said that it’s defined as the total number of correct classifications divided by the
total number of classifications. Its is not the correct for inbalanc data beacause its always
show you high accurancy becoz its bais to the high count data in binary classification becoz
its not calculate the error / its won't count the error
Recall or Sensitivity
𝑇𝑃
Recall= (𝑇𝑃+𝐹𝑁)
Or, as the name suggests, it is a measure of: from the total number of positive results how
many positives were correctly predicted by the model.
It shows how relevant the model is, in terms of positive results only.
Consider a classification model , the model gave 50 correct predictions(TP) but failed to
identify 200 cancer patients(FN). Recall in that case will be:
50
Recall= (50+200) = 0.2 (The model was able to recall only 20% of the cancer patients)
Precision
Precision is a measure of amongst all the positive predictions, how many of them were
actually positive. Mathematically,
𝑇𝑃
Precision= (𝑇𝑃+𝐹𝑃)
Let’s suppose in the previous example, the model identified 50 people as cancer
patients(TP) but also raised a false alarm for 100 patients(FP). Hence,
50
Precision= (50+100) =0.33 (The model only has a precision of 33%)
But we have a problem!!
As evident from the previous example, the model had a very high Accuracy but performed
poorly in terms of Precision and Recall. So, necessarily Accuracy is not the metric to use for
evaluating the model in this case.
Imagine a scenario, where the requirement was that the model recalled all the defaulters
who did not pay back the loan. Suppose there were 10 such defaulters and to recall those
10 defaulters, and the model gave you 20 results out of which only the 10 are the actual
defaulters. Now, the recall of the model is 100%, but the precision goes down to 50%.
F1 Score
From the previous examples, it is clear that we need a metric that considers both Precision
and Recall for evaluating a model. One such metric is the F1 score.
2∗((𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙)
The mathematical formula is: F1 score= (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙))
Specificity or True Negative Rate
This represents how specific is the model while predicting the True Negatives.
Mathematically,
𝑇𝑁
Specificity= (𝑇𝑁+𝐹𝑃) Or, it can be said that it quantifies the total number of negatives
predicted by the model with respect to the total number of actual negative or non favorable
outcomes.
𝐹𝑃
Similarly, False Positive rate can be defined as: (1- specificity) Or, (𝑇𝑁+𝐹𝑃)
ROC(Receiver Operator Characteristic)
We know that the classification algorithms work on the concept of probability of occurrence
of the possible outcomes. A probability value lies between 0 and 1. Zero means that there is
no probability of occurrence and one means that the occurrence is certain.
But while working with real-time data, it has been observed that we seldom get a perfect 0
or 1 value. Instead of that, we get different decimal values lying between 0 and 1. Now the
question is if we are not getting binary probability values how are we actually determining
the class in our classification problem?
There comes the concept of Threshold. A threshold is set, any probability value below the
threshold is a negative outcome, and anything more than the threshold is a favourable or the
positive outcome. For Example, if the threshold is 0.5, any probability value below 0.5
means a negative or an unfavourable outcome and any value above 0.5 indicates a positive
or favourable outcome.
The horizontal lines represent the various values of thresholds ranging from 0 to 1.
Let’s suppose our classification problem was to identify the obese people from the
given data.
The green markers represent obese people and the red markers represent the non-
obese people.
Our confusion matrix will depend on the value of the threshold chosen by us.
For Example, if 0.25 is the threshold then TP(actually obese)=3 TN(Not obese)=2
FP(Not obese but predicted obese)=2(the two red squares above the 0.25 line)
FN(Obese but predicted as not obese )=1(Green circle below 0.25line )
In [6]: data
Out[6]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFun
0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
767 1 93 70 31 0 30.4
In [7]: data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
In [10]: data.describe()
Out[10]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Diab
EDA
In [12]: # Univariate analysis
Done! Use 'show' commands to display/save. [100%] 00:01 -> (00:00 left)
In [15]: data.head()
Out[15]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunct
1 1 85 66 29 0 26.6 0.3
3 1 89 66 23 94 28.1 0.
In [19]: data.isnull().sum()
Out[19]: Pregnancies 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64
In [20]: data.loc[data['BMI']==0]
Out[20]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFun
9 8 125 96 0 0 0.0
49 7 105 0 0 0 0.0
60 2 84 0 0 0 0.0
81 2 74 0 0 0 0.0
426 0 94 0 0 0 0.0
494 3 80 0 0 0 0.0
In [21]: data['BMI'].mean()
Out[21]: 31.992578124999998
In [23]: data.describe()
Out[23]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI Diab
In [24]: data
Out[24]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigree
outliers
In [30]: data.columns
data1 = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insuli
'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
feature selection
In [41]: data
Out[41]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigree
In [43]: X
Out[43]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigree
In [46]: X_scaled
Out[48]: ▾ LogisticRegression
LogisticRegression()
In [51]: y_train_pred
Out[51]: array([1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1,
0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1,
0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0,
1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1,
1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1,
0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0,
1, 0, 0, 0], dtype=int64)
In [53]: y_pred
Out[53]: array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0], dtype=int64)
In [55]: y_pred.shape
Out[55]: (192,)
In [56]: data
Out[56]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigree
Out[59]: 0.78125
In [61]: precision
Out[61]: 0.7708333333333334
In [64]: recall
Out[64]: 0.5441176470588235
In [67]: f1_score
Out[67]: 0.6379310344827587
Out[68]:
col_0 0 1
Outcome
0 113 11
1 31 37
In [71]: auc
Out[71]: 0.7277039848197343
In [75]: print(report)
Multiclass classification
In [80]: import pandas as pd
import seaborn as sns
In [82]: data
Out[82]:
sepal_length sepal_width petal_length petal_width species
In [84]: data.species.unique()
In [85]: data.species.value_counts()
Out[85]: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
In [90]: X
Out[90]:
sepal_length sepal_width petal_length petal_width
In [92]: y
Out[92]: 0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...
145 virginica
146 virginica
147 virginica
148 virginica
149 virginica
Name: species, Length: 150, dtype: object
In [94]: X_new
In [103]: y_train
Out[103]: 4 setosa
32 setosa
142 virginica
85 versicolor
86 versicolor
...
71 versicolor
106 virginica
14 setosa
92 versicolor
102 virginica
Name: species, Length: 112, dtype: object
Out[106]: ▾ LogisticRegression
LogisticRegression(multi_class='ovr')
In [108]: y_hat
In [112]: LR.predict(new)
In [113]: X_test.shape
Out[113]: (38, 4)
Out[115]:
col_0 setosa versicolor virginica
species
setosa 15 0 0
versicolor 0 10 1
virginica 0 0 12
Out[118]: 0.9736842105263158
Out[120]: 0.9757085020242916
accuracy 0.97 38
macro avg 0.97 0.97 0.97 38
weighted avg 0.98 0.97 0.97 38
In [ ]: