EEE436
Introduction to Machine Learning
Khairul Alam
Professor, EEE Department
East West University
Aftabnagar, Dhaka, Bangladesh
1
Logistic Regression
Logistic Regression is commonly used to estimate the probability that an instance belongs to a
particular class. Example: what is the probability that this email is spam?
If the estimated probability is greater than a threshold value, then the model predicts that the instance
belongs to that class, or else it predicts that it does not.
It is important that we have to convert the model predicted values to probability. For this, logistic
regression uses the sigmoidal function to determine the probability. The mathematical and graphical
representation of the sigmoidal function is as follows.
1
σ (t)=
1+e−t
2
Logistic Regression
If the threshold value is 0.5, then it is very easy to classify the instance. If t is positive then
the instance belongs to the class otherwise not.
SL Linear Regression Logistic Regression
Linear regression is used to predict the continuous Logistic regression is used to predict the categorical
1 dependent variable using a given set of dependent variable using a given set of independent
independent variables. variables.
Linear regression is used for solving Regression
2 It is used for solving classification problems.
problem.
3 In this we predict the value of continuous variables. In this we predict values of categorical varibles.
4 In this we find best fit line. In this we find S-Curve .
The output must be continuous value,such as Output is must be categorical value such as 0 or 1, Yes
5
price,age,etc. or no, etc.
3
Logistic Regression
For illustration let us look at the following table that shows the number of hours each
student spent studying, and whether they passed (1) or failed (0).
Let us define odds of success as follows.
probability of happening the event
odds (z )=
1− probability of not happeningthe event
p
odds( z )=
1− p
The logarithm of the odds of success transform y from 0 and
1 to a continuous y-axis from -∞ to +∞.
ln ( 1−p p )= z
4
Cost Function
The cost function or lost function in logistic regression is called log loss or cross entropy defined as follows
m
1
L(θ )=− ∑ [ y i log ( p i )+(1− y i )log (1− pi ) ]
m i=1
zi =θ 0 +θ 1 x i
If we use mean squared error as the 1
cost function then the loss function p i = −z
will have local and global minima 1+e i
and optimization is problematic.
However, the cross
entropy function
looks like following
5
Gradient of Cost Function
Gradient of the cost function of logistic regression.
m
1
L(θ )=− ∑ [ y i log (σ i )+(1− y i ) log (1−σ i )] zi =θ 0 +θ 1 x i
m i=1 1
∂L 1
m p i =σ i =σ ( z i )= −z
∂θ 0 m ∑
= [− y i (1−σ i )+(1− y i ) σ i ] 1+e i
∂ σ =σ ( z )[1−σ ( z )] ∂ z i =σ (1−σ )
i=1
m
∂L 1 ∂θ 0 i i ∂θ 0 i i
∂θ 1 m ∑
= [− y i (1−σ i )+(1− y i ) σ i ] x i
i=1
m ∂ σ =σ ( z )[1−σ (z )] ∂ z i =σ (1−σ ) x
∂L 1 ∂θ 1 i i ∂θ i i i
∂θ 1 m ∑
= [− y i +σ i ] x i 1
i=1 Gradient in matrix form
1 T
J (θ )= X (−Y +σ ( X θ ))
m
6
Logistic Regression Algorithm
Gradient of loss function in vector form
1 T
J (θ )= X (−Y +σ ( X θ ))
m
7
Logistic Regression Example
Consider the data set as follows
import numpy as np
Hour Pass / from sklearn.linear_model import LogisticRegression
studied Fail
x = np.array([10, 15, 20, 28, 32, 40]).reshape(6,1)
10 0
y = np.array([0, 0, 0, 1, 1, 1])
15 0
log_reg = LogisticRegression()
20 0 log_reg.fit(x,y)
The model can be written as follows
28 1 log_reg.intercept_
32 1 Out[8]: array([-15.41760222])
40 1 lr.coef_
Out[9]: array([[0.64318441]])
ln ( )
p
1− p
= z=−15.42+0.643∗x
8
Logistic Regression Example
Question #1: Find probability of pass of a student who studies 30 hours xnew = np.array([30]).reshape(1,1)
log_reg.predict(xnew)
p
( )
Question#2 Minimum hours of study for a
ln =z=−15.42+0.643∗x probability of 0.6 log_reg.predict_proba(xnew)
1− p
( )
p log_reg.score(x,y)
ln
p
( )
1− p
=−15.42+0.643∗30
z=ln
1− p z=−15.42+0.643∗x
x=(0.405+15.42)/ 0.643
=ln ( )
0.6
x=24.6 hours
ln
p
( )
1− p
=3.8699
1−0.6
=0.405
1
p=
1+e−3.8699
p=0.979
9
Logistic Regression Example
import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# x-data
x = [0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50, 2.75, 3.00, 5.0, 3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50]
x = np.array(x).reshape(len(x), 1)
# y-data
y = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1])
# develop the model
log_reg = LogisticRegression()
log_reg.fit(x, y)
# print model parameters
print("intercept value = ", log_reg.intercept_)
print("coefficient value = ", log_reg.coef_)
print("model accuracy score = ", log_reg.score(x, y))
# determine the S curve and plot it
xnew = np.array(np.linspace(0, 5, 50)).reshape(50, 1)
p = log_reg.predict_proba(xnew)
plt.plot(xnew, p[:,1])
10
Logistic Regression Gradient
Descent
import numpy as np
import matplotlib.pyplot as plt
# x-data
x = [0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50, 2.75, 3.00, 5.0, 3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50]
one = np.ones((len(x), 1))
x = np.c_[one, x]
# y-data
y = np.array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1])
y = y.reshape(len(y), 1)
# develop the model using gradient descent
eta = 0.1
num_iter = 5000
theta = np.random.randn(2,1)
for iter in range(num_iter):
arg = np.dot(x, theta)
sg = 1/(1 + np.exp(-arg))
gradient = np.dot(x.T, y-sg)
theta = theta + eta*gradient
# print model parameters
print("intercept value = ", theta[0])
print("coefficient value = ", theta[1])
# determine the S curve and plot it
xnew = np.array(np.linspace(0, 5, 50)).reshape(50, 1)
z = theta[0] + theta[1]*xnew
p = 1/(1 + np.exp(-z))
plt.plot(xnew, p)
11
Confusion Matrix
A Confusion matrix is an k × k matrix used for evaluating the performance of a
classification model, where k is the number of target classes. The matrix compares the
actual target values with those predicted by the machine learning model. By definition a
confusion matrix C is such that Cij is equal to the number of observations known to be in
group i and predicted to be in group j.
Accuracy simply measures how often the classifier
makes the correct prediction. It’s the ratio between
the number of correct predictions and the total
number of predictions.
12
Confusion Matrix
Precision is defined as the ratio of the total number
of correctly classified positive classes divided by the
total number of predicted positive classes. Or, out of
all the predictive positive classes, how much we
predicted correctly.
Recall is defined as the ratio of the total number of
correctly classified positive classes divide by the
total number of positive classes. Or, out of all the
positive classes, how much we have predicted
correctly.
13
Confusion Matrix
When to use Accuracy / Precision / Recall /
F1-Score?
(a) Accuracy is used when the True Positives
and True Negatives are more important.
Accuracy is a better metric for Balanced Data.
The F1 score is a number between 0 and 1
and is the harmonic mean of precision and (b) Whenever False Positive is much more
important use Precision.
recall. We use harmonic mean because it is
not sensitive to extremely large values, unlike (c) Whenever False Negative is much more
simple averages. important use Recall.
(d) F1-Score is used when the False
Negatives and False Positives are important.
F1-Score is a better metric for Imbalanced
Data.
14
Confusion Matrix (Practice
Problem)
15
Softmax Regression
The Logistic Regression model can be generalized to support multiple classes directly,
without having to train and combine multiple binary classifiers. This is called Softmax
Regression, or Multinomial Logistic Regression.
In logistic regression, we came to only 2 classifications, like whether a student passed or
failed. But, in softmax regression, output classification is more than 2. For example, the
status of a student can be continuing, dropped out, or in probation. That is, 3
classification. Similarly, a student can have power major, or electronic major, or
communication major, or computer major.
Similar to the sigmoid function of logistic regression, the softmax function is used in
softmax regression to determine the probability.
exp( z i )
s( z i )= K
∑ exp( z j )
j=1
16
Softmax Regression – Developing
the concept
To develop the understanding of softmax regression, we will consider 3 features (n = 3),
each feature has a length of 6 (m = 6), and there are 3 classes (K = 3). The output y = 1,
2, 3 means RED, GREEN, and BLUE. The system in matrix form is as follows. The
notation is xm,n.
[ ]
1 x 11 x 12 x13 We decompose Y Now use logistic Each θ has 4
X= 1 x 21 x 22 x23 into 3 classes, regression components (0,1,2,3).
⋮ ⋮ ⋮ ⋮ red, green, and algorithm to find θ Put them in each
1 x 61 x 62 x63 blue for each class column to get (n+1 x
()
1
K) matrix.
() () () [ ]
0 0 θ =logreg . fit ( X , y )
3 1 θ0 θ0 θ0
0 0 1 θ =logreg . fit ( X , y )
Y=
2
Y=
0 Y=
1 Y=
0
θ =logreg . fit ( X , y ) θ= θ1 θ1 θ1
2 0 1 0 θ2 θ2 θ2
0 1
3 0
0
θ3 θ3 θ3
1 0
1
17
Softmax Regression – Developing
the concept
Our model is ready. Now for prediction, assume a new value xnew = [x11, x12, x13]. The
steps to follow (1) compute z1, z2, z3. (2) use softmax function on z to find p1, p2, p3. (3)
pick up the maximum from p, and that is the output y. Example, say p2 is maximum, the
output class is GREEN.
Compute z1, z2, z3 Compute p1, p2, p3 Compute y
[ ]
θ0 θ0 θ0 exp( z 1 )
θ θ1 θ1 p1 =
( z1 z 2 z 3 ) = ( 1 x 11 x 12 x 13 ) 1 exp( z 1)+exp (z 2 )+exp( z 3 ) y=max ( p1 , p2 , p 3 )
θ2 θ2 θ2
exp( z 2 )
θ3 θ3 θ3 p2 =
exp (z 1 )+exp ( z 2 )+exp (z 3 )
Z= xnew θ exp( z 3 )
p3 =
exp (z 1 )+exp( z 2 )+exp (z 3)
18
Softmax Regression – Gradient
Descent
We can easily extend the logistic regression method by doing some extra work as
follows. (1) keep the X matrix as it is. (2) decompose the y vector to Y matrix of size
m×K. (3) extend θ from a column vector to a (n+1)×K matrix. (4) use the same matrix
form formula of Jacobian as we did in logistic regression.
Gradient Descent Algorithm
Create the X Convert y-vector to Y- Initialize Compute gradient
matrix as usual matrix weight T
J ( θ )= X ( Y − σ ( X θ ))
[ ]
matrix
( ) ( ) ( )
1 x11 x12 x13
()
1 0 0
[ ]
1
X = 1 x21 x22 x23
3 0 0 1 θ0 θ0 θ0 Update weight
⋮ ⋮ ⋮ ⋮ 0 1 0 θ1 θ1 θ1
Y=
2 Y= Y= Y= & continue until
1 x61 x62 x63 0 1 0 θ2 θ2 θ2
2
1 θ3 θ3 θ3 convergency
3 0 0
1 0 0 θ new =θold + η∗J
1
19
Softmax Regression – Example
problem
Below is the height vs cat(1), dog(2), and lion(3) data. We will use gradient descent
algorithm to develop softmax regression model.
Height
5 7 8 9 11 12 13 18 30 32 34 40
[inch]
Animal 1 1 1 1 2 2 2 2 3 3 3 3
[] [ ]
1 5
1 7 1 0 0 m=12
1 8 1 0 0 n=1
1 9 ⋮ ⋮ ⋮ k =3
X= ⋮
1
⋮
30
Y= 0
0
⋮
1
1
⋮
0
0
⋮
[
θ = θ 01 θ 02 θ 03
θ 11 θ 12 θ 13 ] T
J (θ )=X (Y −σ ( X θ ))
X=[m×n +1]=[12×2]
1 32 Y =[m×k ]=[12×3 ]
1 34 0 0 1 θ =[n+1×k ]=[2×3 ]
1 40 0 0 1
20
Softmax Regression – Example
problem
Now use scikit learn to find the model parameters
import numpy as np
from sklearn.linear_model import LogisticRegression
θ=[ 12.1 0.127 −12.23
−0.972 0.225 0.746 ]
x new =10
# dimension
m = 12; n = 1; k = 3 z= [ 1 10 ] [
12.1 0.127 −12.23
−0.972 0.225 0.746 ]
#data z= [ 2.38 2.377 −4.77 ]
x1 = [5, 7, 8, 9, 11, 12, 13, 18, 30, 32, 34, 40]
e = [ 10.81 10.77 0.0085 ]
z
X = np.reshape(x1, (m, n))
Y = np.array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]) 10.81
p1 = =0.500
10.81+10.77+0.0085
soft_reg = LogisticRegression(multi_class = "multinomial")
10.77
soft_reg.fit(X, Y) p2 = =0.499
10.81+10.77+0.0085
soft_reg.intercept_ 0.0085
p3 = =0.001
soft_reg.coef_ 10.81+10.77 +0.0085
21
Softmax Regression – Example
problem
Suppose you have built a device that can detect the 6 digits of the car’s license plate number
through multi-class classification. There are 10 possible classes (0-9) for each of the 6 digits.
The probability of these classes for the 6 digits of a car is given as follows. Determine the
license plate number of the car.
Digit 1 Digit 2 Digit 3 Digit 4 Digit 5 Digit 6
P(0) 0.4 0.17 0.15 0.34 0.46 0.44
P(1) 0.18 0.11 0.28 0.58 0.88 0.16
P(2) 0.22 0.37 0.44 0.68 0.32 0.48
P(3) 0.83 0.2 0.53 0.01 0.19 0.17
P(4) 0.08 0.49 0.46 0.6 0.47 0.16
P(5) 0.13 0.95 0.5 0.39 0.04 0.11
P(6) 0.17 0.6 0.52 0.92 0.18 0.31
P(7) 0.39 0.12 0.63 0 0.38 0.25
P(8) 0.35 0.05 0.64 0.46 0.47 0.59
P(9) 0.2 0.04 0.96 0.42 0.15 0.15
22
Confusion Matrix for Multi-class
In case of multi-class classification, the confusion matrix is k × k, instead of 2 × 2 matrix.
The following is a confusion matrix with k = 4.
The confusion matrix can be converted into a one-
vs-all type matrix (binary-class confusion matrix) for
calculating class-wise metrics like accuracy,
precision, recall, etc. The following is for class 1.
23
Confusion Matrix for Multi-class
Similarly, for class-2, the converted one-vs-all confusion matrix will look like the following:
Using this concept, we can calculate
the class-wise accuracy, precision,
recall, and f1-scores and tabulate
the results:
24
Confusion Matrix – Scikit Learn
The sklearn library has a confision matrix class, which you should import and work on it.
sklearn.metrics.confusion_matrix(y_true, y_pred, *, labels=None, sample_weight=None, normalize=None)
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
Cm = confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])
print(classification_report(y_true, y_pred))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()
Write the confusion matrix and determine its performance for the following data
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
25