Logistic Regression
Sargur N. Srihari
University at Buffalo, State University of New York
USA
Machine Learning Srihari
Topics in Linear Classification using
Probabilistic Discriminative Models
• Generative vs Discriminative
1. Fixed basis functions
2. Logistic Regression (two-class)
3. Iterative Reweighted Least Squares (IRLS)
4. Multiclass Logistic Regression
5. Probit Regression
6. Canonical Link Functions
2
Machine Learning Srihari
Topics in Logistic Regression
• Logistic Sigmoid and Logit Functions
• Parameters in discriminative approach
• Determining logistic regression parameters
– Error function
– Gradient of error function
– Simple sequential algorithm
– An example
• Generative vs Discriminative Training
– Naiive Bayes vs Logistic Regression
3
Machine Learning Srihari
Logistic Sigmoid and Logit Functions
• In two-class case, posterior of class C1
can be written as as a logistic sigmoid Logistic Sigmoid
of feature vector ϕ=[ϕ1,..ϕM]T σ(a)
p(C1|ϕ) = y(ϕ) = σ (wTϕ)
with p(C2|ϕ) = 1- p(C1|ϕ) a
Here σ (.) is the logistic sigmoid function Properties:
A. Symmetry
– Known as logistic regression
w
in statistics
σ (-a)=1-σ (a)
• Although a model for classification rather than
for regression B. Inverse
a=ln(σ /1-σ)
known as logit.
• Logit function: Also known as
log odds since
– It is the log of the odds ratio it is the ratio
• It links the probability to the predictor variables ln[p(C1|ϕ)/p(C2|ϕ)]
C. Derivative
dσ/da = σ (1-σ)
Machine Learning Srihari
Fewer Parameters in Linear
Discriminative Model
• Discriminative approach (Logistic Regression)
– For M -dim feature space ϕ:
– M adjustable parameters
• Generative based on Gaussians (Bayes/NB)
• 2M parameters for mean
• M(M+1)/2 parameters for shared covariance matrix
• Two class priors
• Total of M(M+5)/2 + 1 parameters
– Grows quadratically with M
• If features assumed independent (naïve Bayes) still 5
needs M+3 parameters
Machine Learning Srihari
Determining Logistic Regression parameters
• Maximum Likelihood Approach for Two classes
• For a data set (ϕn,tn) where tn ε {0,1} and
ϕn=ϕ (xn), n =1,..,N
• Likelihood function can be written as
N
{ }
1−tn
p(t | w) = ∏ yn 1 − yn
tn
n =1
where t =(t1,..,tN)T and yn= p(C1|ϕn)
yn is the probability that tn =1
6
Machine Learning Srihari
Error Fn for Logistic Regression
• Likelihood function is
N
{ }
1−tn
p(t | w) = ∏ yn 1 − yn
tn
n=1
• By taking negative logarithm we get the
Cross-entropy Error Function
N
{ }
E(w) = − ln p(t | w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n =1
where yn= σ (an) and an=wTϕn
• We need to minimize E(w)
At its minimum, derivative of E(w) is zero
So we need to solve for w in the equation
∇E(w) = 0 7
Machine Learning Srihari
What is Cross-entropy?
• Entropy of p(x) is defined as H(p) = − ∑ p(x)log p(x)
x
– If p(x=1| t)=t and p(x=0|t)=1-t then we can write
p(x)=tx(1-t)1-x
• Then Entropy of p(x) is H(p)=t logt+(1-t)log(1-t)
• Cross entropy of p(x) and q(x) is defined as
H(p,q) = − ∑ p(x)logq(x)
x
– If q(x=1| y)=y then H(p,q)= t log y + (1-t)log(1-y)
• In general H(p,q)=H(p)+DKL(p||q)
– where p(x)
DKL (p,q) = − ∑ p(x)log
x q(x)
8
Machine Learning Srihari
Gradient of Error Function
Error function
N
{
E(w) = − ln p(t | w) = − ∑ tn ln yn + (1 − tn )ln(1 − yn )
n =1
}
where yn= σ(wTϕn)
Using Derivative of logistic sigmoid dσ/da=σ(1-σ)
Gradient of the error function Proof of gradient expression
Let z = z1 + z 2
N
∇E(w) = ∑ yn − tn φn
n =1
( ) where z1 = t ln σ (wφ ) and z 2 = (1 − t)ln[1 − σ (wφ )]
dz1 tσ (wφ )[1 − σ (wφ )]φ dσ
= = σ (1− σ )
da
Error x Feature Vector dw σ (wφ ) d a
and Using dx
(ln ax) =
x
Contribution to gradient by data dz2 (1 − t)σ (wφ )[1 − σ (wφ )](−φ )
point n is error between target tn = 9
dw [1 − σ (wφ )]
and prediction yn= σ (wTφn) times basis φn dz
Therefore = (σ (wφ ) − t)φ
dw
Machine Learning Srihari
Simple Sequential Algorithm
• Given Gradient of error function
N
( )
∇E(w) = ∑ yn − tn φn
where y = σ(w ϕ )
n =1
n
T
n
• Solve using an iterative approach
w τ+1 = w τ − η∇En
• where
Takes precisely same form as
∇En = (yn − tn )φn
Gradient of Sum-of-squares
error for linear regression
Error x Feature Vector
Samples are presented one at a time in
which each each of the weight vectors is updated 10
Python Code for Logistic Regression
Sigmoid function to produce value between 0 and 1
def sigmoid(z): Prediction
return (1 / (1 + np.exp(-z)))
Loss and Cost function
Loss function is the loss for a training example
z
Cost is the loss for whole training set
Updating weights and biases
p is our prediction and y is correct value
Finding db and dw
Derivative wrt p à Derivative wrt z.
https://towardsdatascience.com/
logistic-regression-from-
very-scratch-ea914961f320
Logistic Regression Code in Python
use sci-kit learn to create a data set.
import sklearn.datasets
import matplotlib.pyplot as plt
import numpy as np
X, Y = sklearn.datasets.make_moons(n_samples=500, noise=.2)
X, Y = X.T, Y.reshape(1, Y.shape[0])
epochs = 1000
learningrate = 0.01
def sigmoid(z):
return 1 / (1 + np.exp(-z))
losstrack = []
m = X.shape[1]
w = np.random.randn(X.shape[0], 1)*0.01
b=0
for epoch in range(epochs):
z = np.dot(w.T, X) + b
p = sigmoid(z)
cost = -np.sum(np.multiply(np.log(p), Y) + np.multiply((1 - Y), np.log(1 - p)))/m
losstrack.append(np.squeeze(cost))
dz = p-Y
dw = (1 / m) * np.dot(X, dz.T)
db = (1 / m) * np.sum(dz)
w = w - learningrate * dw
b = b - learningrate * db
plt.plot(losstrack)
Prediction: From the code above, you find p. It will be between 0 and
Machine Learning Srihari
ML solution can over-fit
• Severe over-fitting for linearly
4
σ(a)
0
!2
separable data !4
!6
!8
!4 !2 0 2 4 6 8
– Because ML solution occurs at σ = 0.5 a
• With σ>0.5 and σ< 0.5 for the two classes
• Solution equivalent to a=wTϕ = 0
– Logistic sigmoid becomes infinitely steep
• A Heavyside step function ||w||goes to ∞
– Solution
• Penalizing wts
• Recall in linear regression
N
{
∇En = −∑ tn − wT φ(xn )
n=1
} φ(x ) n
T
without reg
⎡ N ⎤
{
∇En = ⎢ − ∑ tn − wT φ(x n ) } φ(x ) n
T
⎥ + λw with reg 13
⎣ n=1 ⎦
14
An Example of 2-class Logistic Regression
• Input Data
ϕ0(x)=1, dummy feature
15
Initial Weight Vector, Gradient and
Hessian (2-class)
• Weight vector
• Gradient
• Hessian
16
Final Weight Vector, Gradient and
Hessian (2-class)
• Weight Vector
• Gradient
• Hessian
Number of iterations : 10
Error (Initial and Final): 15.0642, 1.0000e-009
Generative vs Discriminative Training
Variables x ={x1,..xM} and classifier target y
1. Generative: estimate parameters of variables independently
Naïve y For classification:
Simple estimation
Determine joint:
Bayes M
p(y,x ) = p(y)∏ p(x i | y) independently estimate M sets of parameters
i=1 But independence is usually false
From joint get required We can estimate M(M+1)/2 covariance matrix
x1 x2 xM conditional p(y|x)
2. Discriminative: estimate joint parameters wi
Potential Functions (log-linear) For classification:
⎧ M
⎫
ϕi(xi,y)=exp{wixi I{y=1}}, Unnormalized P(y = 1 | x ) = exp ⎨w 0 + ∑ wi xi ⎬
! ! = 0 | x ) = exp {0} = 1
P(y
⎩ i=1 ⎭
ϕ0(y)=exp{w0 I{y=1}}
⎧ M
⎫ ez
Normalized P(y = 1 | x ) = sigmoid ⎨w 0 + ∑ wi xi ⎬ where sigmoid(z) =
Naïve ⎩ ⎭ 1+ ez
Markov
y I has value 1 when
y=1, else 0 Logistic Regression
i=1
Jointly optimize M parameters multiclass
exp(ai )
More complex estimation but correlations p(yi | φ ) = yi (φ ) =
x1 x2 xM accounted for ∑ j exp(a j )
Can use much richer features: where aj=wjTϕ
Edges, image patches sharing same pixels
Machine Learning Srihari
Logistic Regression is a special
architecture of a neural network
18
Machine Learning Srihari
19
https://storage.ning.com/topology/rest/1.0/file/get/2408482975?profile=original