Artificial Intelligence II (CS4442 & CS9542)
Classification: Logistic Regression
Boyu Wang
Department of Computer Science
University of Western Ontario
Recall: tumor example
Slide credit: Doina Precup
1
Recall: supervised learning
I A training example i has the form: (xi , yi ), where xi ∈ Rn it the
number of features (feature dimension). If y ∈ R, this is the
regression problem.
I If y ∈ {0, 1}, this is the binary classification problem.
I If y ∈ {1, . . . , C} (i.e., y can take more than two values), this is
the multi-class classification problem.
I Most binary classification algorithms can be extended to
multi-class classification algorithms.
2
Linear model for classification
I As in linear regression, we consider a linear model hw for
classification:
hw (x) = w > x
where we already augmented the data: x → [x; 1].
I Rules for binary classification: hw (x) ≥ 0 ⇒ y = 1;
hw (x) < 0 ⇒ y = 0
I How to choose w?
3
Error (cost) function for classification
I Recall: for regression, we use the sum-of-squared errors:
m
1X
J(w) = (hw (xi ) − yi )2
2
i=1
4
Error (cost) function for classification
I Recall: for regression, we use the sum-of-squared errors:
m
1X
J(w) = (hw (xi ) − yi )2
2
i=1
I Can we use it for classification?
4
Error (cost) function for classification
I Recall: for regression, we use the sum-of-squared errors:
m
1X
J(w) = (hw (xi ) − yi )2
2
i=1
I Can we use it for classification?
I We could, but it is not the best choice
- If y = 1, we want hw (x) > 0 as much as possible, which reflects
our confidence of classification
- recall the connection between linear regression and maximum
likelihood with Gaussian assumption
4
Probabilistic view for classification
I We would like a way of learning that is more suitable for the
problem
5
Probabilistic view for classification
I We would like a way of learning that is more suitable for the
problem
I Classification can be formulated as the following question: given
a data point x, what is the probability p(y |x)?
5
Probabilistic view for classification
I We would like a way of learning that is more suitable for the
problem
I Classification can be formulated as the following question: given
a data point x, what is the probability p(y |x)?
I Intuition: we want find the conditional probability p(y|x; w)
parameterized by w in a way such that
hw (x) = w > x → +∞ ⇒ p(y = 1|x) → 1
hw (x) = w > x → −∞ ⇒ p(y = 1|x) → 0 (i.e., p(y = 0|x) → 1)
>
hw (x) = w x = 0 ⇒ p(y = 1|x) = 0.5
5
Sigmoid function
Consider the following function:
1 ea
σ(a) , =
1 + e−a 1 + ea
- a → +∞ ⇒ σ(a) → 1
- a → −∞ ⇒ σ(a) → 0
- a = 0 ⇒ σ(a) = 0.5
Plug hw (x) into σ(·):
1
p(y = 1|x; w) , σ(hw (x)) = ,
1 + e−w > x
which is exactly what we are looking for!
6
1D sigmoid function
Figure: The sigmoid function and the predicted probabilities.
Figure credit: Kevin Murphy
7
2D sigmoid function
Figure: Plots of σ(w1 x1 + w2 x2 )
Figure credit: Kevin Murphy
8
Error (cost) function for classification – revisited
I Maximum likelihood classification assumes that we will find the
hypothesis that maximizes the (log) likelihood of the training
data:
m
X
arg max log p({xi , yi }m
i=1 |h) = arg max log p(xi , yi |h)
h h
i=1
(using the i.i.d. assumption)
I If we ignore the marginal distribution p(x), we may maximize the
conditional probability of the labels, given the inputs and the
hypothesis h:
m
X
arg max log p(yi |xi ; h)
h
i=1
9
The cross-entropy loss function for logistic regression
I Recall that for any data point (xi , yi ), the conditional probability
can be represented by the sigmoid function:
p(yi = 1|xi ; h) = σ(hw (xi ))
I Then the log-likelihood of a hypothesis hw is
m m
(
X X log σ(hw (xi )), if yi = 1
log L(w) = log p(yi |xi ; hw ) =
i=1 i=1
log(1 − σ(hw (xi ))), if yi = 0
m
X
= (yi log ti + (1 − yi ) log(1 − ti )) ,
i=1
where ti = σ(hw (xi )).
I The cross-entropy loss function is the negative of log L(w).
10
Linear regression vs. logistic regression
Both use linear model: hw (x) = w > x
I Conditional probability:
2
yi −w > xi
− 21
√ 1
σ
- linear regression: p(yi |xi ; w) = 2πσ 2
e
1
>x , if yi = 1
- logistic regression: p(yi |xi ; w) = 1+e−w i
1 − 1
>x , if yi = 0
1+e−w i
I Log-likelihood function:
2 !
yi −w > xi
Pm − 21
√ 1
σ
- linear regression: log L(w) = i=1 log
2πσ 2
e
- logistic regression:
log L(w) = m
P
i=1 (yi log ti + (1 − yi ) log(1 − ti )), where
ti = σ(hw (xi )).
I Solution:
- linear regression: w = (x > X )−1 X > y
- logistic regression: No analytical solution
11
Optimization procedure: gradient descent
Objective: minimize a loss function J(w)
Gradient descent for function minimization
1: Input: number of iterations: N, learning rate: α
2: Initialize w(0)
3: for n = 1 to N do
4: Compute the gradient: g(n) = ∇J(w(n) )
5: w(n+1) = w(n) − αg(n)
6: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
7: Stop
8: end if
9: end for
12
Effect of the learning rate
Figure: Gradient descent on a simple function, starting from
(0, 0), for 20 steps using a fixed learning rate α. (a) α = 0.1. (b)
α = 0.6
Figure credit: Kevin Murphy
13
Optimization procedure: Newton’s method
Objective: minimize a loss function J(w)
Newton’s method for function minimization
1: Input: number of iterations: N
2: Initialize w(0)
3: for n = 1 to N do
4: Compute the gradient: g(n) = ∇J(w(n) )
5: Compute the Hessian: H(n) = ∇2 J(w(n) )
−1
6: w(n+1) = w(n) − H(n) g(n)
7: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
8: Stop
9: end if
10: end for
14
Gradient descent for logistic regression
Objective: minimize the cross-entropy loss function(− log L(w)):
m
X
J(w) , − log L(w) = − (yi log ti + (1 − yi ) log(1 − ti ))
i=1
m
X 1
∇J(w) = (ti − yi )xi , ti = σ(hw (xi )) =
i=1
1 + e−w > xi
Gradient descent for logistic regression
1: Input: number of iterations: N, learning rate: α
2: Initialize w(0)
3: for n = 1 to N do
Compute the gradient: g(n) = ∇J(w(n) ) = m
P
4: i=1 (ti − yi )xi
5: w(n+1) = w(n) − αg(n)
6: if converges (e.g, |w(n+1) − w(n) | ≤ ) then
7: Stop
8: end if
9: end for 15
Newton’s method for logistic regression
Objective: minimize the cross-entropy loss function(− log L(w)):
m
X
J(w) , − log L(w) = − (yi log ti + (1 − yi ) log(1 − ti ))
i=1
m
X m
X
∇J(w) = (ti − yi )xi , H(w) = ∇2 J(w) = ti (1 − ti )xi> xi = X > SX
i=1 i=1
where S ∈ Rm×m is a diagonal matrix with elements Sii = ti (1 − ti ),
1
ti = σ(hw (xi )) = > .
1+e−w xi
Then, the update rule becomes:
w(n+1) = w(n) − H(w(n) )−1 ∇J(w(n) )
16
Gradient descent vs. Newton’s method
I Newtons method usually requires significantly fewer iterations
than gradient descent
I Computing the Hessian and its inverse is expensive
I Approximation algorithms exist which help to compute the
product of the inverse Hessian with gradient without explicitly
computing H
17
Cross-entropy loss vs. squared loss
Figure: Decision boundaries obtained by minimizing squared loss
(magenta line) and cross-entropy loss (green line)
Figure credit: Christopher Bishop
18
Regularized logistic regression
One can do regularization for logistic regression just like in the case
of linear regression.
I `2 -regularized logistic regression
λ
J(w) , − log L(w) + ||w||22
2
2. `1 -regularized logistic regression
J(w) , − log L(w) + λ||w||1
19
Multi-class logistic regression
I For 2 classes:
>
1 ew x
p(y = 1|x; w) = > =
1 + e−w x 1 + ew > x
I For C classes {1, . . . , C}:
>
ewc x
p(y = c|x; w1 , . . . , wC ) = PC >x
c=1 e wc
– called the softmax function
20
Multi-class logistic regression
Gradient descent for multi-class logistic regression
1: Input: number of iterations: N, learning rate: α
2: Initialize C vectors: w1,(0) , . . . , wC,(0)
3: for n = 1 to N do
4: for c = 1 to C do
5: Compute the gradient with respect to wc,(n) : gc,(n)
6: wc,(n+1) = wc,(n) − αgc,(n)
7: end for
8: if converges (e.g, |wc,(n+1) − wc,(n) | ≤ for all classes) then
9: Stop
10: end if
11: end for
21
Multi-class logistic regression
I After training, the probability of y = c is given by
>
ewc x
p(y = c|x; w1 , . . . , wC ) , σc (x) = PC
wc> x
c=1 e
I Predict class label as the most probable label:
y = arg max σc (x)
c
22
Summary
I Logistic regression for classification
I Cross-entropy loss for logistic regression
I No analytical solution that minimizes the cross-entropy
loss/maximizes the log-likelihood
I Use gradient descent to find a solution
I Multi-class logistic regression
23