IAML: Logistic Regression
Nigel Goddard
School of Informatics
Semester 1
1 / 21
Outline
I Logistic function
I Logistic regression
I Learning logistic regression
I Optimization
I The power of non-linear basis functions
I Least-squares classification
I Generative and discriminative models
I Relationships to Generative Models
I Multiclass classification
2 / 21
Decision Boundaries
I In this class we will discuss linear classifiers.
I For each class, there is a region of feature space in which
the classifier selects one class over the other.
I The decision boundary is the boundary of this region. (i.e.,
where the two classes are “tied”)
I In linear classifiers the decision boundary is a line.
3 / 21
Example Data
x2
o o
o o
o o
o
o o
x o
o
x x
x
x
x x x1
x
4 / 21
Linear Classifiers
I In a two-class linear classifier, we
learn a function
x2
o
o
o
o
o
o
F (x, w) = w> x + w0
o
o o
x o
x
x
x
o
that represents how aligned the
x
x x x1 instance is with y = 1.
x
I w are parameters of the classifier
that we learn from data.
I To do classification of an input x:
x 7→ (y = 1) if F (x, w) > 0
5 / 21
A Geometric View
x2
o o
o o
o o
o
w o o
x o
o
x x
x
x
x x x1
x
6 / 21
Explanation of Geometric View
I The decision boundary in this case is
{x|w> x + w0 = 0}
I w is a normal vector to this surface
I (Remember how lines can be written in terms of their
normal vector.)
I Notice that in more than 2 dimensions, this boundary will
be a hyperplane.
7 / 21
Two Class Discrimination
I For now consider a two class case: y ∈ {0, 1}.
I From now on we’ll write x = (1, x1 , x2 , . . . xd ) and
w = (w0 , w1 , . . . wd ).
I We will want a linear, probabilistic model. We could try
P(y = 1|x) = w> x. But this is stupid.
I Instead what we will do is
P(y = 1|x) = f (w> x)
I f must be between 0 and 1. It will squash the real line into
[0, 1]
I Furthermore the fact that probabilities sum to one means
P(y = 0|x) = 1 − f (w> x)
8 / 21
The logistic function
The logistic function
I We need a function that returns probabilities (i.e. stays
We need
�between 0aand
function
1). that returns probabilities (i.e. stays
between 0 and 1). provides this
I The logistic function
� The logistic function provides this
I f (z) = σ(z) ≡ 1/(1 + exp(−z)).
� f (z) = σ(z) ≡ 1/(1 + exp(−z)).
I As z goes from −∞ to ∞, so f goes from 0 to 1, a
� As z goes from −∞ to ∞, so f goes from 0 to 1, a
“squashing function”
“squashing function”
I It has a “sigmoid” shape (i.e. S-like shape)
� It has a “sigmoid” shape (i.e. S-like shape)
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
−6 −4 −2 0 2 4 6
6 / 24
9 / 21
Linear weights
I Linear weights + logistic squashing function == logistic
regression.
I We model the class probabilities as
D
wj xj ) = σ(wT x)
X
p(y = 1|x) = σ(
j=0
I σ(z) = 0.5 when z = 0. Hence the decision boundary is
given by wT x = 0.
I Decision boundary is a M − 1 hyperplane for a M
dimensional problem.
10 / 21
Logistic regression
I For this slide write w̃ = (w1 , w2 , . . . wd ) (i.e., exclude the
bias w0 )
I The bias parameter w0 shifts the position of the
hyperplane, but does not alter the angle
I The direction of the vector w̃ affects the angle of the
hyperplane. The hyperplane is perpendicular to w̃
I The magnitude of the vector w̃ effects how certain the
classifications are
I For small w̃ most of the probabilities within the region of
the decision boundary will be near to 0.5.
I For large w̃ probabilities in the same region will be close to
1 or 0.
11 / 21
Learning Logistic Regression
I Want to set the parameters w using training data.
I As before:
I Write out the model and hence the likelihood
I Find the derivatives of the log likelihood w.r.t the
parameters.
I Adjust the parameters to maximize the log likelihood.
12 / 21
I Assume data is independent and identically distributed.
I Call the data set D = {(x1 , y1 ), (x2 , y2 ), . . . (xn , yn )}
I The likelihood is
n
Y
p(D|w) = p(y = yi |xi , w)
i=1
n
p(y = 1|xi , w)yi (1 − p(y = 1|xi , w))1−yi
Y
=
i=1
I Hence the log likelihood L(w) = log p(D|w) is given by
n
X
L(w) = yi log σ(w> xi ) + (1 − yi ) log(1 − σ(w> xi ))
i=1
13 / 21
I It turns out that the likelihood has a unique optimum (given
sufficient training examples). It is convex.
I How to maximize? Take gradient
n
∂L
(yi − σ(wT xi ))xij
X
=
∂wj
i=1
I (Aside: something similar holds for linear regression
n
∂E
(wT φ(xi ) − yi )xij
X
=
∂wj
i=1
where E is squared error.)
I Unfortunately, you cannot maximize L(w) explicitly as for
linear regression. You need to use a numerical
optimisation method, see later.
14 / 21
Fitting this into the general structure for learning algorithms:
I Define the task: classification, discriminative
I Decide on the model structure: logistic regression model
I Decide on the score function: log likelihood
I Decide on optimization/search method to optimize the
score function: numerical optimization routine. Note we
have several choices here (stochastic gradient descent,
conjugate gradient, BFGS).
15 / 21
XOR and Linear Separability
XOR and Linear Separability
I A problem is linearly separable if we can find weights so
that
� A Tproblem is linearly separable if we can find weights so
I w̃ x + w0 > 0 for all positive cases (where y = 1), and
that
I w̃T� xw̃+T xw+0 w≤0 >
0 0forforall
allnegative cases
positive cases (where
(where y=
y = 1), and0)
I XOR � w̃T x + w0 ≤ 0 for all negative cases (where y = 0)
� XOR, a failure for the perceptron
� XOR can be solved by a perceptron using a nonlinear
transformation φ(x) of the input; can you find one?
I XOR becomes linearly separable if we apply a non-linear
11 / 24
tranformation φ(x) of the input — what is one?
16 / 21
The power of non-linear basis functions
1
1
φ2
x2
0 0.5
−1
0
−1 0 x1 1 0 0.5 φ1 1
Using two Gaussian basis functions φ1 (x) and φ2 (x)
Figure credit: Chris Bishop, PRML
As for linear regression, we can transform the input space if we
want x → φ(x) 17 / 21
Generative and Discriminative Models
I Notice that we have done something very different here
than with naive Bayes.
I Naive Bayes: Modelled how a class “generated” the
feature vector p(x|y ). Then could classify using
p(y|x) ∝ p(x|y )p(y )
. This called is a generative approach.
I Logistic regression: Model p(y|x) directly. This is a
discriminative approach.
I Discriminative advantage: Why spend effort modelling
p(x)? Seems a waste, we’re always given it as input.
I Generative advantage: Can be good with missing data
(remember how naive Bayes handles missing data). Also
good for detecting outliers. Or, sometimes you really do
want to generate the input.
18 / 21
Generative Classifiers can be Linear Too
Two scenarios where naive Bayes gives you a linear classifier.
1. Gaussian data with equal covariance. If
p(x|y = 1) ∼ N(µ1 , Σ) and p(x|y = 0) ∼ N(µ2 , Σ) then
p(y = 1|x) = σ(w̃T x + w0 )
for some (w0 , w̃) that depends on µ1 , µ2 , Σ and the class
priors
2. Binary data. Let each component xj be a Bernoulli variable
i.e. xj ∈ {0, 1}. Then a Naı̈ve Bayes classifier has the form
p(y = 1|x) = σ(w̃T x + w0 )
3. Exercise for keeners: prove these two results
19 / 21
Multiclass classification
I Create a different weight vector wk for each class, to
classify into k and not-k.
I Then use the “softmax” function
exp(wTk x)
p(y = k|x) = PC
T
j=1 exp(wj x)
PC
I Note that 0 ≤ p(y = k |x) ≤ 1 and j=1 p(y = j|x) = 1
I This is the natural generalization of logistic regression to
more than 2 classes.
20 / 21
Least-squares classification
I Logistic regression is more complicated algorithmically
than linear regression
I Why not just use linear regression with 0/1 targets?
4 4
2 2
0 0
−2 −2
−4 −4
−6 −6
−8 −8
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
Green: logistic regression; magenta, least-squares regression
Figure credit: Chris Bishop, PRML
21 / 21