Machine Learning
Linear Models for Classification
1
Marcello Restelli
0.5
March 15, 2017 0
−10 −5 0 5 10
Outline
1 Linear Classification
2 Discriminant Functions
Least Squares
The Perceptron Algorithm
3 Probabilistic Discriminative Models
Marcello Restelli March 15, 2017 2 / 31
Linear Classification
Classification Problems
The goal of classification is to assign an input x into one of K discrete
classes Ck , where k = 1, . . . , K
Typically, each input is assigned only to one class
Example: The input vector x is the set of
pixel intensities, and the output variable t
will represent the presence of cancer
(class C1 ) or absence of cancer (class C2 )
Marcello Restelli March 15, 2017 4 / 31
Linear Classification
Linear Classification
In linear classification, the input space is divided into decision regions
whose boundaries are called decision boundaries or decision surfaces
We will consider linear models for classification
In the linear regression case, the model is linear in parameters:
D−1
X
y(x, w) = w0 + wj xj = xT w + w0
j=1
For classification, we need to predict discrete class labels, or posterior
probabilities that lie in the range of (0, 1), so we use a nonlinear
function
y(x, w) = f (xT w + w0 )
Marcello Restelli March 15, 2017 5 / 31
Linear Classification
Generalized Linear Models
y(x, w) = f (xT w + w0 )
The decision surfaces correspond to y(x, w) = const
It follows that xT w + w0 and hence the decision surfaces are linear
functions of x, even if the activation function is nonlinear
These class of models are called generalized linear models
They are no longer linear in parameters
More complex analytical and computational properties than regression
As we did for regression, we can consider fixed nonlinear basis
functions
Marcello Restelli March 15, 2017 6 / 31
Linear Classification
Notation
In two-class problems, we have a binary target value t ∈ {0, 1}, such that
t = 1 is the positive class and t = 0 is the negative class
We can interpret the value of t as the probability of the positive class
The output of the model can be represented as the probability that the
model assigns to the positive class
If there are K classes, we can use a 1-of-K encoding scheme
t is a vector of length K and contains a single 1 for the correct class and 0
elsewhere
Example: if K = 5, then an input that belongs to class 2 corresponds to
target vector:
T
t = (0, 1, 0, 0, 0)
We can interpret t as a vector of class probabilities
Marcello Restelli March 15, 2017 7 / 31
Linear Classification
Three Approaches to Classification
Discriminant function: build a function that directly maps each input
to a specific class
Probabilistic approach: model the conditional probability distribution
p(Ck |x) and use it to make optimal decisions. Two alternatives:
Probabilistic discriminative approach: Model p(Ck |x) directly, for
instance using parametric models (e.g., logistic regression)
Probabilistic generative approach: Model class conditional densities
p(x|Ck ) together with prior probabilities p(Ck ) for the classes. Infer the
posterior using Bayes’ rule:
p(x|Ck )p(Ck )
P(Ck |x) =
p(x)
For example, we could fit multivariate Gaussians to the input vector of
each class. Given a test vector, we see under which Gaussian the vector is
most probable
Marcello Restelli March 15, 2017 8 / 31
Discriminant Functions
Two classes
y(x) = xT w + w0
Assign x to C1 is y(x) ≥ 0 and
class C2 otherwise
Decision boundary: y(x) = 0
Given two points on the decision
surface:
y(xA ) = y(xB ) = 0
wT (xA − xB ) = 0
w is orthogonal to the decision
surface
wT x w0
if x is on the decision surface, then =−
kwk2 kwk2
Hence w0 determines the location of the decision surface
Marcello Restelli March 15, 2017 10 / 31
Discriminant Functions
Multiple classes
Consider the extension to K > 2
classes
One-versus-the-rest: K − 1
classifiers, each of which solves a
two class problem:
Separate points in class Ck
from points not in that class
There are regions in input space
that are ambiguously classified
One-versus-one: K(K − 1)/2
binary discriminant functions
Each function discriminates
between two particular classes
Similar problems of ambiguity
Marcello Restelli March 15, 2017 11 / 31
Discriminant Functions
Multiple classes: Simple Solution
Use K linear discriminant functions of the form
yk (x) = xT wk + wk0 , where k = 1, . . . , K
Assign x to class Ck , if yk (x) > yj (x) ∀j 6= k
The resulting decision boundaries are singly connected and convex
For any two points that lie inside the
region Rk :
yk (xA ) > yj (xA ) and yk (xB ) > yj (xB )
implies that for positive α
yk (αxA +(1−α)xB ) > yj (αxA +(1−α)xB )
due to linearity of the discriminant
functions
Marcello Restelli March 15, 2017 12 / 31
Discriminant Functions Least Squares
Least Squares for Classification
Consider a general classification problem with K classes using 1-of-K
encoding scheme for the target vector t
Recall: Least squares approximates conditional expectation E[t|x]
Each class is described by its own linear model
yk (x) = wk T x + wk0 , where k = 1, . . . , K
Using vector notation:
y(x) = W̃T x̃
T
W̃ is a (D + 1) × K matrix whose k-th column is w̃k = (wk0 , wk T )
T
x̃ = (1, xT )
Marcello Restelli March 15, 2017 14 / 31
Discriminant Functions Least Squares
Least Squares for Classification
Given a dataset D = {xi , ti }, where i = 1, . . . , N
We have already seen how to minimize least squares
−1
W̃ = X̃T X̃ X̃T T
X̃ is an N × (D + 1) matrix whose i-th row is x̃Ti
T is an N × K matrix whose i-th row is ti T
A new input is assigned to a class for which tk = x̃T w̃k is largest
There are problems in using least squares for classification
Marcello Restelli March 15, 2017 15 / 31
Discriminant Functions Least Squares
Problems using Least Squares: Outliers
Least squares is highly sensitive to outliers, unlike logistic regression
Marcello Restelli March 15, 2017 16 / 31
Discriminant Functions Least Squares
Problems using Least Squares: Non-Gaussian Distributions
The reason for the failure is due to the assumption of a Gaussian conditional
distribution that is not satisfied by binary target vectors
Marcello Restelli March 15, 2017 17 / 31
Discriminant Functions Least Squares
Fixed Basis Functions
So far, we have considered classification models that work directly in the
input space
All considered algorithms are equally applicable if we first make a fixed
nonlinear transformation of the input space using vector of basis
functions φ(x)
Decision boundaries will be linear in the feature space, but would
correspond to nonlinear boundaries in the original input space
Classes that are linearly separable in the feature space need not be
linearly separable in the original input space
Marcello Restelli March 15, 2017 18 / 31
Discriminant Functions Least Squares
Linear Basis Function Models
Original input space Feature space
We consider two Gaussian basis functions with centers shown by green
crosses and with contours shown by green circles
Linear decision boundary (right) is obtained using logistic regression
and corresponds to nonlinear decision boundary in the input space (left)
Marcello Restelli March 15, 2017 19 / 31
Discriminant Functions The Perceptron Algorithm
Perceptron
The perceptron (Rosenblatt, 1958) is another example of linear
discriminant models
It is an online linear classification algorithm
It corresponds to a two-class model:
(
T +1, a≥0
y(x) = f (w φ(x)), where f (a) =
−1, a<0
Target values are +1 for C1 and -1 for C2
The algorithm finds the separating hyperplane by minimizing the
distance of misclassified points to the decision boundary
Using the number of misclassified points as loss function is not effective
since it is a piecewise constant function
Marcello Restelli March 15, 2017 21 / 31
Discriminant Functions The Perceptron Algorithm
Perceptron Criterion
We are seeking a vector w such that wT φ(xn ) > 0 when xn ∈ C1 and
wT φ(xn ) < 0 otherwise
The perceptron criterion assigns
zero error to correct classification
wT φ(xn )tn to misclassified patterns xn (it is proportional to the distance to
the decision boundary)
The loss function to be minimized is
X
LP (w) = − wT φ(xn )tn
n∈M
Minimization is performed using stochastic gradient descent:
w(k+1) = w(k) − α∇LP (w) = w(k) + αφ(xn )tn
Since the perceptron function does not change if w is multiplied by a
constant, the learning rate α can be set to 1
Marcello Restelli March 15, 2017 22 / 31
Discriminant Functions The Perceptron Algorithm
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Marcello Restelli March 15, 2017 23 / 31
Discriminant Functions The Perceptron Algorithm
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Marcello Restelli March 15, 2017 23 / 31
Discriminant Functions The Perceptron Algorithm
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Marcello Restelli March 15, 2017 23 / 31
Discriminant Functions The Perceptron Algorithm
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Marcello Restelli March 15, 2017 23 / 31
Discriminant Functions The Perceptron Algorithm
Algorithm
Input:data set xn ∈ RD ,
tn ∈ {−1, +1}, for n = 1 : N
Initialize w0
k←0
repeat
k ←k+1
n ← k mod N
if t̂ n 6= tn then
wk+1 ← wk + φ(xn )tn
end if
until convergence
Marcello Restelli March 15, 2017 23 / 31
Discriminant Functions The Perceptron Algorithm
Preceptron Convergence Theorem
The effect of a single update is to reduce the error due to the
misclassified pattern:
T T T
−w(k+1) φ(xn )tn = −w(k) φ(xn )tn −(φ(xn )tn )T φ(xn )tn < −w(k) φ(xn )tn
This does not imply that the loss is reduced at each stage
Theorem (Perceptron Convergence Theorem)
If the training data set is linearly separable in the feature space Φ, then the
perceptron learning algorithm is guaranteed to find an exact solution in a
finite number of steps
The number of steps before convergence may be substantial
We are not able to distinguish between nonseparable problems and
slowly converging ones
If multiple solutions exist, the one found depends by the initialization
of the parameters and the order of presentation of the data points
Marcello Restelli March 15, 2017 24 / 31
Probabilistic Discriminative Models
Logistic Regression
Consider the problem of two-class classification
The posterior probability of class C1 can be written as a logistic sigmoid
function:
1
p(C1 |φ) = = σ(wT φ)
1 + exp(−wT φ)
where p(C2 |φ) = 1 − p(C1 |φ) and the bias term is omitted for clarity
1
This model is known as logistic
regression (even if it is for
classification!) 0.5
Differently from generative
models, here we model p(Ck |φ) 0
directly −10 −5 0 5 10
Marcello Restelli March 15, 2017 26 / 31
Probabilistic Discriminative Models
Maximum Likelihood for Logistic Regression
Given a dataset D = {xn , tn }, n = 1, . . . , N; tn ∈ {0, 1}
Maximize the probability of getting the right label:
N
Y
p(t|X, w) = ytnn (1 − yn )1−tn , yn = σ(wT φn )
n=1
Taking the negative log of the likelihood, we can define cross-entropy
error function (to be minimized):
N
X N
X
L(w) = − ln p(t|X, w) = − (tn ln yn + (1 − tn ) ln(1 − yn )) = Ln
n=1 n=1
Differentiating and using the chain rule:
∂Ln yn − tn ∂yn ∂Ln ∂Ln ∂yn
= , = yn (1−yn )φi , = = (yn −tn )φn
∂yn yn (1 − yn ) ∂w ∂w ∂yn ∂w
Marcello Restelli March 15, 2017 27 / 31
Probabilistic Discriminative Models
Maximum Likelihood for Logistic Regression
The gradient of the loss function is
N
X
∇L(w) = (yn − tn )φn
n=1
It has the same form as the gradient of the sum-of-squares error
function for linear regression
There is no closed form solution, due to nonlinearity of the logistic
sigmoid function
The error function is convex and can be optimized by standard
gradient-based optimization techniques
Easy to adapt to the online learning setting
Marcello Restelli March 15, 2017 28 / 31
Probabilistic Discriminative Models
Multiclass Logistic Regression
For the multiclass case, we represent posterior probabilities by a softmax
transformation of linear functions of feature variables:
exp(wk T φ)
p(Ck |φ) = yk (φ) = P T
j exp(wj φ)
Differently from generative models, here we will use maximum
likelihood to determine parameters of this discriminative model directly
N K
! N K
!
Y Y Y Y t
p(T|Φ, w1 , . . . , wK ) = p(Ck |φn )tnk = ynknk
n=1 k=1 n=1 k=1
| {z }
Only one term corresponding
to correct class
exp(wk T φn )
where ynk = p(Ck |φn ) = P T
j exp(wj φn )
Marcello Restelli March 15, 2017 29 / 31
Probabilistic Discriminative Models
Multiclass Logistic Regression
Taking the negative logarithm gives the cross-entropy function for
multi-class classification problem
N K
!
X X
L(w1 , . . . , wK ) = − ln p(T|Φ, w1 , . . . , wK ) = − tnk ln ynk
n=1 k=1
Taking the gradient
N
X
∇Lwj (w1 , . . . , wK ) = (ynj − tnj )φn
n=1
Marcello Restelli March 15, 2017 30 / 31
Probabilistic Discriminative Models
Connection Between Logistic Regression and Perceptron
Algorithm
If we replace the logistic function with a step function
1 1
0.5 0.5
0 0
−10 −5 0 5 10 −10 −5 0 5 10
1 (
y(x, w) = 1 if w · φ(x) > 0
1 + e−w·φ(x) y(x, w) =
0 otherwise
Both algorithms use the same updating rule:
w ← w + α (y(xn , w) − tn ) φn
Marcello Restelli March 15, 2017 31 / 31