Introduction to Machine Learning
Linear Models for Classification
林彥宇 教授
Yen-Yu Lin, Professor
國立陽明交通大學 資訊工程學系
Computer Science, National Yang Ming Chiao Tung University
Some slides are modified from Prof. Sheng-Jyh Wang
and Prof. Hwang-Tzong Chen
Classification problem
• The goal in classification is to take an input vector 𝐱 and to
assign it to one of 𝐾 discrete classes 𝐶𝑘 where 𝑘 = 1, 2, … , 𝐾.
• The input space is divided into decision regions whose
boundaries are called decision boundaries or decision surfaces
Photo: https://ww2.mathworks.cn/help/stats/create-
and-visualize-discriminant-analysis-classifier.html 2
Classification problem
• Linear models for classification: the decision surfaces are linear
functions of the input vector 𝐱
2D case 3D case
3
Classification problem
• Linear models for classification: the decision surfaces are linear
functions of the input vector 𝐱
• The decision surfaces are defined by (𝐷 − 1)-dimensional
hyperplanes within the 𝐷-dimensional input space
• Data whose classes can be separated by linear decision
surfaces are said to be linearly separable
linearly not linearly
separable separable 4
Classification problem
• Given a training data set comprising 𝑁 observations
{𝐱𝑛 }𝑁 𝑁
𝑛=1 and the corresponding target labels {𝑡𝑛 }𝑛=1 , the goal
of classification is to predict the label of 𝑡 for a new data
sample of 𝐱
➢ Categorical outputs, e.g., yes/no, dog/cat/other, called labels
➢ A classifier assigns each input vector to one of these labels
• Binary classification: two possible labels
• Multi-class classification: multiple possible labels
• Label representation
➢ Two classes: 𝑡 ∈ 1,0 or 𝑡 ∈ +1, −1
➢ Multiple classes, e.g., 𝐾 = 5: (1-of-K scheme)
5
Three representative linear classifiers
• Different linear models for classification
➢ Discriminant functions
➢ Generative approach using Bayes’ theorem:
➢ Discriminative approach to directly model the class-conditional
density:
6
Linear discriminant for two-class classification
• A linear discriminant is a linear function that takes an input
vector 𝐱 and assigns it to one of 𝐾 classes
• A linear discriminant function for two-class classification
where 𝐰 is the weight vector and 𝑤0 is the bias
• The decision boundary is , i.e., classification result
• is a (𝐷 − 1)-dimensional hyperplane within the 𝐷-
dimensional input space
7
Properties of a linear discriminant
• Weight vector 𝐰 is orthogonal to every vector lying within the
decision boundary
➢ Consider two points 𝐱𝐴 and 𝐱 𝐵 , which lie on the decision
boundary
➢ We have , leading to
Photo: K.-R. Muller 8
Properties of a linear discriminant
• The distance from the origin to the decision boundary is
Photo: G. Shakhnarovich
9
Properties of a linear discriminant
• How to compute the distance between an arbitrary point 𝐱 to
the decision boundary ?
10
Linear discriminant for multi-class classification
• Consider the extension of linear discriminants to 𝐾 > 2 classes
• Many classifiers cannot directly extend to multi-class
classification
➢ Build a 𝐾-class discriminant by combining a number of two-class
discriminant functions
➢ One-versus-the-rest strategy
➢ One-versus-one strategy
11
One-versus-the-rest
• One-vs-the-rest
➢ Learn 𝐾 − 1 two-class classifiers (linear discriminants)
➢ Classifier 1 is derived to separate data of class 1 from the rest
➢ Classifier 2 is derived to separate data of class 2 from the rest
➢…
➢ Classifier 𝐾 − 1 is derived to separate data of class 𝐾 − 1 from
the rest
3-class classification
2 one-vs-the-rest linear
discriminants
12
One-vs-one
• One-vs-one
➢ Learn 𝐾(𝐾 − 1)/2 two-class classifiers, one for each class pair
➢ For classes i and j, a binary classifier is learned to separate data
of class i from those of class j
➢ Classification is done by majority vote
• The problem of ambiguous regions
3-class classification
3 one-vs-one linear
discriminants
13
K-class discriminant
• A single K-class discriminant can avoid the problem of
ambiguous regions
➢ It is composed of K linear functions of the form
for 𝑘 = 1, 2, … 𝐾
➢ It assigns a point 𝐱 to class k if
➢ The decision boundary between class k and class j is
14
K-class discriminant
• An example: 3-class discriminant
• The decision regions of such a
discriminant are convex
• Consider two points 𝐱𝐴 and 𝐱𝐵 , which lie inside region
• For any point 𝐱ො that lies on the line connecting 𝐱𝐴 and 𝐱 𝐵 , it
can be expressed as
• It can be proved that 𝐱ො also lies inside
➢ Linear function of class k:
➢ Proof:
15
Linear discriminant learning
• Learning focuses on estimating a good decision boundary
• We need to optimize parameters 𝒘 and 𝑤0 of the boundary
• What does good mean here?
• Is this boundary good
Photo: R. Zemel et al.
• We need a criterion to tell how to optimize these parameters
16
Least squares for classification
• Use least squares technique to solve a K-class discriminant
• Each class k is described by its own linear model
• A point 𝐱 is assigned to class k if
17
Least squares for classification
• Some notations
➢ A data point 𝐱:
➢ 1-of-K binary coding for the label vector of 𝐱: 𝐭 = [0,1,0,0,0]𝑇
➢ The linear model for class k:
➢ Apply the linear model for class k to a point 𝐱:
➢ All data points: All data label vectors:
➢ All linear models:
➢ Apply all linear models to a point 𝐱:
18
Least squares for classification
• The squared difference between and
• Sum-of-squares error
• Proof sketch
➢ Tr(AB) = Tr(BA)
➢ Tr(BA) is the sum of the diagonal elements of square matrix BA
➢ The nth diagonal element is the squared error of point 𝐱 𝑛
• Setting the derivative w.r.t. to 0, we obtain
19
Least squares for classification
• After getting , we classify a new data point via
20
Least squares for classification
• The least-squares solutions are sensitive to outliers
Magenta: least squares
Green: logistic regression
21
Least squares for classification
• The least squares method sometimes gives poor results
Left: least squares
Right: logistic regression
22
Fisher’s linear discriminant: 2 classes
• Fisher’s linear discriminant (FLD): a non-probabilistic method
for dimensionality reduction
• Consider the case of two classes, and suppose we take a 𝐷-
dimensional input vector 𝐱 and project it onto one dimension
by
• If we place a threshold on 𝐱, and classify it as class 𝐶1 if
𝑦 ≥ −𝑤0 , and otherwise class 𝐶2 , we get the linear classifier
discussed previously
• In general, dimensionality reduction leads to information loss,
but we can select a projection maximizing data separation
23
Linear discriminant with maximum separation
• A two-class problem where there are 𝑁1 points of class 𝐶1 and
𝑁2 points of class 𝐶2
• The mean vectors of the two classes are
and
• An intuitive choice of 𝐰 that maximizes the distance between
the projected mean vectors, i.e.,
where
• However, the distance can be arbitrarily large by increasing the
magnitude of 𝐰
24
Linear discriminant with maximum separation
• We constrain 𝐰 to have unit length, i.e.,
• The constrained optimization problem:
• The optimal 𝐰 in the optimization problem above
• How to prove?
➢ Use Lagrange multiplier to solve it
➢ By setting the gradient of Lagrange function w.r.t. optimization
variables to 0, we get
25
Linear discriminant with maximum separation
• Is the obtained good?
➢ +: 𝐦1 , +: 𝐦2 , +: threshold
➢ Histograms of the two classes overlap
26
Linear discriminant with maximum separation
• Is the obtained good?
➢ +: 𝐦1 , +: 𝐦2 , +: threshold
➢ Histograms of the two classes overlap
➢ Right plot: The projection learned by FLD
27
Fisher’s linear discriminant: 2 classes
• FLD seeks the projection 𝐰 that gives a large distance between
the projected data means while giving a small variance within
each class
• Maximize the between-class variance
where
• Minimize the within-class variance
where
• The objective (Fisher criterion):
28
Fisher’s linear discriminant: 2 classes
• The objective (Fisher criterion):
where SB is the between-class covariance matrix
SW is the within-class covariance matrix
29
Fisher’s linear discriminant: 2 classes
• Differentiate Fisher criterion w.r.t. 𝐰
• We find that
• As , is in the direction of
• We have
30
Fisher’s linear discriminant: 2 classes
• The optimized 𝐰 is called Fisher’s linear discriminant
• Project training data into a one-dimensional space via 𝐰
➢ Classification can be carried out by several methods
➢ Determine a threshold 𝑦0
◆Predict a point 𝐱 as 𝐶1 if 𝑦(𝐱) ≥ −𝑦0 , and otherwise class 𝐶2
➢ Use the nearest-neighbor rule
◆Project all training data into the one-dimensional space via 𝐰
◆Project a testing point 𝐱 to the same space
◆Retrieve the nearest training sample of 𝐱 in the projected space
◆Predict 𝐱 as the class that the retrieved sample belongs to
31
Fisher’s linear discriminant: Multiple classes
• Fisher’s linear discriminant (FLD) for 𝐾 > 2 classes
• Assume the dimension of the input space is 𝐷, which is greater
than 𝐾
• FLD introduces 𝐷′ ≥ 1 linear weight vectors for
• Gathering the weight vectors together projects each data point
𝐱 to a 𝐷′ -dimensional space
where weight vectors are the columns of
32
Fisher’s linear discriminant: Multiple classes
• Generalize the within-class covariance matrix to 𝐾 classes
• Recall the within-class covariance matrix when 𝐾 = 2
• The within-class covariance matrix when 𝐾 ≥ 2
where
and
33
Fisher’s linear discriminant: Multiple classes
• Recall the between-class covariance matrix when 𝐾 = 2
• The extended between-class covariance matrix for 𝐾 > 2
where
34
Fisher’s linear discriminant: Multiple classes
• Consider the case where FLD projects data to a one-dimensional
space, i.e., 𝐷′ = 1
• An equivalent objective
• Lagrangian function
• We have
• The optimal 𝐰 is the eigenvector of that corresponds to
the largest eigenvalue
35
Fisher’s linear discriminant: Multiple classes
• Consider the case where FLD projects data to a multi-
dimensional space, i.e., 𝐷′ > 1
• Can we directly extend the objective to learn a multi-
dimensional projection? No
WT SB W
𝐽(w)= T
W SW W
• A choice for the objective is
• The columns of the optimal W are the eigenvectors of
that correspond to the 𝐷′ largest eigenvalues
36
Fisher’s linear discriminant: Multiple classes
• About the value 𝐷′, the dimension of the projected space
• Note that the rank of the between-class covariance matrix is at
most 𝐾 − 1
• In other words, the dimension of the projected space by FLD is
at most 𝐾 − 1
37
Probabilistic generative models: Two-class case
• In a generative model, we model the class-conditional
densities and class priors , and then use them to
compute posterior probabilities via Bayes’ theorem
• Consider two-class cases. The posterior probability for class 𝐶1
is defined by
where 𝜎(𝑎) is the logistic sigmoid function
38
Logistic sigmoid function
• Logistic sigmoid function maps the whole real axis into [0,1]
➢ Symmetric property:
• The variable 𝑎 here represents the log of the ratio of
probabilities
• Logistic sigmoid function (red curve)
𝜎(𝑎)
𝑎 39
Probabilistic generative models: Multi-class case
• For the case of 𝐾 > 2 classes, the posterior probability for
class 𝐶𝑘 is defined by
where
• Multi-class generalization of logistic sigmoid function, or
softmax function
➢ If , we have and
40
Continuous inputs: Two-class case
• Assume that the class-conditional densities are Gaussian and
all classes share the same covariance matrix:
• Recall where
• We have
where
• Note that the quadratic terms in 𝐱 are canceled
41
Class-conditional and posterior probabilities
class-conditional densities posterior probability
: red
: blue Note that
42
Continuous inputs: Multi-class case
• Assume that the class-conditional densities are Gaussian and
all classes share the same covariance matrix:
• Recall where
• We have
where
43
Continuous inputs: Multi-class case
• If each class-conditional density has its own covariance
matrix , it leads to a quadratic discriminant
three class-conditional densities posterior probability
Decision boundary between red and green classes is linear,
while those between other pairs are quadratic
44
Determine parameter values via maximum likelihood
• We specify the functional form of class-conditional density
• We assume the prior class probability takes the form
and
• Suppose a set of 𝑁 data points {𝐱 𝑛 , 𝑡𝑛 } is provided, where
𝑡𝑛 = 1 denotes class 𝐶1 and 𝑡𝑛 = 0 denotes class 𝐶2
• Our goal is to determine the values of parameters
to complete classification
45
Determine parameter values via maximum likelihood
• For a data point 𝐱 𝑛 from class 𝐶1 , we have
• Similarly for class 𝐶2 , we have
• Suppose data are i.i.d. The likelihood function is given by
where
46
Maximum likelihood solution for 𝜋
• It is convenient to maximize the log of the likelihood function
• Consider first the maximization w.r.t. 𝜋
• The terms in the log likelihood function that depend on 𝜋 are
• Setting the derivative w.r.t. 𝜋 to zero, we obtain
where 𝑁1 is the number of data in class 𝐶1 and 𝑁2 is the
number of data in class 𝐶2
• The ML estimate for 𝜋 is simple the fraction of points in class 𝐶1
47
Maximum likelihood solution for
• Consider the maximization w.r.t.
• We pick those terms that depend on in the log likelihood
function
• Setting the derivative w.r.t. to zero leads to
• It is simply the mean of all data points belonging to class 𝐶1
48
Maximum likelihood solution for
• Similarly, the corresponding result for is
• It is simply the mean of all data points belonging to class 𝐶2
49
Maximum likelihood solution for
• Finally, consider the maximum likelihood solution for the
shared covariance matrix
• Picking out the terms in the log likelihood function that
depend on , we get
50
Maximum likelihood solution for
where
• Setting the derivative to zero, we have
• The derivation
51
Generative approach summary
• Class-conditional densities
• Class prior probabilities
and
• Determine the parameter values of
• Classification is carried out via
where
• ML solution can be directly extended to multi-class cases
52
Generative vs. Discriminative models
• Probabilistic generative model: Indirect approach that finds
the parameters of class-conditional densities and class priors,
and applies Bayes’ theorem to get posterior probabilities
• Probabilistic discriminative models: Direct approach that uses
the generalized linear model to represent posterior
probabilities, and determines its parameters directly.
• Advantages of discriminative models:
➢ Better performance in most cases, especially when the class-
conditional density assumption gives a poor approximation to
the true distribution
➢ Less parameters
53
Nonlinear basis functions for linear classification
• Nonlinear basis functions help when dealing with data that are
not linearly separable:
data points of class 𝐶1 +: mean of Gaussian basis function
data points of class 𝐶2
54
Logistic regression for two-class classification
• Recall the posterior probability for two-class classification
• 1. Ignore the class-conditional probabilities and class priors
• 2. Apply basis functions for nonlinear transform
• 3. Assume the posterior probability can be written as a logistic
sigmoid acting on a linear function of the feature vector
• We get
and
where
55
Logistic regression model
• This model is called logistic regression, though it is used for
classification
• Suppose the dimension of is 𝑀. There are 𝑀 parameters to
learn in logistic regression
• Cf. For the generative model, we use 2𝑀 parameters for the
means of two classes and 𝑀 (𝑀 + 1)/2 parameters for the
shared covariance matrix
56
Determine parameters of logistic regression
• The derivative of the logistic sigmoid function
• Derivation:
• Given training data , where ,
for for , the likelihood function of logistic regression
is
where and
57
Determine parameters of logistic regression
• The negative log likelihood, called cross entropy error, is
where
• The gradient of the error function w.r.t. 𝐰 is
• Derivation:
58
Determine parameters of logistic regression
• Optimize the parameters by stochastic gradient descent
• Optimize the parameters by iterative reweighted least squares
(IRLS), i.e., Newton-Raphson iterative optimization scheme
where is the Hessian matrix whose elements comprise the
second derivatives of w.r.t. 𝐰
59
Newton-Raphson iterative optimization
• Negative log likelihood function
• Gradient
• Hessian
where 𝐑 is a diagonal matrix with
60
Newton-Raphson iterative optimization
• The Newton-Raphson update formula
• Hessian matrix is positive definite
• Proof:
61
Multiclass logistic regression
• Recall two-class logistic regression
and
where
• We deal with multiclass cases
where the activation 𝑎𝑘 is given by
62
Likelihood function of multiclass logistic regression
• Two-class case: Given training data , where
, for , the likelihood function of two-
class logistic regression is
where and
• For a multiclass case, we use 1-of-K coding to
represent each target label vector , let , we
have the likelihood function
63
Newton-Raphson iterative optimization
• Negative log likelihood function is
• When using Newton-Raphson iterative optimization, we need
to have the following gradient and Hessian
64
Background
• Variable dependence: 𝐰 → 𝑎 → 𝑦 → 𝐸
• According to
we have
65
Background
• According to
we have
where
• Proof
66
Background
• According to
we have
• Given and , we can compute
67
Newton-Raphson iterative optimization
𝐰→𝑎→𝑦→𝐸
• Gradient
• Hessian
• With gradient and Hessian, Newton-Raphson method works
68
Discriminative approach summary
• Posterior probability and logistic regression
and
• The negative log likelihood (cross entropy error)
• Newton-Raphson method for iterative optimization
• It can be extended to multiclass classification
69
Summary
• Linear discriminant
➢ Two-class discriminant
➢ K-class discriminant
➢ Fisher’s linear discriminant
• Probabilistic generative model
➢ Class-conditional probability and class prior probability
➢ ML solution
• Probabilistic discriminative model: Logistic regression
➢ Posterior probability
➢ Newton-Raphson iterative optimization
70
References
• Chapters 4.1, 4.2, and 4.3 in the PRML textbook
71
Thank You for Your Attention!
Yen-Yu Lin (林彥宇)
Email: lin@cs.nctu.edu.tw
URL: https://www.cs.nctu.edu.tw/members/detail/lin
72