Topic 3: INTRO TO CLASSIFICATION
STAT 37710/CAAM 37710/CMSC 35400 Machine Learning
Risi Kondor, The University of Chicago
[Hastie, Tibshirani & Friedman]
2
2/48
/48
Classification
Two-class classification in Rn is the prototypical supervised ML problem.
• Input space: X (in the simplest case, X = Rn )
• Output space: Y = {−1, +1} (in k-class case Y = {1, . . . , k} )
• Training set: S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}
• Goal: find a good hypothesis f : X → Y .
We will look at four simple classifiers:
1. Gaussian Discriminant Analysis
2. Logistic regression
3. k –nearest neighbors
4. The perceptron
3
3/48
/48
1. Gaussian discriminant analysis
Model based classification
1. Assume that x and y come from a joint distribution p(x, y) (similar to
mixture models for clustering).
2. Define a marginal distribution for y (Bernoulli):
p0 (y = +1) = π p0 (y = −1) = 1 − π.
3. Define the conditionals for x given y :
p( x |y = +1) = p+1 (x)
p( x |y = −1) = p−1 (x).
4. Estimate the parameters of p+ , p− and p0 from the training data.
5. Classify future examples according to
(
+1 if p(y = +1| x) ≥ p(y = −1| x)
yb =
−1 if p(y = +1| x) < p(y = −1| x).
This is called the Bayes optimal classifier.
5
5/48
/48
Gaussian Discriminant Analysis
Assume that X = Rn and p+ (x) and p− (x) are both multivariate
normal (Gaussian) distributions:
1 ⊤ −1
N ( x; µ , Σ ) = e−(x−µ) Σ (x−µ)/2
(2π)n/2 | Σ |1/2
where µ ∈ Rn , and Σ ∈ Rn×n is a symmetric positive definite matrix.
E(x) = µ Cov(xi , xj ) = Σi,j
6
6/48
/48
Gaussian Discriminant Analysis
!
1 (x − µy )⊤ Σ−1
y (x − µy )
p(x, y) = π (1+y)/2 (1−π)(1−y)/2 exp −
(2π)n/2 |Σy |1/2 2
How do we learn (estimate) the parameters {µ+1 , Σ+1 , µ−1 , Σ−1 , π} ?
Easier than clustering because y is not latent, so no need for EM.
7
/48
7/48
Likelihood for GDA
Y
m
L(π, µ+1 , µ−1 , Σ+1 , Σ−1 ) = π (1+yi )/2 (1 − π)(1−yi )/2 . . .
i=1
1 ⊤ −1
exp − xi − µ yi Σ yi xi − µ yi /2
(2π)n/2 | Σyi |1/2
8
8/48
/48
Log-likelihood for GDA
m
X
1 + yi 1 − yi
ℓ(π, µ+1 , µ−1 , Σ+1 , Σ−1 ) = cst + log π + log(1 − π)
i=1
2 2
X 1 (x − µ+1 )⊤ Σ−1
+1 (x − µ+1 )
+ − log | Σ+1 | −
2 2
i : yi =1
X 1 (x − µ−1 )⊤ Σ−1
−1 (x − µ−1 )
+ − log | Σ−1 | −
2 2
i : yi =−1
Observation: the part of the likelihood relating to the y = +1 Gaussian completely
separates from the part relating to the y = −1 Gaussian, which, in turn, separates
from the Bernoulli part. So these three parts can be maximized separately.
9
9/48
/48
MLE for two-Gaussians model
• MLE for π :
n+ number of traning examples with yi = 1
b=
π =
n total number of training examples
• MLE for (µ+1 , Σ+1 ) and (µ−1 , Σ−1 ) :
1 X X
b +1 =
µ xi b +1 = 1
Σ (xi −µ+1 )(xi −µ+1 )⊤
n+1 n+1
i : yi =+1 i : yi =+1
1 X X
b −1 =
µ xi b −1 = 1
Σ (xi −µ−1 )(xi −µ−1 )⊤
n−1 n−1
i : yi =−1 i : yi =−1
10
10/48
/48
2. Logistic Regression
Logistic regression
• Assume that the conditional of y is
P(y = 1|x) = h(x)
P(y = −1|x) = 1 − h(x).
In other words y|x ∼ Bernoulli(h(x)).
• Set h(x) to be a nice function Rd → [0, 1] , in particular the logistic
function (sigmoid function of θ · x )
1
h(x) = .
1 + e−θ·x
This is similar to how in model based regression we took y|x ∼ N (f (x), σ 2 )) .
12
12/48
/48
The sigmoid function
1
The sigmoid function g(z) = 1+e −z has the nice property that
d 1 1 1 e−z
g ′ (z) = = e −z
=
dz 1 + e−z (1 + e−z )2 1 + e−z 1 + e−z
(1 + e−z ) − 1
=g(z) = g(z) (1 − g(z)).
1 + e−z
13
13/48
/48
MLE for logistic regression
1+yi
Defining for convenience ui = 2 , the likelihood is
Y
m
L(θ) = h(xi )ui (1 − h(xi ))1−ui .
i=1
The log-likelihood is
X
m
ℓ(θ) = ui log(h(xi )) + (1 − ui ) log(1 − h(xi )) .
i=1
Unlike in GDA, this cannot be maximized in closed form. → Try stochastic
gradient descent.
14
14/48
/48
Gradient descent
Given a loss function J , we can optimize it using gradient descent:
θ←0
until(convergence){
θ ← θ − α∇J(θ)
}
α : a parameter called the learning rate.
Gradient descent is the “poor man’s algorithm” for optimization. Works in almost any
case, but sometimes not very well. Choosing α can be tricky.
15
15/48
/48
Recap: the gradient
The gradient of a function J : Rp → R is the vector ∇J(u) with
∂
[∇J(u)]i = J(u).
∂ui
The gradient always points in the most uphill direction possible (picture shows
negative gradient). At a minimum/maximum (or saddle point) ∇J = 0 .
16
16/48
/48
The learning rate α
Setting the learning rate can be quite tricky. Decrease learning rate as we get
closer to the optimum?
17
/48
17/48
Gradient descent
Gradient descent trajectories can be more complicated than you would
expect.
18
18/48
/48
SGD for logistic regression
θ←0
until(convergence){
for( j = 1 to m )
θ ← θ − α (h(xi ) − ui ) xi
}
}
Stochastic gradient descent performs gradient descent wrt the loss (negative
likelihood) of individual examples or small randomly selected sets of
examples, called minibatches.
19
19/48
/48
3. k –nearest neighbors
k -nearest neighbors
1-nearest neighbor 15-nearest neighbor
The hypothesis returned by the k -NN classifier is
P
fb(x) = sgn i∈NNk (x) yi ,
where NNk (x) = {i1 , . . . , ik } are the indices of the k points in S
P
closest to x . (When k is even and i∈NNk (x) yi = 0 , break ties
arbitrarily.)
What effect does k have on the behavior of k -NN?
21
21/48
/48
3. The Perceptron
Linear classifiers
The Perceptron belongs to the wide class of linear classifiers.
Input space: X = Rn
Output space: Y = {−1, +1}
Hypothesis (affine hyperplane):
f (x) = sgn(w · x + b)
• w is the normal to the
separating hyperplane
• b is the bias
1 X
m
Empirical error : Etrain (f ) = ℓ0/1 (f (xi ), yi )
m
i=1
b, y) = 0 if yb = y , else ℓ0/1 (yb, y) = 1 .
where ℓ0/1 (y
Of all possible hyperplanes that separate the data which one is best?
23
23/48
/48
Inspiration: Artificial Neural Networks
X
o j = φ θj + wi,j xi ,
i
An old idea going back to [McCulloch & Pitts, 1943].
24
24/48
/48
The perceptron (Rosenblatt, 1958)
Consider the following a simplified scenario:
• kxi k = 1 for all i .
• Assume that all datapoints satisfy the ground truth concept
(
1 if v · x ≥ 0
htrue (x) = ,
−1 otherwise
where v is some fixed vector and w.l.o.g. kvk = 1 .
• Since htrue goes through the origin, set b = 0 for all h .
• We want to solve the problem in its online form.
25
25/48
/48
The perceptron (Rosenblatt, 1958)
w←0;
t←1;
while(true){
if w · xt ≥ 0 predict ŷt = 1 ; else predict ŷt = −1 ;
if ((ŷt = −1) and (yt = 1)) let w ← w + xt ;
if ((ŷt = 1) and (yt = −1)) let w ← w − xt ;
t←t+1;
}
Note: In this form, the Perceptron is an online learning algorithm.
26
26/48
/48
The margin
The projection of x onto the line orthog-
onal to decision boundary w · x + b =
0 is
P x = w · x/ kwk .
On the boundary
−b
w · x = −b ⇒ Px = .
kwk
The margin of a correctly classified positive resp. negative example are
−b −b
δ(x,+1) = P x − , δ(x,−1) = − P x,
kwk kwk
so the general formula is δ(x,y) = y (w · x + b)/kwk .
27
27/48
/48
The perceptron convergence thm
Suppose that the perceptron is run on a sequence of examples with
kxi k = 1 and yi (v · xi ) ≥ 0 for some v with kvk = 1 . In addition,
assume that the data obeys |v · xi | ≥ δ for all xt (margin) .
Claim 1: After M mistakes w · v ≥ δM .
Initially, w · v = 0 . After a false negative, w ← w + xt , and
(w + xt ) · v ≥ (w · v) + δ , so w · v increases by at least δ . Similarly
for a false positive.
28
28/48
/48
The perceptron convergence thm
2
Claim 2: After M mistakes, kwk ≤ M
Initially kwk = 0 . On a false negative, w ← w + xt , and
k w + x k2 = (w + xt ) · (w + xt ) = kwk2 + 2w · xt + kxt k2 .
The third term on the RHS is 1 , while the second term must be negative,
2
since we predicted 0 . Therefore, kwk can increase by at most 1 .
Similarly for false positives.
29
29/48
/48
The perceptron convergence thm
Theorem. If the perceptron is run a sequence of examples with |v · xt | ≥ δ
(plus our other assumptions), then it will make at most 1/δ 2 mistakes.
Proof. √
δM ≤ w · v ≤ kwk ≤ M
■
30
30/48
/48
Affine hyperplanes
Question: What if we don’t want the separator to go through the origin?
Add extra feature with [xt ]n+1 = 1 for all t .
x1 w 1 + x2 w 2 + . . . x n w n ≥ θ
⇐⇒
x1 w1 + x2 w2 + . . . xn wn + xn+1 wn+1 ≥ 0
with wn+1 = −θ
Question: How can we make linear threshold functions more expressive?
31
31/48
/48
Towards multiclass
Rewrite perceptron as competition between w+ and w−
w+ ← (0, 0, . . . , 0) ;
w− ← (0, 0, . . . , 0) ;
t←1;
while(1){
if w+ · xt ≥ w− · xt predict ŷt = 1 ; else predict ŷt = 0 ;
if ((ŷt = 0) and (yt = 1)) let w+ ← w+ + xt ;
if ((ŷt = 1) and (yt = 0)) let w− ← w− + xt ;
t←t+1;
}
32
/48
32/48
Towards multiclass
If yt ∈ {1, 2, . . . , κ} , maintain κ different weight vectors
w1 , w2 , . . . , wκ .
On erroneously classifying example xt as class i instead of class j :
• wi ← wi − xt /2
• wj ← wj + xt /2
[Collins & Duffy, 2002]
33
33/48
/48
General concepts of supervised learning
The general framework
• There is an unknown distribution p on X × Y .
• The training set S = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )} , is an IID
sample from p .
• The hypothesis is a function fb: X 7→ Y .
• The learning algorithm is a function A : S 7→ fb .
• The hypothesis space F is the space of functions accessible to A .
We want f to be good on future examples drawn from p , where “good” is
defined in terms of a loss function ℓ : Y × Y → R+ .
How can we ever be sure that we can find a good f , especially when F is
huge??? This is what Statistical Learning Theory is about.
35
35/48
/48
Three notions of error
• The training error or empirical error:
1 X b
m
Eemp [fb] = ℓ(f (xi ), yi )
m
i=1
This is what we can actually measure, and what many learning algorithms
try to minimize. → Empirical Risk Minimization (ERM)
• The testing error:
′
1 X b ′ ′
m
Etest [fb] = ′ ℓ(f (xi ), yi )
m
i=1
This is what the algorithm is going to be evaluated on.
• The true error:
Etrue [fb] = E(x,y)∼p ℓ(fb(x), y).
This is what an ideal algorithm would minimize, but it can’t be measured
since p is unknown.
36
36/48
/48
Overfitting
In general, because fb is explicitly chosen to minimize Eemp [fb] , it will tend
to be an overoptimistic estimate of the true error, i.e., Etrue [fb] > Etrain [f ] and
Etest [fb] > Etrain [f ] . This is called overfitting.
Question: How can we avoid overfitting? Restrict hypothesis space!! But
how much?
37
/48
37/48
Overfitting and underfitting
This is the biggest thing to be paranoid about in all branches of ML.
38
38/48
/48
The overfitting/underfitting tradeoff
The hypothesis space should be big enough but not too big. Alternatively, we
need to add a regularizer (c.f. inverse problems in applied math).
39
/48
39/48
Regularized Risk Minimization (RRM)
The most general framework for finding the right balance between overfitting
and underfitting in supervised learning is RRM:
1 X
m
fb = arg min ℓ(f (xi ), yi ) + λ Ω(f )
f ∈F m | {z }
i=1
| {z } regularizer
training error
Terminology:
1. F : hypothesis space
b, y) : loss function
2. ℓ(y
3. Ω : F → R+ : regularization functional
The right compromise is found by tuning the regularization parameter λ .
[Tykhonov regularization]
40
40/48
/48
Statistical Learning Theory
Concentration of measure
The reason that we have any hope of learning a good fb at all is that for any
fixed f ∈ F , over the choice of training/testing sets,
E(Eemp [f ]) = E(Etest [f ]) = Etrue [f ].
Moreover, as the size of training/testing sets increases, these quantities are
more and more concentrated around Etrue [f ]
However, a very unlucky choice of training set can always mess us up.
42
42/48
/48
Generalization Bounds
Statistical Learning Theory proves bounds of the form
P [ Etrue [fb] > Eemp [fb] + ϵ ] < δ
where ϵ is a complicated function depending on δ , the size of the function
class F , and the nature of the regularization. → Probably
Approximately Correct (PAC) bounds [Valiant, 1984]
Example: if the Vapnik-Chervonenkis dimension of F is d , then
s
d(log 2m
d + 1) + log(4/δ)
P Etrue [fb] > Eemp [fb] + ≤ δ.
m
In reality even the best such bounds are ridiculously loose.
43
43/48
/48
Cross validation
Holdout set
The generalization bounds from statistical learning theory are framed entirely
in terms of the training set, therefore they reveal fundamental properties of
the learning algorithm. Unfortunately, they are almost always incredibly
loose.
The more practial way to estimate Etrue is to split the training data into:
• The true training set, used to train the algorithm
• A holdout set {(x1 , y1 ) , (x2 , y2 ) , . . . , (xp , yp )} which is only used to
compute the holdout error
1X b
p
Eh.o. [fb] = ℓ(f (xi ), yi )
p
i=1
.
Holdout error is an unbiased estimator of Etrue [fb] , i.e.,
E[Eh.o. [fb]] = Etrue [fb] .
45
45/48
/48
Cross Validation
k –fold cross validation uses the holdout idea to set internal parameters θ
of learning algorithms (e.g., k in k –NN):
• Set θ to some value.
• Split the training set into k roughly equal parts S1 , S2 , . . . , Sk .
• For each i , train the algorithm on S1 , . . . , Si−1 , Si+1 , . . . , Sp to get
fbi .
• Compute the holdout error of each fbi on the corresponding Si .
• Average to get an estimator of Etrue (fbθ ).
• Iterate over a range of θ values and finally set θ to the value that
minimizes the cross-validation error.
46
46/48
/48
Review of algorithms so far
Clustering Classification
Algorithmic k-means k-nearest neighbors
perceptron
Probabilistic mixture of Gaussians GDA
logistic regression
47
47/48
/48