07 - Linear Models For Classification
07 - Linear Models For Classification
— Steve Oudot
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Supervised learning (classification)
Input: n observations + responses (x1 , y1 ), · · · , (xn , yn ) ∈ X × Y
?
X = {images}
Y = {labels}
classification: Y discrete
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)
→ X,
e that theY joint distribution
are the marginals ofPr(X, Y ) distribution.
the joint encodes theIncomplexity of the problem
fact the complexity is encoded in the co
Y
κ
...
3
X, Y perfectly dependent ⇒ ∃ perfect predictor
2
1
X
Y
κ
...
3
X, Y imperfectly dependent ⇒ ∄ perfect predictor 2
1
X
x
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)
→ X,
e that theY joint distribution
are the marginals ofPr(X, Y ) distribution.
the joint encodes theIncomplexity of the problem
fact the complexity is encoded in the co
step
1 is where the = argmin
choice of the − P(Y
10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
(best prediction at x maximizes the posterior probability P(Y | X) (Bayes classifier))
Advantages and drawbacks of k-NN in practice
• high flexibility:
• easiness of implementation:
▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LS
▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LS
easy and efficient prediction (dot-product)
• algorithmic cost of prediction: algorithmic cost put on pre-training
▶ linear scan in Θ(nd)
▶ sublinear methods become (close to) linear in high dimensions
no hyper-parameter
• slow convergence in high dimensions (curse of dimensionality):
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c
f −1 ({1})
olored dots are the input observations with responses. The colored areas are the fibers of
f −1 ({2})
linear nonlinear
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c
2 types of approaches:
▶ model
g to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA
f −1 ({1})
δ1 δ3
e show the result of the LDA on the Iris dataset: - left: the dataset with the decision bou
δ2
f −1 ({3})
f −1 ({2})
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c
2 types of approaches:
▶ model
g to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):
n
X 2
Fit the model by least squares: B̂ := argmin ∥Z(yi ) − [ 1 xTi ] B∥2
B
i=1
the corresponding
▶ random row random
indicatorvariable in our
vector: Z(Y ) :=statistical
[ 1Y =1 ··· 1model. R1×κ
Y =κ ] ∈ It is fully dependent o
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):
2
Fit the model by least squares: B̂ := argmin ∥Z − X B∥F
B
as2F the ℓ2 -norm 2in
P
(Frobenius
This norm:acts
norm basically ∥M ∥ := i,j M ij ) th
the corresponding
▶ random row random
indicatorvariable in our
vector: Z(Y ) :=statistical
[ 1Y =1 ··· 1model. R1×κ
Y =κ ] ∈ It is fully dependent o
" Z(y1 ) #
our new input
▶ input of responses:
indicator responsea matrix:
binary matrix .. exactly
Z := with ∈ Rn×κ
one nonzero entry per ro
.
Z(yn )
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):
2
Fit the model by least squares: B̂ := argmin ∥Z − X B∥F
B
This norm basically acts as the ℓ2 -norm in th
−1
p
wise we can
Unique regularize
minimum by some
(assuming fullℓcolumn ℓ2 ) as
(e.g. rank XTordinary
in in X): B̂ linear
= X X T T
regression,
X Zto get
h column inδ̂(x)
B̂ corresponds
:= [ δ̂1 (x) ···toδ̂κthe
(x) ] fitted
= [ 1 xparameter
T ] B̂ vector β̂y , hence the y-th entry inδ̂(x)
error
surprising, as the ≈ 33%
rate are
classes indeed linearly separable. For comparison,
diagonal the Bayes error rat
cross-section
δ̂2
This plot explains the phenomenon: the discriminant function of class 2 nev
δ̂3
Linear regression for classification
What is happening:
▶ theythat
assuming sumthe
up set
to 1of(in is centered in Rd (mean= 0)
centered model)
observations
δ̂2
δ̂3
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Logistic regression for binary classification
Generalized linear model for discriminant functions:
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x)
Logistic regression for binary classification
Generalized linear model for discriminant functions:
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
forces δ1 (x) ∈ [0, 1]
▶ δ2 (x) := 1 − δ1 (x)
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
moid” of the logistic
means ”S-shaped” (and bounded) → the logistic sigmoid is but one example
sigmoid:
▶ σ −1 (u) = ln u
1−u
(logit function)
▶ σ(t) + σ(−t) = 1
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions
P(Y = 1 | X = x) = δ1 (x) = σ ([ 1 xT ] β1 )
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions
▶ model
ans that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, t
P(Y = 1 | X = x) σ ([ 1 xT ] β1 ) 1 + exp ([ 1 xT ] β1 )
ln = ln = ln factorize the numerator
= [ 1 xT by
] β1exp
P(Y = 2 | X = x) σ (− [ 1 xT ] β1 ) 1 + exp (− [ 1 xT ] β1 )
Logistic regression for binary classification
Generalized linear model for discriminant functions:
exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )
Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions
▶ model
ans that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, t
n
Y
n n
⇒ L((yi )i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
n n
⇒ ln L((yi )i=1 ; (xi )i=1 , β1 ) = ln P(Y = yi | X = xi ; β1 )
i=1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n
Y
n n
⇒ L((yi )i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
n n
⇒ ln L((yi )i=1 ; (xi )i=1 , β1 ) = ln P(Y = yi | X = xi ; β1 )
i=1
Change
is just of variable:
to have a response Z
taking
:= Y values 2∈
mod in {0,{0,
1}1} ∀i,ofz{1,
instead i :=2},
yi which 2∈
mod is {0, 1}
more conveni
{
X
1
∇ ℓ(βat
ient vector 1) β
=1 (zi − σ ([ 1 xT
i ] β1 )) xi
i=1
Xn
2 ′
1
∇ ℓ(β
ian matrix at1 )β1=. −
Here we
σ ([use T
1 xithe
] βfact
1 ) (1that ] β−
− σσ([ 1=xiσ(1
T
σ) xi [ 1 xTi ]
1 ))
i=1
negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated ℓ(β1 )ofispositive
a concave function matrices with non-negat
semi-definite
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1
{
X
1
∇ ℓ(βat
ient vector 1) β
=1 (zi − σ ([ 1 xT
i ] β1 )) xi
i=1
Xn
2 ′
1
∇ ℓ(β
ian matrix at1 )β1=. −
Here we
σ ([use T
1 xithe
] βfact
1 ) (1that ] β−
− σσ([ 1=xiσ(1
T
σ) xi [ 1 xTi ]
1 ))
i=1
negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated ℓ(β1 )ofispositive
a concave function matrices with non-negat
semi-definite
d + 1 non-linear equations in β1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1
Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix
repeat:
−1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian
Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix
repeat:
−1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian
non-singular
Thm: if aHessian
maximum implies s.t. ∇2 definite
β̄1 isnegative ℓ(β̄1 ) is Hessian, hencethen
non-singular, the β̄
map
1 is ℓthe
is strictly
unique concave
max.
and,to
close enough forβ̄1an initial
, the β̂1 close
Hessian at β̂enough to β̄1 , convergence
1 is non-singular
to β̄1the
as well, hence is quadratic.
algorithm procee
Logistic regression for binary classification
tion:Degenerate
in fact, as the
cases
example
(singular
is degenerate,
Hessian): applying the vanilla logistic regression may lea
lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fac
n
{
X
1
∇ ℓ(β1 ) = (zi − σ ([ 1 xT
i ] β1 )) xi − 2λ β1
i=1
n
X
2 1
∇ ℓ(β1 ) = − σ ([ 1 xT
i ] β1 ) (1 − σ ([ 1 xT
i ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1
lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fac
n
{
X
1
∇ ℓ(β1 ) = (zi − σ ([ 1 xT
i ] β1 )) xi − 2λ β1
i=1
n
X
2 1
∇ ℓ(β1 ) = − σ ([ 1 xT
i ] β1 ) (1 − σ ([ 1 xT
i ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a
{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a
{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)
{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )
generalized sigmoid
(softmax)
This ∈ [0,is1]the o
terminology
1
δκ (x) := P(Y = κ | X = x) = P
1 + z<κ exp ([ 1 xT ] βz )
P
= 1 − y<κ
rces (again) the discriminant functions
δy (x) to sum up to 1. Note that, although it is not as direct as
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a
{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)
{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )
1
δκ (x) := P(Y = κ | X = x) = P
1 + z<κ exp ([ 1 xT ] βz )
▶ estimate
pressions for the B by maximum
objective functionlikelihood and
and for the Newton-Raphson’s
iteration steps are more algorithm.
complicated than in the b
Multi-class logistic regression
Back to our running experiment:
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.
▶ the hyperplanes that maximize the margins (closest distances to data points)
margin
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.
▶ the hyperplanes that maximize the margins (closest distances to data points)
Hyperplane equation: xT β − β0 = 0
0
0
=
β
▶ parameters: β ∈ Rd \ {0}, β0 ∈ R
−
β
xT
▶ β is normal to the hyperplane
β0
▶ ∥β∥
is the shift from the origin along β β
2-norm
β0 0
∥β∥
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.
▶ the hyperplanes that maximize the margins (closest distances to data points)
Hyperplane equation: xT β − β0 = 0
0
=
=
0
β
0
β
−
▶ parameters: β ∈ Rd \ {0}, β0 ∈ R
−
β
xT
β
xT
1
−
=
▶ β is normal to the hyperplane
0
β
1
−
∥β∥
β
β0
xT
▶ ∥β∥
is the shift from the origin along β β
1
e is one degree
▶ fix ∥β∥
of be
to freedom in the hyperplane equation, since the solution set is invariant u
the margin
⇒ slab
is another boundaries of
consequence have
ourequations xTthat
convention β−β 0 = ±1
1/∥β∥ is set to be the margin.
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.
▶ the hyperplanes that maximize the margins (closest distances to data points)
0
=
=
0
β
(xi , 1)
0
β
−
2
−
β̂, β̂0 := argmin ∥β∥ subject to:
β
xT
β
xT
1
β,β0
−
=
0
xTi β − β0 ≥ 1
(
β
∀i s.t. yi = 1 1
−
∥β∥
β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β
T (xi , −1)
⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
β0 0
∥β∥
▶ the hyperplanes that maximize the margins (closest distances to data points)
0
=
=
0
β
(xi , 1)
0
β
−
2
−
β̂, β̂0 := argmin ∥β∥ subject to:
β
xT
β
xT
1
β,β0
−
(maximize margin)
=
0
xTi β − β0 ≥ 1
(
β
∀i s.t. yi = 1 1
−
∥β∥
β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β
T (xi , −1)
⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β0 0
∥β∥
▶ the hyperplanes that maximize the margins (closest distances to data points)
0
=
=
βˆ
0
(xi , 1)
βˆ
0
−
2
−
βˆ
β̂, β̂0 := argmin ∥β∥ subject to:
βˆ
xT
xT
1
β,β0
−
(maximize margin)
=
βˆ
0
xTi β − β0 ≥ 1
(
∀i s.t. yi = 1 1
−
∥β̂∥
βˆ
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β̂
T (xi , −1)
⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β̂0 0
∥β̂∥
▶ quadratic programming problem The fˆ(x)
class/label
▶ classifier: of =
a query xT β̂ −x β̂is0 dete
sign point
h problems
(w/ pos.
have definite
quadraticquadratic
objective
form)
functions and linear constraints (equalities or inequali
Outline
• Reminder about supervised classification
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it
n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const
0
=
=
0
β
(xi , 1)
0
β
−
−
β
xT
β
xT
1
−
=
0
β
1
−
∥β∥
β
xT
β
∝ loss
(xi , −1)
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it
n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const
1
Relaxed optimization (soft margin):
0
=
=
0
β
(xi , 1)
0
β
−
β̂, β̂0 := argmin
−
β
xT
β
β,β0
xT
1
−
=
n
1
0
n o
β
1
X T 2
max 0, 1 − yi xi β − β0 + λ ∥β∥
−
∥β∥
n i=1
β
xT
β
∝ loss
(xi , −1)
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it
n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const
1
Relaxed optimization (soft margin):
0
=
=
0
β
mixing (trade-off) (xi , 1)
0
β
−
β̂, β̂0 := argmin
−
parameter
β
xT
β
β,β0
xT
1
−
=
n
1
0
n o
β
1
X T 2
max 0, 1 − yi xi β − β0 + λ ∥β∥
−
∥β∥
n i=1
β
xT
β
mize(minimize
mean loss, i.e. loss)
mean try to satisfy constraints
(maximizeas best as possible.∝This
margin) loss term competes with
(xi , −1)
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it
n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const
1
Relaxed optimization (soft margin):
0
=
=
0
β
mixing (trade-off) (xi , 1)
0
β
−
β̂, β̂0 := argmin
−
parameter
β
xT
β
β,β0
xT
1
−
=
n
1
0
n o
β
1
X T 2
max 0, 1 − yi xi β − β0 + λ ∥β∥
−
∥β∥
n i=1
β
xT
β
mize(minimize
mean loss, i.e. loss)
mean try to satisfy constraints
(maximizeas best as possible.∝This
margin) loss term competes with
(xi , −1)
▶ λwhen
d, for > 0 classes are linearly
small enough, separable,
the second term in the functional becomes negligible (although
recover problem with hard constraints
by taking λ > 0 small enough.
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it
n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const
1
Conversion to quadratic program:
0
=
=
0
β
(xi , 1)
0
β
−
▶ slack variables:
−
β
xT
β
xT
1
−
n o
T
ures the ξloss
i :=on 0, 1 −
the i-th
max yi (xi β − β0 )
constraint
=
0
β
1
−
∥β∥
β
▶ substitution:
xT
β
n
ˆ n 1 X 2n ∝ loss
is aβ̂,quadratic program in
β̂0 , (ξi )i=1 := argmin the unknownsξi + λ0∥β∥i )i=1 , with a positive-definite
β, β , (ξ quadratic t
β,β0 ,(ξi )n n (xi , −1)
i=1 i=1
subject to:
T
∀i ξ ≥ 0
nfringe the definition
i and y (x β − β0the
of ξi ito iremove ) ≥ max
1 − ξfrom
i the constraint. Specifically, we only a
n o
T
turns one inequality into an equality ⇝ ξ̂i = max 0, 1 −
s because we can optimize each ξ independently from the others once β, βi 0β̂ have
(∀i, optimum yi (x − β̂0 )been
) fix
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it
n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const
1
Interpretation:
0
=
=
0
ˆi > 0:
β
1 − yside T
) = ξ > 0. (xi , 1)
− β0boundary
0
in ξthat
eed, ▶ xiwe
case is on wrong
have i (xiof
β slab
β
−
−
β
xT
T
β
se are the
ˆ vectors that count to
support define the
vectors slab. Indeed, each equality y (x i β̂ − β̂0 ) = 1
xT
1
▶ ξi = 0: i
−
=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary
0
β
1
−
∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side
β
xT
β
n
ˆ n 1 X 2n ∝ loss
is aβ̂,quadratic program in
β̂0 , (ξi )i=1 := argmin the unknownsξi + λ0∥β∥i )i=1 , with a positive-definite
β, β , (ξ quadratic t
β,β0 ,(ξi )n n (xi , −1)
i=1 i=1
subject to:
T
∀i ξ ≥ 0
nfringe the definition
i and y (x β − β0the
of ξi ito iremove ) ≥ max
1 − ξfrom
i the constraint. Specifically, we only a
n o
T
turns one inequality into an equality ⇝ ξ̂i = max 0, 1 −
s because we can optimize each ξ independently from the others once β, βi 0β̂ have
(∀i, optimum yi (x − β̂0 )been
) fix
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it
n o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const
1
Interpretation:
0
=
=
0
ˆi > 0:
β
1 − yside T
) = ξ > 0. (xi , 1)
− β0boundary
0
in ξthat
eed, ▶ xiwe
case is on wrong
have i (xiof
β slab
β
−
−
β
xT
T
β
se are the
ˆ vectors that count to define the slab. Indeed, each equality y (x i β̂ − β̂0 ) = 1
xT
1
▶ ξi = 0: i
−
=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary
0
β
1
−
∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side
β
xT
β
n
ˆ n 1 X 2n ∝ loss
is aβ̂,quadratic program in
β̂0 , (ξi )i=1 := argmin the unknownsξi + λ0∥β∥i )i=1 , with a positive-definite
β, β , (ξ quadratic t
β,β0 ,(ξi )n n (xi , −1)
i=1 i=1
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.
▶ One-vs-all:
y y
• train 1 classifier (β̂ , β̂0 ) for each class y = 1, · · · , κ, to discriminate
labels are in class
this the binary problem,
(assigned label where
1) fromclass
the yrest
is seen as data
of the the positive
(assigned label −1).
class
d T y y
• assign each new observation x ∈ R to the class argmax x β̂ − β̂0x. This requi
class whose corresponding classifier gives the highest score wins the bet for
y=1,··· ,κ
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.
▶ One-vs-all:
y y
• train 1 classifier (β̂ , β̂0 ) for each class y = 1, · · · , κ, to discriminate
labels are in class
this the binary problem,
(assigned label where
1) fromclass
the yrest
is seen as data
of the the positive
(assigned label −1).
class
d T y y
• assign each new observation x ∈ R to the class argmax x β̂ − β̂0x. This requi
class whose corresponding classifier gives the highest score wins the bet for
y=1,··· ,κ
▶ One-vs-one:
y,y ′ y,y ′
•
s there are κ
train( 21)classifier
binary classifiers,
(β̂ , β̂0i.e.) aforquadratic
each pairquantity in
of classes the y ′ ∈ {1,of
y ̸=number · · classes
· , κ},
e, discrimination
to discriminate
is among
y from y ′ in their jointspanned
the subpopulation subpopulations.
by the observations with labels y o
• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Kernel SVM
1.5 1.5
1 1
0.5 0.5
SVM
0
The linear classifier in the data space pe
0
-0.5 -0.5
-1
-1
-1.5
-1.5
1.5 1.5
1 1
0.5 0.5
SVM
0
The linear classifier in the data space pe
0
-0.5 -0.5
-1
-1
-1.5
-1.5
Φ : x 7→ [ x x2 ]
1.5
3.5
1
3
2.5
0.5
2
1.5
0
3 -1
2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2
0
Kernel SVM
present the hard
Quadratic margin (hard
program case, the soft /margin
margin case is similar. Note also that soft margins
no slack):
argmin ∥β∥ 2
subject to yi (Φ(xi )T β − β0 ) ≥ 1 ∀i = 1, · · · , n
β,β0
—may be infinite-dimensional)
1.5
3.5
1
3
2.5
0.5
2
1.5
0
3 -1
2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2
0
Kernel SVM
present the hard
Quadratic margin (hard
program case, the soft /margin
margin case is similar. Note also that soft margins
no slack):
n n n
!
XX X
merely substitute β α
argmin for
i ythe linear
i k(x yj αj subj. to
i , xj )combination of yΦ(x
i i )’s. α
We
j yjthus , xj )a−new
k(xiget ≥ 1 ∀i p
β0 quadratic
α,β0
i=1 j=1 j=1
Pn Pn
Representer Thm =⇒ β̂ = i=1 αi yi Φ(xi ) = i=1 αi yi k(xi , ·)
1.5
3.5
1
3
2.5
0.5
2
1.5
0
3 -1
2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2
0
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2
d lines
SVMarewith
the slab’s
deg.-4boundaries,
polynomial as
kernel
before. The decision
SVMboundary of the kernel
with Gaussian Bayes classifier
rmance of theerror
Gaussian
rate ≈kernel
24.5%is particularly good here. This
errorisrate
explained
≈ 21.8%by the fact that
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2
gistic regression
logistic in feature
reg. with deg.-4space is regularized.
polynomial kernel λ and the reg.
logistic window
withsize for thekernel
Gaussian kernel have
gain, the performance
error rate of the Gaussian kernel is particularlyerror
≈ 26.3% good,
ratedue to the fact that the
≈ 22.1%
What you should know
• Two types of linear approaches for classification
• Logistic regression:
- generalized linear model
- fitting by likelihood maximization & Newton-Raphson’s method,
convergence guarantees
- degenerate cases & regularization
- extension to multi-class