KEMBAR78
07 - Linear Models For Classification | PDF | Support Vector Machine | Statistical Classification
0% found this document useful (0 votes)
7 views76 pages

07 - Linear Models For Classification

The document outlines various linear models for classification in Python, including linear regression, logistic regression (binary and multi-class), and Support Vector Machines (SVM) for both linearly and non-linearly separable classes. It discusses the principles of supervised learning, the statistical framework for classification, and the advantages and drawbacks of methods like k-NN. Additionally, it covers the implementation of these models and their performance in terms of prediction error and risk minimization.

Uploaded by

Wanmeo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views76 pages

07 - Linear Models For Classification

The document outlines various linear models for classification in Python, including linear regression, logistic regression (binary and multi-class), and Support Vector Machines (SVM) for both linearly and non-linearly separable classes. It discusses the principles of supervised learning, the statistical framework for classification, and the advantages and drawbacks of methods like k-NN. Additionally, it covers the implementation of these models and their performance in terms of prediction error and risk minimization.

Uploaded by

Wanmeo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

CSC 43042 EP — Algorithms for Data Analysis in Python

Linear Models for Classification

— Steve Oudot
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Supervised learning (classification)
Input: n observations + responses (x1 , y1 ), · · · , (xn , yn ) ∈ X × Y

Goal: build a predictor f : X → Y from (x1 , y1 ), · · · , (xn , yn )


whose mean prediction error on new query observations is minimal

?
X = {images}
Y = {labels}

classification: Y discrete
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

→ X,
e that theY joint distribution
are the marginals ofPr(X, Y ) distribution.
the joint encodes theIncomplexity of the problem
fact the complexity is encoded in the co

Y
κ
...

3
X, Y perfectly dependent ⇒ ∃ perfect predictor
2

1
X
Y
κ

...
3
X, Y imperfectly dependent ⇒ ∄ perfect predictor 2

1
X
x
Statistical framework
iid
xi ∼ X with values in X = Rd
observations
Hyp: are drawn from a random variable X, the responses from a random variable
iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

→ X,
e that theY joint distribution
are the marginals ofPr(X, Y ) distribution.
the joint encodes theIncomplexity of the problem
fact the complexity is encoded in the co

Prediction error is measured by a loss function L : Y × Y → R

→ goal: minimize risk (expected prediction error): E(X,Y ) L(Y, f (X))


n
1 X
is the mean error on the input dataset
→ in practice: minimize empirical risk: L(yi , f (xi ))
n i=1
Classification with 0-1 loss
iid
xi ∼ X with values in X = Rd
Hyp: iid
yi ∼ Y with values in Y = {1, · · · , κ} (classification)

Risk: R(f ) = E(X,Y ) 1Y ̸=f (X)


 
= EX E(Y |X) 1Y ̸=f (X) | X (conditioning on X)

→ minimize risk pointwise (i.e. independently for each value x of X):



n, here y isf our
(x)guess
:= argmin
for f (x),Eand we[1take
(Y |X) | Xy=that
Y ̸=ythe x] minimizes the expected condition
y∈{1,··· ,κ}
Y Xκ
κ
e the sum over r =just
is argmin
the 1r̸=of
expression y P(Y
the = r | X = x)
expectation (Y categorical variable)
...

3 y∈{1,··· ,κ} r=1


2

step
1 is where the = argmin
choice of the − P(Y
10-1 loss = y | Xinto
comes = x) = argmax P(Y = y | X = x)
play
X y∈{1,··· ,κ} y∈{1,··· ,κ}
x
(best prediction at x maximizes the posterior probability P(Y | X) (Bayes classifier))
Advantages and drawbacks of k-NN in practice
• high flexibility:

▶ little prior on the fitting model


▶ method based on distances / (dis-)similarities (no need for coordinates)

• easiness of implementation:

▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LS

• algorithmic cost of prediction:

▶ linear scan in Θ(nd)


▶ sublinear methods become (close to) linear in high dimensions

• slow convergence in high dimensions (curse of dimensionality):

▶ asymptotic regime often not attained in practice ⇝ need to select k


Advantages and drawbacks of k-NN in practice
• high flexibility:

▶ little prior on the fitting model prior: linear model


▶ method based on distances / (dis-)similarities (no need for coordinates)
requires linear space hence coordinates
• easiness of implementation:

▶ only
same applies for amore
few advanced
lines of code for NN-search
sublinear methods,via linear
using thescan
right libraries (e.g. ANN, LS
easy and efficient prediction (dot-product)
• algorithmic cost of prediction: algorithmic cost put on pre-training
▶ linear scan in Θ(nd)
▶ sublinear methods become (close to) linear in high dimensions
no hyper-parameter
• slow convergence in high dimensions (curse of dimensionality):

▶ asymptotic regime often not attained in practice ⇝ need to select k


Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c

f −1 ({1})

olored dots are the input observations with responses. The colored areas are the fibers of

f −1 ({2})

linear nonlinear
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c

2 types of approaches:
▶ model
g to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA

f −1 ({1})
δ1 δ3
e show the result of the LDA on the Iris dataset: - left: the dataset with the decision bou
δ2

f −1 ({3})
f −1 ({2})
Linear methods for classification
Response variable Y is discrete
▶ consider the fibers of the predictor f : f −1 ({1}), · · · , f −1 ({κ})
ona linear
course ▶ linear regression
classifier produces
methods,linear decision
we said that a boundaries
method is considered as linear if the c

2 types of approaches:
▶ model
g to the posterior probability
Bayes classifier, one tries to(discriminant function
model the posterior δy ) for each
probability P(Y class
= y |y,X = x), o
then classify by taking argmaxy∈{1,··· ,κ} δy (x)
e.g. linear / logistic regression, LDA

▶ model the separating hyperplane(s) directly


e.g. Support Vector Machines, perceptron
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ


Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

n
X 2
Fit the model by least squares: B̂ := argmin ∥Z(yi ) − [ 1 xTi ] B∥2
B
i=1

ector has Z(y


where exactly
i ) = one nonzero
[ 1yi =1 entry:
··· 1yi =κ ] is thethat
rowatindicator
index yivector
. The of
motivation
response yisi that we w

the corresponding
▶ random row random
indicatorvariable in our
vector: Z(Y ) :=statistical
[ 1Y =1 ··· 1model. R1×κ
Y =κ ] ∈ It is fully dependent o
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

2
Fit the model by least squares: B̂ := argmin ∥Z − X B∥F
B
as2F the ℓ2 -norm 2in
P
(Frobenius
This norm:acts
norm basically ∥M ∥ := i,j M ij ) th

ector has Z(y


where exactly
i ) = one nonzero
[ 1yi =1 entry:
··· 1yi =κ ] is thethat
rowatindicator
index yivector
. The of
motivation
response yisi that we w

the corresponding
▶ random row random
indicatorvariable in our
vector: Z(Y ) :=statistical
[ 1Y =1 ··· 1model. R1×κ
Y =κ ] ∈ It is fully dependent o
" Z(y1 ) #
our new input
▶ input of responses:
indicator responsea matrix:
binary matrix .. exactly
Z := with ∈ Rn×κ
one nonzero entry per ro
.
Z(yn )
Linear regression for classification
Use a linear model for the discriminant functions (yields linear decision boundaries):

▶ ∀y ∈ {1, · · · , κ}, δy (x) := [ 1 xT ] βy for some parameter vector βy ∈ Rd+1

▶ matrix of parameters: B := [ β1 ··· βκ ] ∈ Rd+1×κ

2
Fit the model by least squares: B̂ := argmin ∥Z − X B∥F
B
This norm basically acts as the ℓ2 -norm in th
 −1
p
wise we can
Unique regularize
minimum by some
(assuming fullℓcolumn ℓ2 ) as
(e.g. rank XTordinary
in in X): B̂ linear
= X X T T
regression,
X Zto get

▶ row vector of estimated discriminant functions:

h column inδ̂(x)
B̂ corresponds
:= [ δ̂1 (x) ···toδ̂κthe
(x) ] fitted
= [ 1 xparameter
T ] B̂ vector β̂y , hence the y-th entry inδ̂(x)

▶ classifier: fˆ(x) := argmax δ̂y (x)


y∈{1,··· ,κ}
Linear regression for classification
t looks
Experimental
reasonable,results:
as the two
n=classes
100 +are not(mixture
100 linearly of
separable.
2 Gaussians), d = 2

linear regression Bayes classifier


error rate ≈ 34% error rate ≈ 21%
Linear regression for classification
Experimental results:

n = 100 + 100 + 100, d = 2, Bayes error rate ≈ 2.5%

error
surprising, as the ≈ 33%
rate are
classes indeed linearly separable. For comparison,
diagonal the Bayes error rat
cross-section

In the plot on the right-hand side, the a


δ̂1

δ̂2

This plot explains the phenomenon: the discriminant function of class 2 nev
δ̂3
Linear regression for classification
What is happening:

▶ the δ̂y (x) supposedly model the posterior probabilities P(Y = y | X = x)


(cf. Bayes classifier)

▶ theythat
assuming sumthe
up set
to 1of(in is centered in Rd (mean= 0)
centered model)
observations

▶ they may fall outside [0, 1] δ̂1

δ̂2

δ̂3
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x)
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)
forces δ1 (x) ∈ [0, 1]
▶ δ2 (x) := 1 − δ1 (x)

forces δ1 (x) + δ2 (x) = 1, hence δ2 (x) ∈ [0, 1]


Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 ) i.e. makes the


symmetric problem
problem in 1,invariant
2 und
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
moid” of the logistic
means ”S-shaped” (and bounded) → the logistic sigmoid is but one example
sigmoid:

▶ monotonic homeomorphism R → (0, 1)

▶ σ −1 (u) = ln u
1−u
(logit function)

▶ σ(t) + σ(−t) = 1

▶ σ ′ (t) = σ(t) (1 − σ(t))


Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic


i.e. distribution
precisely the d

P(Y = 1 | X = x) = δ1 (x) = σ ([ 1 xT ] β1 )

P(Y = 2 | X = x) = 1 − δ1 (x) = δ2 (x) = σ (− [ 1 xT ] β1 )


Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic


i.e. distribution
precisely the d

▶ model
ans that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, t

P(Y = 1 | X = x) σ ([ 1 xT ] β1 ) 1 + exp ([ 1 xT ] β1 )
ln = ln = ln factorize the numerator
= [ 1 xT by
] β1exp
P(Y = 2 | X = x) σ (− [ 1 xT ] β1 ) 1 + exp (− [ 1 xT ] β1 )
Logistic regression for binary classification
Generalized linear model for discriminant functions:

▶ δ1 (x) := σ ([ 1 xT ] β1 ) for some parameter β1 ∈ Rd+1

exp(t) 1
where σ(t) := = (logistic sigmoid function)
1 + exp(t) 1 + exp(−t)

▶ δ2 (x) := 1 − δ1 (x) = σ (− [ 1 xT ] β1 )

Properties
at, once again,ofwethe fˆ(x) toand
regression
define be associated
the argmaxclassifier:
of the discriminant functions

▶ model for δy (y = 1, 2) assumes P(Y = y | X = x) follows logistic


i.e. distribution
precisely the d

▶ model
ans that, makes probability
fundamentally, ratio
we fit one log-linear
of the discriminant functions (say δ2 ) independently, t

▶ yields linear decision boundary: δ1 (x) = δ2 (x) ⇐⇒ δ1 (x) = 1/2


−1
⇐⇒ [ 1 xT ] β1 = σ
i.e. [ 1 xT ]=β10 =
(1/2)
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d

∀i, L(yi ; xi , β1 ) := P(Y = yi | X = xi ; β1 )

n
Y
n n
⇒ L((yi )i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
n n
⇒ ln L((yi )i=1 ; (xi )i=1 , β1 ) = ln P(Y = yi | X = xi ; β1 )
i=1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d

∀i, L(yi ; xi , β1 ) := P(Y = yi | X = xi ; β1 )

n
Y
n n
⇒ L((yi )i=1 ; (xi )i=1 , β1 ) = P(Y = yi | X = xi ; β1 ) (independence)
i=1
n
X
n n
⇒ ln L((yi )i=1 ; (xi )i=1 , β1 ) = ln P(Y = yi | X = xi ; β1 )
i=1

Change
is just of variable:
to have a response Z
taking
:= Y values 2∈
mod in {0,{0,
1}1} ∀i,ofz{1,
instead i :=2},
yi which 2∈
mod is {0, 1}
more conveni

⇒ ∀i, ln P(Y = yi | X = xi ; β1 ) = ln P(Z = zi | X = xi ; β1 )


= zi ln P(Z = 1 | X = xi ; β1 ) + (1 − zi ) ln P(Z = 0 | X = xi ; β1 )
= zi ln σ ([ 1 xTi ] β1 ) + (1 − zi ) ln σ (− [ 1 xTi ] β1 )
esult is deduced=from xT
zi [ 1 the
i ] β
line
1 − ln
above(1 + exp
after a ([
few
1 x T
] β1of
ilines )) calculation, in which the expressi
eveloped as follows (where t := [ 1 xTi ] β1 ):
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

{
X
1
 
∇ ℓ(βat
ient vector 1) β
=1 (zi − σ ([ 1 xT
i ] β1 )) xi
i=1

Xn
2 ′
1
∇ ℓ(β
ian matrix at1 )β1=. −
Here we
σ ([use T
1 xithe
] βfact
1 ) (1that ] β−
− σσ([ 1=xiσ(1
T
σ) xi [ 1 xTi ]
1 ))
i=1

negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated ℓ(β1 )ofispositive
a concave function matrices with non-negat
semi-definite
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

{
X
1
 
∇ ℓ(βat
ient vector 1) β
=1 (zi − σ ([ 1 xT
i ] β1 )) xi
i=1

Xn
2 ′
1
∇ ℓ(β
ian matrix at1 )β1=. −
Here we
σ ([use T
1 xithe
] βfact
1 ) (1that ] β−
− σσ([ 1=xiσ(1
T
σ) xi [ 1 xTi ]
1 ))
i=1

negative
because the Hessian matrixsemi-definite ⇒ sum
is the negated ℓ(β1 )ofispositive
a concave function matrices with non-negat
semi-definite

the Hessian matrix


▶ choose is negative
β̂1 arbitrarily in semi-definite, only
the solution set ∇ ℓ(β
of the global
1 ) =maxima
0 annihilate the gradi

d + 1 non-linear equations in β1
Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix

init: β̂is1 a←−


e null vector good
0 first
//orguess
someinarbitrary
general vector

repeat:
 −1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian

until convergence // requires convergence threshold


Logistic regression for binary classification
Modelseen
e already fitting by maximum
maximum likelihood:
likelihood estimation for parametrized models in the lecture on d
n n
1 := 1argmax ℓ(β1 ) where ℓ(β1 ) := ln L ((yi )i=1 ; (xi )i=1 , β1 )ℓ(β1 ) for simplicity
matorβ̂for β is the one that maximizes the log-likelihood, renamed
β1
Xn
at this functional has the same shape as=the one [ 1 xTi ]on
zi based − ln (1 + exp
β1minimizing 1 xT
the([empirical risk u
i ] β1 ))
i=1

Newton-Raphson’s
basically method:
gradient descent (or rather ascent), with step prescribed by the Hessian matrix

init: β̂is1 a←−


e null vector good
0 first
//orguess
someinarbitrary
general vector

repeat:
 −1
β̂1 ←− β̂1 − ∇2 ℓ(β̂1 ) ∇ ℓ(β̂1 ) // assuming non-singular Hessian

until convergence // requires convergence threshold

non-singular
Thm: if aHessian
maximum implies s.t. ∇2 definite
β̄1 isnegative ℓ(β̄1 ) is Hessian, hencethen
non-singular, the β̄
map
1 is ℓthe
is strictly
unique concave
max.
and,to
close enough forβ̄1an initial
, the β̂1 close
Hessian at β̂enough to β̄1 , convergence
1 is non-singular
to β̄1the
as well, hence is quadratic.
algorithm procee
Logistic regression for binary classification
tion:Degenerate
in fact, as the
cases
example
(singular
is degenerate,
Hessian): applying the vanilla logistic regression may lea

▶ regularized logistic regression:


X n
p
hat here weβ̂have a minus
1 := argmax sign[zi [ i ] β1 − ln (1 + exp ([ i ] β1 ))] − λ ∥β1 ∥pto linea
because
1 x T
we are maximizing a1 T
quantity,
x as opposed
β1
i=1
Logistic regression for binary classification
tion:Degenerate
in fact, as the
cases
example
(singular
is degenerate,
Hessian): applying the vanilla logistic regression may lea

▶ regularized logistic regression:


X n
p
hat here weβ̂have a minus
1 := argmax sign[zi [ i ] β1 − ln (1 + exp ([ i ] β1 ))] − λ ∥β1 ∥pto linea
because
1 x T
we are maximizing a1 T
quantity,
x as opposed
β1
i=1

lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fac
n

{
X
1
 
∇ ℓ(β1 ) = (zi − σ ([ 1 xT
i ] β1 )) xi − 2λ β1
i=1
n
X
2  1

∇ ℓ(β1 ) = − σ ([ 1 xT
i ] β1 ) (1 − σ ([ 1 xT
i ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1

negative definite ⇒ strictly concave functional


Logistic regression for binary classification
tion:Degenerate
in fact, as the
cases
example
(singular
is degenerate,
Hessian): applying the vanilla logistic regression may lea

▶ regularized logistic regression:


X n
p
hat here weβ̂have a minus
1 := argmax sign[zi [ i ] β1 − ln (1 + exp ([ i ] β1 ))] − λ ∥β1 ∥pto linea
because
1 x T
we are maximizing a1 T
quantity,
x as opposed
β1
i=1

lecture▶on
case
regression
p = 2 (Tikhonov):
we called it ”ridge” because it led to ridge (linear) regression. In fac
n

{
X
1
 
∇ ℓ(β1 ) = (zi − σ ([ 1 xT
i ] β1 )) xi − 2λ β1
i=1
n
X
2  1

∇ ℓ(β1 ) = − σ ([ 1 xT
i ] β1 ) (1 − σ ([ 1 xT
i ] β1 )) xi [ 1 xTi ] − 2λ Id+1
i=1

▶ apply Newton-Raphson, with guaranteed quadratic convergence


(small values λ may lead to numerical instabilities in practice though)
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a

{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a

{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)

▶ generalized linear model for discriminant functions:

{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )
generalized sigmoid
(softmax)
This ∈ [0,is1]the o
terminology
1
δκ (x) := P(Y = κ | X = x) = P
1 + z<κ exp ([ 1 xT ] βz )
P
= 1 − y<κ
rces (again) the discriminant functions
δy (x) to sum up to 1. Note that, although it is not as direct as
Multi-class logistic regression
he Log-linear
binary case,model for posterior
we choose probability
a reference ratios:
class (sayy = κ) and we regress the other classes a

{
P(Y = 1 | X = x)
ln = [ 1 xT ] β 1
P(Y = κ | X = x)
.. parameter matrix
. B := [ β1 ··· βκ−1 ] ∈ Rd+1×κ−1
P(Y = κ − 1 | X = x)
ln = [ 1 xT ] βκ−1
P(Y = κ | X = x)

▶ generalized linear model for discriminant functions:

{
exp ([ 1 xT ] βy )
δy (x) := P(Y = y | X = x) = P for y = 1, · · · , κ − 1
1 + z<κ exp ([ 1 xT ] βz )

1
δκ (x) := P(Y = κ | X = x) = P
1 + z<κ exp ([ 1 xT ] βz )

▶ estimate
pressions for the B by maximum
objective functionlikelihood and
and for the Newton-Raphson’s
iteration steps are more algorithm.
complicated than in the b
Multi-class logistic regression
Back to our running experiment:

n = 100 + 100 + 100, d = 2, Bayes error rate ≈ 2.5%

linear (error rate ≈ 33%) logistic (error rate ≈ 3%)


Logistic regression with Scikit-Learn
sionUsage:
1.6.1 (as
etaitof laversion
version stable
1.6.1 la library)
of the plus recente au moment de la creation de ce tra
class sklearn . linear_model . L o g i s t i c R e g r e s s i o n (
fit_intercept = True , # intercept forced at 0 if False
penalty = ’l 2 ’ , # regularization type
c=1.0, # inverse regu larization coefficient
random_state = None , # fix random seed with int
... # other arguments ( see documentation )
)
>>> from sklearn . linear_model import L o g i s t i c R e g r e ss io n

>>> clf = L o gi s t i c R e g r e s s i o n () # instantiate classifier

>>> clf . fit ( X_train , y_train ) # train


ties, a.k.a. log-likelihoods, of the data points. These are output as an 1-d array of
>>> y = clf . predict ( X_test ) # predict class labels

>>> P = clf . predict_proba ( X_test ) # predict posterior probas


Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

margin
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Hyperplane equation: xT β − β0 = 0

0
0
=
β
▶ parameters: β ∈ Rd \ {0}, β0 ∈ R


β
xT
▶ β is normal to the hyperplane
β0
▶ ∥β∥
is the shift from the origin along β β

2-norm

β0 0
∥β∥
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Hyperplane equation: xT β − β0 = 0

0
=

=
0
β

0
β

▶ parameters: β ∈ Rd \ {0}, β0 ∈ R


β
xT

β
xT

1

=
▶ β is normal to the hyperplane

0
β
1


∥β∥

β
β0

xT
▶ ∥β∥
is the shift from the origin along β β
1
e is one degree
▶ fix ∥β∥
of be
to freedom in the hyperplane equation, since the solution set is invariant u
the margin

⇒ maximizing the margin is equivalent β0 0


∥β∥
to minimizing ∥β∥ or ∥β∥2

⇒ slab
is another boundaries of
consequence have
ourequations xTthat
convention β−β 0 = ±1
1/∥β∥ is set to be the margin.
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):

0
=

=
0
β
(xi , 1)

0
β

2


β̂, β̂0 := argmin ∥β∥ subject to:

β
xT

β
xT

1
β,β0


=
0
xTi β − β0 ≥ 1
(

β
∀i s.t. yi = 1 1


∥β∥

β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β

T (xi , −1)

⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
β0 0
∥β∥

▶ quadratic programming problem ⇝


solvers: ellipsoid, interior point, etc.
h problems
(w/ pos.
have definite
quadraticquadratic
objective
form)
functions and linear constraints (equalities or inequali
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):

0
=

=
0
β
(xi , 1)

0
β

2


β̂, β̂0 := argmin ∥β∥ subject to:

β
xT

β
xT

1
β,β0


(maximize margin)

=
0
xTi β − β0 ≥ 1
(

β
∀i s.t. yi = 1 1


∥β∥

β
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β

T (xi , −1)

⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β0 0
∥β∥

▶ quadratic programming problem ⇝


solvers: ellipsoid, interior point, etc.
h problems
(w/ pos.
have definite
quadraticquadratic
objective
form)
functions and linear constraints (equalities or inequali
Support Vector Machines (SVM)
Principle: explicitly construct the ‘best’ hyperplanes separating the various classes.

▶ the hyperplanes that maximize the margins (closest distances to data points)

Binary classification case (Y = {−1, 1}):

0
=

=
βˆ
0
(xi , 1)

βˆ
0

2


βˆ
β̂, β̂0 := argmin ∥β∥ subject to:

βˆ
xT

xT

1
β,β0


(maximize margin)

=
βˆ
0
xTi β − β0 ≥ 1
(
∀i s.t. yi = 1 1


∥β̂∥

βˆ
xT
xTi β − β0 ≤ −1 ∀i s.t. yi = −1 β̂

T (xi , −1)

⇔ yiis xequivalent
constraint i β − β0 ≥ 1the ∀i
to = 1, · · ones.
previous · ,n
(leave data points outside slab, on correct side)
β̂0 0
∥β̂∥
 
▶ quadratic programming problem The fˆ(x)
class/label
▶ classifier: of =
a query xT β̂ −x β̂is0 dete
sign point
h problems
(w/ pos.
have definite
quadraticquadratic
objective
form)
functions and linear constraints (equalities or inequali
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it

n  o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

0
=

=
0
β
(xi , 1)

0
β


β
xT

β
xT

1

=
0
β
1


∥β∥

β
xT
β

∝ loss
(xi , −1)
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it

n  o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

1
Relaxed optimization (soft margin):

0
=

=
0
β
(xi , 1)

0
β

β̂, β̂0 := argmin


β
xT

β
β,β0

xT

1

=
n
1

0
n  o

β
1
X T 2
max 0, 1 − yi xi β − β0 + λ ∥β∥


∥β∥
n i=1

β
xT
β

∝ loss
(xi , −1)
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it

n  o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

1
Relaxed optimization (soft margin):

0
=

=
0
β
mixing (trade-off) (xi , 1)

0
β

β̂, β̂0 := argmin


parameter

β
xT

β
β,β0

xT

1

=
n
1

0
n  o

β
1
X T 2
max 0, 1 − yi xi β − β0 + λ ∥β∥


∥β∥
n i=1

β
xT
β
mize(minimize
mean loss, i.e. loss)
mean try to satisfy constraints
(maximizeas best as possible.∝This
margin) loss term competes with
(xi , −1)
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it

n  o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

1
Relaxed optimization (soft margin):

0
=

=
0
β
mixing (trade-off) (xi , 1)

0
β

β̂, β̂0 := argmin


parameter

β
xT

β
β,β0

xT

1

=
n
1

0
n  o

β
1
X T 2
max 0, 1 − yi xi β − β0 + λ ∥β∥


∥β∥
n i=1

β
xT
β
mize(minimize
mean loss, i.e. loss)
mean try to satisfy constraints
(maximizeas best as possible.∝This
margin) loss term competes with
(xi , −1)
▶ λwhen
d, for > 0 classes are linearly
small enough, separable,
the second term in the functional becomes negligible (although
recover problem with hard constraints
by taking λ > 0 small enough.
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it

n  o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

1
Conversion to quadratic program:

0
=

=
0
β
(xi , 1)

0
β

▶ slack variables:


β
xT

β
xT

1

n o
T
ures the ξloss
i :=on 0, 1 −
the i-th
max yi (xi β − β0 )
constraint

=
0
β
1


∥β∥

β
▶ substitution:

xT
β
n
ˆ n 1 X 2n ∝ loss
is aβ̂,quadratic program in
β̂0 , (ξi )i=1 := argmin the unknownsξi + λ0∥β∥i )i=1 , with a positive-definite
β, β , (ξ quadratic t
β,β0 ,(ξi )n n (xi , −1)
i=1 i=1

subject to:
T
∀i ξ ≥ 0
nfringe the definition
i and y (x β − β0the
of ξi ito iremove ) ≥ max
1 − ξfrom
i the constraint. Specifically, we only a
n o
T
turns one inequality into an equality ⇝ ξ̂i = max 0, 1 −
s because we can optimize each ξ independently from the others once β, βi 0β̂ have
(∀i, optimum yi (x − β̂0 )been
) fix
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it

n  o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

w some examples on the picture opposite.

1
Interpretation:

0
=

=
0
ˆi > 0:

β
1 − yside T
) = ξ > 0. (xi , 1)
− β0boundary

0
in ξthat
eed, ▶ xiwe
case is on wrong
have i (xiof
β slab

β


β
xT
T

β
se are the
ˆ vectors that count to
support define the
vectors slab. Indeed, each equality y (x i β̂ − β̂0 ) = 1

xT

1
▶ ξi = 0: i


=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary

0
β
1


∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side

β
xT
β
n
ˆ n 1 X 2n ∝ loss
is aβ̂,quadratic program in
β̂0 , (ξi )i=1 := argmin the unknownsξi + λ0∥β∥i )i=1 , with a positive-definite
β, β , (ξ quadratic t
β,β0 ,(ξi )n n (xi , −1)
i=1 i=1

subject to:
T
∀i ξ ≥ 0
nfringe the definition
i and y (x β − β0the
of ξi ito iremove ) ≥ max
1 − ξfrom
i the constraint. Specifically, we only a
n o
T
turns one inequality into an equality ⇝ ξ̂i = max 0, 1 −
s because we can optimize each ξ independently from the others once β, βi 0β̂ have
(∀i, optimum yi (x − β̂0 )been
) fix
When classes are not linearly separable
uch cases, the previous problem with hard constraints has no solution. ⇒ we must relax it

n  o
T
Hinge
is zero forloss: max 0, 1 − yi xi β − side
observations lying on correct β0 of the slab (i.e. satisfying the previous const

(loss compared to a slab excluding observation xi )

w some examples on the picture opposite.

1
Interpretation:

0
=

=
0
ˆi > 0:

β
1 − yside T
) = ξ > 0. (xi , 1)
− β0boundary

0
in ξthat
eed, ▶ xiwe
case is on wrong
have i (xiof
β slab

β


β
xT
T

β
se are the
ˆ vectors that count to define the slab. Indeed, each equality y (x i β̂ − β̂0 ) = 1

xT

1
▶ ξi = 0: i


=
• yi (xTi β̂ − β̂0 ) = 1: xi is on slab boundary

0
β
1


∥β∥
• yi (xTi β̂ − β̂0 ) > 1: xi is on correct side

β
xT
β
n
ˆ n 1 X 2n ∝ loss
is aβ̂,quadratic program in
β̂0 , (ξi )i=1 := argmin the unknownsξi + λ0∥β∥i )i=1 , with a positive-definite
β, β , (ξ quadratic t
β,β0 ,(ξi )n n (xi , −1)
i=1 i=1

subject to: (select e.g. by cross-validation)


T
∀i ξ ≥ 0
nfringe the definition
i and y (x β − β0the
of ξi ito iremove ) ≥ max
1 − ξfrom
i the constraint. Specifically, we only a
Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.

▶ One-vs-all:
y y
• train 1 classifier (β̂ , β̂0 ) for each class y = 1, · · · , κ, to discriminate
labels are in class
this the binary problem,
(assigned label where
1) fromclass
the yrest
is seen as data
of the the positive
(assigned label −1).
class
d T y y
• assign each new observation x ∈ R to the class argmax x β̂ − β̂0x. This requi
class whose corresponding classifier gives the highest score wins the bet for
y=1,··· ,κ
Multi-class SVM
is because
Principle:SVM
convert
is essentially
multi-class
tied
problem
to binary
intoclassification.
multiple binary problems.

▶ One-vs-all:
y y
• train 1 classifier (β̂ , β̂0 ) for each class y = 1, · · · , κ, to discriminate
labels are in class
this the binary problem,
(assigned label where
1) fromclass
the yrest
is seen as data
of the the positive
(assigned label −1).
class
d T y y
• assign each new observation x ∈ R to the class argmax x β̂ − β̂0x. This requi
class whose corresponding classifier gives the highest score wins the bet for
y=1,··· ,κ

▶ One-vs-one:
y,y ′ y,y ′

s there are κ
train( 21)classifier
binary classifiers,
(β̂ , β̂0i.e.) aforquadratic
each pairquantity in
of classes the y ′ ∈ {1,of
y ̸=number · · classes
· , κ},
e, discrimination
to discriminate
is among
y from y ′ in their jointspanned
the subpopulation subpopulations.
by the observations with labels y o

• given a new observation x ∈ Rd , decide between y and y ′ using



 ′

y,y
T y,y
sign x β̂ − β̂0 for each pair of classes y ̸= y ′ , then assign x
to the class with the highest number of positive answers.
SVM with Scikit-Learn
sionUsage:
1.6.1 (as
etaitof laversion
version stable
1.6.1 la library)
of the plus recente au moment de la creation de ce tra
une autre
classclasse egalement,
sklearn sklearn.svm.LinearSVC,
. svm . SVC ( qui n’utilise pas de noyau m
kernel = ’ rbf ’ , # kernel type ( ’ linear ’ for no kernel )
gamma = ’ scale ’ # inverse kernel bandwidth ( if applicable )
c=1.0, # inverse regu larization coefficient
random_state = None , # fix random seed with int
... # other arguments ( see documentation )
)
>>> from sklearn . svm import SVC

>>> clf = SVC () # instantiate classifier

>>> clf . fit ( X_train , y_train ) # train


ties, a.k.a. log-likelihoods, of the data points. These are output as an 1-d array of
>>> clf . s u p p ort _ v e ct o rs _ # support vectors

>>> y = clf . predict ( X_test ) # predict class labels


Outline
• Reminder about supervised classification

• Principles of linear methods for classification

• Textbook case: linear regression for classification

• Logistic regression:
- binary
- multi-class
• Support Vector Machines (SVM):
- binary, linearly separable classes
- binary, non-linearly separable classes
- multi-class
• Non-linear classification using kernels
Kernel SVM

1.5 1.5

1 1

0.5 0.5

SVM
0
The linear classifier in the data space pe
0

-0.5 -0.5

-1
-1

-1.5
-1.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5


-1.5 -1
feature -0.5
space 0 0.5 1 1.5
Kernel SVM

1.5 1.5

1 1

0.5 0.5

SVM
0
The linear classifier in the data space pe
0

-0.5 -0.5

-1
-1

-1.5
-1.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5


-1.5 -1
feature -0.5
space 0 0.5 1 1.5

Φ : x 7→ [ x x2 ]
1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback


0.5
1
become separable (hence the support vec -0.5

3 -1

2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Kernel SVM
present the hard
Quadratic margin (hard
program case, the soft /margin
margin case is similar. Note also that soft margins
no slack):

argmin ∥β∥ 2
subject to yi (Φ(xi )T β − β0 ) ≥ 1 ∀i = 1, · · · , n
β,β0

(vector in feature space (inner product in feature space)

—may be infinite-dimensional)

1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback


1
0.5 become separable (hence the support vec
-0.5

3 -1

2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Kernel SVM
present the hard
Quadratic margin (hard
program case, the soft /margin
margin case is similar. Note also that soft margins
no slack):
n n n
!
XX X
merely substitute β α
argmin for
i ythe linear
i k(x yj αj subj. to
i , xj )combination of yΦ(x
i i )’s. α
We
j yjthus , xj )a−new
k(xiget ≥ 1 ∀i p
β0 quadratic
α,β0
i=1 j=1 j=1

Pn Pn
Representer Thm =⇒ β̂ = i=1 αi yi Φ(xi ) = i=1 αi yi k(xi , ·)

1.5

3.5
1
3
2.5
0.5
2
1.5
0

In the feature space, the two classespullback


1
0.5 become separable (hence the support vec
-0.5

3 -1

2
1 -1.5
0
SVM in -1 4
3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
feature space -3 1
2

0
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

e Bayes classifier and its error rate can beclassifier


Bayes estimated accurately since we know the p
error rate ≈ 21%
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

linear regression Parameter


7-NNk =classifier
7 has been selected u
error rate ≈ 34% error rate ≈ 22.5%
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

λ’s priviledge the with


SVM λ = 10−2
constraints, hence small margin and number of λ
SVM with = 104 vectors. Large
support
error rate ≈ 28.8% error rate ≈ 30%
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

d lines
SVMarewith
the slab’s
deg.-4boundaries,
polynomial as
kernel
before. The decision
SVMboundary of the kernel
with Gaussian Bayes classifier
rmance of theerror
Gaussian
rate ≈kernel
24.5%is particularly good here. This
errorisrate
explained
≈ 21.8%by the fact that
Experimental results
n = 100 + 100 (mixture of 2 Gaussians), d = 2

gistic regression
logistic in feature
reg. with deg.-4space is regularized.
polynomial kernel λ and the reg.
logistic window
withsize for thekernel
Gaussian kernel have
gain, the performance
error rate of the Gaussian kernel is particularlyerror
≈ 26.3% good,
ratedue to the fact that the
≈ 22.1%
What you should know
• Two types of linear approaches for classification

• Logistic regression:
- generalized linear model
- fitting by likelihood maximization & Newton-Raphson’s method,
convergence guarantees
- degenerate cases & regularization
- extension to multi-class

• Support Vector Machines (SVM):


- margin maximization & quadratic programming problem
- hinge loss, soft margin,
relaxed quadratic programming problem, support vectors
- extension to multi-class
- kernel SVM

You might also like