Support Vector Machines
vs Logistic Regression
Kevin Swersky
University of Toronto CSC2515 Tutorial
Part of this tutorial is borrowed from Mark Schmidt’s excellent note on
structural SVMs: http://www.di.ens.fr/~mschmidt/Documents/ssvm.pdf
Logistic regression
Logistic regression
• Assign probability to each outcome
P (y = 1|x) = σ(wT x + b)
• Train to maximize likelihood
!N
l(w) = − n=1 σ(wT xn + b)yn (1 − σ(wT xn + b))(1−yn )
• Linear decision boundary (with y being 0 or 1)
ŷ = I[wT x + b ≥ 0]
Support vector machines
Source: Wikipedia
Support vector machines
• Enforce a margin of separation (here, y ∈ {0, 1})
(2yn − 1)wT xn ≥ 1, ∀n = 1 . . . N
• Train to find the maximum margin
1
min ||w||2
2
s.t. (2yn 1)(wT xn + b) 1, n = 1...N
• Linear decision boundary
ŷ = I[wT x + b ≥ 0]
Recap
• Logistic regression focuses on maximizing the
probability of the data. The farther the data lies from
the separating hyperplane (on the correct side), the
happier LR is.
• An SVM tries to find the separating hyperplane that
maximizes the distance of the closest points to the
margin (the support vectors). If a point is not a
support vector, it doesn’t really matter.
A different take
• Remember, in this example y ∈ {0, 1}
• Another take on the LR decision function uses the
probabilities instead:
1 if P (y = 1|x) P (y = 0|x)
ŷ =
0 otherwise
P (y = 1|x) exp(wT x + b)
P (y = 0|x) 1
A different take
• What if we don’t care about getting the right
probability, we just want to make the right decision?
• We can express this as a constraint on the likelihood
ratio,
P (y=1|x)
P (y=0|x) ≥c
• For some arbitrary constant c>1.
A different take
• Taking the log of both sides,
log (P (y = 1|x)) − log (P (y = 0|x)) ≥ log(c)
• and plugging in the definition of P,
wT x + b 0 log(c)
= (wT x + b) log(c)
• c is arbitrary, so we pick it to satisfy log(c) = 1
wT x + b ≥ 1
A different take
• This gives a feasibility problem (specifically the perceptron
problem) which may not have a unique solution.
• Instead, put a quadratic penalty on the weights to make the
solution unique:
1
min ||w||2
2
s.t. (2yn 1)(wT xn + b) 1, n = 1...N
• This gives us an SVM!
• We derived an SVM by asking LR to make the right decisions.
The likelihood ratio
• The key to this derivation is the likelihood ratio,
P (y = 1|x)
r=
P (y = 0|x)
exp(wT x + b)
=
1
= exp(wT x + b)
• We can think of a classifier as assigning some cost to r.
• Different costs = different classifiers.
LR cost
1
• Pick cost(r) = log(1 + )
r
= log(1 + exp( (wT x + b)))
• This is the LR objective (for a positive example)!
SVM with slack variables
• If the data is not linearly separable, we can change
the program to:
N
1
min ||w||2 + n
2 n=1
s.t. (2yn 1)(wT xn + b) 1 n, n = 1...N
n 0, n = 1...N
• Now if a point n is misclassified, we incur a
cost of n , it’s distance to the margin.
SVM with slack variables cost
• Pick cost(r) =max(0, 1 log(r))
=max(0, 1 (wT x + b))
LR cost vs SVM cost
• Plotted in terms of r,
LR cost vs SVM cost
• Plotted in terms of wT x + b ,
Exploiting this connection
• We can now use this connection to derive extensions
to each method.
• These might seem obvious (maybe not) and that’s
usually a good thing.
• The important point though is that they are
principled, rather than just hacks. We can trust that
they aren’t doing anything crazy.
Kernel trick for LR
• Recall that in it’s dual form, we can represent an
SVM decision boundary as:
N
wT (x) + b = n K(x, xn ) =0
n=1
• where (x) is an ∞-dimensional basis expansion
of x .
• Plugging this into the LR cost:
N
log(1 + exp( n K(x, xn )))
n=1
Multi-class SVMs
• Recall for multi-class LR we have:
exp(wiT x + bi )
P (y = i|x) = Tx + b )
k exp(w k k
Multi-class SVMs
• Suppose instead we just want the decision rule to
satisfy:
P (y = i|x)
c k=i
P (y = k|x)
• Taking logs as before, this gives:
wiT x wkT x 1 k=i
Multi-class SVMs
• This produces the following quadratic program:
1
min ||w||2
2
s.t. (wyTn xn + byn ) (wkT xn + bk ) 1, n = 1 . . . N, k = yn
Take-home message
• Logistic regression and support vector machines are
closely linked.
• Both can be viewed as taking a probabilistic model
and minimizing some cost associated with
misclassification based on the likelihood ratio.
• This lets us analyze these classifiers in a decision
theoretic framework.
• It also allows us to extend them in principled ways.
Which one to use?
• As always, depends on your problem.
• LR gives calibrated probabilities that can be interpreted as
confidence in a decision.
• LR gives us an unconstrained, smooth objective.
• LR can be (straightforwardly) used within Bayesian models.
• SVMs don’t penalize examples for which the correct decision is
made with sufficient confidence. This may be good for
generalization.
• SVMs have a nice dual form, giving sparse solutions when
using the kernel trick (better scalability).