KEMBAR78
Support Vector Machines Vs Logistic Regression: Kevin Swersky University of Toronto CSC2515 Tutorial | PDF | Support Vector Machine | Cybernetics
0% found this document useful (0 votes)
92 views23 pages

Support Vector Machines Vs Logistic Regression: Kevin Swersky University of Toronto CSC2515 Tutorial

This document compares and contrasts logistic regression and support vector machines (SVMs). Both can be viewed as minimizing a cost associated with misclassification based on a likelihood ratio. Logistic regression maximizes probability of the data, while SVMs maximize the margin between classes. However, SVMs can be derived by asking logistic regression to optimize for correct decisions rather than probabilities. This connection allows principled extensions of both models, such as kernelizing logistic regression or designing multi-class SVMs. While each has advantages depending on the problem, both aim to minimize misclassification based on a likelihood ratio framework.

Uploaded by

prash27k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views23 pages

Support Vector Machines Vs Logistic Regression: Kevin Swersky University of Toronto CSC2515 Tutorial

This document compares and contrasts logistic regression and support vector machines (SVMs). Both can be viewed as minimizing a cost associated with misclassification based on a likelihood ratio. Logistic regression maximizes probability of the data, while SVMs maximize the margin between classes. However, SVMs can be derived by asking logistic regression to optimize for correct decisions rather than probabilities. This connection allows principled extensions of both models, such as kernelizing logistic regression or designing multi-class SVMs. While each has advantages depending on the problem, both aim to minimize misclassification based on a likelihood ratio framework.

Uploaded by

prash27k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Support Vector Machines

vs Logistic Regression
Kevin Swersky

University of Toronto CSC2515 Tutorial

Part of this tutorial is borrowed from Mark Schmidt’s excellent note on


structural SVMs: http://www.di.ens.fr/~mschmidt/Documents/ssvm.pdf
Logistic regression
Logistic regression

•  Assign probability to each outcome


P (y = 1|x) = σ(wT x + b)

•  Train to maximize likelihood


!N
l(w) = − n=1 σ(wT xn + b)yn (1 − σ(wT xn + b))(1−yn )

•  Linear decision boundary (with y being 0 or 1)


ŷ = I[wT x + b ≥ 0]
Support vector machines

Source: Wikipedia
Support vector machines

•  Enforce a margin of separation (here, y ∈ {0, 1})


(2yn − 1)wT xn ≥ 1, ∀n = 1 . . . N

•  Train to find the maximum margin


1
min ||w||2
2
s.t. (2yn 1)(wT xn + b) 1, n = 1...N

•  Linear decision boundary


ŷ = I[wT x + b ≥ 0]
Recap

•  Logistic regression focuses on maximizing the


probability of the data. The farther the data lies from
the separating hyperplane (on the correct side), the
happier LR is.

•  An SVM tries to find the separating hyperplane that


maximizes the distance of the closest points to the
margin (the support vectors). If a point is not a
support vector, it doesn’t really matter.
A different take

•  Remember, in this example y ∈ {0, 1}

•  Another take on the LR decision function uses the


probabilities instead:

1 if P (y = 1|x) P (y = 0|x)
ŷ =
0 otherwise

P (y = 1|x) exp(wT x + b)
P (y = 0|x) 1
A different take

•  What if we don’t care about getting the right


probability, we just want to make the right decision?

•  We can express this as a constraint on the likelihood


ratio,

P (y=1|x)
P (y=0|x) ≥c
•  For some arbitrary constant c>1.
A different take

•  Taking the log of both sides,


log (P (y = 1|x)) − log (P (y = 0|x)) ≥ log(c)

•  and plugging in the definition of P,

wT x + b 0 log(c)
= (wT x + b) log(c)

•  c is arbitrary, so we pick it to satisfy log(c) = 1


wT x + b ≥ 1
A different take

•  This gives a feasibility problem (specifically the perceptron


problem) which may not have a unique solution.

•  Instead, put a quadratic penalty on the weights to make the


solution unique:
1
min ||w||2
2
s.t. (2yn 1)(wT xn + b) 1, n = 1...N

•  This gives us an SVM!

•  We derived an SVM by asking LR to make the right decisions.


The likelihood ratio

•  The key to this derivation is the likelihood ratio,

P (y = 1|x)
r=
P (y = 0|x)
exp(wT x + b)
=
1
= exp(wT x + b)
•  We can think of a classifier as assigning some cost to r.

•  Different costs = different classifiers.


LR cost
1
•  Pick cost(r) = log(1 + )
r
= log(1 + exp( (wT x + b)))

•  This is the LR objective (for a positive example)!


SVM with slack variables

•  If the data is not linearly separable, we can change


the program to:
N
1
min ||w||2 + n
2 n=1
s.t. (2yn 1)(wT xn + b) 1 n, n = 1...N
n 0, n = 1...N

•  Now if a point n is misclassified, we incur a


cost of n , it’s distance to the margin.
SVM with slack variables cost

•  Pick cost(r) =max(0, 1 log(r))


=max(0, 1 (wT x + b))
LR cost vs SVM cost

•  Plotted in terms of r,
LR cost vs SVM cost

•  Plotted in terms of wT x + b ,
Exploiting this connection

•  We can now use this connection to derive extensions


to each method.

•  These might seem obvious (maybe not) and that’s


usually a good thing.

•  The important point though is that they are


principled, rather than just hacks. We can trust that
they aren’t doing anything crazy.
Kernel trick for LR

•  Recall that in it’s dual form, we can represent an


SVM decision boundary as:
N
wT (x) + b = n K(x, xn ) =0
n=1
•  where (x) is an ∞-dimensional basis expansion
of x .

•  Plugging this into the LR cost:


N
log(1 + exp( n K(x, xn )))
n=1
Multi-class SVMs

•  Recall for multi-class LR we have:

exp(wiT x + bi )
P (y = i|x) = Tx + b )
k exp(w k k
Multi-class SVMs

•  Suppose instead we just want the decision rule to


satisfy:

P (y = i|x)
c k=i
P (y = k|x)

•  Taking logs as before, this gives:


wiT x wkT x 1 k=i
Multi-class SVMs

•  This produces the following quadratic program:

1
min ||w||2
2
s.t. (wyTn xn + byn ) (wkT xn + bk ) 1, n = 1 . . . N, k = yn
Take-home message

•  Logistic regression and support vector machines are


closely linked.

•  Both can be viewed as taking a probabilistic model


and minimizing some cost associated with
misclassification based on the likelihood ratio.

•  This lets us analyze these classifiers in a decision


theoretic framework.

•  It also allows us to extend them in principled ways.


Which one to use?

•  As always, depends on your problem.


•  LR gives calibrated probabilities that can be interpreted as
confidence in a decision.
•  LR gives us an unconstrained, smooth objective.
•  LR can be (straightforwardly) used within Bayesian models.
•  SVMs don’t penalize examples for which the correct decision is
made with sufficient confidence. This may be good for
generalization.
•  SVMs have a nice dual form, giving sparse solutions when
using the kernel trick (better scalability).

You might also like