Support Vector Machines
(Vapnik, 1979)
• Assume a binary classification problem.
– Instances are represented by vector x ∈ ℜn.
– Training examples: x = (f1, f2, …, fn)
S = {(x1, y1), (x2, y2), ..., (xn, yn) | (xi, yi)∈ ℜn ×{+1, -1}
– Hypothesis: A function h: ℜn→{+1, -1}.
h(x) = h(f1, f2, …, fn) ∈{+1, -1}
• Here, assume positive and negative instances are to be
separated by the hyperplane
w⋅ x + b = 0
Equation of line:
f2 + w⋅ x + b = w1 f1 + w 2 f 2 + b = 0
- + +
€ -
- +
- €
f1
w⋅ x + b = 0
f2 +
- + +
€ -
- +
f1
• Intuition: the best hyperplane (for future generalization)
will “maximally” separate the examples
Definition of Margin
• The margin of the positive examples, d+ with respect to
that hyperplane, is the shortest distance from a positive
example to the hyperplane:
f2 +
- + +
-
- d+ +
- -
f1
• The margin of the negative examples, d- with respect to
that hyperplane, is the shortest distance from a negative
example to the hyperplane:
f2 +
- + +
d-
-
- d+ +
- -
f1
• The margin of the training set S with respect to the
hyperplane is d+ + d- .
f2 +
- + +
d-
-
- d+ +
- -
f1
• The margin of the training set S with respect to the
hyperplane is d+ + d- .
f2 +
+ +
Vapnik showed that the hyperplane- d-
-
d+ +
maximizing the margin of S will have -
minimal VC dimension in the set of- -
all consistent hyperplanes, and will
thus be optimal. f1
• The margin of the training set S with respect to the
hyperplane is d+ + d- .
f2 +
+ +
Vapnik showed that the hyperplane- d-
-
d+ +
maximizing the margin of S will have -
minimal VC dimension in the set of- -
all consistent hyperplanes, and will
thus be optimal. f1
This is an optimization
problem!
• Note that the hyperplane is defined as
w⋅ x + b = 0
• To make the math easier, we will rescale w and b such that
the hyperplane is halfway in-between the closest positive
€
and negative examples, and
From M. A. Hearst et al. paper (on class web page)
• In this case, the margin is
• So to maximize the margin, we need to minimize .
Minimizing the margin
Find w and b by doing the following minimization:
This is a quadratic optimization problem. Use “standard
optimization tools” to solve it.
• Dual formulation: It turns out that w can be expressed as
a linear combination of a small subset of the training
examples: those that lie exactly on margin (minimum
distance to hyperplane):
such that xi lie exactly on the margin.
• These training examples are called “support vectors”.
They carry all relevant information about the classification
problem.
• The result of the SVM training algorithm (involving
solving a quadratic programming problem), is the αi’s and
the xi’s.
• For a new example x, We can now classify x using the
support vectors:
• This is the resulting SVM classifier.
Non-linearly separable training examples
• What if the training examples are not linearly separable?
• Use old trick: find a function that maps points to a higher
dimensional space (“feature space”) in which they are
linearly separable, and do the classification in that higher-
dimensional space.
Need to find a function Φ that will perform such a mapping:
Φ: ℜn→ F
Then can find hyperplane in higher dimensional feature space
F, and do classification using that hyperplane in higher
dimensional space.
• Problem:
– Recall that classification of instance x is expressed in
terms of dot products of x and support vectors.
– The quadratic programming problem of finding the
support vectors and coefficients also depends only on
dot products between training examples, rather than on
the training examples outside of dot products.
– So if each xi is replaced by Φ(xi) in these procedures,
we will have to calculate a lot of dot products,
Φ(xi)⋅ Φ(xj)
– But in general, if the feature space F is high
dimensional, Φ(xi) ⋅ Φ(xj) will be expensive to
compute.
– Also Φ(x) can be expensive to compute
• Second trick:
– Suppose that there were some magic function,
k(xi, xj) = Φ(xi) ⋅ Φ(xj)
such that k is cheap to compute even though Φ(xi) ⋅ Φ
(xj) is expensive to compute.
– Then we wouldn’t need to compute the dot product
directly; we’d just need to compute k during both the
training and testing phases.
– The good news is: such k functions exist! They are
called “kernel functions”, and come from the theory of
integral operators.
Example: Polynomial kernel:
Suppose x = (x1, x2) and y = (y1, y2).
Recap of SVM algorithm
Given training set
S = {(x1, y1), (x2, y2), ..., (xm, ym) | (xi, yi)∈ ℜn ×{+1, -1}
1. Choose a map Φ: ℜn→ F, which maps xi to a higher
dimensional feature space. (Solves problem that X might
not be linearly separable in original space.)
2. Choose a cheap-to-compute kernel function
k(x,z)=Φ(x) ⋅ Φ(z)
(Solves problem that in high dimensional spaces, dot
products are very expensive to compute.)
3. Map all the xi’s to feature space F by computing Φ(xi).
4. Apply quadratic programming procedure (using the kernel
function k) to find a hyperplane (w, w0), such that
where the Φ(xi)’s are support vectors, the αi’s are
coefficients, and w0 is a threshold, such that (w,w0 ) is the
hyperplane maximizing the margin of S in F.
• Now, given a new instance, x, find the classification of x
by computing
Demo:
Spam classification using SVMs
Example:
Applying SVMs to text classification
(Dumais et al., 1998)
• Used Reuters collection of news stories.
• Question: Is a particular news story in the category
“grain” (i.e., about grain, grain prices, etc.)?
• Training examples: Vectors of features such as appearance
or frequency of key words. (Similar to our spam-
classification task.)
• Resulting SVM: weight vector defined in terms of support
vectors and coefficients, plus threshold.
Results
Precision / Recall
• Confusion matrix for a classifier:
Classified Classified
Positive Negative
Positive
Examples True positives False negatives
(TP) (FN)
Negative
Examples False positives True negatives
(FP) (TN)
Some performance measures
• Accuracy: proportion of classifications, over all the N
examples, that were correct:
• Recall (or true positive rate, or “detection rate”):
Proportion of positive examples that were classified
correctly:
• Precision : Proportion of correct positive classifications
over all positive classifications:
Example
Test data Correct Classification Model’s Classification
x1 T T
x2 T F
x3 F T
x4 F F
x5 F T
x6 F F
x7 F F
x8 F T
Accuracy =
Recall =
Precision =
Interpretation of precision and recall
• Precision and recall are often plotted against one another,
especially in “detection” applications (such as spam
detection), when positive examples are sparse in the
observed data.
• Recall: How often did the system correctly identify
positive examples when it encountered them?
• Precision: How often did the system get positive
classifications correct?
• How do these two measures trade off against one another?
Precision / Recall Curves
SVMs also used in Watson for question classification
e.g., see Moschitti et al., Using syntactic and semantic
structural kernels for classifying definition questions in
Jeopardy!
SVM Demo