10 601 Introduction to Machine Learning
Machine Learning Department
School of Computer Science
Carnegie Mellon University
Support Vector Machines
+
Kernels
Matt Gormley
Lecture 27
Apr. 22, 2020
1
Reminders
Homework 8: Reinforcement Learning
Out: Fri, Apr 10
Due: Wed, Apr 22 at 11:59pm
Homework 9: Learning Paradigms
Out: Wed, Apr. 22
Due: Wed, Apr. 29 at 11:59pm
Can only be submitted up to 3 days late,
so we can return grades before final exam
Today’s In Class Poll
http://poll.mlcourse.org
2
CONSTRAINED OPTIMIZATION
7
Constrained Optimization
8
Quadratic Program
9
SVM: Optimization Background
Whiteboard
Constrained Optimization
Linear programming
Quadratic programming
Example: 2D quadratic function with linear
constraints
10
Quadratic Program
11
Quadratic Program
12
Quadratic Program
13
Quadratic Program
14
Quadratic Program
15
SUPPORT VECTOR MACHINE
(SVM)
16
Example: Building Walls
18
https://www.facebook.com/Mondobloxx/
SVM
Whiteboard
SVM Primal (Linearly Separable Case)
This section borrows ideas from Nina Balcan’s SVM lectures at CMU and Patrick Winston’s 19
“widest street” SVM lecture at MIT (https://www.youtube.com/watch?v=_PwhiWxHK8o).
SVM QP
20
SVM QP
21
SVM QP
22
SVM QP
23
SVM QP
24
SVM QP
25
Support Vector Machines (SVMs)
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
Instead of minimizing the primal, we can maximize the
dual problem
For the SVM, these two problems give the same
answer (i.e. the minimum of one is the maximum of the
other)
Definition: support vectors are those points x(i) for
which (i) 0
26
METHOD OF LAGRANGE
MULTIPLIERS
27
Method of Lagrange Multipliers
28
Method of Lagrange Multipliers
29
Method of Lagrange Multipliers
30
Method of Lagrange Multipliers
31
Method of Lagrange Multipliers
32
Method of Lagrange Multipliers
33
Method of Lagrange Multipliers
34
Figure from http://tutorial.math.lamar.edu/Classes/CalcIII/LagrangeMultipliers.aspx
Method of Lagrange Multipliers
35
Figure from http://tutorial.math.lamar.edu/Classes/CalcIII/LagrangeMultipliers.aspx
SVM DUAL
36
Method of Lagrange Multipliers
Whiteboard
Lagrangian Duality
Example: SVM Dual
This section borrows ideas from Nina Balcan’s SVM lectures at CMU and Patrick Winston’s 37
“widest street” SVM lecture at MIT (https://www.youtube.com/watch?v=_PwhiWxHK8o).
Support Vector Machines (SVMs)
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
Instead of minimizing the primal, we can maximize the
dual problem
For the SVM, these two problems give the same
answer (i.e. the minimum of one is the maximum of the
other)
Definition: support vectors are those points x(i) for
which (i) 0
39
SVM EXTENSIONS
42
Soft Margin SVM
Hard margin SVM (Primal)
Question: If the dataset is
not linearly separable, can
we still use an SVM?
Answer: Not the hard
margin version. It will never
find a feasible solution.
In the soft margin version,
Soft margin SVM (Primal) we add “slack variables”
that allow some points to
violate the large margin
constraints.
The constant C dictates
how large we should allow
the slack variables to be
43
Soft Margin SVM
Hard margin SVM (Primal)
Soft margin SVM (Primal)
44
Soft Margin SVM
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
Soft margin SVM (Primal) Soft margin SVM (Lagrangian Dual)
We can also work with the dual of the soft margin SVM
45
Multiclass SVMs
The SVM is inherently a binary classification method,
but can be extended to handle K class classification in
many ways.
1. one vs rest:
build K binary classifiers
train the kth classifier to predict whether an instance
has label k or something else
predict the class with largest score
2. one vs one:
build (K choose 2) binary classifiers
train one classifier for distinguishing between each pair
of labels
predict the class with the most “votes” from any given
classifier
46
Learning Objectives
Support Vector Machines
You should be able to…
1. Motivate the learning of a decision boundary with large margin
2. Compare the decision boundary learned by SVM with that of
Perceptron
3. Distinguish unconstrained and constrained optimization
4. Compare linear and quadratic mathematical programs
5. Derive the hard margin SVM primal formulation
6. Derive the Lagrangian dual for a hard margin SVM
7. Describe the mathematical properties of support vectors and provide
an intuitive explanation of their role
8. Draw a picture of the weight vector, bias, decision boundary, training
examples, support vectors, and margin of an SVM
9. Employ slack variables to obtain the soft margin SVM
10. Implement an SVM learner using a black box quadratic programming
(QP) solver
52
KERNELS
53
Kernels: Motivation
Most real world problems exhibit data that is
not linearly separable.
Example: pixel representation for Facial Recognition:
Q: When your data is not linearly separable,
how can you still use a linear classifier?
A: Preprocess the data to produce nonlinear
features
54
Kernels: Motivation
Motivation #1: Inefficient Features
Non linearly separable data requires high
dimensional representation
Might be prohibitively expensive to compute or
store
Motivation #2: Memory based Methods
k Nearest Neighbors (KNN) for facial recognition
allows a distance metric between images no
need to worry about linearity restriction at all
55
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)
The kernel k(x,z) can be any legal definition of a dot product:
k(x, z) = (x) T (z) for any function : RD
So we only compute the dot product implicitly
This “kernel trick” can be applied to many algorithms:
classification: perceptron, SVM, …
regression: ridge regression, …
clustering: k means, …
57
SVM: Kernel Trick
Hard margin SVM (Primal) Hard margin SVM (Lagrangian Dual)
Suppose we do some
feature engineering
Our feature function is
We apply to each input
vector x
58
SVM: Kernel Trick
Hard margin SVM (Lagrangian Dual)
We could replace the dot product of the two feature vectors
in the transformed space with a function k(x,z)
59
SVM: Kernel Trick
Hard margin SVM (Lagrangian Dual)
We could replace the dot product of the two feature vectors
in the transformed space with a function k(x,z)
60
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)
The kernel k(x,z) can be any legal definition of a dot product:
k(x, z) = (x) T (z) for any function : RD
So we only compute the dot product implicitly
This “kernel trick” can be applied to many algorithms:
classification: perceptron, SVM, …
regression: ridge regression, …
clustering: k means, …
61
Kernel Methods
Q: These are just non linear features, right?
A: Yes, but…
Q: Can’t we just compute the feature
transformation explicitly?
A: That depends...
Q: So, why all the hype about the kernel trick?
A: Because the explicit features might either
be prohibitively expensive to compute or
infinite length vectors 62
Example: Polynomial Kernel
63
Slide from Nina Balcan
Kernel Examples
Name Kernel Function Feature Space
(implicit dot product) (explicit dot product)
Linear Same as original input
space
Polynomial (v1) All polynomials of degree
d
Polynomial (v2) All polynomials up to
degree d
Gaussian Infinite dimensional space
Hyperbolic (With SVM, this is
Tangent equivalent to a 2 layer
(Sigmoid) neural network)
Kernel
66
RBF Kernel Example
RBF Kernel:
67
RBF Kernel Example
RBF Kernel:
68
RBF Kernel Example
RBF Kernel:
69
RBF Kernel Example
RBF Kernel:
70
RBF Kernel Example
RBF Kernel:
71
RBF Kernel Example
RBF Kernel:
72
RBF Kernel Example
RBF Kernel:
73
RBF Kernel Example
RBF Kernel:
74
RBF Kernel Example
RBF Kernel:
75
RBF Kernel Example
RBF Kernel:
76
RBF Kernel Example
RBF Kernel:
77
RBF Kernel Example
RBF Kernel:
78
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
79
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
80
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
81
RBF Kernel Example
KNN vs. SVM
RBF Kernel:
82
Kernel Methods
Key idea:
1. Rewrite the algorithm so that we only work with dot products xTz
of feature vectors
2. Replace the dot products xTz with a kernel function k(x, z)
The kernel k(x,z) can be any legal definition of a dot product:
k(x, z) = (x) T (z) for any function : RD
So we only compute the dot product implicitly
This “kernel trick” can be applied to many algorithms:
classification: perceptron, SVM, …
regression: ridge regression, …
clustering: k means, …
83
SVM + Kernels: Takeaways
Maximizing the margin of a linear separator is a good
training criteria
Support Vector Machines (SVMs) learn a max margin
linear classifier
The SVM optimization problem can be solved with
black box Quadratic Programming (QP) solvers
Learned decision boundary is defined by its support
vectors
Kernel methods allow us to work in a transformed
feature space without explicitly representing that
space
The kernel trick can be applied to SVMs, as well as
many other algorithms
86
Learning Objectives
Kernels
You should be able to…
1. Employ the kernel trick in common learning
algorithms
2. Explain why the use of a kernel produces only
an implicit representation of the transformed
feature space
3. Use the "kernel trick" to obtain a
computational complexity advantage over
explicit feature transformation
4. Sketch the decision boundaries of a linear
classifier with an RBF kernel
87