Lecture: Classification with Support Vector Machines1
CS 2XX: Mathematics for AI and ML
Chandresh Kumar Maurya
IIT Indore
https://chandreshiit.github.io
November 17, 2024
1
Slides credit goes to Yi, Yung
November 17, 2024 1 / 28
Warm-Up
Please watch this tutorial video by Luis Serrano on Support Vector
Machine.
https://youtu.be/Lpr__X8zuE8
November 17, 2024 2 / 28
Roadmap
(1) Story and Separating Hyperplanes
(2) Primal SVM: Hard SVM
(3) Primal SVM: Soft SVM
(4) Dual SVM
(5) Kernels
(6) Numerical Solution
November 17, 2024 3 / 28
Roadmap
(1) Story and Separating Hyperplanes
(2) Primal SVM: Hard SVM
(3) Primal SVM: Soft SVM
(4) Dual SVM
(5) Kernels
(6) Numerical Solution
L12(1) November 17, 2024 4 / 28
Storyline
• (Binary) classification vs. regression
• A Classification predictor f : RD 7→ {+1, −1}, where D is the dimension of features.
• Suppervised learning as in the regression with a given dataset
{(x1 , y1 ), . . . , (xN , yN )}, where our task is to learn the model parameters which
produces the smallest classification errors.
• SVM
◦ Geometric way of thinking about supvervised learning
◦ Relying on empirical risk minimization
◦ Binary classification = Drawing a separating hyperplane
◦ Various interpretation from various perspectives: geometric view, loss function view, the
view from convex hulls of data points
L12(1) November 17, 2024 5 / 28
Hard SVM vs. Soft SVM
• Hard SVM: Linearly separable, and thus, allow no classification error
• Soft SVM: Non-linearly separable, thus, allow some classification error
L12(1) November 17, 2024 6 / 28
Separating Hyperplane
• Hyperplane in RD is a set: {x | aT x = b} where a ∈ Rn , a ̸= 0, b ∈ R L7(3)
In other words, {x | aT (x − x0 ) = 0}, where x0 is any point in the hyperplane, i.e.,
aT x0 = b.
• Divides RD into two halfspaces:
{x|aT x ≤ b} and {x|aT x > b}
• In our problem, we consider the hyperplane w T x + b = 0, where w and b are the
parameters of the model.
• Classification logic
(
w T xn + b ≥ 0 when yn = +1 T
T
=⇒ yn w xn + b ≥ 0
w xn + b < 0 when yn = −1
L12(1) November 17, 2024 7 / 28
Distance bertween Two Hyperplanes
• Consider two hyperplanes w T x − b = 0 and w T x − b = r , where assume r > 0.
2 r
• Question. What is the distance between two hyperplanes? Answer:
∥w ∥
2
Shortested distance between two hyperplanes.
L12(1) November 17, 2024 8 / 28
Roadmap
(1) Story and Separating Hyperplanes
(2) Primal SVM: Hard SVM
(3) Primal SVM: Soft SVM
(4) Dual SVM
(5) Kernels
(6) Numerical Solution
L12(2) November 17, 2024 9 / 28
Hard Support Vector Machine
• Assume that the data points are linearly separable.
• Goal: Find the hyperplane that maximizes the margin between the positive and the
negative samples
• Given the training dataset {(x1 , y1 ), . . . , (xN , yN )} and a hyperplane w T x + b = 0,
what is the constraint that all data points are ∥wr ∥ -away from the hyperplane?
T
r
yn w xn + b ≥
∥w ∥
• Note that r and ∥w ∥ are scaled together, so if we fix ∥w ∥ = 1, then
T
yn w xn + b ≥ r
L12(2) November 17, 2024 10 / 28
Hard SVM: Formulation 1
• Maximize the margin, such that all the training data points are well-classified into
their classes (+ or −)
max r
w ,b,r
T
subject to yn w xn + b ≥ r , for all n = 1, . . . , N, ∥w ∥ = 1, r >0
L12(2) November 17, 2024 11 / 28
Formulation 2 (1)
max r
w ,b,r
T
subject to yn w xn + b ≥ r , for all n = 1, . . . , N, ∥w ∥ = 1, r >0
w ′T
• ′
Since ∥w ∥ = 1, reformulate w by w as: yn xn + b ≥ r
∥w ′ ∥
• Change the objective from r to r 2 .
• Define w ′′ and b ′′ by rescaling the constraint:
w ′T w ′ b
T
yn ′
xn + b ≥ r ⇐⇒ yn w ′′ xn + b ′′ ≥ 1, ′′
w = ′′
and b =
∥w ∥ ∥w ′ ∥ r r
L12(2) November 17, 2024 12 / 28
Formulation 2 (2)
• Note that ∥w ′′ ∥ = 1
r
• Thus, we have the following reformulated problem:
1
max
′′ ′′
w ,b ∥w ′′ ∥2
′′ T ′′
subject to yn w xn + b ≥ 1, for all n = 1, . . . , N,
1
min ∥w ∥2
w ,b 2
T
subject to yn w xn + b ≥ 1, for all n = 1, . . . , N,
L12(2) November 17, 2024 13 / 28
Understanding Formulation 2 Intuitively
• Given the training dataset {(x1 , y1 ), . . . , (xN , yN )} and a hyperplane w T x + b = 0,
what is the constraint that all data points are ∥wr ∥ -away from the hyperplane?
T
r
yn w xn + b ≥
∥w ∥
• Formulation 1. Note that r and ∥w ∥ are scaled together, so if we fix ∥w ∥ = 1, then
T
yn w xn + b ≥ r .
And, maximize r .
• Formulation 2. If we fix r = 1, then
T
yn w xn + b ≥ 1.
And, minimize ∥w ∥
L12(2) November 17, 2024 14 / 28
Roadmap
(1) Story and Separating Hyperplanes
(2) Primal SVM: Hard SVM
(3) Primal SVM: Soft SVM
(4) Dual SVM
(5) Kernels
(6) Numerical Solution
L12(3) November 17, 2024 15 / 28
Soft SVM: Geometric View
• Now we allow some classification errors, because it’s not linearly separable.
• Introduce a slack variable that quantifies how much errors will be allowed in my
optimization problem
• ξ = (ξn : n = 1, . . . , N)
• ξn : slack for the n-th sample (xn , yn )
N
1 2
X
min ∥w ∥ + C ξn
w ,b 2 n=1
T
subject to yn w x n + b ≥ 1 − ξ n ,
ξn ≥ 0, for all n
• C : Trade-off between width and slack
L12(3) November 17, 2024 16 / 28
Soft SVM: Loss Function View (1)
• From the perspective of empirical risk minimizaiton
• Loss function design
◦ zero-one loss 1(f (xn ) ̸= yn ): # of mismatches between the prediction and the label
=⇒ combinatorial optimization (typically NP-hard)
◦ hinge loss
ℓ(t) = max(0, 1 − t), where t = yf (x) = y (w T x + b)
▶ If x is really at the correct side, t ≥ 1
→ ℓ(t) = 0
▶ If x is at the correct side, but too
close to the boundary, 0 < t < 1
→ 0 < ℓ(t) = 1 − t < 1
▶ If x is at the wrong side, t < 0
→ 1 < ℓ(t) = 1 − t
L12(3) November 17, 2024 17 / 28
Soft SVM: Loss Function View (2)
N
1 2
X
min (regularizer + loss) = min ∥w ∥ + C max{0, 1 − y (w T x + b)}
w ,b w ,b 2
n=1
• 1
2 ∥w ∥2 : L2-regularizer (margin maximization = regularization)
• C : regularization parameter, which moves from the regularization term to the loss
term
• Why this loss function view = geometric view?
min max(0, 1 − t) ⇐⇒ min ξ, subject to ξ ≥ 0, ξ ≥ 1 − t
t ξ,t
L12(3) November 17, 2024 18 / 28
Roadmap
(1) Story and Separating Hyperplanes
(2) Primal SVM: Hard SVM
(3) Primal SVM: Soft SVM
(4) Dual SVM
(5) Kernels
(6) Numerical Solution
L12(4) November 17, 2024 19 / 28
Dual SVM: Idea
N
1 2
X
min ∥w ∥ + C ξn
w ,b 2
n=1
T
subject to yn w xn + b ≥ 1 − ξn , ξn ≥ 0, for all n
• The above primal problem is a convex optimization problem.
• Let’s apply Lagrange multipliers, find another formulation, and see what other nice
properties are shown L7(2), L7(4)
• Convert the problem into ”≤” constraints, so as to apply min-min-max rule
N
1 2
X
T
min ∥w ∥ + C ξn , s.t. − yn w xn + b ≤ −1 + ξn , −ξn ≤ 0, for all n
w ,b 2
n=1
L12(4) November 17, 2024 20 / 28
Applying Lagrange Multipliers (1)
N
1 2
X
ξn , s.t. − yn w T xn + b ≤ −1 + ξn , −ξn ≤ 0,
min ∥w ∥ + C for all n
w ,b 2
n=1
• Lagrangian with multipliers αn ≥ 0 and γn ≥ 0
N N N
1 X X h i X
L(w , b, ξ, α, γ) = ∥w ∥2 + C αn yn w T xn + b − 1 + ξn −
ξn − γ n ξn
2
n=1 n=1 n=1
• Dual function: D(α, γ) = inf w ,b,ξ L(w , b, ξ, α, γ) for which the followings should
be met:
N N
∂L T
X
T ∂L X ∂L
(D1) =w − αn yn xn = 0, (D2) = αn yn = 0, (D3) = C − αn − γn = 0
∂w n=1
∂b n=1
∂ξn
L12(4) November 17, 2024 21 / 28
Applying Lagrange Multipliers (2)
• Dual function D(α, γ) = inf w ,b,ξ L(w , b, ξ, α, γ) with (D1) is given by:
N XN N
* N + N
1 X X X X
D(α, γ) = yi yj αi αj ⟨xi , xj ⟩ − y i αi y j αj x j , x i − b yi αi
2
i=1 j=1 i=1 j=1 i=1
N
X N
X
+ αi + (C − αi − γi )ξi
i=1 i=1
• From (D2) and (D3), the above is simplified into:
N N N
1 XX X
D(α, γ) = yi yj αi αj ⟨xi , xj ⟩ + αi
2
i=1 j=1 i=1
• αi , γi ≥ 0 and C − αi − γi = 0 =⇒ 0 ≤ αi ≤ C
L12(4) November 17, 2024 22 / 28
Dual SVM
• (Lagrangian) Dual Problem: maximize D(α, γ)
N N N
1 XX X
min yi yj αi αj ⟨xi , xj ⟩ + αi
α 2
i=1 j=1 i=1
N
X
subject to yi αi = 0, 0 ≤ αi ≤ C , ∀i = 1, . . . , N
i=1
• Primal SVM: the number of parameters scales as the number of features (D)
• Dual SVM
◦ the number of parameters scales as the number of training data (N)
◦ only depends on the inner products of individual training data points ⟨xi , xj ⟩ → allow
the application of kernel
L12(4) November 17, 2024 23 / 28
Roadmap
(1) Story and Separating Hyperplanes
(2) Primal SVM: Hard SVM
(3) Primal SVM: Soft SVM
(4) Dual SVM
(5) Kernels
(6) Numerical Solution
L12(5) November 17, 2024 24 / 28
Kernel
• Modularity: Using the feature
transformation ϕ(x), dual SVMs can be
modularized
⟨xi , xj ⟩ =⇒ ⟨ϕ(xi ), ϕ(xj )⟩
• Similarity function k : X × X 7→ R,
k(xi , xj ) = ⟨ϕ(xi ), ϕ(xj )⟩
• Kernel matrix, Gram matrix: must be
symmetric and positive semidifinite
• Examples: polynomial kernel, Gaussian
radial basis function, rational quadratic
kernel
L12(5) November 17, 2024 25 / 28
Numerical Solution
L12(5) November 17, 2024 26 / 28
Questions?
L12(5) November 17, 2024 27 / 28
Review Questions
1)
L12(5) November 17, 2024 28 / 28