Support Vector Machine (SVM)
Consider binary classification S-1 13 ,
+
Linearly separable
x b-S
f
nno
+
=>
- (t)
+x +b = 0
x()
h(xi) =
Sign (critb)
If Y : (Wxi b)+ >O
,
then xi is
correctly classified (y : S-1 ,
+
B
exist which satisfy
Many hyperplane
this criteria
Best Hyperplane : SVM
-
that maximizes distance
the hyperplane
both
to closest data points from
classes.
Iy
x(2)
margin
n
,
Maximize
thegin (4)
Tri
·
x, w, d , xp are
vectors here
.
Up-projection of re on Hyperplane H
of minc length
from 1
tox
d-vector
xp = x -
d
& is parallel to w
R
d = awaf
xw
xp x -
2
=
UpElt
wixp + b = 0
+
w (x - aw) + b 0
=
.
0
wix - aww + b =
or a
= -
length of
d =
11dll =d
= w
= a
w
. .
data (D).
Margin (2) w .
r .
A
b. 1
"
10 +
7 (w b) = min x
Twllz
,
xED .
>
closest data point
-
Aim is to maximize n(w ) ,
max .
7 (0 b) ,
w
,
b
I I
#
Additional constraints S .
t .
Hyperplane
should lie. in between the
data points .
-constraints xi + b0 Fi =
Elise
wix +b 0
!
=
at hyperplane .
E
set blo
a
trarian
ii)
Tele
.
+
x +b
0
Hyperplane
=
: -w
e(win b) +
= 0
2wix + eb = 0
I+
O
X
s
↓ cr +b = 1
win +b : - 1
I wis):
min
,
both
constraints
.
combinising
Y (wTxi
:
b) y) +
-
.
. . .
Lii)
----
objective function
ia --
---
man w.use
of constraints
mas Hlz minIwll-minww
=
Functioy (wixi
objective + b)y1 +i
w
,
b
Quadratic Optimization problem
solution exists
.
unique
smallest w "W
the
is to find
objective at least
inputs
are
that all
Such
from
the hyperplane
one
unit away
correct side.
on the
Vector
port ech are closest to
the
the
and exactly satisfy
hyperplane.
constraint Ji (W"Ri +D =
1.
with soft constraints
-H
· Mistakes but stil allow
them
Penalize ,
Ji (win b) ! +
some points
relax for
,
to
want
If we
where
Y: (w xi + b) >1 -
Ei .
0
E,
Let's say , y :
(wini b) + = 0 .
5 .
Then Er 0 .
5 to satisfy
:
b) 3
(0 xi
= -
+
F
f J:
4. to satisfy
ten ET ,
This is inside marger
sachu I
2(z)a
<
O
·
Opti (wixi b) +
0
: 04E
O
E
&
f *- >
This is in opp.
region .
is eate
:71 .
&
-
min
w b
ww . C Ei
,
S .
A- yi/cxi +B>1 - Esi
-----------
Ei should be as small as possible
C large ,
to be on
all points should try
correct side of hyperplane
to be in
I small allowance given
for few pts
,
side
incorrect
at least
2- Hyper-paraders
Er 1-ye (xi +)
the Ei
is to minimize
objective
holds.
Equality
Fini if Silvi <
E
Ei = 1-
xi +b)
"
= O if
. Y: (w
4 b) 0)
Ei =
max)1-y (wxi +
,
&
aa
fun
un
min
stainedSubjecti
b u loss.
W
,
.
Hinge
Regularizer
can be .
used
descent
Gradient
Cost
# Dist . From boundary
SVM
Kernel :
W can be represented as Linear combination
of training samples i
·e .
we Die or ↓
i
Ii is eithert !
Proof (Intuitively) or -1
matter
For convers functions ,
it doesn't
where we start . Start with W =
s
t iteration
[t + 1) [A]
L
=
.
wi = 0 -x
For
squared
loss ,
1(w) = (wix-y))
m
:
al
-
.
22(cn -
(i)xi
aw i = 1
= Wi
By induction ,
assume We Biki
i .
e .
it holds for t-th eteration .
:
Ej
=
Was-L
-Bi-a
= (Pi-Wi)xi
↓
Combination of
Still a
linear
i - Hence Pro.
ve d
Let's now rewrite SVM objective fune"
i tw .
St .
Y(w
(
= Brin
Imer Product of vector.
S
[Hi ,
Ri]
Ji (wixi + b) 1 .
2
[fe +
b]1
b
· I
[, ↓.
+
Ske , j ?
w's but
: In objective funer ,
no
ner
of training
ex
all in terms .
product
we
have to learn Bis for
So ,
now
instead of
each training example
w which can be high demensional
computation
Inner Producthigh dimensional
-
132)
I·
1 I I
(i)
& (x) =
(2) z()
d(z) = zi2
:
[i ..
(d)
((2)
2k .
: q(xq(z) (1) (2)
(1) (2)
1+ xZ + xz +... + x x . .
2 2 .
xy(z)(d)y()y()
(d) -
...
Z
+ -
. .
+
+ O
)
%
instead
of 0 (2
[T] =
(1 +
xz +
x,
I
2) (1
+
+
Uzzz)
...
Kernel function .
ex:
with a training
Kij = (ui) 0(x)) .
SVM interactive demo:
https://greitemann.dev/svm-demo
SVM .
&
Need to convert
x19) to x' P Dad .
do (almost
Hard to impossible) .
sol"
-
: -
Find how each
point compare
to another data point .
-
Dot Product .
Then we don't need to computexD)
Ke r .
n ls
Popular same as
Linear -K(x 2) = xZ +
linear
1) :
.
SVM .
2) Polynomial : -K(x 2) ,
=
(2 +P
).
3 Radial Basis Fune :
k(x ,
2) =
e
- 1
or e-wllx-2112
Kernel
Polynomial :
.
k(x z) ,
=
(x z + p)d
Distypically
O or 1
Let us assume W =
t ,
d = .
2
Assuming n & I to be I-d .
(xz +
t) = (xz ()(xz 2)
+ +
= x -22 + xz +
+
= (x , x ,t) (2 2) ·
,
Dot Product
- .
This transformation converts
I-d 3-d .
original .
to
x
,
x?
, At constant so not
here
useful
Check yourself for (x2 .
+
1) 2
n = 1 d = 2
.
,
What will dimensions .
be
modified
BasiFunction
Radial
.
12
↓
thi
we will refor by
v
=
called .
Kernel
Also Gaussian
RBF beh aves like a. Weighted
model .
Nearest Neighbor
nearest
Closest observations (i e
..
i .
e
.
lot of enfluence
neighbors ) have on
.
observation
classification of new .
In RBF , each
training pt is a
feature
Kersel
stone
=
192 1z .
&
landmarks
Sess
112
f =
q(x ,
l
)
,
=
e-
"x- l ,
-
22
-
fi = p(x ,
e)) =
e
-
1x-lell
242
# Kernel
Gaussian
a
If is close to 1
,
jeoI .
Carget)
If a
is
very far away from 1
,
j(1)-e -
zo2
20
Proximity
RBF
·
is an
univers al
approximator .
x()
to
Choose landmarks exactly equal
of training
data points
#
? : -Denotes spread of
ale
of
each pe by gaussian.
* -
-
=
82 = 5 (say)
High Bias .
Low Bias .
Low variance .
High variance
.
Proof that RBF
·
pernel projects to
.
infeste space .
Hint :
Taylor Series Expansion