KEMBAR78
SVM ML | PDF | Support Vector Machine | Algorithms
0% found this document useful (0 votes)
11 views21 pages

SVM ML

The document discusses Support Vector Machines (SVM) for binary classification, focusing on the concept of finding the best hyperplane that maximizes the margin between two classes. It covers the mathematical formulation of SVM, including constraints and the use of kernel functions to handle non-linearly separable data. Additionally, it highlights the importance of regularization and optimization techniques in SVM implementation.

Uploaded by

Kartik Singhal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views21 pages

SVM ML

The document discusses Support Vector Machines (SVM) for binary classification, focusing on the concept of finding the best hyperplane that maximizes the margin between two classes. It covers the mathematical formulation of SVM, including constraints and the use of kernel functions to handle non-linearly separable data. Additionally, it highlights the importance of regularization and optimization techniques in SVM implementation.

Uploaded by

Kartik Singhal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Support Vector Machine (SVM)

Consider binary classification S-1 13 ,


+

Linearly separable
x b-S
f

nno
+
=>
- (t)

+x +b = 0

x()

h(xi) =
Sign (critb)

If Y : (Wxi b)+ >O


,
then xi is

correctly classified (y : S-1 ,


+
B
exist which satisfy
Many hyperplane
this criteria
Best Hyperplane : SVM
-

that maximizes distance


the hyperplane
both
to closest data points from
classes.

Iy
x(2)

margin
n
,

Maximize
thegin (4)

Tri
·

x, w, d , xp are

vectors here
.

Up-projection of re on Hyperplane H

of minc length
from 1
tox
d-vector
xp = x -
d

& is parallel to w

R
d = awaf

xw
xp x -

2
=

UpElt
wixp + b = 0

+
w (x - aw) + b 0
=

.
0
wix - aww + b =

or a
= -

length of
d =
11dll =d

= w
= a
w

. .
data (D).
Margin (2) w .
r .
A

b. 1
"
10 +
7 (w b) = min x

Twllz
,

xED .

>
closest data point
-

Aim is to maximize n(w ) ,

max .

7 (0 b) ,

w
,
b

I I
#
Additional constraints S .
t .

Hyperplane
should lie. in between the

data points .
-constraints xi + b0 Fi =
Elise
wix +b 0

!
=

at hyperplane .

E
set blo

a
trarian
ii)
Tele
.

+
x +b
0
Hyperplane
=
: -w

e(win b) +
= 0

2wix + eb = 0
I+
O

X
s

↓ cr +b = 1
win +b : - 1

I wis):
min
,

both
constraints
.

combinising
Y (wTxi
:
b) y) +
-
.
. . .
Lii)
----

objective function
ia --
---

man w.use
of constraints

mas Hlz minIwll-minww


=
Functioy (wixi
objective + b)y1 +i

w
,
b

Quadratic Optimization problem


solution exists
.
unique
smallest w "W
the
is to find
objective at least
inputs
are
that all
Such
from
the hyperplane
one
unit away
correct side.
on the

Vector
port ech are closest to
the

the
and exactly satisfy
hyperplane.
constraint Ji (W"Ri +D =
1.
with soft constraints
-H

· Mistakes but stil allow


them
Penalize ,

Ji (win b) ! +

some points
relax for
,

to
want
If we

where
Y: (w xi + b) >1 -

Ei .
0
E,

Let's say , y :
(wini b) + = 0 .
5 .

Then Er 0 .
5 to satisfy
:
b) 3
(0 xi
= -

+
F
f J:
4. to satisfy
ten ET ,
This is inside marger

sachu I
2(z)a
<
O

·
Opti (wixi b) +

0
: 04E
O
E
&

f *- >
This is in opp.
region .

is eate
:71 .

&

-
min
w b
ww . C Ei
,

S .
A- yi/cxi +B>1 - Esi
-----------

Ei should be as small as possible


C large ,
to be on

all points should try


correct side of hyperplane
to be in
I small allowance given
for few pts
,
side
incorrect
at least

2- Hyper-paraders
Er 1-ye (xi +)
the Ei
is to minimize
objective
holds.
Equality

Fini if Silvi <


E
Ei = 1-

xi +b)
"

= O if
. Y: (w

4 b) 0)
Ei =
max)1-y (wxi +
,
&

aa
fun
un
min
stainedSubjecti
b u loss.
W
,
.

Hinge
Regularizer
can be .
used
descent
Gradient

Cost

# Dist . From boundary


SVM
Kernel :

W can be represented as Linear combination

of training samples i

·e .

we Die or ↓
i

Ii is eithert !

Proof (Intuitively) or -1

matter
For convers functions ,
it doesn't
where we start . Start with W =
s

t iteration
[t + 1) [A]
L
=
.

wi = 0 -x

For
squared
loss ,
1(w) = (wix-y))
m

:
al
-
.

22(cn -
(i)xi
aw i = 1

= Wi

By induction ,
assume We Biki
i .
e .
it holds for t-th eteration .
:
Ej
=
Was-L
-Bi-a
= (Pi-Wi)xi

Combination of
Still a
linear

i - Hence Pro.
ve d

Let's now rewrite SVM objective fune"

i tw .

St .

Y(w

(
= Brin
Imer Product of vector.
S

[Hi ,
Ri]
Ji (wixi + b) 1 .

2
[fe +
b]1
b
· I
[, ↓.
+

Ske , j ?
w's but
: In objective funer ,
no

ner
of training
ex
all in terms .

product
we
have to learn Bis for
So ,
now
instead of
each training example
w which can be high demensional
computation
Inner Producthigh dimensional
-
132)


1 I I
(i)

& (x) =
(2) z()

d(z) = zi2

:
[i ..
(d)
((2)

2k .

: q(xq(z) (1) (2)


(1) (2)

1+ xZ + xz +... + x x . .
2 2 .

xy(z)(d)y()y()
(d) -

...
Z
+ -
. .
+

+ O
)
%
instead
of 0 (2

[T] =

(1 +
xz +

x,
I

2) (1
+

+
Uzzz)
...
Kernel function .

ex:
with a training

Kij = (ui) 0(x)) .

SVM interactive demo:

https://greitemann.dev/svm-demo
SVM .
&
Need to convert
x19) to x' P Dad .

do (almost
Hard to impossible) .
sol"
-
: -
Find how each
point compare
to another data point .

-
Dot Product .

Then we don't need to computexD)


Ke r .
n ls
Popular same as
Linear -K(x 2) = xZ +
linear
1) :
.

SVM .

2) Polynomial : -K(x 2) ,
=
(2 +P

).
3 Radial Basis Fune :

k(x ,
2) =
e
- 1
or e-wllx-2112
Kernel
Polynomial :

.
k(x z) ,
=
(x z + p)d
Distypically
O or 1

Let us assume W =

t ,
d = .
2

Assuming n & I to be I-d .

(xz +
t) = (xz ()(xz 2)
+ +

= x -22 + xz +
+
= (x , x ,t) (2 2) ·
,

Dot Product

- .
This transformation converts

I-d 3-d .
original .
to

x
,
x?
, At constant so not
here

useful
Check yourself for (x2 .
+
1) 2
n = 1 d = 2
.
,

What will dimensions .


be
modified

BasiFunction
Radial
.

12


thi
we will refor by
v
=
called .
Kernel
Also Gaussian

RBF beh aves like a. Weighted


model .
Nearest Neighbor
nearest
Closest observations (i e
..
i .
e
.

lot of enfluence
neighbors ) have on
.

observation
classification of new .
In RBF , each
training pt is a

feature

Kersel

stone
=

192 1z .
&

landmarks
Sess
112
f =
q(x ,
l
)
,
=
e-
"x- l ,
-
22
-

fi = p(x ,
e)) =
e
-
1x-lell
242

# Kernel
Gaussian

a
If is close to 1
,
jeoI .
Carget)
If a
is
very far away from 1
,
j(1)-e -

zo2

20

Proximity
RBF

·
is an

univers al

approximator .

x()
to
Choose landmarks exactly equal

of training
data points
#

? : -Denotes spread of
ale
of
each pe by gaussian.

* -
-
=

82 = 5 (say)
High Bias .
Low Bias .
Low variance .

High variance
.

Proof that RBF


·
pernel projects to
.

infeste space .

Hint :
Taylor Series Expansion

You might also like