SYSC4415
Introduction to Machine Learning
Lecture 5
Prof James Green
jrgreen@sce.Carleton.ca
Systems and Computer Engineering, Carleton University
Learning Objectives for Lecture 5
• Understand how SVM and KNN models classify data.
• Understand how to train a SVM model from labelled data.
• Introduce concept of SVM kernel to achieve nonlinear decision boundary.
• Define a "support vector"
• Be able to classify data by hand using a KNN classifier.
Pre-lecture Assignment:
• Read Section 3.4-3.5 of the 100pMLB
In-class activities:
• Poll Everywhere review questions
• Review fundamental equations governing SVM
• Discuss impact of kernel function
• Implement a KNN classifier by hand using sample data
Key terms
• Hinge loss function, kernel trick, kernel functions (kernels), RBF
kernel, Euclidean distance, k-Nearest Neighbors, cosine similarity,
support vector.
>
-
Support Vector Machines - Review
• Review of support vector machines (SVM)
Note that 100pMLbook doesn’t
show dot product
f(x)=sign(wTx-b)=sign(w·x-b)
Learn w, b such that:
How is this equivalent to that?
Why minimize ||w||?
Support Vector Machines – new concepts
• Data can be linearly inseparable due to 1) noise or 2) inherent structure
Cost Function Hinge LOSS Function : predict
if
22-Regularizat hinge-loss ~
Wrong
- ~
- max (0 ,
1 -
y , w 2 b)
+
cllwll + * yi( (i Lif predict
↑
,
max(0 ,
1 -
.
+ b) right
determine tradeoff between increasing size of decision boundary
and classification
Regulates empirical
3
large 2 width MOST risk
>
-
of margin matters
Small - > classifying matters more and generalization => TRADE OFF
·
soft-margin SVM : Optimize hinge-loss function
.
· hard-margin SVM : no hinge-loss
.
Inherent Non-linearity :
·
implicitly transform the original space into a higher dimension
during the cost function optimization
. This is called the Kernel trich .
&
1) Noisy/mislabelled data
• Use hinge loss function to allow for incorrectly classified training data
large 2 >
- width
of margin matters MOST
incorrectly
classified
Small - > classifying matters
<0 for correctly penalty
-
ma makes
classified samples dominate
↓ -
h
value becomes
-
ve
FOR CORRECTLY
Minimize classified points
. contributes nothing
i
. e
Minimize Hinge loss function
||w|| to (For each misclassified
Maximize sample, minimize distance
margin from decision margin)
• Hard vs. soft margin SVM
• C hyperparameter
• Either way, solve ‘optimization under constraints’ using Lagrangian multipliers then quadratic programming
>
-
2) Inherent structure of data
• Can introduce nonlinear mapping to map data to higher dimensional
space where data become linearly separable
• “Kernel trick” (computational trick)
• It turns out that we only need to compute dot products between pairs of
vectors (training/testing instances) in the new space, and not the actual
transformed vectors explicitly
• Therefore, define a “kernel function” to bypass mapping:
From: http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html
For , , , = ,
Where °, ° is an inner product of , M > N,
and transforms to :
2) Inherent structure of data (example)
From http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html 12
2) Inherent structure of data (example con’t)
From http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html 13
2) Inherent structure of data
Common kernels:
1) Linear kernel: , =
No mapping to higher dimensional space…
2) Polynomial kernel: , = + (dth-order polynomial)
Euclidian distance (squared)
3) Radial basis function kernel: , =
• Infinite dimensions, but chance of overfitting only depends on # of support vectors
• Varying controls smooth/curvy nature of decision boundary
• For the math lovers, https://www.youtube.com/watch?v=_PwhiWxHK8o shows a near-intuitive
derivation of SVMs (50 min MIT grad lecture)
>
-
K-Nearest Neighbour classifier
• Non-parametric
• Vote among classes of K nearest training points to test point
• Hyperparameters:
• K, # neighbours to examine
• Choice of distance metric
• City block distance, Euclidian distance, Mahalanobis distance, etc.
• Cosine similarity often used for comparing two documents (binary feature vectors):
K-Nearest Neighbour Example
• Consider the following table of training data:
Weight Length Fish Type
4 19 Salmon
6 17 Salmon
8 15 Salmon
8 12 Sea Bass
7 13 Sea Bass
9 15 Sea Bass
• For K=3, use city-block-distance to classify the following test point:
• Weight = 8, length = 12
Salman
mseabass
unknown
weight
A
9 -
&
=
O
123
S -
& - - -
&
a
neighbours ass
7 -
I & & 7 neighbour = Salmon
2
6 -
&
5 -
4- ~ A
length
is in is i in is is
I >
12
>
-
>
-
don't have to compute any distances because all will the close . ex : if more
apples then
oranges ,
will
always the apples