0% found this document useful (0 votes)

203 views34 pages

Support Vector Machines: (Vapnik, 1979)

Support vector machines (SVMs) are a supervised learning method for classification and regression. SVMs find the optimal separating hyperplane that maximizes the margin between positive and negative examples. The hyperplane is defined by support vectors, which are the closest training examples to the hyperplane. SVMs can handle non-linearly separable data using kernel functions to map inputs to higher dimensions where a separating hyperplane may exist. Precision and recall are used to evaluate classifier performance, where precision measures the proportion of correct positive predictions and recall measures the proportion of actual positive examples that were correctly identified.

Uploaded by

sisda amalia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

203 views34 pages

Support Vector Machines: (Vapnik, 1979)

Uploaded by

sisda amalia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Support Vector Machines

(Vapnik, 1979)

•  Assume a binary classification problem.

–  Instances are represented by vector x ∈ ℜn.

–  Training examples: x = (f1, f2, …, fn)

S = {(x1, y1), (x2, y2), ..., (xn, yn) | (xi, yi)∈ ℜn ×{+1, -1}

–  Hypothesis: A function h: ℜn→{+1, -1}.

h(x) = h(f1, f2, …, fn) ∈{+1, -1}

•  Here, assume positive and negative instances are to be
separated by the hyperplane

w⋅ x + b = 0
Equation of line:
f2 + w⋅ x + b = w1 f1 + w 2 f 2 + b = 0
- + +
€ -
- +

- €
f1
w⋅ x + b = 0

f2 +
- + +
€ -
- +

•  Intuition: the best hyperplane (for future generalization)

will “maximally” separate the examples
Definition of Margin

•  The margin of the positive examples, d+ with respect to

that hyperplane, is the shortest distance from a positive
example to the hyperplane:

f2 +
- + +
-
- d+ +

- -

f1
•  The margin of the negative examples, d- with respect to
that hyperplane, is the shortest distance from a negative
example to the hyperplane:

f2 +
- + +
d-
-
- d+ +

- -

f1
•  The margin of the training set S with respect to the
hyperplane is d+ + d- .

f2 +
- + +
d-
-
- d+ +

- -

f1
•  The margin of the training set S with respect to the
hyperplane is d+ + d- .

f2 +
+ +
Vapnik showed that the hyperplane- d-
-
d+ +
maximizing the margin of S will have -
minimal VC dimension in the set of- -
all consistent hyperplanes, and will
thus be optimal. f1
•  The margin of the training set S with respect to the
hyperplane is d+ + d- .

f2 +
+ +
Vapnik showed that the hyperplane- d-
-
d+ +
maximizing the margin of S will have -
minimal VC dimension in the set of- -
all consistent hyperplanes, and will
thus be optimal. f1

This is an optimization
problem!
•  Note that the hyperplane is defined as

w⋅ x + b = 0

•  To make the math easier, we will rescale w and b such that

the hyperplane is halfway in-between the closest positive
€
and negative examples, and
From M. A. Hearst et al. paper (on class web page)
•  In this case, the margin is

•  So to maximize the margin, we need to minimize .

Minimizing the margin
Find w and b by doing the following minimization:

This is a quadratic optimization problem. Use “standard

optimization tools” to solve it.
•  Dual formulation: It turns out that w can be expressed as
a linear combination of a small subset of the training
examples: those that lie exactly on margin (minimum
distance to hyperplane):

such that xi lie exactly on the margin.

•  These training examples are called “support vectors”.

They carry all relevant information about the classification
problem.
•  The result of the SVM training algorithm (involving
solving a quadratic programming problem), is the αi’s and
the xi’s.
•  For a new example x, We can now classify x using the
support vectors:

•  This is the resulting SVM classifier.

Non-linearly separable training examples
•  What if the training examples are not linearly separable?

•  Use old trick: find a function that maps points to a higher

dimensional space (“feature space”) in which they are
linearly separable, and do the classification in that higher-
dimensional space.
Need to find a function Φ that will perform such a mapping:
Φ: ℜn→ F

Then can find hyperplane in higher dimensional feature space

F, and do classification using that hyperplane in higher
dimensional space.
•  Problem:

–  Recall that classification of instance x is expressed in

terms of dot products of x and support vectors.

–  The quadratic programming problem of finding the

support vectors and coefficients also depends only on
dot products between training examples, rather than on
the training examples outside of dot products.
–  So if each xi is replaced by Φ(xi) in these procedures,
we will have to calculate a lot of dot products,
Φ(xi)⋅ Φ(xj)

–  But in general, if the feature space F is high

dimensional, Φ(xi) ⋅ Φ(xj) will be expensive to
compute.

–  Also Φ(x) can be expensive to compute

•  Second trick:

–  Suppose that there were some magic function,

k(xi, xj) = Φ(xi) ⋅ Φ(xj)

such that k is cheap to compute even though Φ(xi) ⋅ Φ

(xj) is expensive to compute.

–  Then we wouldn’t need to compute the dot product

directly; we’d just need to compute k during both the
training and testing phases.

–  The good news is: such k functions exist! They are

called “kernel functions”, and come from the theory of
integral operators.
Example: Polynomial kernel:

Suppose x = (x1, x2) and y = (y1, y2).

Recap of SVM algorithm

Given training set

S = {(x1, y1), (x2, y2), ..., (xm, ym) | (xi, yi)∈ ℜn ×{+1, -1}

1.  Choose a map Φ: ℜn→ F, which maps xi to a higher

dimensional feature space. (Solves problem that X might
not be linearly separable in original space.)

2.  Choose a cheap-to-compute kernel function

k(x,z)=Φ(x) ⋅ Φ(z)
(Solves problem that in high dimensional spaces, dot
products are very expensive to compute.)

3.  Map all the xi’s to feature space F by computing Φ(xi).

4.  Apply quadratic programming procedure (using the kernel
function k) to find a hyperplane (w, w0), such that

where the Φ(xi)’s are support vectors, the αi’s are

coefficients, and w0 is a threshold, such that (w,w0 ) is the
hyperplane maximizing the margin of S in F.
•  Now, given a new instance, x, find the classification of x
by computing
Demo:
Spam classification using SVMs
Example:
Applying SVMs to text classification
(Dumais et al., 1998)
•  Used Reuters collection of news stories.

•  Question: Is a particular news story in the category

“grain” (i.e., about grain, grain prices, etc.)?

•  Training examples: Vectors of features such as appearance

or frequency of key words. (Similar to our spam-
classification task.)

•  Resulting SVM: weight vector defined in terms of support

vectors and coefficients, plus threshold.
Results
Precision / Recall

•  Confusion matrix for a classifier:

Classified Classified
Positive Negative

Positive
Examples True positives False negatives
(TP) (FN)

Negative
Examples False positives True negatives
(FP) (TN)
Some performance measures
•  Accuracy: proportion of classifications, over all the N
examples, that were correct:

•  Recall (or true positive rate, or “detection rate”):

Proportion of positive examples that were classified
correctly:

•  Precision : Proportion of correct positive classifications

over all positive classifications:
Example
Test data Correct Classification Model’s Classification
x1 T T
x2 T F
x3 F T
x4 F F
x5 F T
x6 F F
x7 F F
x8 F T

Accuracy =
Recall =

Precision =
Interpretation of precision and recall
•  Precision and recall are often plotted against one another,
especially in “detection” applications (such as spam
detection), when positive examples are sparse in the
observed data.

•  Recall: How often did the system correctly identify

positive examples when it encountered them?

•  Precision: How often did the system get positive

classifications correct?

•  How do these two measures trade off against one another?

Precision / Recall Curves
SVMs also used in Watson for question classification

e.g., see Moschitti et al., Using syntactic and semantic

structural kernels for classifying definition questions in
Jeopardy!
SVM Demo

W12 SVM
No ratings yet
W12 SVM
52 pages
Support Vector Machine
No ratings yet
Support Vector Machine
52 pages
Support Vector Machine: Prof. Subodh Kumar Mohanty
No ratings yet
Support Vector Machine: Prof. Subodh Kumar Mohanty
52 pages
Support Vector Machine: Abinas Panda
No ratings yet
Support Vector Machine: Abinas Panda
52 pages
Unit 2
No ratings yet
Unit 2
47 pages
CS-13410 Introduction To Machine Learning
No ratings yet
CS-13410 Introduction To Machine Learning
33 pages
SVM Tutorial
No ratings yet
SVM Tutorial
31 pages
SML Unit 4
No ratings yet
SML Unit 4
61 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
SVM
No ratings yet
SVM
11 pages
SVM Applications and Properties
100% (1)
SVM Applications and Properties
34 pages
Support Vector Machine
No ratings yet
Support Vector Machine
35 pages
Unit - 2
No ratings yet
Unit - 2
15 pages
Support Vector Machine
No ratings yet
Support Vector Machine
19 pages
SVM Formulation and Optimization
No ratings yet
SVM Formulation and Optimization
16 pages
Support Vector Machine
No ratings yet
Support Vector Machine
45 pages
Lec5 Support Vector Machine
No ratings yet
Lec5 Support Vector Machine
28 pages
ML 18-20 SVM
No ratings yet
ML 18-20 SVM
44 pages
SVM PRESENTATION
No ratings yet
SVM PRESENTATION
34 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
26 pages
S V M (SVM) : Upport Ector Achine
No ratings yet
S V M (SVM) : Upport Ector Achine
67 pages
Final - Support Vector Machine - Class - Modifie
No ratings yet
Final - Support Vector Machine - Class - Modifie
69 pages
Exp 14
No ratings yet
Exp 14
27 pages
SVM Basics for Data Scientists
No ratings yet
SVM Basics for Data Scientists
139 pages
Machine Learning - Open Elective - Part III
No ratings yet
Machine Learning - Open Elective - Part III
90 pages
SVM Notes Unit 4
No ratings yet
SVM Notes Unit 4
8 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
103 pages
Support Vector Machine
No ratings yet
Support Vector Machine
55 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Introduction To Support Vector Machines: 1 Description
No ratings yet
Introduction To Support Vector Machines: 1 Description
15 pages
Support Vector Machines: Jeff Wu
No ratings yet
Support Vector Machines: Jeff Wu
35 pages
Machine Learning Unit-3.3
No ratings yet
Machine Learning Unit-3.3
38 pages
Unit 2 PPT - Part 2
100% (1)
Unit 2 PPT - Part 2
81 pages
Support Vector Machine
100% (2)
Support Vector Machine
11 pages
L5-Support Vector Machine
No ratings yet
L5-Support Vector Machine
61 pages
ML Support Vector Machines 2
No ratings yet
ML Support Vector Machines 2
22 pages
Lecture 18 - SVM
No ratings yet
Lecture 18 - SVM
54 pages
Lec06 SVM
No ratings yet
Lec06 SVM
25 pages
Support Vector Machine (SVM)
No ratings yet
Support Vector Machine (SVM)
28 pages
SVM Basics for Machine Learning Enthusiasts
No ratings yet
SVM Basics for Machine Learning Enthusiasts
4 pages
SVM Tutorial
No ratings yet
SVM Tutorial
34 pages
Support Vector Machine
No ratings yet
Support Vector Machine
12 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Intro to Support Vector Machines
No ratings yet
Intro to Support Vector Machines
25 pages
SVM Tutorial
No ratings yet
SVM Tutorial
28 pages
Unit - 2-1
No ratings yet
Unit - 2-1
7 pages
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
No ratings yet
Overview of SVM: A Support Vector Machine (SVM) Performs by Finding The That The Margin Between The
20 pages
Lecture Slides-Week12
100% (1)
Lecture Slides-Week12
41 pages
Support Vector Machine Guide
No ratings yet
Support Vector Machine Guide
21 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages
SVM
No ratings yet
SVM
43 pages
Support Vector Machine
No ratings yet
Support Vector Machine
29 pages
ML SVM Lect10 11
No ratings yet
ML SVM Lect10 11
27 pages
Ann Unit III
No ratings yet
Ann Unit III
20 pages
Number System Sheet-3 - 423278 - Crwill
No ratings yet
Number System Sheet-3 - 423278 - Crwill
4 pages
Part3Module2 PDF
No ratings yet
Part3Module2 PDF
15 pages
Results 2009 DVC Accountant Advt
100% (2)
Results 2009 DVC Accountant Advt
5 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
Algebraic Reasoning & Postulates Guide
No ratings yet
Algebraic Reasoning & Postulates Guide
10 pages
8 Polynomials
No ratings yet
8 Polynomials
16 pages
JavaScript Midterm Review Guide
No ratings yet
JavaScript Midterm Review Guide
3 pages
X Project Topics
No ratings yet
X Project Topics
1 page
Part 3 - Matrix Algebra - Matrix Operations
No ratings yet
Part 3 - Matrix Algebra - Matrix Operations
18 pages
Borov Kov
No ratings yet
Borov Kov
502 pages
Digital Distance Relay Modeling and Testing Using LabVIEW and MATLAB Simulink
No ratings yet
Digital Distance Relay Modeling and Testing Using LabVIEW and MATLAB Simulink
55 pages
Student Exploration: Golf Range
No ratings yet
Student Exploration: Golf Range
6 pages
IS-LM Model for Economists
100% (1)
IS-LM Model for Economists
10 pages
Comments To FprEN1991 1-3-2023 Cylindrical Roof
No ratings yet
Comments To FprEN1991 1-3-2023 Cylindrical Roof
15 pages
Java OOP Course: Concepts & Design
No ratings yet
Java OOP Course: Concepts & Design
3 pages
PGDCA Assignment Paper
No ratings yet
PGDCA Assignment Paper
13 pages
Apm1514 TL 101 Assignment04 2024
No ratings yet
Apm1514 TL 101 Assignment04 2024
3 pages
Implementation of Symbol Table
No ratings yet
Implementation of Symbol Table
31 pages
Flow Passed Immersed Bodies: Outline
No ratings yet
Flow Passed Immersed Bodies: Outline
22 pages
2017 Cet Paper PDF
No ratings yet
2017 Cet Paper PDF
31 pages
Do My Algebra 2 Homework
100% (1)
Do My Algebra 2 Homework
4 pages
WPE 11th
No ratings yet
WPE 11th
14 pages
Q3 - WS - Mathematics 7 - Lesson 3 - Week 3
No ratings yet
Q3 - WS - Mathematics 7 - Lesson 3 - Week 3
12 pages
Theoretical Physics Thermodynamics, Electromagnetism, Waves, An
100% (1)
Theoretical Physics Thermodynamics, Electromagnetism, Waves, An
392 pages
Solved ISRO Scientist or Engineer Civil 2013 Paper With Solutions
No ratings yet
Solved ISRO Scientist or Engineer Civil 2013 Paper With Solutions
21 pages
IMU and UWB Indoor Positioning System
No ratings yet
IMU and UWB Indoor Positioning System
8 pages
Synchronizing A Triple Dragline Stripping System in Thick Overburden
No ratings yet
Synchronizing A Triple Dragline Stripping System in Thick Overburden
14 pages
Maximum Mark: 96: Cambridge International General Certificate of Secondary Education (9-1)
No ratings yet
Maximum Mark: 96: Cambridge International General Certificate of Secondary Education (9-1)
8 pages
Reading and Writing Decimals
No ratings yet
Reading and Writing Decimals
12 pages
Coordination Chemistry Overview
No ratings yet
Coordination Chemistry Overview
3 pages

Support Vector Machines: (Vapnik, 1979)

Uploaded by

Support Vector Machines: (Vapnik, 1979)

Uploaded by

Support Vector Machines

• Assume a binary classification problem.

– Instances are represented by vector x ∈ ℜn.

– Training examples: x = (f1, f2, …, fn)

– Hypothesis: A function h: ℜn→{+1, -1}.

h(x) = h(f1, f2, …, fn) ∈{+1, -1}

• Intuition: the best hyperplane (for future generalization)

• The margin of the positive examples, d+ with respect to

• To make the math easier, we will rescale w and b such that

• So to maximize the margin, we need to minimize .

This is a quadratic optimization problem. Use “standard

such that xi lie exactly on the margin.

• These training examples are called “support vectors”.

• This is the resulting SVM classifier.

• Use old trick: find a function that maps points to a higher

Then can find hyperplane in higher dimensional feature space

– Recall that classification of instance x is expressed in

– The quadratic programming problem of finding the

– But in general, if the feature space F is high

– Also Φ(x) can be expensive to compute

– Suppose that there were some magic function,

k(xi, xj) = Φ(xi) ⋅ Φ(xj)

such that k is cheap to compute even though Φ(xi) ⋅ Φ

– Then we wouldn’t need to compute the dot product

– The good news is: such k functions exist! They are

Suppose x = (x1, x2) and y = (y1, y2).

Given training set

1. Choose a map Φ: ℜn→ F, which maps xi to a higher

2. Choose a cheap-to-compute kernel function

3. Map all the xi’s to feature space F by computing Φ(xi).

where the Φ(xi)’s are support vectors, the αi’s are

• Question: Is a particular news story in the category

• Training examples: Vectors of features such as appearance

• Resulting SVM: weight vector defined in terms of support

• Confusion matrix for a classifier:

• Recall (or true positive rate, or “detection rate”):

• Precision : Proportion of correct positive classifications

• Recall: How often did the system correctly identify

• Precision: How often did the system get positive

• How do these two measures trade off against one another?

e.g., see Moschitti et al., Using syntactic and semantic

You might also like

•  Assume a binary classification problem.

–  Instances are represented by vector x ∈ ℜn.

–  Training examples: x = (f1, f2, …, fn)

–  Hypothesis: A function h: ℜn→{+1, -1}.

•  Intuition: the best hyperplane (for future generalization)

•  The margin of the positive examples, d+ with respect to

•  To make the math easier, we will rescale w and b such that

•  So to maximize the margin, we need to minimize .

•  These training examples are called “support vectors”.

•  This is the resulting SVM classifier.

•  Use old trick: find a function that maps points to a higher

–  Recall that classification of instance x is expressed in

–  The quadratic programming problem of finding the

–  But in general, if the feature space F is high

–  Also Φ(x) can be expensive to compute

–  Suppose that there were some magic function,

–  Then we wouldn’t need to compute the dot product

–  The good news is: such k functions exist! They are

1.  Choose a map Φ: ℜn→ F, which maps xi to a higher

2.  Choose a cheap-to-compute kernel function

3.  Map all the xi’s to feature space F by computing Φ(xi).

•  Question: Is a particular news story in the category

•  Training examples: Vectors of features such as appearance

•  Resulting SVM: weight vector defined in terms of support

•  Confusion matrix for a classifier:

•  Recall (or true positive rate, or “detection rate”):

•  Precision : Proportion of correct positive classifications

•  Recall: How often did the system correctly identify

•  Precision: How often did the system get positive

•  How do these two measures trade off against one another?