Support Vector Machine
Shreya Pandey(192112318),Priyanka Gupta(192112324)
October 15, 2019
1 / 50
Contents
1 Introduction
2 Topic Description
3 Literature Review
4 Outcomes of Literature Review
5 Discussion
6 Conclusion
7 References
2 / 50
What is Support Vector Machine?
Support Vector Machine (SVM) is a supervised machine learning
algorithm which can be used for both classification or regression
challenges. However, it is mostly used in classification problems.
In this algorithm, we plot each data item as a point in n-dimensional
space (where n is number of features you have) with the value of each
feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two
classes very well.
The SVM training algorithm builds a model on the basis of functional
margin.We want to develop a classifier so that it maximizes the
separation between the data points and the decision surface.So we
define margin. The distance from the decision surface to the closest
data point determines the margin of the classifier.
3 / 50
CONTD.
SVMs are so popular because they are used to prevent overfittting
and we can work with relatively large number of features without too
much computation.
Figure: 1 hyperplane
There could be various decision surfaces with different margin
widths.We choose that decision surface for which margin width is
highest. 4 / 50
Functional Margin
Let equation of hyperplane be → −
w .→
−x +b =0
w1 x1 + w2 x2 + b = 0 where x1 and x2 are features.
let (xi , yi ) be a point with respect to (w,b).
margin=distance of (xi , yi ) from decision boundary (w,b)
where →−
w is the vector normal to decision surface.
distance between (xi , yi )to(w , b) = γ i = yi (w T xi + b)-this is normal
distance between (xi , yi ) to decision surface
5 / 50
CONTD..
Functional margin of point (xj , yj ) is more than functional margin of
(xi , yi ).
The points having larger functional margin means more confidence in
predicting class of that point.
The problem of defining a margin as this distance is as follows - w1
and w2 can be scaled for eg. let equation of hyperplane be
2x+3y+1=0 or
4x+6y+2=0 or
20x+30y+10=0
In this example equation of line is same but functional margin is
increasing. So functional margin depends on value of coefficients of
equation of line.
So we need to use normalised distance.
6 / 50
CONTD...
Defining Functional Margin for set of points
let set of training points be (x1 , y1 ), (x2 , y2 ), (x3 , y3 ).........(xm , ym )
and with respect to these points functional margin is defined to be
γ = min(γ i ), where1 ≤ i ≤ m.
Among all functional margins we choose smallest functional margin
and that will be functional margin of set of points.
7 / 50
Geomatrical Margin
Geomatrical margin removes the drawbacks of Functional Margin.
Geomatrical margin is invariant to scaling of equation.
let equation of line be ax+by+c=0 then w=(a,b)
let w be a vector normal to decision surface then unit vector normal
to decision surface is given by
w a b
= (√ ,√ )
||w || 2
a +b 2 a + b2
2
8 / 50
If we want to find distance between P and Q then this distance is in
γw
direction of normal vector. Therefore we can write P = Q + where γ
||w ||
is distance of P from decision surface.
γw
Coordinates of P=coordinates of Q+
||w ||
γw
(a1 , a2 ) = (b1 , b2 ) + from this equation we can find γ
||w ||
γw
w T ((a1 , a2 ) − )+b =0 since (b1, b2) lies on line wx+b=0
||w ||
w T (a1 , a2 ) + b
γ=
||w ||
w T ((a1 , a2 ) b
γ= +
||w || ||w ||
w T ((a1 , a2 ) b
γ = y( + )
||w || ||w ||
we will scale w so that w norm (||w ||) equal to 1 and then we will find
geometric margin γ=y(wT (a1 , a2 ) + b))
9 / 50
Maximise margin width
Assume linearly separable training examples. γ is geometric margin
and we want to maximize it. We represent optimisation problem
as follows-
Give set of training examples labelled as +ve and -ve if
γ
(w , b)characterises decision surface then is geometric margin and
||w ||
we want to learn values of(w , b) so that this geometric margin is
largest, subjected to constraint-
wxi +b≥γ for +ve points.....(1)
wx+b≤γ for -ve points......(2)
1
Our aim is to scale w so that geometric margin is ( γ = 1) so
||w ||
1
we have to maximize or minimize ||w || = w .w combining (1)
||w ||
and (2) yi (wxi + b) ≥ γ for i = 1, 2, 3....m
when γ is made 1 by normalisation then yi (wxi + b) ≥1 for all training
instances.
10 / 50
Large Margin Linear classifier
1
Minimize ||w ||2 such that yi (w T xi + b) ≥ 1
2
Figure: 2 hyperplane
margin has been scaled so the geometric margin has width 1.The
equation of decision surface is w T x + b = 0
+ve points on margin satisfy eq. w T x + b = 1
-ve points on margin satisfy eq. w T x + b = −1 11 / 50
1
Minimize ||w ||2 such that yi (w T xi + b) ≥ 1
2
Lagrangian function of our optimisation problem
1
minimise LP (w,b,αi )= ||w ||2 - m T
P
i,j=1 αi (yi (w xi + b) − 1)such that
2
αi ≥ 0, i = 1, 2, 3....k
We convert our primal problem into dual formulation.This is solved by
using Lagrange multiplier and we can get Lagrange duality to get dual of
this optimisation problem.
Dual Problem
1 Pm
maxα J(α)= m αi αj yi yj (xiT xj )
P
i=1 αi –
2 i,j=1
such that αi ≥ 0, i = 1, 2, 3....k
Pm
i=1 αi yi = 0
This is a quadratic programming problem. A global maximum of αi can
always be found by solving this Quadratic Problem.
12 / 50
Maximum Margin with Noise
Limitation of previous SVM formulation
Data points are noisy so data points are not linearly separable
There could be a non linear decision surface
Figure: 3 hyperplane
we need to redefine optimisation formulation
13 / 50
CONTD..
We want a decision surface which looks at two things-
One is maximising margin which corresponds to minimising w.w
We want to reduce number of miss classification means minimise
training error.
Objective Function= w.w +C.number of training error
It is no longer a Quadratic objective Function.So we cannot use a QP
solver to solve this optimisation problem.
Figure: 4 hyperplane
14 / 50
CONTD....
Corresponding to each of points which are not correctly classified
assuming a particular decision surface and corresponding margin for
each such points ,we can give a penality and that can be proportional
to how far it is from its correct position.
Objective Function= w.w +C.ξi
yK (w .xk + b) ≥ 1 − ξk k=1,2,3...m
ξk ≥ 0 k=1,2,3....m
This is also called noisy linear SVM or soft SVM.
15 / 50
Non linear SVM and Kernel functions
Non Linear SVM is used when dataset is truely non linearly separable.
In this case original feature space is transformed to a new feature
space (higher dimensional feature space).
And in some cases it is possible that training points are linearly
separable in transformed feature space.
Suppose x is input variable then using the methodology of nonlinear
SVM we can use a mapping to map x to φ(x) x→ φ(x)
Normally when we transform this feature space to higher dimensional
feature space,it will lead to higher computational cost,but by using
Kernel function we can achieve this transformation without much
higher cost.
16 / 50
CONTD....
17 / 50
CONTD....
There are certain type of feature transformations where after
transformation SVM can be solved effeciently.For this we have kernel
function.
Suppose we have two points xa and xb , after transformation it
becomes φ(xa ) and φ(xb )Corresponding to certain φs we have
K (xa , xb )=φ(xa ).φ(xb ).This K (xa , xb ) function can be computed
without expanding φ(xa )and φ(xb ) and it is easy to compute even for
extremely high dimensional feature space.These are called kernel
functions.
18 / 50
The Kernel Trick
So without expanding individual φs while finding for solution to
SVM,we can use kernel function to find this.This is called kernel trick.
g (x) = w T φ(x) + b= αi φ(xi )T φ(x) + b
P
where i ∈ SV
this is discriminant function.
We only use dot product of feature vectors in both training and test.
A kernel function is defined as a function that corresponds to a dot
product of two feature vectors in some expanded feature space.
K (xa , xb )=φ(xa ).φ(xb )
19 / 50
Literature Review
20 / 50
Outcomes of Literature Review
Paper 1 : Multi-class support vector machines for static security
assessment of power system
This paper focuses on predicting security status of power system in the
least time frame with the highest accuracy level. The secure state of
power system is defined as its continue operation within allowable
boundaries in normal conditions as well as after any kind of
disturbance. Power system security assessment is classified in three
broad categories: (a) normal steady state operation where its
assessment is carried out by solving set of algebraic equations (b)
transient performance (c) dynamic performance assessment.Here only
static security assessment is addressed.
Paper 2 : An Optimized Stacked Support Vector Machines
Based Expert System for the Effective Prediction of Heart
Failure
In this paper, we introduce an expert system that stacks two support
vector machine (SVM) models for the effective prediction of HF.The
experimental results confirm that the proposed method improves the
performance of a conventional SVM model by 33%.
21 / 50
Discussion
We will be discussing two papers which have used Support Vector Machine
to design required classifiers.
Paper 1:
Multi-class support vector machines for static security
assessment of power system
Paper 2:
An Optimized Stacked Support Vector Machines Based Expert
System for the Effective Prediction of Heart Failure
22 / 50
Static security assessment using composite security index
The outage of any line and generator is considered as credible
contingency and performance evaluation of system after such
contingency is a part of static security assessment (SSA).
SSA identifies any significant bus-bar voltage limit violation or
overload of any transmission line whenever the system is subjected to
such contingencies
Then, SSA screens and ranks the critical contingencies in descending
order of decreasing its adverse impacts on security.
This is done to compute scalar performance index (PI) using FDLF or
DC power flow method The detection of any adverse conditions of
limit violations following any outages is known as screening of
contingency and its arrangement as per the severity is known as
ranking of contingency and highly affected to masking problem.
An improved methodology to compute CSI is reported using the
methodology based on hyper-ellipse encompassed within a hyper-box.
23 / 50
Formulation of composite security index
The constraints given in (1)and(2) must be satisfied under normal
operation of power systems.
Where PGi = real power at generator bus i, PD = total real power
load,Ploss = transmission network real power losses, |Vk | =magnitude of
kth bus, Pkm = real power transfer between bus k-m, Ngen = total
generators in system and N B = total buses in system. The real power
generation and load balance relation is given in (1) whereas minimum and
maximum constraints on active power generations, limits on bus voltage
magnitudes and active power flows are given in (2). The satisfaction of
constraints in (1), (2) following any contingency will determine secure
state of power system. 24 / 50
Power system static security assessment using pattern
recognition
Classification of security stages correctly, has prime role in security
assessment of large power systems. In this work, three classes have
been defined for security assessment,namely secure, alarm and
insecure, based on which power system operator will come on correct
decision about system security.
The pattern recognition approach implemented in this work has
reduced the online computational burden by performing the major
simulation works off-line.
The off-line simulations are carried out to generate large sets of
operating scenarios based on which security classifier will be designed.
Pattern recognition approach used in this work is depicted in Fig. 5
for assessment of power system security.
25 / 50
Figure: 5 Steps involved in pattern recognition for assessment of power system
security
26 / 50
CONTD....
The training set should be chosen to sufficiently cover the entire
range of operating scenario of power system including the variations
in active and reactive power load changes as well as outage conditions
for generator and transmission lines which may endanger the security
of power system.
Newton Raphson load flow is used to generate each operating
condition. These generated data is considered as pattern.
For each operating scenario, load flow is carried out and pattern is
formed with the parameters listed in (13).
X={|Vi |,δi ,Pgi ,Qgi ,Pdi ,Qdi ,Pkm ,Qkm } (13)
Evaluation of composite security index given in (11) will lead to the
labeling of secure, alarm or insecure state of power system.
27 / 50
Feature selection
The patterns generated for training purpose is generally large in
numbers and each pattern contains relatively large number of
variables to formulate classifier.
Feature selection will be useful to reduce the dimensionality by
choosing a small set of parameters known as features . A set of
parameters such obtained is termed as feature vector
Y = {y1 , y2 , y3 ..ym } where m is states in feature vector.
In this work, a sequential search algorithm is implemented for feature
selection which adds or removes parameters sequentially to feature
vector Z. Sequential search algorithm may follow either Sequential
backward selection (SBS) or Sequential forward selection (SFS) to
form feature vector Z.
28 / 50
Classifier design using multiclass support vector machine
SVMs are recently developed learning algorithms for classification and
has been successful to achieve higher accuracy even in more complex
system.
. Let A={pi , qi } is a training set, where pi is the vector of input
valued having n-dimension and qi ∈ {+1, −1} represents a label for
class determination of data instance qi.
Fig. represents optimal hyperplane of SVM classifier for two class
problem where bias b and weight w are two orthogonal vectors.
Support Vectors (SVs) are identified in Fig. 6 by those points which
are nearest to such created hyperplane.
In Fig. 6, q is the highest margin which is to be specified to create
SVs.
SVM constructs this optimal hyperplane by minimizing error function
with the use of an iterative training algorithm.
29 / 50
Figure: 6 Optimal Hyperplane representation for SVM Classifier
30 / 50
CONTD....
In this work, for security assessment of power system, three classes for
classifications have been considered, namely, secure, alarm and
insecure. Hence, such a problem has been treated as multiclass
problem.
There are two types of approaches suggested for multi-class SVM in
literature-
One is considering all data in one optimization.
The other is decomposing multi-class into a series of binary SVMs,
such as ”One-Against-All” (OAA) and ”One-versus-One”.
31 / 50
One-Against-All (OAA) SVM
In this simplest extension of the SVM to a k-class problem, k binary
SVM models are constructed.
In jth binary-class SVM problem,an optimal hyperplane is created
that separates data points from the class Cj and the combined class
Ci (all the data points from classes other than Cj are combined to
form one class Ci) by using the standard SVM approach.
We denote the optimal separating hyperplane discriminating the class
Cj and the combined class Ci as- g j (x) = w j .Φ(x) + b j ,
j=1,2,3...k
To determine g j (x),we have to solve the following dual form-
1 Pm
Maximize:LD = m αi αj yi yj (xiT xj )
P
i=1 αi –
2 i,j=1
such that 0 ≤ αi ≤ C , i=1,2,3....k
Pm
i=1 αi yi = 0
32 / 50
CONTD...
All k binary SVM classifiers are combined together to make a final
multi-class classifier.
After finding all the optimal hyperplanes given by g j (x)forj = 1, ..., k,
we say x is in the class which has the largest value of the decision
function and is given by
Class label of x = argmaxj=1,...k g j (x)
33 / 50
CONTD...
Figure: 7 Class boundaries for OAA SVM foormulation for a three class problem
34 / 50
An Optimized Stacked Support Vector Machines Based
Expert System for the Effective Prediction of Heart Failure
In this paper, we introduce an expert system that stacks two support
vector machine models for the effective prediction of HF.
The first SVM model is linear and L1 regularized. The second SVM
model is L2 regularized.To optimize the two models, we propose a
hybrid grid search algorithm (HGSA) that is capable of optimizing the
two models simultaneously.
Heart failure (HF) is the failure of heart to pump sufficient amount of
blood to meet the needs of the body.
Due to lack of adequate diagnostic tools and medical experts,
effective diagnosis of heart failure is a challenge [3], [4].
The effectiveness of the proposed method is evaluated using different
evaluation metrics: accuracy, sensitivity, specificity, the Matthews
correlation coefficient (MCC),
35 / 50
Proposed Method
The proposed diagnostic system has two sequential stages
The first model has the capability to eliminate irrelevant features by
shrinking their coefficients to zero. For different values of its
hyperparameter C1, different features are selected. Hence, we need to
search the optimal value of C1 which gives us optimal subset of
features.
The optimal subset of features are applied to the second SVM model
which is used as a predictive model. The second model has its own
hyperparameters i.e. kernel, C2 and gamma denoted by G, which also
need to be optimized.
36 / 50
Experiment: L1 Regularized Linear SVM Stacked With L2
Reguralized Linear SVM
DATASET DESCRIPTION
this study considered 13 HF features.
37 / 50
CONTD...
Figure: Block Diagram of the newly proposed method. X P : Original set of
HF features, X K : Optimal subset of features,by: predicted label,C:
Hyperparameter of the first linear SVM model which acts are feature
selector, and λ: Contains the C, G, and kernel hyperparameters of the
38 / 50
L2 Support Vector Machine
The model tries to search an optimal hyperplane which will maximize
the distance from the nearest training data points of any class
Considering a dataset S with k instances:
S = {(xi , yi )|xi ∈ R P , yi ∈ {−1, 1}} where xi denotes i t h instance
and P denotes the dimension of each instance or feature
vector.Moreover, the class label is denoted by yi . The class label
maybe -1 or 1 for HF disease binary classification problem. The SVM
model learns hyper-plane given by f(x)=w T ∗ x + b,where b is the
bias and w is the weight vector.
The hyperplane of the SVM model maximizes the margin while
minimizes the classification error. The margin is computed as the sum
of the distances to one of the closest positive and one of the closest
negative instances. That is the hyperplane maximizes the margin
2
distance .
||w ||2 2
39 / 50
CONTD...
By introducing a set of slack variables ξ i, i= 1,2,3...k and a penalty
parameter i.e., C, the SVM model tries to balance the minimization of
||w ||2 2 and the minimization of the misclassification errors
where L2-norm is the regularizer term and ξ is slack variable which
measures the degree of misclassification.
40 / 50
L1 Support Vector Machine
The L1-norm SVM can be used for feature selection due to its
capability of suppressing irrelevant or noisy features automatically. It
shrinks components of the vector w that correspond to the features
that would be eliminated.
If we change value of the C hyperparameter, different fitted
coefficients will be made zero. As a result, different subsets of
features will be obtained . Thus, we need to search the optimal value
of the hyperparameter C that will yield optimal subset of features.
41 / 50
Formulating The Two Optimization Problems As One
Optimizationn Problem By Merging them
Merge the hyperparameters of the two models, as a result a hybrid
grid is produced.
The first coordinate of each point on the hybrid gird will be the
hyperparameter of the first model i.e., C1 while the second C1 and
third coordinates will be the hyperparameters of the second coordinate
of each point on the hybrid gird will be the model i.e. C2 and G.
Hence, each point on the hybrid grid can be denoted as (C 1, C 2, G ).
Optimal point on the hybrid grid corresponds to the optimal subset of
features and the optimized predictive model which will show good
performance on the optimal subset of features.
42 / 50
Evaluation Metrics
To evaluate the effectiveness of the newly proposed method, different
evaluation metrics including accuracy, sensitivity, specificity and Matthews
correlation coefficient (MCC) have been used.
Accuracy- is the percentage of correctly classified subjects.
Sensitivity is the percentage of correctly classified patients
specificity is the correctly classified healthy subjects.
43 / 50
Contd...
where TP denotes number of true positives,
TN denotes numberof true negatives,
FP denotes number of false positives,
FN denotes number of false negatives.
In machine learning and statistics, the quality of binary classification
is measured using MCC. Its value can be between -1 and 1. MCC
value of -1 indicates total disagreement between prediction and
observation, 1 indicates a prefect prediction and 0 means the
classification is no better than random prediction.
44 / 50
Contd...
45 / 50
Contd...
The best accuracy of 91.1%is obtained using only 11 and 12 features
i.e., for K D 11 and K D 12, respectively. The optimal subset of
features for K D 11 includes F2, F3, F4, F6, F7, F8,F9, F10, F11, F12
and F13 while the optimal subset obtained for K D 12 includes F1,
F2, F3, F4, F6, F7, F8, F9, F10, F11, F12 and F13.
COMPARATIVE STUDY
Experimental results of the proposed method are compared with other
machine learning models and previously proposed methods.
46 / 50
The hyperparameters of all these models are optimized using
exhaustive search strategy
For Adaboost model, the hyperparameter Ne denotes the maximum
number of estimators at which boosting is terminated.
For random forest model, the hyperparameter Ne denotes the number
of trees in the forest and
For extra tree ensemble model the hyperparameter Ne denotes the
number of trees used by the ensemble model. From the table, it is
evidently clear that the proposed model show better performance
than the ensemble machine learning models
47 / 50
Conclusion
Multiclass power system security assessment with pattern recognition
has been presented in this work. The composite security index (CSI)
based on bus voltage is successfully applied to avoid the masking
problem for accurately discriminating secure, alarm and insecure cases
which fall within the close region of constraints violations.It has also
turned the pattern recognition problem in multiclass problem and it is
successfully formulated using SVM classifier.
From the experimental results achieved on the heart failure dataset, it
is concluded that the proposed expert system can improve the
decision making process of the physicians during diagnosis of heart
failure.The proposed method is efficient in terms of time complexity
as well because it reduces the training time of the predictive model.
48 / 50
References
[1] S. Kalyani, K.S. Swarup Design of pattern recognition system for
static security assessment and classification Pattern Anal Appl, 15 (3)
(2011).
[2] P. Sekhar, S. Mohanty An online power system static security
assessment module using multi-layer perceptron and radial basis
function network Int J Electr Power Energy Syst, 76 (2016).
[3] I. Paschalidis, “How machine learning is helping us predict heart
disease and diabetes,” Harvard Bus. Rev., 2017.
[4] G. Manogaran, R. Varatharajan, and M. K. Priyan, “Hybrid
recommendation system for heart disease diagnosis based on multiple
kernel learning with adaptive neuro-fuzzy inference system,”
Multimedia Tools Appl., vol. 77, no. 4, 2018.
49 / 50
THANK YOU
50 / 50