Module-3
(Neural Networks (NN) and Support
Vector Machines (SVM))
Perceptron, Neural Network - Multilayer feed
forward network, Activation functions
(Sigmoid, ReLU, Tanh), Backpropagation
algorithm.
SVM - Introduction, Maximum Margin
Classification, Mathematics behind Maximum
Margin Classification, Maximum Margin
linear separators, soft margin SVM classifier,
non-linear SVM, Kernels for learning non-
linear functions, polynomial kernel, Radial
Basis Function(RBF).
Biological Neuron
Artificial Neuron
Perceptron
The output of the perceptron can also be
expressed as a dot product
Net input function
Activation functio
ttps://cs231n.github.io/neural-networks-1/
https://towardsdatascience.com/perceptron-the-artificial-neuron-
Perceptron learning rule
One way to learn an acceptable weight
vector is to begin with random weights,
then iteratively apply the perceptron to
each training example, modifying the
perceptron weights whenever it
misclassifies an example.
This process is repeated, iterating through
the training examples as many times as
needed until the perceptron classifies all
training examples correctly.
Weights are modified at each step according
to the perceptron training rule, which
revises the weight wi associated with input
xi according to the rule
Here, t is the target output for the
current training example, o is the output
generated by the perceptron, and q is a
positive constant called the learning rate.
The role of the learning rate is to moderate
the degree to which weights are changed at
each step.
It is usually set to some small value (e.g., 0.1).
Gradient Descent and the
Delta Rule
Although the perceptron rule finds a
successful weight vector when the training
examples are linearly separable, it can fail
to converge if the examples are not linearly
separable.
Delta rule, is designed to overcome this
difficulty.
If the training examples are not linearly
separable, the delta rule converges toward
a best-fit approximation to the target
The key idea behind the delta rule is to use
gradient descent to search the
hypothesis space of possible weight
vectors to find the weights that best fit the
training examples.
The delta training rule is best understood
by considering the task of training an
unthresholded perceptron; that is, a
linear unit for which the output o is
given by
Training error
where D is the set of training examples, td
is the target output for training
example d, and od is the output of the
linear unit for training example d.
Since the gradient specifies the direction of
steepest increase of E, the training rule
for gradient descent is
Here the learning rate is a positive constant
, which determines the step size in the
gradient descent search.
The negative sign is present because we
want to move the weight vector in the
direction that decreases E.
This training rule can also be written in
its component form
That is,
Therefore,
Multilayer Feed Forward
Network
Feed forward neural network
Each layer is made up of units.
The inputs to the network correspond to the
attributes measured for each training tuple.
The inputs are fed simultaneously into the
units making up the input layer.
These inputs pass through the input layer
and are then weighted and fed
simultaneously to a second layer of
“neuronlike” units, known as a hidden
layer.
The outputs of the hidden layer units can be
input to another hidden layer, and so on.
The weighted outputs of the last hidden
layer are input to units making up
the output layer, which emits the network's
prediction for given tuples.
The units in the input layer are called input
units.
The units in the hidden layers and output
layer are sometimes referred to
as neurodes, due to their symbolic
biological basis, or as output units.
A network containing two hidden layers is
called a three-layer neural network, and so
on.
It is a feed-forward network since none of
the weights cycles back to an input unit or
to a previous layer's output unit.
https://www.sciencedirect.com/topics/computer-science/backpropagation-
Each output unit takes, as input, a
weighted sum of the outputs from units in
the previous layer.
It applies a nonlinear (activation) function
to the weighted input.
Compute the number of parameters for the
given network.
The network has 4 + 2 = 6 neurons (not
counting the inputs), [3 x 4] + [4 x 2] = 20
weights and 4 + 2 = 6 biases, for a total of
26 learnable parameters.
Compute the number of parameters for the
given network.
The network has 4 + 4 + 1 = 9 neurons
(not counting inputs), [3 x 4] + [4 x 4] + [4
x 1] = 12 + 16 + 4 = 32 weights and 4 + 4
+ 1 = 9 biases, for a total of 41 learnable
parameters.
Sigmoid function
Relu Function
Tanh Function
Sigmoid function
Sigmoid outputs are not zero centered.
If the activation function of the network is not
zero centered, y = f(x w) is always positive or
always negative.
Thus, the output of a layer is always being
moved to either the positive values or the
negative values.
As a result, the weight vector needs more update
to be trained properly.
Tanh vs Sigmoid
The tanh function is a stretched and shifted
version of the sigmoid.
Both sigmoid and tanh functions belong
to the S-like functions that suppress the
input value to a bounded range.
This helps the network to keep its weights
bounded and prevents the exploding
gradient problem where the value of the
gradients becomes very large.
https://www.baeldung.com/cs/sigmoid-vs-tanh-functions
The gradient of tanh is four times greater
than the gradient of the sigmoid function.
This means that using the tanh activation
function results in higher values of gradient
during training and higher updates in the
weights of the network.
So, if we want strong gradients and
big learning steps, we should use the
tanh activation function.
Another difference is that the output
of tanh is symmetric around zero
leading to faster convergence.
The output of tanh ranges from -1 to 1 and
have an equal mass on both the sides of
zero-axis so it is zero centered function.
So, tanh overcomes the non-zero
centric issue of the logistic activation
function.
Hence optimization becomes comparatively
easier than logistic and it is always
preferred over logistic.
Comparison with ReLU
Sigmoid and tanh functions suffer from vanishing
gradient problem.
It is encountered while training artificial neural
networks with gradient-based learning
methods and backpropagation.
In such methods, during each iteration of training
each of the neural network's weights receives an
update proportional to the partial derivative of the
error function with respect to the current weight.
The problem is that in some cases, the gradient
will be vanishingly small, effectively preventing
the weight from changing its value.
In the worst case, this may completely stop the
neural network from further training.
ReLU activation function can fix the
vanishing gradient problem.
Back propagation
A feedforward phase - where an input
vector is applied and the signal propagates
through the network layers, modified by
the current weights and biases and by the
nonlinear activation functions.
Corresponding output values then emerge,
and these can be compared with the target
outputs for the given input vector using a
loss function.
A feedback phase - the error signal is
then fed back (backpropagated) through
the network layers to modify the weights in
a way that minimizes the error across the
entire training set, effectively minimizing
the error surface in weight-space.
Backpropagation Algorithm
(Stochastic gradient descent version)
Determine the number of trainable
parameters of the following neural net:
Input layer: 4 units.
Hidden layer 1: 16 units.
Hidden layer 2: 8 units.
Hidden layer 3: 4 units.
Output layer: 2 units.
262 trainable parameters.
Support Vector Machine
(Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues in
1995)
51 01/04/2025
Support Vector Machines—
01/04/2025
General Philosophy
Small Margin Large Margin
Support Vectors 52
http://image.diku.dk/imagecanon/material/cortes_vapnik95.pdf01/04/2025
53
A learned classifier (hyperplane) achieves
01/04/2025
maximum separation between the classes.
The two planes parallel to the classifier and
which pass through one or more points in
the dataset are called bounding planes.
The distance between these bounding
planes is called margin.
By SVM learning, we mean finding a
hyperplane which maximizes this margin.
54
01/04/2025
55
https://towardsdatascience.com/support-vector-machines-dual-formulation-
quadratic-programming-sequential-minimal-optimization-57f4387ce4dd
Linearly Separable SVM
01/04/2025
The optimal hyperplane is given by
w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b
a scalar (bias).
56
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
Maximum Margin
01/04/2025
Distance between a point P (x o, yo, zo) and a
given plane Ax + By + Cz = D, is given by
|Axo + Byo+ Czo + D|/√(A2 + B2 +
C2).
Here we have, two bounding planes
w.x+b=1 and w.x+b=-1
57
01/04/2025
Distance of the bounding hyperplane w.x+b=1 from origin
|1 b |
=
|| w||
Distance of the bounding hyperplane w.x+b=-1 from origin
| 1 b |
=
|| w||
Distance between the planes (which needs to be maximized)
|1 b | | 1 b |
=
|| w|| || w||
2
58
|| w||
Mathematics behind SVM
01/04/2025
For the training data to be linearly
separable:
w.x i b 1, if yi 1
w.x i b 1, if yi 1
Or,
yi (w.x i b) 1, i 1, 2,..., n
59
01/04/2025
Vectors xi for which yi (w•xi, + b) = 1
(points which fall on the bounding planes)
are termed as support vectors.
60
01/04/2025
61
Primal problem
(1)
Linearly Separable SVM
01/04/2025
The optimal hyperplane is given by
w.x + b = 0
where w={w1, w2, …, wn} is a weight vector and b
a scalar (bias).
The linear decision function I(x) is then given by
69
https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
SVM – Soft Margin
Here, C is a hyperparameter that decides
the trade-off between maximizing the
margin and minimizing the mistakes.
When C is small, classification mistakes are
given less importance and focus is more on
maximizing the margin, whereas when C is
large, the focus is more on avoiding
misclassification at the expense of keeping
the margin small.
https://towardsdatascience.com/support-vector-machines-soft-margin-
formulation-and-kernel-trick-4c9729dc8efe
Mathematics behind Soft Margin
SVM
(1)
Non-Linear SVM
XOR Problem
X Y X XOR Y
0 0 0
0 1 1
1 0 1
1 1 0
https://www.tech-quantum.com/solving-xor-problem-using-neural-network-c/
https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f
SVM—Linearly Inseparable
01/04/2025
Transform the original input data into a higher
dimensional space.
Search for a linear separating hyperplane in the
new space.
90
Kernel Functions
Kernel functions are generalized functions
that take two vectors (of any dimension) as
input and output a score that denotes how
similar the input vectors are.
An example is the dot product function: if
the dot product is small, we conclude that
vectors are different and if the dot product
is large, we conclude that vectors are more
similar.
Kernel Trick
We can use any fancy Kernel function in
place of dot product that has the capability
of measuring similarity in higher
dimensions (where it could be more
accurate;), without increasing the
computational costs much.
This is essentially known as the Kernel
Trick.
Polynomial Kernel
Kernel Matrix
Why is SVM Effective on High Dimensional Data?
01/04/2025
The complexity of trained classifier is
characterized by the # of support vectors rather
than the dimensionality of the data.
The support vectors are the essential or critical
training examples —they lie closest to the decision
boundary (Maximum Margin Hyperplane).
Thus, an SVM with a small number of support
vectors can have good generalization, even when
the dimensionality of the data is high. 105
References
01/04/2025
Dunham M H, “Data Mining: Introductory and
Advanced Topics”, Pearson Education, New Delhi,
2003.
Jaiwei Han, Micheline Kamber, “Data Mining
Concepts and Techniques”, Elsevier, 2006.
K.P. Soman, Shyam Diwakar, V. Ajay, “Insight into
Data Mining Theory and Practice”, PHI Pvt. Ltd.,
New Delhi, 2008.
https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm 106