CSEN3082 DEEP LEARNING
O. V. Ramana Murthy
Course Outcome 1
Understand the role of neural networks and
its various applications.
2
Contents
Artificial Neuron
Feed Forward Networks
Gradient descent
Back propagation
Regularization techniques
Norm penalties as constrained optimization
3
Reference
Chapter 1, 2
Charu C. Aggarwal, “Neural Networks and
Deep Learning”, Springer International
Publishing AG
Deep Learning- Charu Aggarwal
https://www.youtube.com/playlist?list=PLLo1
RD8Vbbb_6gCyqxG_qzCLOj9EKubw7
Ian Goodfellow, Yoshua Bengio, Aaron
Courville, Deep Learning, MIT Press, 2016
4
Neuron
5
Perceptron
The simplest neural network is referred to as the perceptron. This neural
network contains a single input layer and an output node.
6
Artificial Neuron
x1 1
w1
b fact
^
𝑦
w2
x2
y
xn wn
7
Numerical Example 1
Calculate the output assuming Binary step
activation function 1
0.3
0.45 fact
0.2 ^
𝑦
y=0.93
0.7
0.6
8
^
{
𝑦 = 𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 ( 𝑦 )= 1 𝑖𝑓𝑦 ≥ 0
0 𝑖𝑓𝑦 <0
Artificial Neuron model
Sigmoidal function
=1 is Binary sigmoidal function
w1
x1 y
w2
9 x2
ACTIVATION FUNCTIONS
(A)Identity/linear
(B) Bipolar Binary step
(C) Bipolar sigmoidal step
10
Source [2]
Numerical Example 2
Calculate the output assuming binary sigmoidal
activation function 1
0.3
0.45 fact
0.2 ^
𝑦 =0.72
y=0.93
0.7
0.6
=1 is Binary sigmoidal
11 function
Pre- and Post-Activation
Values
12
Gradient Descent
Gradient represents how STEEP a slope is.
13
Gradient Descent
Given a differentiable function f(x),
gradient descent finds the minimum
by updating the variable x in steps
proportional to the negative of the gradient
(derivative) of f’(x) at the current point.
14
Gradient Descent
Gradient represents how STEEP a slope is.
Uphill is positive; downhill is negative.
One direction – Derivative. multi-direction – Gradient.
Drawback: Settles to local minimum
15
Gradient Search
For unconstrained optimization i.e., no
constraints
Gradient of function exists
1. Initialization. Choose an initial xk and let k =
0
2. Calculate the derivative, which indicates the
slope of at
3. Update . Where is learning rate
4. Repeat steps 2 -3 till convergence i.e. , or the
maximum number of iterations is reached.
16
Gradient function
17
Case 1: Initial value x0=−2
Iterati x_n+1=x_n−αf′
x_n f'(x_n) f(x_n)
on (n) (x_n)
0−2 f′(−2)=−16 −2+0.1×16=−0.4- f(−2)=12
f′ −0.4−0.1×1.28=− f(−0.4)=3.
1−0.4
(−0.4)=1.28 0.528 23
f′
−0.5 −0.528−0.1×1.39= f(−0.528)
2 (−0.528)=1.
28 −0.667 =2.9
39
f′
−0.6 −0.667−0.1×1.48= f(−0.667)
3 (−0.667)=1.
67 −0.815 toward the local
=2.65
f′ at 𝑥=−1
The trajectory
48 is converging
minimum
−0.8 −0.815−0.1×1.6= f(−0.815)
4 (−0.815)=1.
15 −0.975 =2.5
18 6
Case 2: Initial value x0=2
Iterati x_n+1=x_n−αf′
x_n f'(x_n) f(x_n)
on (n) (x_n)
0 2f′(2)=16 2−0.1×16=0.42 f(2)=12
f
0.4−0.1×−1.28=0. f(0.4)=3.2
1 0.4 ′(0.4)=−1.2
528 3
8
f
0.528−0.1×−1.39= f(0.528)=2
2 0.528 ′(0.528)=−1
0.667 .9
.39
f
0.667−0.1×−1.48= f(0.667)=2
𝑥
The trajectory is
3 0.667 ′(0.667)=−1 converging toward the local
minimum at =1 0.815 .65
.48
f
0.815−0.1×−1.6=0 f(0.815)=2
19 4 0.815 ′(0.815)=−1
.975 .5
Effect of learning rate
20
Chain Rule
The chain rule is applied to calculate the
gradients of the loss function with respect to the
weights and biases (different layers) of the
network. These gradients are then used to
update the parameters during optimization
(e.g., using gradient descent).
The Chain Rule: Formula
If you have a composite function , the derivative
with respect to x is given by:
This principle is used extensively in updating the
weights of different layers using the Loss
21 function at the outside layer
Linear Separability
22
Linearly Separable – AND
gate
x1 x2 Y
(inpu (inpu (outpu
t) t) t)
0 0 0
0 1 0
1 0 0
1 1 1
23
Linearly Separable – AND
gate
Two input sources => Two input neurons
One output => One output Neuron.
Activation function is binary sigmoidal
Derivative
24
Linearly Separable – AND
gate
1
w0
x1 w1
𝑦 Y
f(.
w2 )
x2
25
Back-propagation training/algorithm
Given: Input vector i th instant , Target .
Initialize weights w0, w1, w2 and learning rate
with some random values in the range [0 1]
1. Output
2. Activation function sigmoidal activation
function
3. Compute error:
4. Backpropagate the error to crossing
activation function
where is the derivative of activation function
selected.
26
for sigmoidal activation function
Back-propagation training/algorithm
5. Compute change in weights and bias
,,
6. Update the changes in weights and bias
7. Keep repeating the steps 1 – 6, for all input
combinations ( 4 nos). This is one epoch.
8. Run multiple Epochs till the error decreases and
stabilizes.
27
(4 Rules)Backpropagating
Error
1. Output
𝑦 Y Neuron
f(.
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒)𝑦
2. Across Link
wi 𝑦
xi f(.
) 𝑦
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑥 𝑖 3. Weights Update
28
(4 Rules)Backpropagating
Error
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑥 𝑖
4. Across Link (>1
xi hidden layer)
w1
𝑦1
f(.
w2 𝑒
𝑦2 )
𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 1
f(.
wn 𝑒
𝑦𝑛 )
𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 2
29
f(.
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 𝑛
The power of nonlinear activation
functions in transforming a data set to
linear separability
30
Linearly not Separable – XOR
gate
x1 x2 Y
(inpu (inpu (outpu
t) t) t)
0 0 0
0 1 1
1 0 1
1 1 0
31
Linearly not Separable – XOR
gate
Two input sources => Two input neurons
One output => One output Neuron.
One hidden layer => 2 neurons
Activation function is binary sigmoidal
Derivative
32
Linearly not Separable – XOR
gate
33
Input Hidden Output
layer layer layer
1 1
v01
w0
v11 Z1
x1
v12 w1
v02
w2
Y
v21
x2
v22 Z2
34
Back-propagation Training
Given: Inputs , target .
Initialize weights and learning rate with
Feed-forward Phase
some random values
1. Hidden unit , j = 1 to p hidden neurons
2. output , sigmoidal activation function
3. Output unit
4. Output sigmoidal activation function
35
Back-propagation Training
Back-propagation of error Phase
5. Compute error correction term
where is derivative
6. Compute change in weights and bias
,
send to previous layer
7. Hidden unit
8. Calculate error term
9. Compute change in weights and bias
,
36
Weights and Bias update phase
Back-propagation Training
10. Each output unit, k = 1 to m update weights and bias
11. Each hidden unit, j = 1 to p update weights and bias
12. Check for stopping criterion e.g. certain number of
epochs or when targets are equal/close to network
outputs
37
Hidden neuron input computation
𝑧 1 =v 11 𝑥 1+ v 2 1 𝑥 2 +𝑣 01
1 1
v01
w0
v11 Z1
x1
v12 w1
v02
w2
Y
v21
x2
v22 Z2
𝑧 2=v 12 𝑥 1+ v 22 𝑥2 + 𝑣 02
38
Hidden neuron output computation
𝑍 1= 𝑓 𝑠𝑖𝑔 ( 𝑧 1 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
w2
Y
v21
x2 𝑧2
v22 Z2
𝑍 2= 𝑓 𝑠𝑖𝑔 ( 𝑧 2 )
39
Output neuron input computation
𝑦 =w 1 𝑍 1+w 2 𝑍 2+𝑤 0
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
w2
Y
v21
x2 𝑧2
v22 Z2
40
Output neuron Output computation
𝑌 = 𝑓 𝑠𝑖𝑔 ( 𝑦 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
w2
v21
x2 𝑧2
v22 Z2
41
Output Error correction computation
′
𝛿=(𝑡 −𝑌 ) 𝑓 ( 𝑦 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
w2
v21
x2 𝑧2
v22 Z2
42
Output neuron changes updates
computation
∆ 𝑤 0=1. 𝛿
1 1 ∆ 𝑤 1=𝛿. 𝑍 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2 𝑧2
v22 Z2 ∆ 𝑤 2=𝛿. 𝑍 2
43
Hidden neuron error propagation
computation
𝛿1=𝛿. 𝑤 1
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2 𝑧2
v22 Z2
𝛿2=𝛿. 𝑤 2
44
Hidden neuron error correction
computation ′
𝛿11=𝛿1 . 𝑓 ( 𝑧 1 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 𝛿1 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2 𝑧2
v22
𝛿2′ Z2
𝛿22=𝛿2 . 𝑓 ( 𝑧 2 )
45
Hidden neuron changes updates
∆computation
𝑣 01=𝛿11 ∆ 𝑣 11=𝛿11 𝑥 1 2
1 1
v01
w0
v11
x1 𝛿11𝑧 1Z1
v12 𝛿1 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2
𝛿22
𝑧 2
v22
𝛿2Z2
∆ 𝑣 02=𝛿22 1 2
46
NN with Two Hidden Layers
(HW)
(1)
𝑤𝑖 , 𝑗
47
Regularization (to avoid
Overfitting)
One of the primary causes of corruption of
the generalization process is overfitting.
The objective is to determine a curve that
defines the border of
the two groups using the training data.
48
Overfitting
One of the primary causes of corruption of
the generalization process is overfitting.
The objective is to determine a curve that
defines the border of
the two groups using the training data.
49
Overfitting
Some outliers penetrate the area of the other group
and disturb the boundary. As Machine Learning
considers all the data, even the noise, it ends up
producing an improper model (a curve in this case).
This would be penny-wise and pound-foolish.
50
Remedy : Regularization
Regularization is a numerical method that
attempts to construct a model structure as
simple as possible. The simplified model
can avoid the effects of overfitting at the
small cost of performance.
Cost function Sum of squared errors
51
Remedy : Regularization
For this reason, overfitting of the neural
network can be improved by adding the sum
of weights to the cost function, (new) Cost
function
In order to drop the value of the cost function,
both the error and weight should be controlled
to be as small as possible.
However, if a weight becomes small enough,
the associated nodes will be practically
disconnected. As a result, unnecessary
connections are eliminated, and the neural
network becomes simpler.
52
Add L1 Regularization to XOR
Network
New Loss function
The gradient of the regularized loss w.r.t a
weight w is:
Update rule for weights w is
53
Add L2 Regularization to XOR
Network
New Loss function
The gradient of the regularized loss w.r.t a
weight w is:
Update rule for weights w is
54
XOR implementation with L1
# Apply L2 regularization to weights
hidden_layer_weights += learning_rate *
(np.dot(hidden_layer_output.T, output_layer_delta)
– sign(hidden_layer_weights))
input_layer_weights += learning_rate *
(np.dot(inputs.T, hidden_layer_delta) -
sign(input_layer_weights))
# Update biases (no regularization applied to
biases) hidden_layer_bias +=
np.sum(output_layer_delta, axis=0, keepdims=True)
* learning_rate
input_layer_bias += np.sum(hidden_layer_delta,
axis=0, keepdims=True) * learning_rate
55
XOR implementation with L2
# Apply L2 regularization to weights
hidden_layer_weights += learning_rate *
(np.dot(hidden_layer_output.T, output_layer_delta)
- hidden_layer_weights)
input_layer_weights += learning_rate *
(np.dot(inputs.T, hidden_layer_delta) -
input_layer_weights)
# Update biases (no regularization applied to
biases) hidden_layer_bias +=
np.sum(output_layer_delta, axis=0, keepdims=True)
* learning_rate
input_layer_bias += np.sum(hidden_layer_delta,
axis=0, keepdims=True) * learning_rate
56
Norm Penalties as Constrained
Optimization
Denote Regularized Objective function
is a loss function e.g., MSE, Cross-entropy
is a penalty function.
is hyerparameter. Setting α to 0 results in no
regularization. Larger values of α correspond
to more regularization.
L2 Norm Parameter Regularization (ridge
regression or Tikhonov regularization)
If
L1 Norm Regularization
57
If
Norm Penalties as Constrained
Optimization
Consider the Objective function with L2 Norm
Regularization
Transform the L2 regularization term into a
constraint:
Subject to
Where τ>0: A constraint hyperparameter
controlling the regularization strength.
• Larger τ: Weaker regularization (equivalent to
smaller ).
• Smaller τ: Stronger regularization (equivalent
58
to larger ).
Norm Penalties as Constrained
Optimization
The constrained optimization can be
reformulated using the Lagrangian multiplier
λ
Here, λ acts as a penalty parameter for
violating the constraint
This formulation helps in using techniques
like dual optimization or projected
gradient descent to enforce constraints
during optimization.
59
Norm Penalties as Constrained
Optimization – Backpropagation
Consider
Containing the MSE and the constraint term
1. Compute Gradients for MSE Term
The gradients of the MSE term with respect to the
weights and biases are:
Propagate through the layers as in standard
backpropagation.
2. Compute Gradients for L2 Constraint
The L2 regularization gradient is:
60
This term is added to the gradient from the MSE
during weight updates.
Norm Penalties as Constrained
Optimization – Backpropagation
Enforcing the Constraint ()
To ensure the constraint is satisfied, after
each weight update:
1.Check the L2 Norm
2. Project Back if Necessary: If , rescale W
as follows
61
Numerical Example
Consider two inputs, one hidden layer with
two neurons and one output. NO bias for any
neuron.
1. Let initial weights ,
2. Using back propagation, say we obtained ,
,
3. For learning rate , update
62
Numerical Example
4. Updated, new weights ,
5. Compute Current L2 Norm
6. If τ=1.5, the constraint is already
satisfied, and no adjustment is required.
63
Numerical Example
7. If τ=1.5, the constraint is already satisfied,
and no adjustment is required.
8. If τ=1.0, the constraint is violated. Then do
the following adjustment
64
65
Appendix: Example
Implementation
Using Back-
propagation
network, find the
new weights for
the network
shown aside.
Input = [0 1]
and target
output is 1. use
learning rate
0.25 and binary
sigmoidal
activation
66function
1. Consolidate the
information
Given: Inputs [0 1], target 1.
[
[]=[0.4 0.1 0.2]
Learning rate
Activation function is binary sigmoidal
Derivative
67
2. Feed-forward Phase
1. Hidden unit , j = 1,2
2. Output , sigmoidal activation function
3. Output unit
4. Output sigmoidal activation function
68
2. Feed-forward Phase
1. Hidden unit , j = 1,2
2. Output , sigmoidal activation function ,
3. Output unit
4. Output sigmoidal activation function
69
3. Back-propagation of error
Phase
5. Compute error correction term
6. Compute change in weights and bias
,
,
,
70
3. Back-propagation of error
Phase
5. Compute error correction term
6. Compute change in weights and bias
,
,
,
7. Hidden unit
71
3. Back-propagation of error
Phase
5. Compute error correction term
6. Compute change in weights and bias
,
,
,
7. Hidden unit
72
3. Back-propagation of error
Phase
8. Calculate error term
73
3. Back-propagation of error
Phase
8. Calculate error term
74
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,
75
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,
0.0118
0.0118
76
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,
0.0118
0.0118
0.0
0.00245
77
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,
0.0118
0.0118
0.0
0.00245
78
4. Weights and Bias update
phase
10. Each output unit, k = 1 to m update weights and bias
79
4. Weights and Bias update
phase
10. Each output unit, k = 1 to m update weights and bias
80
4. Weights and Bias update
phase
11. Each hidden unit, j = 1 to p update weights and bias
81
4. Weights and Bias update
phase
11. Each hidden unit, j = 1 to p update weights and bias
82
4. Weights and Bias update
phase
11. Each hidden unit, j = 1 to p update weights and bias
83
Epo v 11 v21 v01 v12 v22 v02
ch
0 0.6 -0.1 0.3 -0.3 0.4 0.5
1 0.6 -0.097 0.303 -0.3 0.401 0.501
Epoc z1 z2 w1 w2 w0 y
h
0 0.549 0.711 0.4 0.1 -0.2 0.523
1 0.551 0.711 0.416 0.121 -0.17 0.536
3 3 3
Write a program for this case and cross-
verify your answers. After how many epochs
will the output converge?
84