KEMBAR78
Module 1 DL | PDF | Applied Mathematics | Algorithms
0% found this document useful (0 votes)
16 views84 pages

Module 1 DL

Uploaded by

T SANKARA RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views84 pages

Module 1 DL

Uploaded by

T SANKARA RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 84

CSEN3082 DEEP LEARNING

O. V. Ramana Murthy
Course Outcome 1
Understand the role of neural networks and
its various applications.

2
Contents
Artificial Neuron
Feed Forward Networks
Gradient descent
Back propagation
Regularization techniques
Norm penalties as constrained optimization

3
Reference
Chapter 1, 2
Charu C. Aggarwal, “Neural Networks and
Deep Learning”, Springer International
Publishing AG
Deep Learning- Charu Aggarwal
https://www.youtube.com/playlist?list=PLLo1
RD8Vbbb_6gCyqxG_qzCLOj9EKubw7

Ian Goodfellow, Yoshua Bengio, Aaron


Courville, Deep Learning, MIT Press, 2016

4
Neuron

5
Perceptron

The simplest neural network is referred to as the perceptron. This neural


network contains a single input layer and an output node.
6
Artificial Neuron

x1 1
w1
b fact
^
𝑦
w2
x2 
y
xn wn
7
Numerical Example 1
Calculate the output assuming Binary step
activation function 1
0.3
0.45 fact
0.2 ^
𝑦

y=0.93
0.7
0.6

8
^
{
𝑦 = 𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 ( 𝑦 )= 1 𝑖𝑓𝑦 ≥ 0
0 𝑖𝑓𝑦 <0
Artificial Neuron model
Sigmoidal function

=1 is Binary sigmoidal function

w1
x1 y

w2
9 x2
ACTIVATION FUNCTIONS
(A)Identity/linear

(B) Bipolar Binary step

(C) Bipolar sigmoidal step

10
Source [2]
Numerical Example 2
Calculate the output assuming binary sigmoidal
activation function 1
0.3
0.45 fact
0.2 ^
𝑦 =0.72

y=0.93
0.7
0.6
=1 is Binary sigmoidal
11 function
Pre- and Post-Activation
Values

12
Gradient Descent
Gradient represents how STEEP a slope is.

13
Gradient Descent
Given a differentiable function f(x),
gradient descent finds the minimum
by updating the variable x in steps
proportional to the negative of the gradient
(derivative) of f’(x) at the current point.

14
Gradient Descent
 Gradient represents how STEEP a slope is.
 Uphill is positive; downhill is negative.
 One direction – Derivative. multi-direction – Gradient.
 Drawback: Settles to local minimum

15
Gradient Search
For unconstrained optimization i.e., no
constraints

Gradient of function exists


1. Initialization. Choose an initial xk and let k =
0
2. Calculate the derivative, which indicates the
slope of at
3. Update . Where is learning rate
4. Repeat steps 2 -3 till convergence i.e. , or the
maximum number of iterations is reached.
16
Gradient function

17
Case 1: Initial value x0=−2
Iterati x_n+1=x_n−αf′
x_n f'(x_n) f(x_n)
on (n) (x_n)
0−2 f′(−2)=−16 −2+0.1×16=−0.4- f(−2)=12
f′ −0.4−0.1×1.28=− f(−0.4)=3.
1−0.4
(−0.4)=1.28 0.528 23
f′
−0.5 −0.528−0.1×1.39= f(−0.528)
2 (−0.528)=1.
28 −0.667 =2.9
39
f′
−0.6 −0.667−0.1×1.48= f(−0.667)
3 (−0.667)=1.
67 −0.815 toward the local
=2.65
f′ at 𝑥=−1
The trajectory
48 is converging
minimum
−0.8 −0.815−0.1×1.6= f(−0.815)
4 (−0.815)=1.
15 −0.975 =2.5
18 6
Case 2: Initial value x0=2
Iterati x_n+1=x_n−αf′
x_n f'(x_n) f(x_n)
on (n) (x_n)
0 2f′(2)=16 2−0.1×16=0.42 f(2)=12
f
0.4−0.1×−1.28=0. f(0.4)=3.2
1 0.4 ′(0.4)=−1.2
528 3
8
f
0.528−0.1×−1.39= f(0.528)=2
2 0.528 ′(0.528)=−1
0.667 .9
.39
f
0.667−0.1×−1.48= f(0.667)=2
𝑥
The trajectory is
3 0.667 ′(0.667)=−1 converging toward the local
minimum at =1 0.815 .65
.48
f
0.815−0.1×−1.6=0 f(0.815)=2
19 4 0.815 ′(0.815)=−1
.975 .5
Effect of learning rate

20
Chain Rule
The chain rule is applied to calculate the
gradients of the loss function with respect to the
weights and biases (different layers) of the
network. These gradients are then used to
update the parameters during optimization
(e.g., using gradient descent).
The Chain Rule: Formula
If you have a composite function , the derivative
with respect to x is given by:

This principle is used extensively in updating the


weights of different layers using the Loss
21 function at the outside layer
Linear Separability

22
Linearly Separable – AND
gate
x1 x2 Y
(inpu (inpu (outpu
t) t) t)
0 0 0
0 1 0
1 0 0
1 1 1

23
Linearly Separable – AND
gate
Two input sources => Two input neurons
One output => One output Neuron.
 Activation function is binary sigmoidal
 Derivative

24
Linearly Separable – AND
gate
1

w0

x1 w1

𝑦 Y
 f(.
w2 )
x2
25
Back-propagation training/algorithm
Given: Input vector i th instant , Target .
Initialize weights w0, w1, w2 and learning rate
with some random values in the range [0 1]
1. Output
2. Activation function sigmoidal activation
function
3. Compute error:
4. Backpropagate the error to crossing
activation function
where is the derivative of activation function
selected.
26
for sigmoidal activation function
Back-propagation training/algorithm
5. Compute change in weights and bias
,,
6. Update the changes in weights and bias

7. Keep repeating the steps 1 – 6, for all input


combinations ( 4 nos). This is one epoch.
8. Run multiple Epochs till the error decreases and
stabilizes.

27
(4 Rules)Backpropagating
Error
1. Output
𝑦 Y Neuron
 f(.
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒)𝑦

2. Across Link
wi 𝑦
xi  f(.
) 𝑦
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑥 𝑖 3. Weights Update
28
(4 Rules)Backpropagating
Error
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑥 𝑖
4. Across Link (>1
xi hidden layer)
w1
𝑦1
 f(.
w2 𝑒
𝑦2 )
𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 1

 f(.
wn 𝑒
𝑦𝑛 )
𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 2

29
 f(.
𝑒 𝑎𝑡 𝑠𝑡𝑎𝑔𝑒 𝑦 𝑛
The power of nonlinear activation
functions in transforming a data set to
linear separability

30
Linearly not Separable – XOR
gate
x1 x2 Y
(inpu (inpu (outpu
t) t) t)
0 0 0
0 1 1
1 0 1
1 1 0

31
Linearly not Separable – XOR
gate
Two input sources => Two input neurons
One output => One output Neuron.
One hidden layer => 2 neurons
 Activation function is binary sigmoidal
 Derivative

32
Linearly not Separable – XOR
gate

33
Input Hidden Output
layer layer layer

1 1
v01
w0
v11 Z1
x1
v12 w1
v02
w2
Y
v21
x2
v22 Z2

34
Back-propagation Training
Given: Inputs , target .
Initialize weights and learning rate with
Feed-forward Phase

some random values


1. Hidden unit , j = 1 to p hidden neurons
2. output , sigmoidal activation function
3. Output unit
4. Output sigmoidal activation function

35
Back-propagation Training
Back-propagation of error Phase

5. Compute error correction term


where is derivative
6. Compute change in weights and bias
,
send to previous layer
7. Hidden unit
8. Calculate error term
9. Compute change in weights and bias
,

36
Weights and Bias update phase

Back-propagation Training
10. Each output unit, k = 1 to m update weights and bias

11. Each hidden unit, j = 1 to p update weights and bias

12. Check for stopping criterion e.g. certain number of


epochs or when targets are equal/close to network
outputs

37
Hidden neuron input computation

𝑧 1 =v 11 𝑥 1+ v 2 1 𝑥 2 +𝑣 01
1 1
v01
w0
v11 Z1
x1
v12 w1
v02
w2
Y
v21
x2
v22 Z2
𝑧 2=v 12 𝑥 1+ v 22 𝑥2 + 𝑣 02
38
Hidden neuron output computation
𝑍 1= 𝑓 𝑠𝑖𝑔 ( 𝑧 1 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
w2
Y
v21
x2 𝑧2
v22 Z2
𝑍 2= 𝑓 𝑠𝑖𝑔 ( 𝑧 2 )
39
Output neuron input computation
𝑦 =w 1 𝑍 1+w 2 𝑍 2+𝑤 0
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
w2
Y
v21
x2 𝑧2
v22 Z2

40
Output neuron Output computation
𝑌 = 𝑓 𝑠𝑖𝑔 ( 𝑦 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
w2
v21
x2 𝑧2
v22 Z2

41
Output Error correction computation

𝛿=(𝑡 −𝑌 ) 𝑓 ( 𝑦 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
w2
v21
x2 𝑧2
v22 Z2

42
Output neuron changes updates
computation
∆ 𝑤 0=1. 𝛿
1 1 ∆ 𝑤 1=𝛿. 𝑍 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2 𝑧2
v22 Z2 ∆ 𝑤 2=𝛿. 𝑍 2

43
Hidden neuron error propagation
computation
𝛿1=𝛿. 𝑤 1
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2 𝑧2
v22 Z2
𝛿2=𝛿. 𝑤 2
44
Hidden neuron error correction
computation ′
𝛿11=𝛿1 . 𝑓 ( 𝑧 1 )
1 1
v01
w0
v11 Z1
x1 𝑧1
v12 𝛿1 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2 𝑧2
v22
𝛿2′ Z2
𝛿22=𝛿2 . 𝑓 ( 𝑧 2 )
45
Hidden neuron changes updates
∆computation
𝑣 01=𝛿11 ∆ 𝑣 11=𝛿11 𝑥 1 2

1 1
v01
w0
v11
x1 𝛿11𝑧 1Z1
v12 𝛿1 w1
v02
( 𝑦 )Y
v21
w2
𝛿
x2
𝛿22
𝑧 2
v22
𝛿2Z2
∆ 𝑣 02=𝛿22 1 2
46
NN with Two Hidden Layers
(HW)

(1)
𝑤𝑖 , 𝑗

47
Regularization (to avoid
Overfitting)
One of the primary causes of corruption of
the generalization process is overfitting.
The objective is to determine a curve that
defines the border of
the two groups using the training data.

48
Overfitting
One of the primary causes of corruption of
the generalization process is overfitting.
The objective is to determine a curve that
defines the border of
the two groups using the training data.

49
Overfitting
Some outliers penetrate the area of the other group
and disturb the boundary. As Machine Learning
considers all the data, even the noise, it ends up
producing an improper model (a curve in this case).
This would be penny-wise and pound-foolish.

50
Remedy : Regularization
Regularization is a numerical method that
attempts to construct a model structure as
simple as possible. The simplified model
can avoid the effects of overfitting at the
small cost of performance.
Cost function Sum of squared errors

51
Remedy : Regularization
For this reason, overfitting of the neural
network can be improved by adding the sum
of weights to the cost function, (new) Cost
function
In order to drop the value of the cost function,
both the error and weight should be controlled
to be as small as possible.
However, if a weight becomes small enough,
the associated nodes will be practically
disconnected. As a result, unnecessary
connections are eliminated, and the neural
network becomes simpler.
52
Add L1 Regularization to XOR
Network
New Loss function
The gradient of the regularized loss w.r.t a
weight w is:

Update rule for weights w is

53
Add L2 Regularization to XOR
Network
New Loss function
The gradient of the regularized loss w.r.t a
weight w is:

Update rule for weights w is

54
XOR implementation with L1
# Apply L2 regularization to weights
hidden_layer_weights += learning_rate *
(np.dot(hidden_layer_output.T, output_layer_delta)
– sign(hidden_layer_weights))

input_layer_weights += learning_rate *
(np.dot(inputs.T, hidden_layer_delta) -
sign(input_layer_weights))

# Update biases (no regularization applied to


biases) hidden_layer_bias +=
np.sum(output_layer_delta, axis=0, keepdims=True)
* learning_rate
input_layer_bias += np.sum(hidden_layer_delta,
axis=0, keepdims=True) * learning_rate
55
XOR implementation with L2
# Apply L2 regularization to weights
hidden_layer_weights += learning_rate *
(np.dot(hidden_layer_output.T, output_layer_delta)
- hidden_layer_weights)

input_layer_weights += learning_rate *
(np.dot(inputs.T, hidden_layer_delta) -
input_layer_weights)

# Update biases (no regularization applied to


biases) hidden_layer_bias +=
np.sum(output_layer_delta, axis=0, keepdims=True)
* learning_rate
input_layer_bias += np.sum(hidden_layer_delta,
axis=0, keepdims=True) * learning_rate
56
Norm Penalties as Constrained
Optimization
Denote Regularized Objective function

is a loss function e.g., MSE, Cross-entropy


is a penalty function.
is hyerparameter. Setting α to 0 results in no
regularization. Larger values of α correspond
to more regularization.
L2 Norm Parameter Regularization (ridge
regression or Tikhonov regularization)
If
L1 Norm Regularization
57
If
Norm Penalties as Constrained
Optimization
Consider the Objective function with L2 Norm
Regularization

Transform the L2 regularization term into a


constraint:

Subject to
Where τ>0: A constraint hyperparameter
controlling the regularization strength.
• Larger τ: Weaker regularization (equivalent to
smaller ).
• Smaller τ: Stronger regularization (equivalent
58
to larger ).
Norm Penalties as Constrained
Optimization
The constrained optimization can be
reformulated using the Lagrangian multiplier
λ

Here, λ acts as a penalty parameter for


violating the constraint
This formulation helps in using techniques
like dual optimization or projected
gradient descent to enforce constraints
during optimization.

59
Norm Penalties as Constrained
Optimization – Backpropagation
Consider
Containing the MSE and the constraint term
1. Compute Gradients for MSE Term
The gradients of the MSE term with respect to the
weights and biases are:

Propagate through the layers as in standard


backpropagation.
2. Compute Gradients for L2 Constraint
The L2 regularization gradient is:

60
This term is added to the gradient from the MSE
during weight updates.
Norm Penalties as Constrained
Optimization – Backpropagation
Enforcing the Constraint ()
To ensure the constraint is satisfied, after
each weight update:
1.Check the L2 Norm

2. Project Back if Necessary: If , rescale W


as follows

61
Numerical Example
Consider two inputs, one hidden layer with
two neurons and one output. NO bias for any
neuron.
1. Let initial weights ,
2. Using back propagation, say we obtained ,
,
3. For learning rate , update

62
Numerical Example
4. Updated, new weights ,
5. Compute Current L2 Norm

6. If τ=1.5, the constraint is already


satisfied, and no adjustment is required.

63
Numerical Example
7. If τ=1.5, the constraint is already satisfied,
and no adjustment is required.
8. If τ=1.0, the constraint is violated. Then do
the following adjustment

64
65
Appendix: Example
Implementation
Using Back-
propagation
network, find the
new weights for
the network
shown aside.
Input = [0 1]
and target
output is 1. use
learning rate
0.25 and binary
sigmoidal
activation
66function
1. Consolidate the
information
Given: Inputs [0 1], target 1.
[

[]=[0.4 0.1 0.2]


Learning rate
Activation function is binary sigmoidal
Derivative

67
2. Feed-forward Phase
1. Hidden unit , j = 1,2

2. Output , sigmoidal activation function

3. Output unit
4. Output sigmoidal activation function

68
2. Feed-forward Phase
1. Hidden unit , j = 1,2

2. Output , sigmoidal activation function ,


3. Output unit
4. Output sigmoidal activation function

69
3. Back-propagation of error
Phase
5. Compute error correction term

6. Compute change in weights and bias

,
,
,

70
3. Back-propagation of error
Phase
5. Compute error correction term

6. Compute change in weights and bias

,
,
,
7. Hidden unit

71
3. Back-propagation of error
Phase
5. Compute error correction term

6. Compute change in weights and bias

,
,
,
7. Hidden unit

72
3. Back-propagation of error
Phase
8. Calculate error term

73
3. Back-propagation of error
Phase
8. Calculate error term

74
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,

75
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,
0.0118
0.0118

76
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,
0.0118
0.0118

0.0
0.00245

77
3. Back-propagation of error
Phase
8. Calculate error term
9. Compute change in weights and bias
,
0.0118
0.0118

0.0
0.00245

78
4. Weights and Bias update
phase
10. Each output unit, k = 1 to m update weights and bias

79
4. Weights and Bias update
phase
10. Each output unit, k = 1 to m update weights and bias

80
4. Weights and Bias update
phase
11. Each hidden unit, j = 1 to p update weights and bias

81
4. Weights and Bias update
phase
11. Each hidden unit, j = 1 to p update weights and bias

82
4. Weights and Bias update
phase
11. Each hidden unit, j = 1 to p update weights and bias

83
Epo v 11 v21 v01 v12 v22 v02
ch
0 0.6 -0.1 0.3 -0.3 0.4 0.5
1 0.6 -0.097 0.303 -0.3 0.401 0.501
Epoc z1 z2 w1 w2 w0 y
h
0 0.549 0.711 0.4 0.1 -0.2 0.523
1 0.551 0.711 0.416 0.121 -0.17 0.536
3 3 3
Write a program for this case and cross-
verify your answers. After how many epochs
will the output converge?
84

You might also like