Recurrent Neural Networks (RNNs)
and the Exploding & Vanishing Gradient Problems
Dang Huu Phat 22521065
Phan Nguyen Huu Phong 22521090
Chau The Vi 22521653
University of Information Technology
November 2, 2024
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 1 / 45
References
The contents of this document are taken mainly from the follow sources:
Robin M.Schmidt, Recurrent Neural Networks (RNNs): a gentle
introductionand overview1 ;
Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, On the difficulty of
training RNNs2
Yoshua Bengio, Patrice Simard, Paolo Frasconi, Learning long-term
dependencies with gradients descent is difficult3
1
https://arxiv.org/pdf/1912.05911.pdf
2
https://arxiv.org/pdf/1211.5063.pdf
3
https://www.researchgate.net/profile/Y-Bengio/publication/5583935_
Learning_long-term_dependencies_with_gradient_descent_is_difficult/
links/546b702d0cf2397f7831c03d/
Learning-long-term-dependencies-with-gradient-descent-is-difficult.pdf
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 2 / 45
Table of Contents
1 Introduction & Notation
2 Training Recurrent Neural Networks
3 Exploding and Vanishing Gradient
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 3 / 45
Table of Contents
1 Introduction & Notation
2 Training Recurrent Neural Networks
3 Exploding and Vanishing Gradient
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 4 / 45
Sequential Data
Sequential data, which more commonly known as sequence data or
sequences
What makes it unique? It is that elements in a sequence appear in a
certain order and are not independent of each other.
Example: Stock Price Prediction
In this example, the input data is the historical stock prices. i1 is the
stock price on January 1st, i2 is the stock price on January 2nd, and so
on. (i1 , i2 , ...i30 ) is called sequences. RNN will learn from the input
and predict the stock price in the future.
Figure: Example of Stock Price Prediction
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 5 / 45
Sequential data versus time series data
Time series data is a special type of sequential data where each
example is associated with a dimension for time.
In time series data, samples are taken at successive timestamps, and
therefore, the time dimension determines the order among the data
points. For example, stock prices and voice or speech records are time
series data.
In contrast, not all sequential data shares this temporal structure. Text
data or DNA sequences, for instance, exhibit ordered patterns, but lack
a distinct time dimension.
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 6 / 45
Sequential data versus time series data
A example of machine translation, input is sentence: "I love CS115",
the order of words significantly impact on the meaning of the sentence.
Input data, which consists of the sequence of words [’I’, ’love’, ’CS115’]
is refered to as sequences.
In natural language processing (NLP), it is not feasible to process entire
sentences as inputs. Similar to how videos are processed by breaking
them down into individual frames, in NLP, sentences are broken down
into individual words for input processing.
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 7 / 45
Representing sequences
The movie my friend has not seen is good
The movie my friend has seen is not good
Output: y<1> y<2> y<3> y<4> y<5> y<6>
Input: x<1> x<2> x<3> x<4> x<5> x<6> Time
Figure: An example of time series data
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 8 / 45
What is Recurrent Neural Networks (RNNs)?
Recurrent Neural Networks (RNNs) is a type of neural network
architecture which is used to detect patterns in a sequences.
They are also can be extended to images by breaking them down into a
series of patches and treating them as a sequence.
What differentiates Recurrent Neural Networks from Feedforward
Neural Networks also known as Multi-Layer Perceptrons (MLPs) is how
information gets passed through the network.
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 9 / 45
What is Recurrent Neural Networks (RNNs)?
Multilayer perceptrons (MLPs) and CNNs for image data, assume that
the training examples are independent of each other and thus do not
incorporate ordering information.
We can say that such models do not have a memory of previously seen
training examples.
RNNs, by contrast, are designed for modeling sequences and are capable
of remembering past information and processing new events accordingly,
which is a clear advantage when working with sequence data.
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 10 / 45
Pros and Cons
Benefits
Possibility of processing input of any length;
Model size not increasing with size of input;
Computation takes into account historical information;
Weights are shared across time
Costs
Computation being slow;
Difficulty of accessing information from a long time ago;
Cannot consider any future input for the current state;
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 11 / 45
Frame Title
Algorithm 2: Evaluate the learned policy on an unlabeled test graph
Input: the unlabeled test graph G, the learned policy πθ , the query budget B
Result: the classification GNN f trained on the selected node sequence τ
τ = []
0
Vlabel ← ∅;
Randomly initialize the classification GNN as fG,V 0 ;
label
for t = 1 to B do
t
Generate the graph state SG based on fG,V t−1 and G;
tabel
Compute the probability distribution over candidate nodes as ptG = πθ SG
t
;
t−1
Select vt = arg max ptG (v) from Vtrain \Vlabel and query for its label;
v
t t−1
Vlabel ← Vlabel ∪ {vt } ;
Train the classification GNN for one more epoch to get fG,V t ;
label
τ . append
Team 2 (v t) ;
(UIT) CS115 - Math for Computer Science November 2, 2024 12 / 45
The different categories of sequence modeling
Many to One RNN
Many-to-One is used when a single output is required from
multiple input units or a sequence of them. It takes a sequence of
inputs to display a fixed output.
Sentiment Analysis is a common example of this type of Recurrent
Neural Network.
Figure: Many to One mode
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 13 / 45
The different categories of sequence modeling
Many to Many RNN
Many-to-Many is used to generate a sequence of output data from
a sequence of input units.
This type of RNN is further divided into the following two
subcategories:
1 Equal Unit Size: In this case, the number of both the input and output
units is the same. A common application can be found in Name-Entity
Recognition.
2 Unequal Unit Size In this case, inputs and outputs have different
numbers of units. Its application can be found in Machine Translation.
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 14 / 45
Understanding the dataflow in RNNs
In a standard feedforward network, information flows from the input to
the hidden layer, and then from the hidden layer to the output layer.
On the other hand, in an RNN, the hidden layer receives its input from
both the input layer of the current time step and the hidden layer from
the previous time step.
Figure: The dataflow of a standard feedforward NN and an RNN
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 15 / 45
Understanding the dataflow in RNNs
y <t> Single layer RNN y <t-1> y <t> y (t+1)
h<t> Unfold h<t-1> h<t> h(t+1)
x<t> x<t-1> x<t> x(t+1)
Multilayer RNN
y <t> y <t-1> y <t> y (t+1)
h<t>
2
h<t-1> h<t> h(t+1)
2 2 2
Unfold
h<t> h<t-1> h<t> h(t+1)
1 1 1 1
x<t> x<t-1> x<t> x(t+1)
Figure: Examples of an RNN with one and two hidden layers
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 16 / 45
Table of Contents
1 Introduction & Notation
2 Training Recurrent Neural Networks
3 Exploding and Vanishing Gradient
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 17 / 45
Weight matrices in a single-hidden layer RNN
y <t> y <t-1> y <t> y <t+1>
[
Wh= Whh ; Whx ]
Wyh Wyh Wyh Wyh
Whh
Whh Whh
h<t> Unfold h<t-1> h<t> h<t+1>
Whx Whx Whx Whx
<t>
x x <t-1>
x <t>
x<t+1>
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 18 / 45
Weight matrices in a single-hidden layer RNN
y <t> y <t-1> y <t> y <t+1>
[
Wh = Whh ; Whx ]
Wyh Wyh Wyh Wyh
Whh
Whh Whh
h <t> Unfold h <t-1> h <t> h <t+1>
Whx Whx Whx Whx
x <t> x <t-1> x <t> x <t+1>
Net input: t t t− 1
zh = W hx x + W hh h + bh
Activation: t
t
h = Øh zh
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 19 / 45
Weight matrices in a single-hidden layer RNN
y <t> y <t-1> y <t> y <t+1>
[
Wh = Whh ; Whx ]
Wyh Wyh Wyh Wyh
Whh
Whh Whh
h <t> Unfold h <t-1> h <t> h <t+1>
Whx Whx Whx Whx
x <t> x <t-1> x <t> x <t+1>
Net input: t t t− 1
zh = W hx x + W hh h + bh Net input:
Activation: zyt = W yh h t
+ by
t t
h = Øh zh
Output:
t
y = Øy zyt
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 20 / 45
Gradient-based Learning Methods
Gradient descent (GD) adjusts the weights of the model by finding the
error function derivatives with respect to each member of the weight
matrices.
Batch Gradient Descent
Computing the gradient for the whole dataset in each optimization iteration
to perform a single update as
U
λ X ∂ℓk
θt+1 = θt −
U ∂θ
k=1
However, computing error-derivatives through time is difficult. This is
mostly due to the relationship among the parameters and the dynamics
of the RNN, that is highly unstable and makes GD ineffective.
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 21 / 45
Back-propagation Through Time (BPTT)
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 22 / 45
Back-propagation Through Time (BPTT)
Hidden variable
h(t) = ϕh (xt · Wxh + h(t−1) · Whh + bh )
Output variable
y(t) = ϕy (h(t) · Wyh + by )
We can define a loss function L(y, o) to describe the difference between all
outputs yt and the target values ot as shown in the equation below.
Loss Function
T
X
L(y, o) = l(t) (y(t) , o(t) )
t=1
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 23 / 45
Back-propagation Through Time (BPTT)
Since we have three weight matrices Wxh , Whh and Wyh we need to
compute the partial derivative to each of these weight matrices using the
chain rule.
Partial derivative with respect to Wyh
T
X ∂l(t) ∂y(t) ∂zy (t)
X ∂l(t) ∂y(t) T
∂L
= (t)
· ·
(t) ∂W
= (t)
· (t) · h(t)
∂Wyh ∂y ∂zy yh ∂y ∂zy
t=1 t=1
Partial derivative with respect to Whh
T
X ∂ℓ(t) ∂y(t) ∂zy (t)
∂L ∂h(t)
= · · ·
∂Whh
t=1
∂y(t) ∂zy(t) ∂h(t) ∂Whh
T
X ∂ℓ(t) ∂y(t) ∂h(t)
= · · Wyh ·
t=1
∂y(t) ∂zy(t) ∂Whh
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 24 / 45
Back-propagation through time (BPTT)
(t-1) (t) (t+1)
( )
= ( ) y(t-1) y(t) y(t+1)
Wyh Wyh Wyh
Whh Whh Whh
h(...) h(t-1) h(t) h(t+1)
Wxh Wxh Wxh
x(t-1) x(t) x(t+1)
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 25 / 45
Back-propagation through time (BPTT)
(t-1) (t) (t+1)
( )
= ( ) y(t-1) y(t) y(t+1)
Wyh Wyh Wyh
Whh Whh Whh
h(...) h(t-1) h(t) h(t+1)
Wxh Wxh Wxh
x(t-1) x(t) x(t+1)
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 26 / 45
Back-propagation through time (BPTT)
o(t)
(t)
( )
= +
y(t)
Wyh
h(...) h(t-2) h(t-1) h(t) h(t+1)
Whh
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 27 / 45
Back-propagation through time (BPTT)
1 2 h(t-1) h(t)
( )
= +
2 1
Whh
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 28 / 45
Back-propagation through time (BPTT)
o(t)
( )
= + (t)
= + + y(t)
Wyh
h(...) h(t-2) h(t-1) h(t) h(t+1)
Whh
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 29 / 45
Back-propagation through time (BPTT)
( )
= + o(t)
= + + (t)
= + + y(t)
Wyh
h(...) h(t-2) h(t-1) h(t) h(t+1)
Whh
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 30 / 45
Back-propagation through time (BPTT)
( )
= +
= + +
= + +
= + +
= + +
( ) ( )
= ( )
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 31 / 45
Back-propagation through time (BPTT)
( )
= ( )
( ) ( ) ( ) ( ) ( ) ( )
= ( ) ( ) ( ) ( )
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 32 / 45
Back-propagation through time (BPTT)
Since each ht depends on the previous time step we can substitute the
last part from above equation
Partial derivative with respect to Whh
T t
∂L X ∂ℓ(t) ∂y(t) X ∂h(t) ∂h(k)
= (t)
· (t)
· W yh ·
(k) ∂W
∂Whh ∂y ∂zy
∂h hh
t=1 k=1
Similarly, we have the partial derivative with respect to Wxh as below
Partial derivative with respect to Wxh
T t
∂L X ∂ℓ(t) ∂y(t) X ∂h(t) ∂h(k)
= · · W yh ·
∂Wxh
t=1
∂y(t) ∂zy(t) k=1
∂h(k) ∂W
xh
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 33 / 45
Back-propagation through time (BPTT)
In order to transport the error through time from timestep t back to
timestep k we can have
t
∂h(t) Y ∂h(i)
=
∂h(k) i=k+1 ∂h(i−1)
We can consider previous equation as a Jacobian matrix for the hidden
state parameter as
t t
Y ∂h(i) Y (i)
(i−1)
= T
Whh diag(ϕ′h (zh ))
i=k+1
∂h i=k+1
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 34 / 45
Back-propagation through time (BPTT)
…
=
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 35 / 45
Back-propagation through time (BPTT)
Assume activation function for h is tanh.
∂tanh(x)
∂x = 1 − tanh(x)2
Thus:
(k+1)
∂hj
(k+1) 2
(k)
= wi,j 1 − [hj ]
∂hi
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 36 / 45
Back-propagation through time (BPTT)
( ) ( ) ( )
1 ( ) 1 ( ) 1 ( )
( ) ( ) ( )
( ) 1 ( ) 1 ( ) 1 ( )
( )
=
( ) ( ) ( )
1 ( ) 1 ( ) 1 ( )
(k+1) 2
This is just W T with the jth row multiplied by (1 − (hj ) , or:
∂h(k+1)
(k+1) 2
(k)
= diag 1 − (h 1 ) • WT
∂h
2
(k+1)
1 − h1 0 ··· 0
W11 W21 w31 · · ·
(k+1) 2 W12 W22 · · · · · ·
= •
0 1 − h2 ··· 0
.. .. .. ..
.. .. .. .. . . . .
. . . .
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 37 / 45
Back-propagation through time (BPTT)
This is very problematic: Vanishing/Exploding gradient problem!
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 38 / 45
Table of Contents
1 Introduction & Notation
2 Training Recurrent Neural Networks
3 Exploding and Vanishing Gradient
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 39 / 45
Difficulty of training RNNs
ℎ ′
∥ ∥ = ∥ diag ∥
ℎ −1
′
ℎ
≤ diag ∥ ∥ ∥ ∥≤ ∥ ∥
ℎ −1
ℎ
∥ ∥≤
1 ℎ −1
′
≤ = if is sigmoid
4
′ ≤1= [ if is tanh ]
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 40 / 45
Prove the derivative of the sigmoid function
1
f (x) =
1 +e−x
d 1 −1
f ′ (x) = −x
= −e
dx 1 + e−x (1 + e−x ) 2
e−x 1 + e−x − 1
= 2 =
(1 + e−x ) (1 + e−x )2
2 2
1 + e−x
1 1 1
= − = −
(1 + e−x )2 1 + e−x (1 + e−x ) 1 + e−x
= f (x) − f (x)2 = f (x)(1 − f (x))
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 41 / 45
Gradient of Sigmoid Function
Figure: Sigmoid function
d
σ ′ (x) = σ (x) = slope of σ (x) at x
dx
1 1
= 1−
1 + e−x 1 + e−x
Derivative of σ(x) with respect to x
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 42 / 45
Gradient of Sigmoid Function
x = 10 → g(x) ≈ 1
d
σ(x) ≈ 1(1 − 1) = 0
dx
x = −10 → g(x) ≈ 0
d
σ(x) ≈ 1(1 − 1) = 0
dx
1
x = 0 → g(x) = 2
d 1 1
σ(x) = 1 1 − =
dx 2 4
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 43 / 45
Difficulty of training RNNs
t
∂ht Y ∂hi
=
∂hk ∂hi−1
i=k+1
t
Y
≤ γλ
i=k+1
(t−k)
≤ (γλ)
1 If (γλ) > 1 then gradient will explode
2 If (γλ) < 1 then gradient will vanish
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 44 / 45
Exploding Gradient
Gradients in training RNNs on long sequences may explode as the
weights become larger and the norm of the gradient during training
largely increases.
As it is stated before, the necessary condition for this situation to
happen is λ > γ1 .
Team 2 (UIT) CS115 - Math for Computer Science November 2, 2024 45 / 45