Computational Tutorial:
An introduction to LSTMs in Tensorflow
y0 y1 y2
s0 s1 s2
...
x0 x1 x2
Harini Suresh Nick Locascio
Part 1: Neural Networks Overview
Part 2: Sequence Modeling with LSTMs
Part 3: TensorFlow Fundamentals
Part 4: LSTMs + Tensorflow Tutorial
Part 1: Neural Networks Overview
Neural Network
Input hidden output
layer layers layer
h0 h0
x0
h1 h1 o0
x1 ...
h2 h2 on
xn
hn hn
The Perceptron
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Forward Pass
inputs weights sum non-linearity
Activation Function
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Sigmoid Activation
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Common Activation Functions
Importance of Activation Functions
● Activation functions add non-linearity to our network’s function
● Most real-world problems + data are non-linear
Perceptron Forward Pass
inputs weights sum non-linearity
2
0.1
3 0.5
2.5 output
-1 Σ
0.2
5
3.0
bias
Perceptron Forward Pass
inputs weights sum non-linearity
(2*0.1) + 2
0.1
(3*0.5) + 3 0.5
2.5 output
(-1*2.5) + -1 Σ
0.2
(5*0.2) + 5
3.0
1
(1*3.0)
) bias
Perceptron Forward Pass
inputs weights sum non-linearity
2
0.1
3 0.5
2.5 output
-1 Σ
0.2
5
3.0
bias
How do we build neural networks
with perceptrons?
Perceptron Diagram Simplified
inputs weights sum non-linearity
x0
w
0
x1 w
1
w2 output
x2 Σ
wn
xn
b
bias
Perceptron Diagram Simplified
inputs output
x0
x1
o0
x2
xn
Multi-Output Perceptron
Input layer output layer
x0
x1 o0
x2 o1
xn
Multi-Layer Perceptron (MLP)
input hidden output
layer layer layer
h0
x0
h1 o0
x1
h2 on
xn
hn
Multi-Layer Perceptron (MLP)
input hidden output
layer layer layer
h0
x0
h1 o0
x1
h2 on
xn
hn
Deep Neural Network
Input hidden output
layer layers layer
h0 h0
x0
h1 h1 o0
x1 ...
h2 h2 on
xn
hn hn
Training Neural Networks
Training Neural Networks: Loss function
Predicted Actual
N = # examples
Training Neural Networks: Objective
Loss is a function of the model’s parameters
How to minimize loss?
Start at random point
+
How to minimize loss?
Compute:
+
How to minimize loss?
Move in direction opposite
of gradient to new point
+
How to minimize loss?
Move in direction opposite
of gradient to new point
+
+
How to minimize loss?
Repeat!
This is called Stochastic Gradient Descent (SGD)
Repeat!
Stochastic Gradient Descent (SGD)
● Initialize θ randomly
● For N Epochs
○ For each training example (x, y):
■ Compute Loss Gradient:
■ Update θ with update rule:
Stochastic Gradient Descent (SGD)
● Initialize θ randomly
● For N Epochs
○ For each training example (x, y):
■ Compute Loss Gradient:
■ Update θ with update rule:
Stochastic Gradient Descent (SGD)
● Initialize θ randomly
● For N Epochs
○ For each training example (x, y):
■ Compute Loss Gradient:
■ Update θ with update rule:
● How to Compute Gradient?
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Apply the chain rule
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Apply the chain rule
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Apply the chain rule
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Apply the chain rule
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Apply the chain rule
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Apply the chain rule Apply the chain rule
Calculating the Gradient: Backpropagation
W1 W2
x0 h0 o0 J( )
Apply the chain rule Apply the chain rule
Core Fundamentals Review
● Perceptron Classifier
● Stacking Perceptrons to form neural networks
● How to formulate problems with neural networks
● Train neural networks with backpropagation
Part 2: Sequence Modeling
with Neural Networks
Harini Suresh
y0 y1 y2
s0 s1 s2
...
x0 x1 x2
What is a sequence?
● “I took the dog for a walk this morning.” sentence
●
function
speech waveform
Successes of deep models
Machine translation Question Answering
Left:
https://research.googleblog.com/2016/09/a-
neural-network-for-machine.html
Right:
https://rajpurkar.github.io/SQuAD-explorer/
how do we model sequences?
idea: represent a sequence as a bag of words
“I dislike rain.”
[01010001]
prediction
problem: bag of words does not preserve order
problem: bag of words does not preserve order
“The food was good, not bad at all.”
vs
“The food was bad, not good at all.”
idea: maintain an ordering within feature vector
[0001000100100000100000001]
On Monday it was snowing
One hot feature
vector indicates
what each word is prediction
problem: hard to deal with different word orders
“On Monday, it was snowing.”
vs
“It was snowing on Monday.”
problem: hard to deal with different word orders
[0001000100100000100000001]
On Monday it was snowing
vs
[1000001000000010001000100 ]
It was snowing on Monday
problem: hard to deal with different word orders
“On Monday it was snowing.”
vs
“It was snowing on Monday.”
We would have to relearn the rules of language at
each point in the sentence.
idea: markov models
problem: we can’t model long-term dependencies
markov assumption: each state depends only on the
last state.
problem: we can’t model long-term dependencies
“In France, I had a great time and I learnt some of the _____
language.”
We need information from the far past and future to
accurately guess the correct word.
let’s turn to recurrent neural networks! (RNNs)
1. to maintain word order
2. to share parameters across the sequence
3. to keep track of long-term dependencies
example network:
.
.
.
. .
. .
. .
input hidden output
example network:
.
.
.
. .
. .
. .
let’s take a look at
this one hidden unit
input hidden output
RNNS remember their previous state:
x0 : “it” W
s1
s0
U
t=0
RNNS remember their previous state:
x1 : “was” W
s2 1
2
s1
U
t=1
“unfolding” the RNN across time:
time
s0 s1 s2
...
U U U
W W W
x0 x1 x2
“unfolding” the RNN across time:
time
s0 s1 s2
... notice that W and U stay
U U U the same!
W W W
x0 x1 x2
“unfolding” the RNN across time:
time
s0 s1 s2
... sn can contain
U U U information from all
past timesteps
W W W
x0 x1 x2
possible task: language model
KING LEAR:
O, if you were a feeble sight, the
courtesy of your law,
all the works of language Your sight and several breath, will
shakespeare model wear the gods
With his heads, and my hands are
wonder'd at the deeds,
So drop upon your lordship's head,
and your opinion
Shall be against your honour.
possible task: language model
y0 y1 y2
alas my honor yi is actually a probability
distribution over possible
V V V next words, aka a softmax
s0 s1 s2
...
U U U
W W W
<start> alas my
x0 x1 x2
possible task: language model
37:29 The righteous shall inherit
the land, and leave it for an
inheritance unto the children of
King James Bible, Gad according to the number of
language
Structure and Interpretation steps that is linear in b.
model
of Computer Programs
hath it not been for the singular
taste of old Unix, “new Unix”
would not exist.
http://kingjamesprogramming.tumblr.com/
possible task: classification (i.e. sentiment)
:(
:)
possible task: classification (i.e. sentiment)
y
negative y is a probability
distribution over
V possible classes (like
positive, negative,
neutral), aka a softmax
s0 s1 sn
...
U U
W W W
don’t fly luggage
x0 x1 xn
possible task: machine translation
le chien mange <end>
K K K K
s0 s1 s2
c0 c1 c2 c3
U U L L L
W W W J J J J
the dog s2 , <go> s2 , le s2 , chien s2 , mange
eats
how do we train an RNN?
how do we train an RNN?
backpropagation!
(through time)
remember: backpropagation
1. take the derivative (gradient) of the loss with
respect to each parameter
2. shift parameters in the opposite direction in order
to minimize loss
we have a loss at each timestep:
(since we’re making a prediction at each timestep)
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
we have a loss at each timestep:
(since we’re making a prediction at each timestep)
loss at each
J0 J1 J2 timestep
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
we sum the losses across time:
loss at time t = Jt( )
= our
parameters, like
weights
total loss = J( ) = Σt Jt( )
what are our gradients?
we sum gradients across time for each
parameter P:
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
but wait…
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
but wait…
U U U
W W W
x0 x1 x2
let’s try it out for W with the chain rule:
J0 J1 J2
y0 y1 y2 so let’s take a single timestep t:
V V V
s0 s1 s2
...
but wait…
U U U
W W W
x1
s1 also depends on W so we can’t
x0 x2
just treat as a constant!
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
how does s2 depend on W?
J0 J1 J2
y0 y1 y2
V V V
s0 s1 s2
...
U U U
W W W
x0 x1 x2
backpropagation through time:
Contributions of W in previous
timesteps to the error at timestep t
backpropagation through time:
Contributions of W in previous
timesteps to the error at timestep t
why are RNNs hard to train?
problem: vanishing gradient
problem: vanishing gradient
problem: vanishing gradient
y0 y1 y2
s0 s1 s2
at k = 0:
x0 x1 x2
problem: vanishing gradient
y0 y1 y2 y3 yn
s0 s1 s2 s3 sn
. . .
x0 x1 x2 x3 xn
problem: vanishing gradient
as the gap between timesteps
gets bigger, this product gets
longer and longer!
y0 y1 y2 y3 yn
s0 s1 s2 s3 sn
. . .
x0 x1 x2 x3 xn
problem: vanishing gradient
problem: vanishing gradient
what are each of these terms?
problem: vanishing gradient
what are each of these terms?
W = sampled from f = tanh or sigmoid so f’ < 1
standard normal
distribution = mostly < 1
problem: vanishing gradient
what are each of these terms?
W = sampled from f = tanh or sigmoid so f’ < 1
standard normal
distribution = mostly < 1
we’re multiplying a lot of small numbers together.
we’re multiplying a lot of small numbers together.
so what?
errors due to further back timesteps have increasingly
smaller gradients.
so what?
parameters become biased to capture shorter-term
dependencies.
“In France, I had a great time and I learnt some
of the _____ language.”
our parameters are not trained to capture long-term
dependencies, so the word we predict will mostly depend on
the previous few words, not much earlier ones
solution #1: activation functions
ReLU derivative
prevents f’ from shrinking
the gradients
tanh derivative
sigmoid derivative
solution #2: initialization
weights initialized to identity matrix
biases initialized to zeros
prevents W from shrinking the gradients
solution #3: gated cells
rather each node being just a simple RNN cell, make each node
a more complex unit with gates controlling what information is
passed through.
vs
RNN LSTM, GRU, etc
solution #3: more on LSTMs
sj sj+1
solution #3: more on LSTMs
sj sj+1
forget
irrelevant parts
of previous
state
solution #3: more on LSTMs
sj sj+1
selectively
update cell
state values
solution #3: more on LSTMs
sj sj+1
output certain
parts of cell
state
solution #3: more on LSTMs
sj sj+1
forget selectively output certain
irrelevant parts update cell parts of cell
of previous state values state
state
why do LSTMs help?
1. forget gate allows information to pass through
unchanged
→ when taking the derivative, f’ is 1 for what we want to keep!
2. sj depends on sj-1 through addition!
→ when taking the derivative, not lots of small W terms!
in practice: machine translation.
basic encoder-decoder model:
le chien mange <end>
K K K K
s0 s1 s2
c0 c1 c2 c3
U U L L L
W W W J J J J
the dog s2 , <go> s2 , le s2 , chien s2 , mange
eats
add LSTM cells:
le chien mange <end>
K K K K
s0 s1 s2
c0 c1 c2 c3
U U
L L L
W W W J J J J
the dog s2 , <go> s2 , le s2 , chien s2 , mange
eats
problem: a fixed-length encoding is limiting
le chien mange <end>
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
W W W J J J J
the dog s2 , <go> s2 , le s2 , chien s2 , mange
eats
all the decoder knows about the input
sentence is in one fixed length vector, s2
solution: attend over all encoder states
le chien mange <end>
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
J J J J
s* , <go> s* , le s* , chien s* , mange
solution: attend over all encoder states
le chien mange <end>
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
J J J
s* , le s* , chien s* , mange
solution: attend over all encoder states
le chien mange <end>
K K K K
s0 s1 s2 c3
c0 c1 c2
U U
L L L
J J
s* , chien s* , mange
now we can model sequences!
● why recurrent neural networks?
● building models for language, classification, and machine translation
● training them with backpropagation through time
● solving the vanishing gradient problem with activation functions,
initialization, and gated cells (like LSTMs)
● using attention mechanisms
and there’s lots more to do!
● extending our models to timeseries + waveforms
● complex language models to generate long text or books
● language models to generate code
● controlling cars + robots
● predicting stock market trends
● summarizing books + articles
● handwriting generation
● multilingual translation models
● … many more!
Using TensorFlow
Deep Learning Frameworks
● GPU Acceleration
● Automatic Differentiation
● Code Reusability + Extensibility
● Speed up Idea -> Implementation
Whats out there?
What is a Tensor?
● Tensorflow Tensors are very similar to numpy ndarrays
https://cs224d.stanford.edu/lectures/CS224d-Lecture7.pdf
TensorFlow Basics
● Create a session
● Define a computation Graph
● Feed your data in, get results out
Sessions
● Encapsulates environment to run graph
● How to create the session
import tensorflow as tf
session = tf.InteractiveSession()
or
session = tf.Session()
What is a graph
● Encapsulates the computation you want to perform
What are graphs made of?
● Placeholders (aka Graph Inputs)
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
What are graphs made of?
● Constants
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
k = tf.constant(1.0)
What are graphs made of?
● Operations
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
k = tf.constant(1.0)
c = tf.add(a, b)
d = tf.subtract(b, k)
e = tf.multiply(c, d)
How do we run the graph?
● Select nodes to evaluate
● Specify values for placeholders
session.run(e, feed_dict={a:2.0, b:0.5})
>>> -1.25
session.run(c, feed_dict={a:2.0, b:0.5})
>>> 2.5
session.run([e,c], feed_dict={a:2.0, b:0.5})
>>> [2.5, -1.25]
Building a Neural Network Graph
● The previous graph performed a constant computation
● Network weights need to mutable
● Enter: tf.Variable
tf.Variable: Initialization
● Can initialize to specific values
b1 = tf.Variable(tf.zeros((2,2)), name="bias")
● Can initialize to random values
w1 = tf.Variable(tf.random_normal((2,2)), name="w1")
Building a Neural Network Graph
n_input_nodes = 2
n_output_nodes = 1
x = tf.placeholder(tf.float32, (None, 2))
y = tf.placeholder(tf.float32, (None, 1))
W = tf.Variable(tf.random_normal((n_input_nodes,
n_output_nodes)))
b = tf.Variable(tf.zeros(n_output_nodes))
z = tf.matmul(x, W) + b
out = tf.sigmoid(z)
Adding a loss function
n_input_nodes = 2
n_output_nodes = 1
x = tf.placeholder(tf.float32, (None, 2))
W = tf.Variable(tf.random_normal((n_input_nodes,
n_output_nodes)))
b = tf.Variable(tf.zeros(n_output_nodes))
z = tf.matmul(x, W) + b
out = tf.sigmoid(z)
loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(
logits=z, labels=y))
Add an optimizer: SGD
learning_rate = 0.02
loss = tf.reduce_mean(
tf.nn.sigmoid_cross_entropy_with_logits(
logits=output, labels=y))
optimizer = tf.train.GradientDescentOptimizer(
learning_rate).minimize(loss)
sess.run(optimizer, feed_dict={x: inputs, y:labels})
Run the graph
● Feed in training data in batches
● Each run of the graph updates the variables
○ SGD applies an op to all variables
● Feed in dev/test data to evaluate
○ Do not fetch the train op
Useful Features of TensorFlow
TensorBoard: Model Visualization
TensorBoard: Logging
How to use TensorBoard
● Write to Tensorboard using Summary Logs
Open your TensorBoard with the terminal command:
tensorboard --logdir=path/to/log-directory
Summary Logs
● Summaries are operations! So just part of the graph:
loss_summary = tf.summary.scalar('loss', loss)
● Summary writers save summaries to a log file
summary_writer = tf.summary.FileWriter('logs/', session.graph)
● Summaries are operations - so just run them!
pred, summary = sess.run([out, loss_summary], feed_dict={
x: inputs, labels_placeholder:labels})
summary_writer.add_summary(summary, global_step)
Summary Logs
Name Scoping
with tf.variable_scope("foo"):
with tf.variable_scope("bar"):
v = tf.Variable("v", [1])
v.name
>>> "foo/bar/v:0"
Sharing weights tf.get_variable()
with tf.variable_scope("foo"):
with tf.variable_scope("bar"):
v = tf.get_variable("v", [1])
v.name
>>> "foo/bar/v:0"
Why share weights?
● Imagine we want to learn a feature detector that we run over multiple inputs,
and aggregate features and produce a prediction, all in 1 graph
● Need to share the weights to ensure:
○ A shared, single representation is learned
○ Gradients get propagated for all inputs
Attempt 1
def cnn_feature_extractor(image):
...
with tf.variable_scope("feature_extractor"):
v = tf.Variable("v", [1])
...
features = tf.relu(h4)
return features
feat_1 = cnn_feature_extractor(image_1)
feat_2 = cnn_feature_extractor(image_2)
pred = predict(feat_1, feat_2)
Name Scoping for cleaner code
● Networks often re-use similar structures, gets tedious to write each of them
def make_layer(input, input_size, output_size, scope_name):
tf.variable_scope(scope_name):
W = tf.Variable("w", tf.random_normal((input_size,
output_size)))
b = tf.Variable("b", tf.zeros(output_size))
z = tf.matmul(input, W) + b
return z
Name Scoping for cleaner code
● Networks often re-use similar structures, gets tedious to write each of them
...
input = ...
h0 = make_layer(input, 10, 20, "h0")
h1 = make_layer(h0, 20, 20, "h1")
...
tf.get_variable("h0/w")
tf.get_variable("h1/b")
Name Scoping Makes for Clean Graph Visualizations
Checkpointing + Saving Models
# Create a saver.
saver = tf.train.Saver(...variables...)
# Launch the graph and train, saving the model every 1,000 steps.
sess = tf.Session()
for step in xrange(1000000):
sess.run(..training_op..)
if step % 1000 == 0:
# Append the step number to the checkpoint name:
saver.save(sess, 'my-model', global_step=step)
Loading Models
# Add ops to save and restore all the variables.
saver = tf.train.Saver()
# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
# Restore variables from disk.
saver.restore(sess, "/tmp/model.ckpt")
print("Model restored.")
# Do some work with the model
TensorFlow as core of other Frameworks
● Keras, TFLearn, TF-slim, others all based on TensorFlow
● Research often means tinkering with inner workings - worthwhile to
understand the core of any framework you are using
TensorFlow Tutorial:
- Pair up into pairs of 2
- Go to https://github.com/yala/introdeeplearning
- Follow install instructions
- If you need help, come down to the front
- Hint for Lab 2: Fix map(lambda...) to list(map(lambda...
TensorFlow Tutorial:
- Pair up into pairs of 2
- Go to https://github.com/yala/introdeeplearning
- Follow install instructions
- If you need help, hop on the HelpQ:
- HelpQ is here: http://deepqueue.herokuapp.com/
- Click “Log in with GitHub”
- (or just raise your hand)