Sequence Models For NLP
Sequence Models For NLP
Ground
Garage
the clouds are in the
Sky
Airport
Oven
Image/Video
Sequences are everywhere!
Language Modeling
Ground
Garage
the clouds are in the Sky
Sequential by nature!
Airport
Oven
As the sequence length increases, it will
be more “unique” leading to more
entries in the n-gram Counter table.
If n is small, then it’s more “repeated”,
so less unique entries
Let’s code
N-gram SLM
https://colab.research.google.com/drive/1cq-FKtD8tJpaoEODzZM6dm6hWaLUEy
Gk?usp=sharing
Neural Language Models
(NLM)
NLM with
BoW vectors
How to represent words to NN?
Bag-of-Words model
Word Embeddings
Word index 1 Word index 2 Word index N
Embedding Embedding
Embedding
W …... W
W
Dense
(W2)
ALL Embedding layers → Same W?
Softmax
(Wcls)
BoW vectors model
What’s wrong with BoW?
Soup of words vectors → Sequence info is lost!
Recurrent Neural Networks
Natural Selection for sequence models
Sequential process prediction
History factor
• Discrete state
Kalman filter
Exploration vs. Exploitation
• All forms depend on two terms
• Exploitation: use of known
information = history = old state term
= forget gate
• Exploitation: use of new information
= inclusion of input = input gate
• It has to be learnt!
Emission and
Transition pdf’s are
• But computationally effective so far “assumptions”
• Discriminative
• CRF
• Recurrent neural nets
• Used in context
modeling
Recurrent Neural Nets
Recurrent Neural Nets
?=
Recurrent Neural Nets
?=
What makes RNN special vs. DNN/CNN?
• A glaring limitation of Vanilla Neural Networks (and also
Convolutional Networks) is that their API is too constrained: they
accept a fixed-sized vector as input (e.g. an image) and
produce a fixed-sized vector as output (e.g. probabilities of
different classes).
• Not only that: These models perform this mapping using a fixed
amount of computational steps (e.g. the number of layers in
the model).
• The core reason that recurrent nets are more exciting is that they
allow us to operate over sequences of vectors: Sequences in
the input, the output, or in the most general case both.
What makes RNN special vs.
DNN/CNN?
What makes RNN special vs.
DNN/CNN?
• One-One: Vanilla mode of processing without RNN, from fixed-sized input to
fixed-sized output (e.g. image classification).
• Prediction problems
• Planning problems
• RL = Evaluate the utility of state-action pair and take action that maximizes the expected total future
payoff (reward) from source to goal
Sentence Embedding with Sequence models
Supervised Learning Model Design Pattern
Output
Wcls
Decision
Wfeature
s
Features
For what?
- Classification
- Many-to-one: Seq2Class
- Sentiment analysis, Toxicity detection (JIGSAW), Real or not? Disaster tweets
- Dialogue:
- Many-to-many: Seq2Seq → Unaligned case → More on that later
- MT, Spelling correction, Speech, OCR,...etc
NLP models meta-architectures
Just like in CV: Encoder-Decoder
- Seq2Class:
- Encoder = words vectors aggregation (How?)
- Decoder = None (just classifer=softmax)
- Analogy to CV: Encoder-Softmax (AlexNet, VGG,...etc)
- Seq2Seq:
- Encoder = words vectors aggregation (How?)
- Decoder = multiple words generation (How?)
- Analogy to CV: Encoder-Decoder in semantic segmentation. But in SS, we have aligned
many2many, while in NLP, we have unaligned sequences→ challenge in annotation, model,
when to stop, position encoding...etc
- Word vectors are the input to all the above meta-architectures:
- Unlike in CV, where pixels are already digitized
- Also, in NLP order matters! = Context
Characters generator
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Characters generator
Error_Softmax = y_i – t_i (no f’ of
Backprop
activation, but still y_i-t_i, review
The error backpropagated to Wxx
the intro and workshop)
and Wxh is coming from the
T_i = 1 for target[t] and 0
output + next time step (dhnext)
otherwise
Y_i = ps[t] 1-K = one hot encoding others 0’s
dy[target[t]]=error=ps[t][target[t]]-1
dy[otherwise]=error=ps[t][otherwis
e]-0
Ps[t]=[0.1 0.02…0.7…]
dWhy=alpha*error*input
Target[t] = 1..28
Notice
We want thattowe
maxstart from the end
ps[t][target[t]] =1
of the sequence
Others 0’s and
backpropagate
Loss = negativethe logerror from this
likelihood
point back. At the final node, no
=sigma(t_i*log(y_i))
dhnext.
Notice At the
All t_i are
after0’s first
theexceptnode,
softmax no =all1
target[t]
layer,
hs[t-1]
Y_i=ps[t][target[t]]
backprop of errors shall include
the non-linearity f’
Essays generation
tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e
plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng
100 epochs: At least it is starting to get an idea about words separated by spaces.
Except sometimes it inserts two spaces. It also doesn't know that comma is amost always followed by a space.
we counter. He stutn co des. His stanted out one ofler that concossions and
was to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn
500 epochs: The model has now learned to spell the shortest and most common words such as "we", "He", "His", "Which", "and", etc.
Aftair fall unsuch that the hall for Prince Velzonski's that me of her hearly,
and behs to so arwage fiving were to it beloge, pavu say falling misfort how,
and Gogition is so overelical and ofter.
700 epochs: English-Like !!
Essays generation
"Kite vouch!" he repeated by her door. "But I would be done and quarts, feeling,
then, son is people...."
1200 epochs: we're now seeing use of quotations and question/exclamation marks. Longer words have now been learned as well
"Why do what that day," replied Natasha, and wishing to himself the fact the princess,
Princess Mary was easier, fed in had oftened him. Pierre aking his soul came to the
packs and drove up his father-in-law women.
2000 epochs: we start to get properly spelled words, quotations, names, and so on
The picture that emerges is that the model first discovers the general word-space structure and
then rapidly starts to learn the words; First starting with the short words and then eventually the
longer ones. Topics and themes that span multiple words (and in general longer-term
dependencies) start to emerge only much later.
Wikipedia
Markdown
Algebra Lemmas (Latex)
• Nice music!!
• https://soundcloud.com/optometrist-prime/recurrence-music-written
-by-a-recurrent-neural-network
Images generator
• Input Monet drawings (50 images)
Arabic
Chineses
I grew up in France…[LONG_DOC] I speak fluent ____.
French
English
German
How to pay “attention” to
words from very OLD
context?
Long-Short Term Memory (LSTM)
• Sometimes, we only need to look at recent information to
perform the present task.
What to What to
forget/keep/prop add from
agate new
input
Bi-directional LSTM
Let’s code
LSTM
https://colab.research.google.com/drive/1dXeClcTIaFqG3UGmZrdDjPktte6TZ5si#scrollTo=Y9xddpbZJrjm
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
Conventions Time series Tensors
3D tensors of shape (samples, timesteps, features) samples could be the
batch (size=batch_size)
Whenever time matters in your data (or the notion of sequence order), it
makes sense to store it in a 3D tensor with an explicit time axis. Each
sample can be encoded as a sequence of vectors (a 2D tensor), and thus a
batch of data will be encoded as a 3D tensor
A MUST for stacked LSTM!
LSTM return sequences Otherwise, use TimeDistributed (NOT
preferred)
- h = output
- c = internal state
- h = lstm1
- h = state_h (again!) → why?
- c = state_c
LSTM return_state and return_sequences
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
Verify: Last
of lstm1=h
NLM with RNN
Let’s code
https://colab.research.google.com/drive/16XcUKh2eWIqIwC3vnnbKUdqkpHqY99L
B?usp=sharing
Characters generator = Char level NLM
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Word level NLM
Let’s code
Word level NLM
https://colab.research.google.com/drive/1WRG86sVhXxI3bADagQOy6DdNVlog_C
Oo?usp=sharing
https://colab.research.google.com/drive/1qQAqtDK6b9E27Dzu38hBB4cV6QVtuN
g0?usp=sharing
RNN for text classification
Let’s code
LSTM-IMDB Keras dataset
https://colab.research.google.com/drive/1Qllb_h_kuwv8PaIHPZDr2eC4Zo5ljr5b?u
sp=sharing
https://colab.research.google.com/drive/1WOUuqsfPZXurupMxlExvScB3ivSiVaV9
?usp=sharing
Conv1D for sequence
models
What’s wrong with RNN?
It’s sequential!
CNNs are good to summarize image into one feature vector → How?
● Extraction = matching
Following:
In Conv2D: we concat on the
channel dim for each output
2D filter ⇒ 3D
In Conv1D: we we concat on
the channel dim for each 1D
output filter ⇒ 2D
How to get 1D vector per
sentence?
Conv1D + GlobalMaxPool1D example
inp_emb_sz Every kernel conv gives one output due to max
pool over time⇒ over all input words
Out_emb = #out_kernels
Sent_len
=
max_len
after
padding
https://colab.research.google.com/drive/1zazDWpg6JeWUe8lxXp4dY61k5QzfuKM
A?usp=sharing
https://colab.research.google.com/drive/17iOxqMW-MT36RCLm7N31smxylk9buA
GK?usp=sharing
CNN-LSTM
Spatio-temporal
ConvLSTM
Built-in
Spatio-temporal
Let’s code
CNN-LSTM-IMDB Keras dataset
https://colab.research.google.com/drive/1zazDWpg6JeWUe8lxXp4dY61k5QzfuKM
A?usp=sharing
https://colab.research.google.com/drive/17iOxqMW-MT36RCLm7N31smxylk9buA
GK?usp=sharing
Recursive Auto Encoders
(optional)
Compositionality modeling
• Spatial
• Image
• CNN
• Distributed representations
• No control
• No prior templates for
context. Ex: parse trees
• The only template is the 2D
map spatial relation: every
pixel is affected by its
neighbors
http://www.slideshare.net/roelofp/2014-1021-sicsdlnlpg
Compositionality modeling
● Hierarchical
■ Recursive
■ RNN
■ RNTN
■ LTSM NN
○ Prior templates for context can be imposed
http://www.kdnuggets.com/2014/08/interview-pedro-domingos-master-algorithm-new-deep-learning.html
How to get the parsing order?
- Parser
- Stanford
- Expensive Language resource (Low-Language-Resources (LLR) = Arabic)
- RAE
- Unsupervised
- Sub-optimal
- Expensive computations
Recursive Auto Encoder
Recursive models
Adv
Disadv