KEMBAR78
Sequence Models For NLP | PDF | Kalman Filter | Deep Learning
0% found this document useful (0 votes)
10 views195 pages

Sequence Models For NLP

Uploaded by

muhammadswelam1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views195 pages

Sequence Models For NLP

Uploaded by

muhammadswelam1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 195

Sequence models for NLP

Dr. Ahmad El Sallab


AI Senior Expert
Agenda
- Why sequence models?
- Statistical Language Models
- Recurrent NN
- Neural Language Models
- RNN for text classification
- What’s wrong with RNN?
- Other sequence models: CNN1D
- CNN-LSTM models
Why Sequence models?
Speech
Text

Ground
Garage
the clouds are in the
Sky
Airport
Oven
Image/Video
Sequences are everywhere!

Context is all what matters!


Statistical Language Models
(SLM)
LM to train embeddings Src task Embedding
My task
(LM) table

Language Modeling

Ground
Garage
the clouds are in the Sky
Sequential by nature!
Airport
Oven
As the sequence length increases, it will
be more “unique” leading to more
entries in the n-gram Counter table.
If n is small, then it’s more “repeated”,
so less unique entries
Let’s code
N-gram SLM

https://colab.research.google.com/drive/1cq-FKtD8tJpaoEODzZM6dm6hWaLUEy
Gk?usp=sharing
Neural Language Models
(NLM)
NLM with
BoW vectors
How to represent words to NN?
Bag-of-Words model
Word Embeddings
Word index 1 Word index 2 Word index N

Embedding Embedding
Embedding
W …... W
W

Dense
(W2)
ALL Embedding layers → Same W?

Softmax
(Wcls)
BoW vectors model
What’s wrong with BoW?
Soup of words vectors → Sequence info is lost!
Recurrent Neural Networks
Natural Selection for sequence models
Sequential process prediction
History factor

What if we introduce time?


Do we need to worry about old times?
The Markov Assumption and Markov Decision
Process (MDP)
• In Chess, you only need to worry about the current state
of the game to take the next action
Bayes filters
• True state is hidden

• Only partial states are observable

• “Filter” the noise in the observed measurement to obtain


the “de-noised” state
Robot Localization
Recursive State estimation
• State space equations in control
• Continuous states

• Probabilistic State Estimation


• Bayes
• Gaussian
• Kalman=Gaussian + Linear (State space equations + randomness
to be probabilistic)
Hidden Markov Model
• Well known in speech

• Discrete state
Kalman filter
Exploration vs. Exploitation
• All forms depend on two terms
• Exploitation: use of known
information = history = old state term
= forget gate
• Exploitation: use of new information
= inclusion of input = input gate

• All comes from the MDP


formulation
Is Kalman/Bayes Filter Enough?
• How to set the motion model?

• How to set the measurement/sensor model?

• It has to be learnt!
Emission and
Transition pdf’s are
• But computationally effective so far “assumptions”

• Bad in remembering history


• Due to MDP assumption
Partially observable MDP (POMDP)
• In Chess, you only need to worry about the current state
of the game to take the next action

• In Card games (Estimation, Poker,…etc), the current


state of the game is not sufficient
• You need to be aware what were the previous cards played to take
the correct actions (or you estimate ;)
Occlusion in Automotives
Occlusion in Automotives
Is Kalman/Bayes Filter Enough?

MDP is an assumption Emission and


Dependency could be on Transition pdf’s are
longer history “assumptions”
Sequence modeling in DL
• Generative
• HMM
• Recursive auto encoders

• Discriminative
• CRF
• Recurrent neural nets

• Used in context
modeling
Recurrent Neural Nets
Recurrent Neural Nets

?=
Recurrent Neural Nets

?=
What makes RNN special vs. DNN/CNN?
• A glaring limitation of Vanilla Neural Networks (and also
Convolutional Networks) is that their API is too constrained: they
accept a fixed-sized vector as input (e.g. an image) and
produce a fixed-sized vector as output (e.g. probabilities of
different classes).

• Not only that: These models perform this mapping using a fixed
amount of computational steps (e.g. the number of layers in
the model).

• The core reason that recurrent nets are more exciting is that they
allow us to operate over sequences of vectors: Sequences in
the input, the output, or in the most general case both.
What makes RNN special vs.
DNN/CNN?
What makes RNN special vs.
DNN/CNN?
• One-One: Vanilla mode of processing without RNN, from fixed-sized input to
fixed-sized output (e.g. image classification).

• One-many: Sequence output (e.g. image captioning takes an image and


outputs a sentence of words).

• Many-one: Sequence input (e.g. sentiment analysis where a given sentence is


classified as expressing positive or negative sentiment).

• Many-Many: Sequence input and sequence output (e.g. Machine Translation:


an RNN reads a sentence in English and then outputs a sentence in French).
• Synced sequence input and output (e.g. video classification where we wish to label each frame
of the video). Notice that in every case are no pre-specified constraints on the lengths
sequences because the recurrent transformation (green) is fixed and can be applied as many
times as we like
DNN=Function approximation
RNN=Program approximation

If training vanilla neural nets is


optimization over functions, training
recurrent nets is optimization over
programs. What is a program anyway?
A sequence of operations
Performed in order
Result of one affecting the next
==Recurrence!
Problems the RNN solves = The Oracle
• Classification problems

• DNN = Associative Mapping of Input/Output


• =Human recognition (develops in youngest age)🡪 Easiest
• In automotive = Environment awareness and object detection
• As hard as a human understands the traffic system

• Prediction problems

• RNN = Information aggregation/integration overtime to predict next state


• =Human anticipation (develops in older age) 🡪 Harder
• In automotive = Mapping and Environment state estimation and object tracking

• Planning problems

• RL = Evaluate the utility of state-action pair and take action that maximizes the expected total future
payoff (reward) from source to goal
Sentence Embedding with Sequence models
Supervised Learning Model Design Pattern

Output

Wcls

Decision

Wfeature
s

Features

Manual = Hand crafted =


Engineered = ML
Auto = NN = DL
Input
Encoder-Decoder pattern
What is the best representation of language?
This question summarizes all NLP efforts!

For what?

- Classification
- Many-to-one: Seq2Class
- Sentiment analysis, Toxicity detection (JIGSAW), Real or not? Disaster tweets
- Dialogue:
- Many-to-many: Seq2Seq → Unaligned case → More on that later
- MT, Spelling correction, Speech, OCR,...etc
NLP models meta-architectures
Just like in CV: Encoder-Decoder

- Seq2Class:
- Encoder = words vectors aggregation (How?)
- Decoder = None (just classifer=softmax)
- Analogy to CV: Encoder-Softmax (AlexNet, VGG,...etc)
- Seq2Seq:
- Encoder = words vectors aggregation (How?)
- Decoder = multiple words generation (How?)
- Analogy to CV: Encoder-Decoder in semantic segmentation. But in SS, we have aligned
many2many, while in NLP, we have unaligned sequences→ challenge in annotation, model,
when to stop, position encoding...etc
- Word vectors are the input to all the above meta-architectures:
- Unlike in CV, where pixels are already digitized
- Also, in NLP order matters! = Context
Characters generator

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Characters generator
Error_Softmax = y_i – t_i (no f’ of
Backprop
activation, but still y_i-t_i, review
The error backpropagated to Wxx
the intro and workshop)
and Wxh is coming from the
T_i = 1 for target[t] and 0
output + next time step (dhnext)
otherwise
Y_i = ps[t] 1-K = one hot encoding others 0’s
dy[target[t]]=error=ps[t][target[t]]-1
dy[otherwise]=error=ps[t][otherwis
e]-0
Ps[t]=[0.1 0.02…0.7…]
dWhy=alpha*error*input
Target[t] = 1..28
Notice
We want thattowe
maxstart from the end
ps[t][target[t]] =1
of the sequence
Others 0’s and
backpropagate
Loss = negativethe logerror from this
likelihood
point back. At the final node, no
=sigma(t_i*log(y_i))
dhnext.
Notice At the
All t_i are
after0’s first
theexceptnode,
softmax no =all1
target[t]
layer,
hs[t-1]
Y_i=ps[t][target[t]]
backprop of errors shall include
the non-linearity f’
Essays generation
tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e
plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng
100 epochs: At least it is starting to get an idea about words separated by spaces.
Except sometimes it inserts two spaces. It also doesn't know that comma is amost always followed by a space.

"Tmont thithey" fomesscerliund Keushey.


300 epochs: Quotes and periods are learnt

we counter. He stutn co des. His stanted out one ofler that concossions and
was to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn
500 epochs: The model has now learned to spell the shortest and most common words such as "we", "He", "His", "Which", "and", etc.

Aftair fall unsuch that the hall for Prince Velzonski's that me of her hearly,
and behs to so arwage fiving were to it beloge, pavu say falling misfort how,
and Gogition is so overelical and ofter.
700 epochs: English-Like !!
Essays generation
"Kite vouch!" he repeated by her door. "But I would be done and quarts, feeling,
then, son is people...."
1200 epochs: we're now seeing use of quotations and question/exclamation marks. Longer words have now been learned as well

"Why do what that day," replied Natasha, and wishing to himself the fact the princess,
Princess Mary was easier, fed in had oftened him. Pierre aking his soul came to the
packs and drove up his father-in-law women.
2000 epochs: we start to get properly spelled words, quotations, names, and so on

The picture that emerges is that the model first discovers the general word-space structure and
then rapidly starts to learn the words; First starting with the short words and then eventually the
longer ones. Topics and themes that span multiple words (and in general longer-term
dependencies) start to emerge only much later.
Wikipedia
Markdown
Algebra Lemmas (Latex)

Valid Latex Syntax!!


Linux Source Code (Input=Linux Repo 400MB)
Compiling!!

The RNN inserts


comments every now
and then!
Music generator
• Input is a GuitarPro Notes in ASCII

• Output are ASCII too🡪 Play

• Nice music!!
• https://soundcloud.com/optometrist-prime/recurrence-music-written
-by-a-recurrent-neural-network
Images generator
• Input Monet drawings (50 images)

• Output: Monet style drawings!!


• Output evolution every 100 iterations
Issues with Vanilla RNN
Exploration vs. Exploitation
Can we have
access over longer
When to act quickly based on current input? history and control
what to use and
When to consider long history? what to drop?

• All forms depend on two terms


• Exploitation: use of known
information = history = old state term
= forget gate
• Exploitation: use of new information
= inclusion of input = input gate

• All comes from the MDP


formulation
Long term dependency
Short term dependency
Partially observable MDP (POMDP)
• In Chess, you only need to worry about the current state
of the game to take the next action

• In Card games (Estimation, Poker,…etc), the current


state of the game is not sufficient
• You need to be aware what were the previous cards played to take
the correct actions (or you estimate ;)

History dependency still controlled by the “summarizing” last


hidden state
Occlusion in Automotives
Occlusion in Automotives
Can we have access over longer history and
control what to use and what to drop?
Vanishing gradients in RNN

BPTT “dilutes” the gradient in very “deep” architecture in


time
Skip connections to Very Deep Arch
•Very deep networks 🡪 vanishing gradients

•Remember that the back-propagation of a sum node will


replicate the input gradient with no degradation.

So they view the map F(x):=H(x)


−x as some residual map.
Can we do the same in time? They use a skip layer connection to
Skip connection through time! cast this mapping into F(x)+x=H(x).
So if the residual F(x) is "small", the
map H(x) is roughly the identity.
Long-Short Term Memory (LSTM)
Ground
Garage
the clouds are in the Sky
Airport
Oven

Arabic
Chineses
I grew up in France…[LONG_DOC] I speak fluent ____.
French
English
German
How to pay “attention” to
words from very OLD
context?
Long-Short Term Memory (LSTM)
• Sometimes, we only need to look at recent information to
perform the present task.

• For example, consider a language model trying to predict


the next word based on the previous ones.

• If we are trying to predict the last word in “the clouds are


in the sky,” we don’t need any further context – it’s pretty
obvious the next word is going to be sky.

• In such cases, where the gap between the relevant


information and the place that it’s needed is small, RNNs
can learn to use the past information.
Long-Short Term Memory (LSTM)
• But there are also cases where we need more context.
Consider trying to predict the last word in the text “I grew up
in France… I speak fluent French.”

• Recent information suggests that the next word is probably


the name of a language, but if we want to narrow down which
language, we need the context of France, from further back.

• It’s entirely possible for the gap between the relevant


information and the point where it is needed to become very
large.

• Unfortunately, as that gap grows, RNNs become unable to


learn to connect the information.
Long-Short Term Memory (LSTM)
• In theory, RNNs are absolutely capable of handling such
“long-term dependencies.”

• A human could carefully pick parameters for them to


solve toy problems of this form. Sadly, in practice, RNNs
don’t seem to be able to learn them.

• The problem was explored in depth by Hochreiter (1991)


[German] and Bengio, et al. (1994), who found some
pretty fundamental reasons why it might be difficult.

• Thankfully, LSTMs don’t have this problem!


Long-Short Term Memory (LSTM)
• Long Short Term Memory networks – usually just called
“LSTMs” – are a special kind of RNN, capable of learning
long-term dependencies.

• They were introduced by Hochreiter & Schmidhuber


(1997), and were refined and popularized by many
people in following work.

• LSTMs are explicitly designed to avoid the long-term


dependency problem. Remembering information for long
periods of time is practically their default behavior, not
something they struggle to learn!
Long-Short Term Memory (LSTM)
LSTM Step By Step: Gates
• Gates are a way to optionally let information through.
They are composed out of a sigmoid neural net layer
and a pointwise multiplication operation.

• The sigmoid layer outputs numbers between zero and


one, describing how much of each component should
be let through. A value of zero means “let nothing
through,

• An LSTM has three of these gates, to protect and


control the cell state.” while a value of one means “let
everything through!”
LSTM Step By Step: Cell State
• The key to LSTMs is the cell
state, the horizontal line running
through the top of the diagram.

• The cell state is kind of like a


conveyor belt. It runs straight
down the entire chain, with only
some minor linear interactions.
It’s very easy for information to
just flow along it unchanged.

• The LSTM does have the ability


to remove or add information to
the cell state, carefully regulated
by structures called gates.
LSTM Step By Step: Forget gate

Which information to forget/keep/propagate (1/0*Ct-1) from cell state Ct-1 to


next cell state Ct
LSTM Step By Step: Input gate
The tanh combination acts as a “regularizer”
on the input
It acts as P(xt|ht-1), so that we don’t keep
any info from the input independent from
what the prev. state was

What new information to ADD (+Ct-1) to next cell state Ct


The tanh combines previous hidden state and new input
The sigmoid input gate blocks and let pass the info to
add from the prev. hidden + input
LSTM Step By Step: Cell State
Update

Exploitation Exploration from


From prev. new input
Cell state (regulated by
the prev. hidden)
LSTM Step By Step: Output gate

What to propagate from input to output (Redundant to be treated in GRNN)


LSTM Matrix Equations
LSTM Gates = Skip connections

- Skip connections to the rescue


- LSTM gates are skip connections over time!
LSTM as skip connection (ResNet)
LSTM as skip connection (ResNet)

In BWD → Grads are just copied at the “+” → no vanishing = Reset!


Gated Recurrent Neural Networks

What to What to
forget/keep/prop add from
agate new
input
Bi-directional LSTM
Let’s code

LSTM
https://colab.research.google.com/drive/1dXeClcTIaFqG3UGmZrdDjPktte6TZ5si#scrollTo=Y9xddpbZJrjm

https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
Conventions Time series Tensors
3D tensors of shape (samples, timesteps, features) samples could be the
batch (size=batch_size)
Whenever time matters in your data (or the notion of sequence order), it
makes sense to store it in a 3D tensor with an explicit time axis. Each
sample can be encoded as a sequence of vectors (a 2D tensor), and thus a
batch of data will be encoded as a 3D tensor
A MUST for stacked LSTM!
LSTM return sequences Otherwise, use TimeDistributed (NOT
preferred)

3D tensors of shape (samples, timesteps, features) samples could be the


batch (size=batch_size)
Whenever time matters in your data (or the notion of sequence order), it
makes sense to store it in a 3D tensor with an explicit time axis. Each
sample can be encoded as a sequence of vectors (a 2D tensor), and thus a
batch of data will be encoded as a 3D tensor
(samples, timesteps, features)

LSTM/RNN always expects 3D tensor⇒ True


- hidden_dim
(samples, timesteps, features) - return_sequences
False

https://colab.research.google.com/drive/1dXe (samples, features)


ClcTIaFqG3UGmZrdDjPktte6TZ5si#scrollTo=
7LRuYaCqILOn
LSTM return_state
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/

LSTM has 2 states: hidden and


cell (h,c):

- h = output
- c = internal state

In keras, if you set


“return_states=True”, then 3
things are returned:

- h = lstm1
- h = state_h (again!) → why?
- c = state_c
LSTM return_state and return_sequences
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/

In keras, if you set “return_states=True”,


then 3 things are returned:

- h = lstm1 → ALL h’s of all states!


- h = state_h→ Last h
- c = state_c → Last c

Verify: Last
of lstm1=h
NLM with RNN
Let’s code

LSTM Text Generation

https://colab.research.google.com/drive/16XcUKh2eWIqIwC3vnnbKUdqkpHqY99L
B?usp=sharing
Characters generator = Char level NLM

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Word level NLM
Let’s code
Word level NLM

https://colab.research.google.com/drive/1WRG86sVhXxI3bADagQOy6DdNVlog_C
Oo?usp=sharing

Char level NLM

https://colab.research.google.com/drive/1qQAqtDK6b9E27Dzu38hBB4cV6QVtuN
g0?usp=sharing
RNN for text classification
Let’s code
LSTM-IMDB Keras dataset

https://colab.research.google.com/drive/1Qllb_h_kuwv8PaIHPZDr2eC4Zo5ljr5b?u
sp=sharing

LSTM-IMDB Raw dataset

https://colab.research.google.com/drive/1WOUuqsfPZXurupMxlExvScB3ivSiVaV9
?usp=sharing
Conv1D for sequence
models
What’s wrong with RNN?
It’s sequential!

- We cannot make use of parallelism of modern computer architectures or


GPU!

What’s good about it?

- It captures sequence information and summarizes it in a state


- Use that state for (condition output on that state):
- Classification
- Sequence generation

State = Feature of the input


NLP = Context
Image → Spatial context → Conv2D

Text → Sequence context → RNN ⇒ CNN1D (this lecture)


How to make sequence parsing fast?
Employ parallelism!

Matrix operations → Transformers (Advanced: More on that later)

Convolution operations → CNNs


ConvNets in NLP

CNNs are good to summarize image into one feature vector → How?

It takes advantage of the spatial structure of the image

And it’s very fast → Parallel (conv formula)

Can we use it for the sequential structure summarization?


ConvNets in NLP
Filters as template matching
1D
Filters as template matching
2D
Filters as template matching
Convolution for feature extraction
Template matching = features extraction
● Feature = template

● Extraction = matching

● Example of features are edge detectors

● A Kernel is a feature detector – Local features


How to decide on the kernel weights?
● Kernel weights = features

● Method 1 = Hand crafted 🡪 like vertical or horizontal edges

● Method 2 = Learning! 🡪 ConvNets


N-grams
Filters as template matching
1D
Template=kernel=n-gram
Shared weights = Efficiency
Every n-gram = Channel → See later
Conv1D
Conv2D
- Conv2D → (N, OUT_CH (n_kernels), L, W, C) → 5D tensor
- Concatenation model:
- If we concatenate([t0,t1,..tT],axis=-1): t0=(N,L,W,3), t1=(N,L,W,3)...tT=(N,L,W,3) → x =
(N,L,W,3xT)
- Next layers will treat the concatenated frames as just channels
- Spatio-temporal:
- Spatial model:
- Spatio model = Conv2D(out_maps, kernel)(ti=(N,L,W,3)) → (N,L,W,out_maps)
- If we concatenate([t0,t1,..tT],axis=-1): t0=(N,L,W,out_maps),
t1=(N,L,W,out_maps)...tT=(N,L,W,out_maps) → x = (N,L,W,out_mapsxT)
- Temporal model:
- Conv2D(final_maps, kernel)(x => (N,L,W,out_mapsxT)) → (N,L,W,final_maps)
- Flatten
- Classification → Softmax
Conv1D
Works same as Conv2D, except the input is (N,
OUT_CH (n_kernels), T, C=Features) → 4D tensor

- T is the samples (scalars in time)=#words for


example
- C (channels) is the dimension of feature vector at
time t
- OUT_CH is the output dimension of the mapped
vector (out_emb)
- Exampe C=1 feature per time step, OUT_CH=1
- Conv1D(1, kernel=3)(x=>(1,F)--> y=>(1,F-3+1)
- If we have more than 1 frame, concatenate in T
not channels
Multi-channel convolution illustration
out_emb_sz
Conv1D: (kernel, out_channels, inp_channels,
sent_len):
- Inp_ch = inp_emb_sz
- Out_ch = out_emb_sz
- Each output kernel will operate on the input
(sent_lenxemb_sz) ⇒ 2D (N-M+1 x
emb_sz)
- we treat inp_emb_sz as input channels and
sum ⇒ learnable with W⇒ 2D-1D
- we we concat on the channel dim for each
1D output filter ⇒ 2D

Crush the row dim (emb_sz) using learnable sum


Conv1D example Input seq max_len = 400 words
Each 50 emb_dim

Conv over 400 with kernel=3→ 398


token
Each should have 50 dim
BUT, we summarize them with w →
so each is 1 scalar → 398x1
We have 250 output kernels (each 3)
→ so we have 398x250 outputs
Multi-channel convolution illustration: Conv2D vs
Conv1D
In Conv2D: we sum over
input feature maps ⇒ 3D-2D
In Conv1D: we treat emb_sz
as input channels and sum
⇒ learnable with W⇒ 2D-1D

Following:
In Conv2D: we concat on the
channel dim for each output
2D filter ⇒ 3D

In Conv1D: we we concat on
the channel dim for each 1D
output filter ⇒ 2D
How to get 1D vector per
sentence?
Conv1D + GlobalMaxPool1D example
inp_emb_sz Every kernel conv gives one output due to max
pool over time⇒ over all input words
Out_emb = #out_kernels

Sent_len
=
max_len
after
padding

For a filter operating on input sequence each of emb_sz,


we concat them, and summarize with eight vector, for
every kernel slide,, so one local dot = produces 1 scalar
Conv1D + GlobalMaxPool1D example
But what if we want
different kernels ⇒
Multi-headed CNN n-grams (2-3-...etc)?

Another popular approach with CNNs is to have a


multi-headed model, where each head of the model
reads the input time steps using a different sized
kernel.

For example, a three-headed model may have three


different kernel sizes of 3, 5, 11, allowing the model
to read and interpret the sequence data at three
different resolutions. The interpretations from all
three heads are then concatenated within the model
and interpreted by a fully-connected layer before a
prediction is made.
Disadv: Depth scales with seq length!
- ConvNet captures sequence information
and summarizes it in a state
- Use that state for (condition output on
that state):
- Classification
- Sequence generation
- Network depth scales with max seq
(sentence) length!
- sequence length requires deeper models, it
makes it difficult to learn dependencies
between distant words
- Inefficient for variable length! Needs
padding
Let’s code
Conv1D-IMDB Keras dataset

https://colab.research.google.com/drive/1zazDWpg6JeWUe8lxXp4dY61k5QzfuKM
A?usp=sharing

Conv1D-IMDB Raw dataset

https://colab.research.google.com/drive/17iOxqMW-MT36RCLm7N31smxylk9buA
GK?usp=sharing
CNN-LSTM
Spatio-temporal

CNN-LSTM: internal hidden state flattened as normal


vector dot
CNN-LSTM - Just take the raw CNN
output without flatten

LSTM will operate on the time_dim


CNN-LSTM - TimeDistributed (Like VideoCNN)
Spatio-temporal

ConvLSTM

Internal hidden state as


convolution
ConvLSTM - No TimeDistributed
(states already temporal)

Built-in
Spatio-temporal
Let’s code
CNN-LSTM-IMDB Keras dataset

https://colab.research.google.com/drive/1zazDWpg6JeWUe8lxXp4dY61k5QzfuKM
A?usp=sharing

CNN-LSTM-IMDB Raw dataset

https://colab.research.google.com/drive/17iOxqMW-MT36RCLm7N31smxylk9buA
GK?usp=sharing
Recursive Auto Encoders
(optional)
Compositionality modeling
• Spatial
• Image
• CNN
• Distributed representations
• No control
• No prior templates for
context. Ex: parse trees
• The only template is the 2D
map spatial relation: every
pixel is affected by its
neighbors

http://www.slideshare.net/roelofp/2014-1021-sicsdlnlpg
Compositionality modeling

● Hierarchical
■ Recursive
■ RNN
■ RNTN
■ LTSM NN
○ Prior templates for context can be imposed

http://www.kdnuggets.com/2014/08/interview-pedro-domingos-master-algorithm-new-deep-learning.html
How to get the parsing order?
- Parser
- Stanford
- Expensive Language resource (Low-Language-Resources (LLR) = Arabic)
- RAE
- Unsupervised
- Sub-optimal
- Expensive computations
Recursive Auto Encoder
Recursive models
Adv

- Parsing language structure

Disadv

- Require parse trees


- Expensive computations

You might also like