0% found this document useful (0 votes)

10 views195 pages

Sequence Models For NLP

Uploaded by

muhammadswelam1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views195 pages

Sequence Models For NLP

Uploaded by

muhammadswelam1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 195

Sequence models for NLP

Dr. Ahmad El Sallab

AI Senior Expert
Agenda
- Why sequence models?
- Statistical Language Models
- Recurrent NN
- Neural Language Models
- RNN for text classification
- What’s wrong with RNN?
- Other sequence models: CNN1D
- CNN-LSTM models
Why Sequence models?
Speech
Text

Ground
Garage
the clouds are in the
Sky
Airport
Oven
Image/Video
Sequences are everywhere!

Context is all what matters!

Statistical Language Models
(SLM)
LM to train embeddings Src task Embedding
My task
(LM) table

Language Modeling

Ground
Garage
the clouds are in the Sky
Sequential by nature!
Airport
Oven
As the sequence length increases, it will
be more “unique” leading to more
entries in the n-gram Counter table.
If n is small, then it’s more “repeated”,
so less unique entries
Let’s code
N-gram SLM

https://colab.research.google.com/drive/1cq-FKtD8tJpaoEODzZM6dm6hWaLUEy
Gk?usp=sharing
Neural Language Models
(NLM)
NLM with
BoW vectors
How to represent words to NN?
Bag-of-Words model
Word Embeddings
Word index 1 Word index 2 Word index N

Embedding Embedding
Embedding
W …... W
W

Dense
(W2)
ALL Embedding layers → Same W?

Softmax
(Wcls)
BoW vectors model
What’s wrong with BoW?
Soup of words vectors → Sequence info is lost!
Recurrent Neural Networks
Natural Selection for sequence models
Sequential process prediction
History factor

What if we introduce time?

Do we need to worry about old times?
The Markov Assumption and Markov Decision
Process (MDP)
• In Chess, you only need to worry about the current state
of the game to take the next action
Bayes filters
• True state is hidden

• Only partial states are observable

• “Filter” the noise in the observed measurement to obtain

the “de-noised” state
Robot Localization
Recursive State estimation
• State space equations in control
• Continuous states

• Probabilistic State Estimation

• Bayes
• Gaussian
• Kalman=Gaussian + Linear (State space equations + randomness
to be probabilistic)
Hidden Markov Model
• Well known in speech

• Discrete state
Kalman filter
Exploration vs. Exploitation
• All forms depend on two terms
• Exploitation: use of known
information = history = old state term
= forget gate
• Exploitation: use of new information
= inclusion of input = input gate

• All comes from the MDP

formulation
Is Kalman/Bayes Filter Enough?
• How to set the motion model?

• How to set the measurement/sensor model?

• It has to be learnt!
Emission and
Transition pdf’s are
• But computationally effective so far “assumptions”

• Bad in remembering history

• Due to MDP assumption
Partially observable MDP (POMDP)
• In Chess, you only need to worry about the current state
of the game to take the next action

• In Card games (Estimation, Poker,…etc), the current

state of the game is not sufficient
• You need to be aware what were the previous cards played to take
the correct actions (or you estimate ;)
Occlusion in Automotives
Occlusion in Automotives
Is Kalman/Bayes Filter Enough?

MDP is an assumption Emission and

Dependency could be on Transition pdf’s are
longer history “assumptions”
Sequence modeling in DL
• Generative
• HMM
• Recursive auto encoders

• Discriminative
• CRF
• Recurrent neural nets

• Used in context
modeling
Recurrent Neural Nets
Recurrent Neural Nets

?=
Recurrent Neural Nets

?=
What makes RNN special vs. DNN/CNN?
• A glaring limitation of Vanilla Neural Networks (and also
Convolutional Networks) is that their API is too constrained: they
accept a fixed-sized vector as input (e.g. an image) and
produce a fixed-sized vector as output (e.g. probabilities of
different classes).

• Not only that: These models perform this mapping using a fixed
amount of computational steps (e.g. the number of layers in
the model).

• The core reason that recurrent nets are more exciting is that they
allow us to operate over sequences of vectors: Sequences in
the input, the output, or in the most general case both.
What makes RNN special vs.
DNN/CNN?
What makes RNN special vs.
DNN/CNN?
• One-One: Vanilla mode of processing without RNN, from fixed-sized input to
fixed-sized output (e.g. image classification).

• One-many: Sequence output (e.g. image captioning takes an image and

outputs a sentence of words).

• Many-one: Sequence input (e.g. sentiment analysis where a given sentence is

classified as expressing positive or negative sentiment).

• Many-Many: Sequence input and sequence output (e.g. Machine Translation:

an RNN reads a sentence in English and then outputs a sentence in French).
• Synced sequence input and output (e.g. video classification where we wish to label each frame
of the video). Notice that in every case are no pre-specified constraints on the lengths
sequences because the recurrent transformation (green) is fixed and can be applied as many
times as we like
DNN=Function approximation
RNN=Program approximation

If training vanilla neural nets is

optimization over functions, training
recurrent nets is optimization over
programs. What is a program anyway?
A sequence of operations
Performed in order
Result of one affecting the next
==Recurrence!
Problems the RNN solves = The Oracle
• Classification problems

• DNN = Associative Mapping of Input/Output

• =Human recognition (develops in youngest age)🡪 Easiest
• In automotive = Environment awareness and object detection
• As hard as a human understands the traffic system

• Prediction problems

• RNN = Information aggregation/integration overtime to predict next state

• =Human anticipation (develops in older age) 🡪 Harder
• In automotive = Mapping and Environment state estimation and object tracking

• Planning problems

• RL = Evaluate the utility of state-action pair and take action that maximizes the expected total future
payoff (reward) from source to goal
Sentence Embedding with Sequence models
Supervised Learning Model Design Pattern

Output

Wcls

Decision

Wfeature
s

Features

Manual = Hand crafted =

Engineered = ML
Auto = NN = DL
Input
Encoder-Decoder pattern
What is the best representation of language?
This question summarizes all NLP efforts!

For what?

- Classification
- Many-to-one: Seq2Class
- Sentiment analysis, Toxicity detection (JIGSAW), Real or not? Disaster tweets
- Dialogue:
- Many-to-many: Seq2Seq → Unaligned case → More on that later
- MT, Spelling correction, Speech, OCR,...etc
NLP models meta-architectures
Just like in CV: Encoder-Decoder

- Seq2Class:
- Encoder = words vectors aggregation (How?)
- Decoder = None (just classifer=softmax)
- Analogy to CV: Encoder-Softmax (AlexNet, VGG,...etc)
- Seq2Seq:
- Encoder = words vectors aggregation (How?)
- Decoder = multiple words generation (How?)
- Analogy to CV: Encoder-Decoder in semantic segmentation. But in SS, we have aligned
many2many, while in NLP, we have unaligned sequences→ challenge in annotation, model,
when to stop, position encoding...etc
- Word vectors are the input to all the above meta-architectures:
- Unlike in CV, where pixels are already digitized
- Also, in NLP order matters! = Context
Characters generator

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Characters generator
Error_Softmax = y_i – t_i (no f’ of
Backprop
activation, but still y_i-t_i, review
The error backpropagated to Wxx
the intro and workshop)
and Wxh is coming from the
T_i = 1 for target[t] and 0
output + next time step (dhnext)
otherwise
Y_i = ps[t] 1-K = one hot encoding others 0’s
dy[target[t]]=error=ps[t][target[t]]-1
dy[otherwise]=error=ps[t][otherwis
e]-0
Ps[t]=[0.1 0.02…0.7…]
dWhy=alpha*error*input
Target[t] = 1..28
Notice
We want thattowe
maxstart from the end
ps[t][target[t]] =1
of the sequence
Others 0’s and
backpropagate
Loss = negativethe logerror from this
likelihood
point back. At the final node, no
=sigma(t_i*log(y_i))
dhnext.
Notice At the
All t_i are
after0’s first
theexceptnode,
softmax no =all1
target[t]
layer,
hs[t-1]
Y_i=ps[t][target[t]]
backprop of errors shall include
the non-linearity f’
Essays generation
tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e
plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng
100 epochs: At least it is starting to get an idea about words separated by spaces.
Except sometimes it inserts two spaces. It also doesn't know that comma is amost always followed by a space.

"Tmont thithey" fomesscerliund Keushey.

300 epochs: Quotes and periods are learnt

we counter. He stutn co des. His stanted out one ofler that concossions and
was to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn
500 epochs: The model has now learned to spell the shortest and most common words such as "we", "He", "His", "Which", "and", etc.

Aftair fall unsuch that the hall for Prince Velzonski's that me of her hearly,
and behs to so arwage fiving were to it beloge, pavu say falling misfort how,
and Gogition is so overelical and ofter.
700 epochs: English-Like !!
Essays generation
"Kite vouch!" he repeated by her door. "But I would be done and quarts, feeling,
then, son is people...."
1200 epochs: we're now seeing use of quotations and question/exclamation marks. Longer words have now been learned as well

"Why do what that day," replied Natasha, and wishing to himself the fact the princess,
Princess Mary was easier, fed in had oftened him. Pierre aking his soul came to the
packs and drove up his father-in-law women.
2000 epochs: we start to get properly spelled words, quotations, names, and so on

The picture that emerges is that the model first discovers the general word-space structure and
then rapidly starts to learn the words; First starting with the short words and then eventually the
longer ones. Topics and themes that span multiple words (and in general longer-term
dependencies) start to emerge only much later.
Wikipedia
Markdown
Algebra Lemmas (Latex)

Valid Latex Syntax!!

Linux Source Code (Input=Linux Repo 400MB)
Compiling!!

The RNN inserts

comments every now
and then!
Music generator
• Input is a GuitarPro Notes in ASCII

• Output are ASCII too🡪 Play

• Nice music!!
• https://soundcloud.com/optometrist-prime/recurrence-music-written
-by-a-recurrent-neural-network
Images generator
• Input Monet drawings (50 images)

• Output: Monet style drawings!!

• Output evolution every 100 iterations
Issues with Vanilla RNN
Exploration vs. Exploitation
Can we have
access over longer
When to act quickly based on current input? history and control
what to use and
When to consider long history? what to drop?

• All forms depend on two terms

• Exploitation: use of known
information = history = old state term
= forget gate
• Exploitation: use of new information
= inclusion of input = input gate

• All comes from the MDP

formulation
Long term dependency
Short term dependency
Partially observable MDP (POMDP)
• In Chess, you only need to worry about the current state
of the game to take the next action

• In Card games (Estimation, Poker,…etc), the current

state of the game is not sufficient
• You need to be aware what were the previous cards played to take
the correct actions (or you estimate ;)

History dependency still controlled by the “summarizing” last

hidden state
Occlusion in Automotives
Occlusion in Automotives
Can we have access over longer history and
control what to use and what to drop?
Vanishing gradients in RNN

BPTT “dilutes” the gradient in very “deep” architecture in

time
Skip connections to Very Deep Arch
•Very deep networks 🡪 vanishing gradients

•Remember that the back-propagation of a sum node will

replicate the input gradient with no degradation.

So they view the map F(x):=H(x)

−x as some residual map.
Can we do the same in time? They use a skip layer connection to
Skip connection through time! cast this mapping into F(x)+x=H(x).
So if the residual F(x) is "small", the
map H(x) is roughly the identity.
Long-Short Term Memory (LSTM)
Ground
Garage
the clouds are in the Sky
Airport
Oven

Arabic
Chineses
I grew up in France…[LONG_DOC] I speak fluent ____.
French
English
German
How to pay “attention” to
words from very OLD
context?
Long-Short Term Memory (LSTM)
• Sometimes, we only need to look at recent information to
perform the present task.

• For example, consider a language model trying to predict

the next word based on the previous ones.

• If we are trying to predict the last word in “the clouds are

in the sky,” we don’t need any further context – it’s pretty
obvious the next word is going to be sky.

• In such cases, where the gap between the relevant

information and the place that it’s needed is small, RNNs
can learn to use the past information.
Long-Short Term Memory (LSTM)
• But there are also cases where we need more context.
Consider trying to predict the last word in the text “I grew up
in France… I speak fluent French.”

• Recent information suggests that the next word is probably

the name of a language, but if we want to narrow down which
language, we need the context of France, from further back.

• It’s entirely possible for the gap between the relevant

information and the point where it is needed to become very
large.

• Unfortunately, as that gap grows, RNNs become unable to

learn to connect the information.
Long-Short Term Memory (LSTM)
• In theory, RNNs are absolutely capable of handling such
“long-term dependencies.”

• A human could carefully pick parameters for them to

solve toy problems of this form. Sadly, in practice, RNNs
don’t seem to be able to learn them.

• The problem was explored in depth by Hochreiter (1991)

[German] and Bengio, et al. (1994), who found some
pretty fundamental reasons why it might be difficult.

• Thankfully, LSTMs don’t have this problem!

Long-Short Term Memory (LSTM)
• Long Short Term Memory networks – usually just called
“LSTMs” – are a special kind of RNN, capable of learning
long-term dependencies.

• They were introduced by Hochreiter & Schmidhuber

(1997), and were refined and popularized by many
people in following work.

• LSTMs are explicitly designed to avoid the long-term

dependency problem. Remembering information for long
periods of time is practically their default behavior, not
something they struggle to learn!
Long-Short Term Memory (LSTM)
LSTM Step By Step: Gates
• Gates are a way to optionally let information through.
They are composed out of a sigmoid neural net layer
and a pointwise multiplication operation.

• The sigmoid layer outputs numbers between zero and

one, describing how much of each component should
be let through. A value of zero means “let nothing
through,

• An LSTM has three of these gates, to protect and

control the cell state.” while a value of one means “let
everything through!”
LSTM Step By Step: Cell State
• The key to LSTMs is the cell
state, the horizontal line running
through the top of the diagram.

• The cell state is kind of like a

conveyor belt. It runs straight
down the entire chain, with only
some minor linear interactions.
It’s very easy for information to
just flow along it unchanged.

• The LSTM does have the ability

to remove or add information to
the cell state, carefully regulated
by structures called gates.
LSTM Step By Step: Forget gate

Which information to forget/keep/propagate (1/0*Ct-1) from cell state Ct-1 to

next cell state Ct
LSTM Step By Step: Input gate
The tanh combination acts as a “regularizer”
on the input
It acts as P(xt|ht-1), so that we don’t keep
any info from the input independent from
what the prev. state was

What new information to ADD (+Ct-1) to next cell state Ct

The tanh combines previous hidden state and new input
The sigmoid input gate blocks and let pass the info to
add from the prev. hidden + input
LSTM Step By Step: Cell State
Update

Exploitation Exploration from

From prev. new input
Cell state (regulated by
the prev. hidden)
LSTM Step By Step: Output gate

What to propagate from input to output (Redundant to be treated in GRNN)

LSTM Matrix Equations
LSTM Gates = Skip connections

- Skip connections to the rescue

- LSTM gates are skip connections over time!
LSTM as skip connection (ResNet)
LSTM as skip connection (ResNet)

In BWD → Grads are just copied at the “+” → no vanishing = Reset!

Gated Recurrent Neural Networks

What to What to
forget/keep/prop add from
agate new
input
Bi-directional LSTM
Let’s code

LSTM
https://colab.research.google.com/drive/1dXeClcTIaFqG3UGmZrdDjPktte6TZ5si#scrollTo=Y9xddpbZJrjm

https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/
Conventions Time series Tensors
3D tensors of shape (samples, timesteps, features) samples could be the
batch (size=batch_size)
Whenever time matters in your data (or the notion of sequence order), it
makes sense to store it in a 3D tensor with an explicit time axis. Each
sample can be encoded as a sequence of vectors (a 2D tensor), and thus a
batch of data will be encoded as a 3D tensor
A MUST for stacked LSTM!
LSTM return sequences Otherwise, use TimeDistributed (NOT
preferred)

3D tensors of shape (samples, timesteps, features) samples could be the

batch (size=batch_size)
Whenever time matters in your data (or the notion of sequence order), it
makes sense to store it in a 3D tensor with an explicit time axis. Each
sample can be encoded as a sequence of vectors (a 2D tensor), and thus a
batch of data will be encoded as a 3D tensor
(samples, timesteps, features)

LSTM/RNN always expects 3D tensor⇒ True

- hidden_dim
(samples, timesteps, features) - return_sequences
False

https://colab.research.google.com/drive/1dXe (samples, features)

ClcTIaFqG3UGmZrdDjPktte6TZ5si#scrollTo=
7LRuYaCqILOn
LSTM return_state
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/

LSTM has 2 states: hidden and

cell (h,c):

- h = output
- c = internal state

In keras, if you set

“return_states=True”, then 3
things are returned:

- h = lstm1
- h = state_h (again!) → why?
- c = state_c
LSTM return_state and return_sequences
https://machinelearningmastery.com/return-sequences-and-return-states-for-lstms-in-keras/

In keras, if you set “return_states=True”,

then 3 things are returned:

- h = lstm1 → ALL h’s of all states!

- h = state_h→ Last h
- c = state_c → Last c

Verify: Last
of lstm1=h
NLM with RNN
Let’s code

LSTM Text Generation

https://colab.research.google.com/drive/16XcUKh2eWIqIwC3vnnbKUdqkpHqY99L
B?usp=sharing
Characters generator = Char level NLM

http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Word level NLM
Let’s code
Word level NLM

https://colab.research.google.com/drive/1WRG86sVhXxI3bADagQOy6DdNVlog_C
Oo?usp=sharing

Char level NLM

https://colab.research.google.com/drive/1qQAqtDK6b9E27Dzu38hBB4cV6QVtuN
g0?usp=sharing
RNN for text classification
Let’s code
LSTM-IMDB Keras dataset

https://colab.research.google.com/drive/1Qllb_h_kuwv8PaIHPZDr2eC4Zo5ljr5b?u
sp=sharing

LSTM-IMDB Raw dataset

https://colab.research.google.com/drive/1WOUuqsfPZXurupMxlExvScB3ivSiVaV9
?usp=sharing
Conv1D for sequence
models
What’s wrong with RNN?
It’s sequential!

- We cannot make use of parallelism of modern computer architectures or

GPU!

What’s good about it?

- It captures sequence information and summarizes it in a state

- Use that state for (condition output on that state):
- Classification
- Sequence generation

State = Feature of the input

NLP = Context
Image → Spatial context → Conv2D

Text → Sequence context → RNN ⇒ CNN1D (this lecture)

How to make sequence parsing fast?
Employ parallelism!

Matrix operations → Transformers (Advanced: More on that later)

Convolution operations → CNNs

ConvNets in NLP

CNNs are good to summarize image into one feature vector → How?

It takes advantage of the spatial structure of the image

And it’s very fast → Parallel (conv formula)

Can we use it for the sequential structure summarization?

ConvNets in NLP
Filters as template matching
1D
Filters as template matching
2D
Filters as template matching
Convolution for feature extraction
Template matching = features extraction
● Feature = template

● Extraction = matching

● Example of features are edge detectors

● A Kernel is a feature detector – Local features

How to decide on the kernel weights?
● Kernel weights = features

● Method 1 = Hand crafted 🡪 like vertical or horizontal edges

● Method 2 = Learning! 🡪 ConvNets

N-grams
Filters as template matching
1D
Template=kernel=n-gram
Shared weights = Efficiency
Every n-gram = Channel → See later
Conv1D
Conv2D
- Conv2D → (N, OUT_CH (n_kernels), L, W, C) → 5D tensor
- Concatenation model:
- If we concatenate([t0,t1,..tT],axis=-1): t0=(N,L,W,3), t1=(N,L,W,3)...tT=(N,L,W,3) → x =
(N,L,W,3xT)
- Next layers will treat the concatenated frames as just channels
- Spatio-temporal:
- Spatial model:
- Spatio model = Conv2D(out_maps, kernel)(ti=(N,L,W,3)) → (N,L,W,out_maps)
- If we concatenate([t0,t1,..tT],axis=-1): t0=(N,L,W,out_maps),
t1=(N,L,W,out_maps)...tT=(N,L,W,out_maps) → x = (N,L,W,out_mapsxT)
- Temporal model:
- Conv2D(final_maps, kernel)(x => (N,L,W,out_mapsxT)) → (N,L,W,final_maps)
- Flatten
- Classification → Softmax
Conv1D
Works same as Conv2D, except the input is (N,
OUT_CH (n_kernels), T, C=Features) → 4D tensor

- T is the samples (scalars in time)=#words for

example
- C (channels) is the dimension of feature vector at
time t
- OUT_CH is the output dimension of the mapped
vector (out_emb)
- Exampe C=1 feature per time step, OUT_CH=1
- Conv1D(1, kernel=3)(x=>(1,F)--> y=>(1,F-3+1)
- If we have more than 1 frame, concatenate in T
not channels
Multi-channel convolution illustration
out_emb_sz
Conv1D: (kernel, out_channels, inp_channels,
sent_len):
- Inp_ch = inp_emb_sz
- Out_ch = out_emb_sz
- Each output kernel will operate on the input
(sent_lenxemb_sz) ⇒ 2D (N-M+1 x
emb_sz)
- we treat inp_emb_sz as input channels and
sum ⇒ learnable with W⇒ 2D-1D
- we we concat on the channel dim for each
1D output filter ⇒ 2D

Crush the row dim (emb_sz) using learnable sum

Conv1D example Input seq max_len = 400 words
Each 50 emb_dim

Conv over 400 with kernel=3→ 398

token
Each should have 50 dim
BUT, we summarize them with w →
so each is 1 scalar → 398x1
We have 250 output kernels (each 3)
→ so we have 398x250 outputs
Multi-channel convolution illustration: Conv2D vs
Conv1D
In Conv2D: we sum over
input feature maps ⇒ 3D-2D
In Conv1D: we treat emb_sz
as input channels and sum
⇒ learnable with W⇒ 2D-1D

Following:
In Conv2D: we concat on the
channel dim for each output
2D filter ⇒ 3D

In Conv1D: we we concat on
the channel dim for each 1D
output filter ⇒ 2D
How to get 1D vector per
sentence?
Conv1D + GlobalMaxPool1D example
inp_emb_sz Every kernel conv gives one output due to max
pool over time⇒ over all input words
Out_emb = #out_kernels

Sent_len
=
max_len
after
padding

For a filter operating on input sequence each of emb_sz,

we concat them, and summarize with eight vector, for
every kernel slide,, so one local dot = produces 1 scalar
Conv1D + GlobalMaxPool1D example
But what if we want
different kernels ⇒
Multi-headed CNN n-grams (2-3-...etc)?

Another popular approach with CNNs is to have a

multi-headed model, where each head of the model
reads the input time steps using a different sized
kernel.

For example, a three-headed model may have three

different kernel sizes of 3, 5, 11, allowing the model
to read and interpret the sequence data at three
different resolutions. The interpretations from all
three heads are then concatenated within the model
and interpreted by a fully-connected layer before a
prediction is made.
Disadv: Depth scales with seq length!
- ConvNet captures sequence information
and summarizes it in a state
- Use that state for (condition output on
that state):
- Classification
- Sequence generation
- Network depth scales with max seq
(sentence) length!
- sequence length requires deeper models, it
makes it difficult to learn dependencies
between distant words
- Inefficient for variable length! Needs
padding
Let’s code
Conv1D-IMDB Keras dataset

https://colab.research.google.com/drive/1zazDWpg6JeWUe8lxXp4dY61k5QzfuKM
A?usp=sharing

Conv1D-IMDB Raw dataset

https://colab.research.google.com/drive/17iOxqMW-MT36RCLm7N31smxylk9buA
GK?usp=sharing
CNN-LSTM
Spatio-temporal

CNN-LSTM: internal hidden state flattened as normal

vector dot
CNN-LSTM - Just take the raw CNN
output without flatten

LSTM will operate on the time_dim

CNN-LSTM - TimeDistributed (Like VideoCNN)
Spatio-temporal

ConvLSTM

Internal hidden state as

convolution
ConvLSTM - No TimeDistributed
(states already temporal)

Built-in
Spatio-temporal
Let’s code
CNN-LSTM-IMDB Keras dataset

https://colab.research.google.com/drive/1zazDWpg6JeWUe8lxXp4dY61k5QzfuKM
A?usp=sharing

CNN-LSTM-IMDB Raw dataset

https://colab.research.google.com/drive/17iOxqMW-MT36RCLm7N31smxylk9buA
GK?usp=sharing
Recursive Auto Encoders
(optional)
Compositionality modeling
• Spatial
• Image
• CNN
• Distributed representations
• No control
• No prior templates for
context. Ex: parse trees
• The only template is the 2D
map spatial relation: every
pixel is affected by its
neighbors

http://www.slideshare.net/roelofp/2014-1021-sicsdlnlpg
Compositionality modeling

● Hierarchical
■ Recursive
■ RNN
■ RNTN
■ LTSM NN
○ Prior templates for context can be imposed

http://www.kdnuggets.com/2014/08/interview-pedro-domingos-master-algorithm-new-deep-learning.html
How to get the parsing order?
- Parser
- Stanford
- Expensive Language resource (Low-Language-Resources (LLR) = Arabic)
- RAE
- Unsupervised
- Sub-optimal
- Expensive computations
Recursive Auto Encoder
Recursive models
Adv

- Parsing language structure

Disadv

- Require parse trees

- Expensive computations

11 RNN
No ratings yet
11 RNN
32 pages
Recurrent Neural Networks LSTMS, Transformers, Graph Neural Networks
No ratings yet
Recurrent Neural Networks LSTMS, Transformers, Graph Neural Networks
97 pages
RNNs: Temporal Sequence Processing
No ratings yet
RNNs: Temporal Sequence Processing
45 pages
4.2 Sequence2Sequence (RNN)
No ratings yet
4.2 Sequence2Sequence (RNN)
46 pages
CNN RNN LSTM Attention
No ratings yet
CNN RNN LSTM Attention
86 pages
Unit III - Recurrent Neural Networks
No ratings yet
Unit III - Recurrent Neural Networks
44 pages
A M3 RD Ipjn Yd Ps GKF
No ratings yet
A M3 RD Ipjn Yd Ps GKF
20 pages
Lecture8 421
No ratings yet
Lecture8 421
85 pages
RNN StannfordBased
No ratings yet
RNN StannfordBased
102 pages
Lecture 5
No ratings yet
Lecture 5
102 pages
04 - RNNs
No ratings yet
04 - RNNs
37 pages
LSTM Lecture
No ratings yet
LSTM Lecture
163 pages
Time Series RNN LSTM 1746197734
No ratings yet
Time Series RNN LSTM 1746197734
25 pages
Sequence Models
No ratings yet
Sequence Models
73 pages
Sequence Models231205
No ratings yet
Sequence Models231205
72 pages
Recurrent Neural Networks (RNN) : Subtitle
No ratings yet
Recurrent Neural Networks (RNN) : Subtitle
53 pages
RNN LSTM GRU Transformers
0% (1)
RNN LSTM GRU Transformers
123 pages
AN2DL 05 2324 Seq2SeqAndWordEmbedding
No ratings yet
AN2DL 05 2324 Seq2SeqAndWordEmbedding
42 pages
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
No ratings yet
2022 Foundations Tutorial3 Sunwang Deeplearning4nlp
103 pages
AI Perspective (Post-Web) : Robotics
No ratings yet
AI Perspective (Post-Web) : Robotics
84 pages
RNN Tutorial
No ratings yet
RNN Tutorial
41 pages
UNIT-3 Sequence Modeling
No ratings yet
UNIT-3 Sequence Modeling
20 pages
RNN & LSTM Notes
No ratings yet
RNN & LSTM Notes
8 pages
CS564 RNN Nov 20 2020
No ratings yet
CS564 RNN Nov 20 2020
93 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Sequence Learning Problem
No ratings yet
Sequence Learning Problem
42 pages
DL Mod4
No ratings yet
DL Mod4
105 pages
Deep Learning: Sequence Models
No ratings yet
Deep Learning: Sequence Models
85 pages
AML - Lecture - 09 - 08nov24
No ratings yet
AML - Lecture - 09 - 08nov24
126 pages
NLP Week7 RNNLSTM
No ratings yet
NLP Week7 RNNLSTM
66 pages
3 Sequence and Language Modeling
No ratings yet
3 Sequence and Language Modeling
56 pages
Sequential Short-Text Classification With Recurrent and Convolutional Neural Networks
No ratings yet
Sequential Short-Text Classification With Recurrent and Convolutional Neural Networks
6 pages
42 Recurrent Neural Networks and LSTM
No ratings yet
42 Recurrent Neural Networks and LSTM
68 pages
RNN Recurrent Neural Network: Application Input Sequence Task
No ratings yet
RNN Recurrent Neural Network: Application Input Sequence Task
10 pages
Day 4
No ratings yet
Day 4
22 pages
L6 - UCLxDeepMind DL2020 Document of Google
No ratings yet
L6 - UCLxDeepMind DL2020 Document of Google
141 pages
Lecture 4
No ratings yet
Lecture 4
34 pages
DP Module 5
No ratings yet
DP Module 5
8 pages
Deep Learning
No ratings yet
Deep Learning
24 pages
RNN LSTM
No ratings yet
RNN LSTM
72 pages
01-Transformer Based NLP Applications
No ratings yet
01-Transformer Based NLP Applications
55 pages
FDP Deep Learning Architectures and Applications
No ratings yet
FDP Deep Learning Architectures and Applications
51 pages
Sequence Modeling for IT Students
No ratings yet
Sequence Modeling for IT Students
71 pages
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
No ratings yet
RNNs and Their Types - 15 Slides (Easy Copy-Paste Format)
6 pages
Sequence Models Notes
No ratings yet
Sequence Models Notes
4 pages
NLP Unit-3A Notes
No ratings yet
NLP Unit-3A Notes
28 pages
Unit 3
No ratings yet
Unit 3
8 pages
Rnnjan 25
No ratings yet
Rnnjan 25
59 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
32 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
125 pages
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
No ratings yet
Recurrent Neural Networks: RNN: S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology
47 pages
6b. Recurrent Neural Networks
No ratings yet
6b. Recurrent Neural Networks
38 pages
For Seminar
No ratings yet
For Seminar
17 pages
11-Transformer LLMs Updated
No ratings yet
11-Transformer LLMs Updated
96 pages
NN UNIT 5 Notes
No ratings yet
NN UNIT 5 Notes
23 pages
Unit III (2) RNN, LSTM, Gru
No ratings yet
Unit III (2) RNN, LSTM, Gru
14 pages
08 Natural Language Processing in Tensorflow
No ratings yet
08 Natural Language Processing in Tensorflow
29 pages
雅思作文练习与复习资料
No ratings yet
雅思作文练习与复习资料
11 pages
T-Root Blades in A Steam Turbine Rotor A
No ratings yet
T-Root Blades in A Steam Turbine Rotor A
8 pages
Vedic Chart Insights
No ratings yet
Vedic Chart Insights
12 pages
Sample Rubrics: Graphing Rubric 1
100% (1)
Sample Rubrics: Graphing Rubric 1
2 pages
English FAL P3 Grade 11 Nov 2019 Memo
No ratings yet
English FAL P3 Grade 11 Nov 2019 Memo
12 pages
Study Guide QA1
No ratings yet
Study Guide QA1
3 pages
RX200A-3-25-1D-MRZ 200mm Pedestrian + Acoustic Device
No ratings yet
RX200A-3-25-1D-MRZ 200mm Pedestrian + Acoustic Device
4 pages
LG Oem Lgit Plde-P017a SCH
No ratings yet
LG Oem Lgit Plde-P017a SCH
2 pages
Java 8 Features
No ratings yet
Java 8 Features
42 pages
Marketing Strategies in Creating Brand Image of FMCG in India With Special Reference To Store Promotion
No ratings yet
Marketing Strategies in Creating Brand Image of FMCG in India With Special Reference To Store Promotion
9 pages
Syllabus Database Management For Business Fall 2024-2025-1
No ratings yet
Syllabus Database Management For Business Fall 2024-2025-1
9 pages
Micrometer
No ratings yet
Micrometer
6 pages
G Codes and M Codes List
No ratings yet
G Codes and M Codes List
3 pages
RB Safety Drive
No ratings yet
RB Safety Drive
6 pages
Kit Instructions: SCN66 USB Cash Acceptor Upgrade
No ratings yet
Kit Instructions: SCN66 USB Cash Acceptor Upgrade
28 pages
Automotive Service Management: Principles Into Practice
33% (3)
Automotive Service Management: Principles Into Practice
14 pages
Compressor: Dynamic Compressors Centrifugal Compressors
100% (1)
Compressor: Dynamic Compressors Centrifugal Compressors
7 pages
CEL 2106 - Material 3
No ratings yet
CEL 2106 - Material 3
12 pages
LKPD Let Me Introduce Myself
100% (1)
LKPD Let Me Introduce Myself
10 pages
FSK Filters
No ratings yet
FSK Filters
4 pages
DLL Matatag - English 4 q4 w2
No ratings yet
DLL Matatag - English 4 q4 w2
13 pages
Assignment 8 - Professional Learning Plan
No ratings yet
Assignment 8 - Professional Learning Plan
8 pages
Chapter 5 - Recovery Techniques
No ratings yet
Chapter 5 - Recovery Techniques
24 pages
Mind, Language and Society Philosophy in The Real World
No ratings yet
Mind, Language and Society Philosophy in The Real World
189 pages
Focus-On Opta en
No ratings yet
Focus-On Opta en
3 pages
Learning Plan in EPP 6
No ratings yet
Learning Plan in EPP 6
4 pages
Grade 5 Learning Activities
No ratings yet
Grade 5 Learning Activities
7 pages
CLASS 8th Soc SC BRIDGE COURSE Bridge Course Primary 2024 25
No ratings yet
CLASS 8th Soc SC BRIDGE COURSE Bridge Course Primary 2024 25
42 pages
Excel Tutorial PDF
No ratings yet
Excel Tutorial PDF
13 pages
Cat Red
No ratings yet
Cat Red
5 pages