KEMBAR78
NLP using transformers | PDF
NLP using Transformer models
How translation works
Questions to ponder
What is Deep learning for NLP
How Machine translation (NMT) works
How Google translate drastically improved after 2017
Why Transformer (gave rise to BERT, GPT, XLNet)
Deep learning for NLP
Neural Machine Translation
Major improvement from the earlier SMT (Statistical MT)
Transformer model introduced in 2017 revolutionized and became SOTA
Models
1. Neural Networks
2. Recurrent Neural Networks
3. Encoder-Decoder
4. Attention Models
5. Transformer
Models
1. Neural Networks
2. Recurrent Neural Networks
3. Encoder-Decoder
4. Attention Models
5. Transformer
Models with Applications
1. Neural Networks - Word representation (word2vec)
2. Recurrent Neural Networks - Language Generation
3. Encoder-Decoder - Translation
4. Attention Models - Translation, Image recognition
5. Transformer - Translation
Model 1
Neural Networks
Neural Network
Neural Networks act as a ‘black box’ that takes inputs and predicts an output.
Neural Network
Neural Network Training
Neural Network
Neural Network
‘Black box’ that takes inputs and predicts an output.
Trained using known (input,output) , approximates the function and maps new inputs
Learns the function mapping inputs to output by adjusting the internal parameters (weights)
Word2vec using neural networks
Word2vec using neural networks
Word2vec using neural networks
From words to sentences
Word embedding captures meaning of words
How about sentences?
Can we encode meaning of sentences
You can't cram the meaning of the entire $%!!* sentence into a !!$!$ vector
Sentences are variable length
Need fixed length representation
Model 2
RNN
RNN
RNNs are used when the inputs have some state information
Examples include time series and word sequences
Can captures the essence of the sequence in its hidden state (context)
https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9
RNN - unrolled in time
RNN
RNN
The unreasonable effectiveness of neural networks by Andrej Karpathy
RNN can learn sentence representations
RNNs can predict
RNN can learn a probability function p( word | previous-words )
RNN Training
RNN - Summary
RNN can encode sentence meaning
Can predict next word given a sequence of words.
Can be trained to learn a language model
RNN - problems
● Hard to parallelize efficiently
● Back propagation through variable length sequence
● Transmitting information through one bottleneck (final hidden state)
● longer sentences lose context
Example I was born in France..I speak fluent French
(There are improvements to RNN structure that retain context)
Translation - First Attempt
Now we can have two RNNs jointly trained.
The first RNN creates context
The second RNN uses the final context of the first RNN.
Training using Parallel corpora
Model 3
Encoder Decoder
Encoder Decoder
http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Encoder Decoder
Encoder Decoder
models the conditional probability p(y/x) of
translating a source X into target Y
Encoder-Decoder in Translation
Encoder - Decoder summary
2014 -Google replaced their Statistical model with NMT
Due to its flexibility it is the goto framework for NLG with different models taking
roles of encoder and decoder
The decoder can not only be conditioned on a sequence but on arbitrary
representation enabling lot of use cases (like generating caption from image)
Translation using Encoder-Decoder
Issues
● Meaning is crammed
● Long term dependencies
● Word alignment
● Word interdependencies
Translation issues (long term dependencies)
The model cannot remember enough
Only final Encoder state is used
Each encoder state has valuable information
difficult to retain >30 words.
"Born in France..went to EPFL Switzerland..I speak fluent ..."
Translation issues ( Word alignment )
Translation issues (word alignment)
The European Economic Area → Le zone économique européenne
The European Economic Area
Le 1
Zone 1
Economique 1
Europeene 1
Translation issues ( Interdepencies )
Translation issues
Interdependencies
Crammed meaning (due to context)
Word alignment
Words may or may not compose ( green apple, hot dog)
Word ordering is important ( I hit him on the eye yesterday)
How to retain more meaning (Context)
European 1 Economic 2 Area 3 → zone économique européenne
Decoding is done using only final context.
The final context doesn’t capture entire information
Can we use intermediate contexts ?
Model 4
Attention
Need for Attention
Attention removes bottleneck of Encoder-Decoder model
● Pay selective attention to relevant parts of input
● Utilize all intermediate states while decoding
(Don’t encode the whole input into one vector losing information)
● Do Alignment while translating
Pay selective attention
Pay selective attention
First introduced in Image Recognition
Without attention
Decode using only final state
Attention
Attention captures alignment
Align while translating
Attention summary
Pay selective attention to words we need to translate
Compute weighted sum of all hidden states (called attention weight)
Use attention weights while decoding
Attention also take care of alignment
Model 5
Transformers
Transformers
Can we get rid of RNN completely ?
RNN are too sequential. Parallelization is not possible since these intermediate
contexts are generated sequentially
Transformer
Transformer
Transformers
There are three kinds of attention
1. within encoder
2. encoder decoder
3. within decoder
Transformers (Self attention)
A new representation Z
Transformers (Self attention)
Calculate for each word , its
dependencies on others
‘it’ = 0.7 * street + .2 * animal + …..
Transformers (Self attention)
Remember that Encoder creates input representation
Transformers create a rich representation that captures interdependencies
(This rich representation captures relationship between words compared to
simple word embeddings)
This rich representation is called Self-attention
Transformers (Self attention)
Self-attention mechanism directly models relationships between all words in a
sentence, regardless of their respective position
Self attention allows connections between words within a sentence.
Eg. I arrived at the bank after crossing the river
Transformers
Use self attention instead of RNNs, CNN
Computation can be parallelized
Ability to learn long term dependencies
Transformers
Transformer model
Transformer
Two types of attention
1. Source Target Attention
2. Self Attention
Thank you for you attention
Attention is all you need
Summary
NN can encode words
RNN can encode sentences
Long sentences need changes to RNN architecture
Two RNNs can act as encoder and decoder (of any representation)
Encoding everything into single context loses information
Selectively pay attention to the inputs we need.
Get rid of RNN and use only attention mechanism. Make it parallelizable
Richer representation of inputs using Self-attention.
Use Encoder-Decoder attention as usual for translation

NLP using transformers