NLP using transformers

NLP using Transformer models
How translation works

Questions to ponder
What is Deep learning for NLP
How Machine translation (NMT) works
How Google translate drastically improved after 2017
Why Transformer (gave rise to BERT, GPT, XLNet)

Neural Machine Translation
Major improvement from the earlier SMT (Statistical MT)
Transformer model introduced in 2017 revolutionized and became SOTA

Models
1. Neural Networks
2. Recurrent Neural Networks
3. Encoder-Decoder
4. Attention Models
5. Transformer

Models with Applications
1. Neural Networks - Word representation (word2vec)
2. Recurrent Neural Networks - Language Generation
3. Encoder-Decoder - Translation
4. Attention Models - Translation, Image recognition
5. Transformer - Translation

Neural Network
Neural Networks act as a ‘black box’ that takes inputs and predicts an output.

Neural Network
‘Black box’ that takes inputs and predicts an output.
Trained using known (input,output) , approximates the function and maps new inputs
Learns the function mapping inputs to output by adjusting the internal parameters (weights)

Word2vec using neural networks

From words to sentences
Word embedding captures meaning of words
How about sentences?
Can we encode meaning of sentences
You can't cram the meaning of the entire $%!!* sentence into a !!$!$ vector
Sentences are variable length
Need fixed length representation

RNN
RNNs are used when the inputs have some state information
Examples include time series and word sequences
Can captures the essence of the sequence in its hidden state (context)
https://towardsdatascience.com/illustrated-guide-to-recurrent-neural-networks-79e5eb8049c9

RNN
The unreasonable effectiveness of neural networks by Andrej Karpathy

RNN can learn sentence representations

RNNs can predict
RNN can learn a probability function p( word | previous-words )

RNN - Summary
RNN can encode sentence meaning
Can predict next word given a sequence of words.
Can be trained to learn a language model

RNN - problems
● Hard to parallelize efficiently
● Back propagation through variable length sequence
● Transmitting information through one bottleneck (final hidden state)
● longer sentences lose context
Example I was born in France..I speak fluent French
(There are improvements to RNN structure that retain context)

Translation - First Attempt
Now we can have two RNNs jointly trained.
The first RNN creates context
The second RNN uses the final context of the first RNN.
Training using Parallel corpora

Encoder Decoder
http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

Encoder Decoder
models the conditional probability p(y/x) of
translating a source X into target Y

Encoder-Decoder in Translation

Encoder - Decoder summary
2014 -Google replaced their Statistical model with NMT
Due to its flexibility it is the goto framework for NLG with different models taking
roles of encoder and decoder
The decoder can not only be conditioned on a sequence but on arbitrary
representation enabling lot of use cases (like generating caption from image)

Translation using Encoder-Decoder
Issues
● Meaning is crammed
● Long term dependencies
● Word alignment
● Word interdependencies

Translation issues (long term dependencies)
The model cannot remember enough
Only final Encoder state is used
Each encoder state has valuable information
difficult to retain >30 words.
"Born in France..went to EPFL Switzerland..I speak fluent ..."

Translation issues ( Word alignment )

Translation issues (word alignment)
The European Economic Area → Le zone économique européenne
The European Economic Area
Le 1
Zone 1
Economique 1
Europeene 1

Translation issues ( Interdepencies )

Translation issues
Interdependencies
Crammed meaning (due to context)
Word alignment
Words may or may not compose ( green apple, hot dog)
Word ordering is important ( I hit him on the eye yesterday)

How to retain more meaning (Context)
European 1 Economic 2 Area 3 → zone économique européenne
Decoding is done using only final context.
The final context doesn’t capture entire information
Can we use intermediate contexts ?

Need for Attention
Attention removes bottleneck of Encoder-Decoder model
● Pay selective attention to relevant parts of input
● Utilize all intermediate states while decoding
(Don’t encode the whole input into one vector losing information)
● Do Alignment while translating

Pay selective attention
First introduced in Image Recognition

Without attention
Decode using only final state

Attention captures alignment
Align while translating

Attention summary
Pay selective attention to words we need to translate
Compute weighted sum of all hidden states (called attention weight)
Use attention weights while decoding
Attention also take care of alignment

Transformers
Can we get rid of RNN completely ?
RNN are too sequential. Parallelization is not possible since these intermediate
contexts are generated sequentially

Transformers
There are three kinds of attention
1. within encoder
2. encoder decoder
3. within decoder

Transformers (Self attention)
A new representation Z

Calculate for each word , its
dependencies on others
‘it’ = 0.7 * street + .2 * animal + …..

Remember that Encoder creates input representation
Transformers create a rich representation that captures interdependencies
(This rich representation captures relationship between words compared to
simple word embeddings)
This rich representation is called Self-attention

Self-attention mechanism directly models relationships between all words in a
sentence, regardless of their respective position
Self attention allows connections between words within a sentence.
Eg. I arrived at the bank after crossing the river

Transformers
Use self attention instead of RNNs, CNN
Computation can be parallelized
Ability to learn long term dependencies

Two types of attention
1. Source Target Attention
2. Self Attention

Thank you for you attention
Attention is all you need

Summary
NN can encode words
RNN can encode sentences
Long sentences need changes to RNN architecture
Two RNNs can act as encoder and decoder (of any representation)
Encoding everything into single context loses information
Selectively pay attention to the inputs we need.
Get rid of RNN and use only attention mechanism. Make it parallelizable
Richer representation of inputs using Self-attention.
Use Encoder-Decoder attention as usual for translation

NLP using transformers

More Related Content

What's hot

Similar to NLP using transformers

More from Arvind Devaraj

Recently uploaded

NLP using transformers