CS414 – MACHINE LEARNING
Lesson 10
TRANSFORMER & APPLICATIONS
Dr. Nguyễn Vinh Tiệp
1
Recap
Attention output
• Hidden state in encoder (values):
𝒔1 , 𝒔2 , … , 𝒔𝑁 ∈ ℝ𝑑1
• Query vector (query) 𝒉 ∈ ℝ𝑑2
• Attention performs the following steps:
(1) Calculate attention scores: 𝒓
(2) Calculate attention distribution 𝜶:
𝒉
𝜶 = softmax 𝒓
(3) Calculate attention output 𝒄 to aggregate
𝒔1 𝒔2 𝒔3 … 𝒔𝑁 𝒉𝑡=1
information:
𝒄 = 𝛼𝑖 𝒔𝑖
𝑖=1
‘I’ ‘am’ ‘not’ ‘sure’ ‘<Start>’
2
Regression models for NLP tasks
• Encode the sequence input using
Bidirectional LSTM
• Represent the sequence output and
use LSTM to generate results
• Utilize Attention to flexibly access
memory
3
Content
1. TRANSFORMER ARCHITECTURE
2. TRANSFORMER WEAKNESSES
3. APPLICATIONS OF TRANSFORMER
4
Transformer Motivation
• Minimize path length any pair of
words
• Maximize the number of parallel
operations
5
Minimize path length any pair of words
• It takes O(seq_len) steps for distant word pairs to interact
semantically
▪ Hard to learn long-distance dependencies (vanishing gradient)
In France, I had _____ language
Info of “France” has gone through
O(seq_len) many layers 6
Maximize Parallelizability
• Forward & backward require O(seq_len) unparallelizable operations
▪ GPU: performs many independent computation at once
▪ RNN: future hidden states can’t be computed in full before past hidden
→ Inhibits training on large-scale datasets
Require 𝑇 previous operations
1 2 3 … T
0 1 3 … T-1
𝒔1 𝒔2 𝒔3 𝒔𝑇
7
Self-attention
• Attention: a query word in the decoder retrieval and synthesize
information from a set of values in encoder
• Self-attention: is encoder-encoder (or decoder-decoder)
attention where each word “attends" to each other within input
(or output)
2 2 2 … 2
Only 2 previous operations are needed
1 1 1 … 1
0 0 0 … 0
𝒔1 𝒔2 𝒔3 𝒔𝑇
8
Transformer Advantanges:
Transformer: Idea • Number of parallel operations
independent of sequence length
• Each word interacts with each
other at maximum in O(1) step
9
Output
Transformer Architecture probabilities
Softmax
Linear
• Includes two main components: Add & Norm
encoder & decoder Feed forward
Add & Norm Decoder
Add & Norm
Multi-head Repeat 6x
Encoder Feed forward Cross-Attention (# layers = 6)
Repeat 6x
(# layers = 6) Add & Norm Add & Norm
Multi-head Masked Multi-
Attention head Attention
~ + + ~
Input Output
embedding embedding
Inputs Output
(shifted right)
10
Encoder: Self-Attention
Output
probabilities
• Self-Attention is core of Transformer
Decoder
Encoder
Self-Attention
Input Output
embedding embedding
Inputs Output
(shifted right)
11
Attention mechanism
• Attention ~ approximate hashtable
▪ To look up value, we compare a query with keys In a table
𝑣0 𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 𝑣6 𝑣0 𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 𝑣6
𝑘0 𝒌𝟏 𝑘2 𝑘3 𝑘4 𝑘5 𝑘6 𝒌𝟎 𝒌𝟏 𝒌𝟐 𝒌𝟑 𝒌𝟒 𝒌𝟓 𝒌𝟔
q q
Each query (hash) maps to one key-value pair Each query matches each key to varying degrees,
return a sum of values weighted by query-key
12
Recipe for Self-Attention in Encoder
• Step 1: For each word 𝑥𝑖 , caculate its query, key & value
𝑞𝑖 = 𝑄𝑥𝑖 𝑘𝑖 = 𝐾𝑥𝑖 𝑣𝑖 = 𝑉𝑥𝑖
• Step 2: Calculate attention score between query and keys
𝑟𝑖𝑗 = 𝑞𝑖 . 𝑘𝑗
• Step 3: Use softmax to calculate attention distribution: 𝑣0 𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 𝑣6
𝑟
𝑒 𝑖𝑗
𝛼𝑖𝑗 = softmax 𝑟𝑖𝑗 = σ𝑘 𝑒 𝑟𝑖𝑘 𝒌𝟎 𝒌𝟏 𝒌𝟐 𝒌𝟑 𝒌𝟒 𝒌𝟓 𝒌𝟔
• Step 4: Take a weighted sum of values:
q
𝑜𝑢𝑡𝑝𝑢𝑡𝑖 = σ𝑗 𝛼𝑖𝑗 𝑣𝑗
13
Recipe for vectorized Self-Attention
• Step 1: words 𝑥𝑖 are stacked in 𝑋, calculate its query, key & value
𝑋𝑞 = 𝑋𝑄 𝑋𝑘 = 𝑋𝐾 𝑋𝑣 = 𝑋𝑉
• Step 2: Calculate attention score between query and keys
𝑅 = 𝑋𝑞 𝑋𝑘 T = 𝑋𝑄 𝑋𝐾 T = 𝑋𝑄𝐾 T 𝑋 T
• Step 3: Use softmax to calculate attention distribution:
𝐴 = softmax 𝑅
𝑜𝑢𝑡𝑝𝑢𝑡 = softmax 𝑋𝑄𝐾 T 𝑋 T ∗ 𝑋𝑉
• Step 4: Take a weighted sum of values:
𝑜𝑢𝑡𝑝𝑢𝑡 = 𝐴𝑋𝑉
14
Attention is all you need?
• Problem: Attention without non-linear transformation
Output
is just a weighted sum of value vectors probabilities
• Solution:Apply a feedforward layer with non-linear
activation to the output of self-attention:
Decoder
Feed forward
𝑚𝑖 = MLP 𝑜𝑢𝑡𝑝𝑢𝑡𝑖 Encoder
= 𝑊2 𝑅𝑒𝐿𝑈 𝑊1 × 𝑜𝑢𝑡𝑝𝑢𝑡𝑖 + 𝑏1 + 𝑏2 Self-Attention
Output
embedding
Input
embedding Output
(shifted right)
Inputs 15
Inheritance of Deep Learning Achievements
• Trick 1: Multi-layer stack Output
probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Feed forward
Encoder
Self-Attention
Output
embedding
Input
embedding Output
(shifted right)
Inputs 16
Inheritance of Deep Learning Achievements
• Trick 1: Multi-layer stack Output
probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Repeat 6x
#layer = 6
Encoder Feed forward
Repeat 6x
#layer = 6
Self-Attention
Output
embedding
Input
embedding Output
(shifted right)
Inputs 17
Inheritance of Deep Learning Achievements
• Trick 1: Multi-layer stack Output
probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Add
Repeat 6x
Feed forward #layer = 6
𝐹 𝑥 = 𝑥𝑙−1 + 𝐹(𝑥𝑙−1 ) Encoder
Add
Repeat 6x
#layer = 6
Self-Attention
Output
embedding
Input
embedding Output
(shifted right)
Inputs 18
Inheritance of Deep Learning Achievements
• Trick 1: Multi-layer stack Output
probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Add & Norm
Lặp 6x
Feed forward #layer = 6
Problem: Difficult to train the parameters of a Encoder
given layer because its input from the layer Add & Norm
Lặp 6x
beneath keeps shifting #layer = 6
Self-Attention
Solution: Reduce variation by normalizing to
Output
normal distribution 𝒩(0,1) within each layer embedding
Input
𝑙 𝑥𝑙
− 𝜇𝑙 embedding Output
𝑥′ = 𝑙 (shifted right)
𝜎 +𝜀 Inputs 19
Inheritance of Deep Learning Achievements
• Trick 1: Multi-layer stack Output
probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Add & Norm
Lặp 6x
Problem: After Norm, vector elements Feed forward #layer = 6
Encoder
follows a normal distribution, dot product still
Add & Norm
Lặp 6x
tends to be large, as its variance scales with #layer = 6
Scaled-Attention
dimensionality(𝑑𝑘 )
Output
embedding
Solution: After multiplying, divide by 𝑑𝑘 Input
embedding Output
𝑜𝑢𝑡𝑝𝑢𝑡 = softmax 𝑄𝐾 T / 𝑑𝑘 𝑉 Inputs
(shifted right)
20
Problem (big): word order is not considered
• Example: “do you understand” vs “you do understand”
21
Solution: Position representation vectors
• Self-attention do not care about position information
• It’s necessary to encode the order in query, key and value
• Assuming the 𝑖-ordinal indices are encoded:
𝑝𝑖 ∈ ℝ𝑑 với 𝑖 ∈ 1,2, … , 𝑇 là vector vị trí
• Where, value, key and query vectors: 𝑣𝑖 , 𝑘෨ 𝑖 , 𝑞𝑖 are old
𝑣𝑖 = 𝑣𝑖 + 𝑝𝑖 value, key, query
without positional
𝑘𝑖 = 𝑘෨ 𝑖 + 𝑝𝑖 information
𝑞𝑖 = 𝑞𝑖 + 𝑝𝑖
We can use concat, instead of addition
22
Solution: Position representation vectors
• Position embedding: Output
probabilities
encode position using
representation vector
Add & Norm Decoder
• Each position is a Encoder
Feed forward Repeat 6x
#layer = 6
distinct vector Repeat 6x Add & Norm
#layer = 6
Scaled Attention
Output
~ + embedding
Input
embedding Output
(shifted right)
Inputs
23
Position representation vectors through sinusoids
sin 𝑖/100002∗1/𝑑
cos 𝑖/100002∗1/𝑑
𝑝𝑖 = ⋮
𝑑
2∗ /𝑑
sin 𝑖/10000 2
𝑑
2∗ 2 /𝑑
cos 𝑖/10000
Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Can represent very long sequences
Cons:
• Not learnable from data, representation vectors are fixed
24
Multi-head Self-Attention
• A word can have multiple relationships in a sentence: perform self-
attention multiple times and combine the results
• Example: Tôi có hẹn với Bảo, nhưng anh ấy nhắn đến muộn
Multi-head attention
Single-head attention Linear
Linear Concat
Scaled Dot-product Attention
Scaled Dot-product Attention
Scaled Dot-product Attention Scaled Dot-product Attention
Linear Linear Linear
Linear Linear Linear Linear
Linear Linear
Linear Linear
Linear
V K Q V K Q 25
Multi-head Self-Attention – Recipe
𝑑
𝑑×ℎ
• Let 𝑄𝑙 , 𝐾𝑙 , 𝑉𝑙 ∈ ℝ , where ℎ is the number of attention heads, 𝑙 = 1. . ℎ\
T
output 𝑙 = softmax 𝑋𝑄𝑙 𝐾𝑙 𝑋 T / 𝑑𝑘 ∗ 𝑋𝑉𝑙 ∈ ℝ𝑑/ℎ
• Output of all attention heads:
output = 𝑌 output1 ; output 2 ; … ; output ℎ , trong đó 𝑌 ∈ ℝ𝑑×𝑑
• Each head of attention is a “look" of the language
26
Encoder architecture is complete, but Decoder?
• Ant Encoder is basically full
Output
probabilities
• Decoder conduct a query on
the feature from Encoder Add & Norm Decoder
Feed forward Repeat 6x
Encoder #layer = 6
Repeat 6x Add & Norm
#layer = 6 Multi-head
Attention
Output
~ + embedding
Input
embedding Output
(shifted right)
27
Inputs
Decoder: Masked Multi-head Self-Attention
• Problem: Decoder is sequential
decoding process in form of a
language model, we can’t “see” the
answer
• Solution: At each step of the
Decoder's calculation, we
gradually expand the set of Here, it is necessary to
keys and values. mask the following states
28
Decoder: Masked Multi-head Self-Attention
• Problem: Decoder is sequential
decoding process in form of a
language model, we can’t “see” the
answer
• Solution: At each step of the
Decoder's calculation, we
gradually expand the set of
keys and values.
29
Decoder: Masked Multi-head Self-Attention
• Problem: gradually expanding the set of
keys and values ➔Parallelization is not <Start> Do you understand
possible <Start>
• Solution: Use Masked Multi-head Self- Do
Attention to mask out attention to future
you
words by setting attention score to −∞
understand
𝑞𝑖 𝑘𝑗 , 𝑗 < 𝑖
𝑟𝑖𝑗 = ቊ
−∞,𝑗 ≥ 𝑖
30
Decoder: Masked Multi-head Self-Attention
• The following layer of Masked
Output
Multi-head Attention is “Add & probabilities
Norm” layer
Add & Norm Decoder
Feed forward Repeat 6x
Encoder Add & Norm #layer = 6
Add & Norm Masked Multi-
Repeat 6x
head Attention
#layer = 6 Multi-head
Attention
+ ~
~ +
Output
Input embedding
embedding
Output (shifted right)
Inputs 31
Encoder-Decoder Attention (Cross-Attention)
• In Attention Mechanism lesson, key comes
Output
from decoder,and value comes form encoder, probabilities
and so does Transformer :
𝑠𝑖
• 𝑠1 , 𝑠2 , … , 𝑠𝑇 ∈ ℝ𝑑 are encoder outputs Add & Norm
• ℎ1 , ℎ2 , … , ℎ𝑇 ∈ ℝ𝑑 are decoder inputs Add & Norm
Multi-head
Cross-Attention Decoder
ℎ𝑖
• Where key, value and query: Encoder Feed forward
Add & Norm
Repeat 6x
#layer = 6
𝑘𝑖 = 𝐾𝑠𝑖 Repeat 6x Add & Norm Masked Multi-
head Attention
#layer = 6
𝑣𝑖 = 𝑉𝑠𝑖 Multi-head
Attention
𝑞𝑖 = 𝑄ℎ𝑖 + ~
~ +
Output
Input embedding
embedding
Inputs Output (shifted right) 32
Decoder: Final steps Output
probabilities
Softmax
• Add “Feed forward” and “Add & norm” Linear
• Add “Linear” to project from a feature Add & Norm
Feed forward
space to a dictionary space 𝑠𝑖
• Add “softmax” to calculate the Add & Norm
Multi-head
probability of the next word Add & Norm Cross-Attention
ℎ𝑖
Decoder
Feed forward Repeat 6x
Encoder Add & Norm #layer = 6
Add & Norm Masked Multi-
Repeat 6x
head Attention
#layer = 6 Multi-head
Attention
+ ~
~ +
Output
Input embedding
embedding
Inputs Output (shifted right) 33
Content
1. TRANSFORMER ARCHITECTURE
2. TRANSFORMER WEAKNESSES
3. APPLICATIONS OF TRANSFORMER
34
Drawbacks
• Computation grows quadratically with the sequence length!
▪ Computing all pairs of interactions
• Position representations:
▪ Relative linear position attention [Shaw et al., 2018].
▪ Dependency syntax-based position [Wang et al., 2019]
▪ Rotary Embeddings [Su et al., 2021]
35
Improving on quadratic self-attention cost
• Linformer [Wang et al., 2020]: Map from 𝑇-dimensional space
to lower-dimensional space
36
Improving on quadratic self-attention cost
• BigBird [Zaheer et al., 2021]: replace all-pairs interactions with a
family of other interactions: random, window and global
37
Transformer Stability
• Transformer variants do not
meaningfully improve the
performance
38
Content
1. TRANSFORMER ARCHITECTURE
2. TRANSFORMER WEAKNESSES
3. APPLICATIONS OF TRANSFORMER
39
BERT and GPT Models
• BERT: Bidirectional Encoder Representation from Transformer
• GPT: Generative Pre-trained Transformer
• Common Points:
▪ are self-supervised learning
▪ use unlabeled data
▪ use for representation
▪ use Transformer
▪ Can use with downstream tasks
Source: https://scholar.harvard.edu/sites/scholar.harvard.edu/files/binxuw/files/mlfs_tutorial_nlp_transformer_ssl_updated.pdf 40
BERT and GPT Models
• Differences
BERT GPT
Model Masked language model Autoregressive Language Model
Construct Encoder Decoder
Task Predict the masked word predict the next word
Downstream task Text classification Machine Translation
Question answering Content generation
Text summary
Entity Identification
41
Using the Pre-trained Language Model
Fine-tuning Prompting
• Use Gradient descent to optimize • Design special prompt to
weight for 1 task hint/constraint network to solve
tasks
• Fine-tune ways: • No parameter update required
▪ Full model
▪ At readout heads
▪ Adapter
Thay đổi tham số mô hình Thay đổi cách sử dụng mô hình
42
Fine-tuning with Readout Heads
• Add the appropriate output and activation function
• Fine-tune the entire model with the new data
Fine-tune for Word Type Labeling Fine-tune for Question Answering Fine-tune for Text Classification
43
Fine-tuning with Adapter
• Add a small module to the language model
• Only fine-tune parameters of the newly added module
➔Easy to save models and save computational costs
44
Prompting: for models to learn from context
• Given some examples of tasks that need to be solved, the
model will find a solution on its own
45
Fine-tune + Prompting: Instruction tuning
• Refine the model to answer the questions
• Model can be generalized to solve other unprecedented tasks
Pre-trained Instruction-tune on Inference on new
Language model task B, C, D… task: A
46
Transformer for sequence data
• Transformers not only perform on text data, but can also be
extended to other sequence-type data. Example: Audio
Transformer on audio data
➔ Whisper model of OpenAI
47
Transformer for sequence data
• Transformers not only perform on text data, but can also be
extended to other sequence-type data. Example: Image
Transformer on image data
➔ Vision Transformer model
48
Transformer for sequence data
• Transformers not only perform on text data, but can also be
extended to other sequence-type data. Example: Image + text
Transformer on multimodal data
➔ Stable Diffusion model
49
QUIZ & QUESTIONS
50