KEMBAR78
CS414-Lesson 10.transformer and Applications | PDF | Learning | Computational Neuroscience
0% found this document useful (0 votes)
14 views50 pages

CS414-Lesson 10.transformer and Applications

The document discusses the Transformer architecture in machine learning, focusing on its attention mechanisms and applications. It highlights the advantages of Transformers, such as minimizing path lengths for word interactions and maximizing parallel operations, which enhance efficiency in processing sequences. Additionally, it covers the self-attention mechanism, position representation, and multi-head attention to capture various relationships in data.

Uploaded by

Thành Công
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views50 pages

CS414-Lesson 10.transformer and Applications

The document discusses the Transformer architecture in machine learning, focusing on its attention mechanisms and applications. It highlights the advantages of Transformers, such as minimizing path lengths for word interactions and maximizing parallel operations, which enhance efficiency in processing sequences. Additionally, it covers the self-attention mechanism, position representation, and multi-head attention to capture various relationships in data.

Uploaded by

Thành Công
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CS414 – MACHINE LEARNING

Lesson 10

TRANSFORMER & APPLICATIONS

Dr. Nguyễn Vinh Tiệp

1
Recap
Attention output
• Hidden state in encoder (values):
𝒔1 , 𝒔2 , … , 𝒔𝑁 ∈ ℝ𝑑1
• Query vector (query) 𝒉 ∈ ℝ𝑑2
• Attention performs the following steps:
(1) Calculate attention scores: 𝒓
(2) Calculate attention distribution 𝜶:
𝒉
𝜶 = softmax 𝒓
(3) Calculate attention output 𝒄 to aggregate
𝒔1 𝒔2 𝒔3 … 𝒔𝑁 𝒉𝑡=1
information:

𝒄 = ෍ 𝛼𝑖 𝒔𝑖
𝑖=1

‘I’ ‘am’ ‘not’ ‘sure’ ‘<Start>’


2
Regression models for NLP tasks
• Encode the sequence input using
Bidirectional LSTM

• Represent the sequence output and


use LSTM to generate results

• Utilize Attention to flexibly access


memory

3
Content

1. TRANSFORMER ARCHITECTURE

2. TRANSFORMER WEAKNESSES

3. APPLICATIONS OF TRANSFORMER

4
Transformer Motivation

• Minimize path length any pair of


words

• Maximize the number of parallel


operations

5
Minimize path length any pair of words

• It takes O(seq_len) steps for distant word pairs to interact


semantically
▪ Hard to learn long-distance dependencies (vanishing gradient)

In France, I had _____ language

Info of “France” has gone through


O(seq_len) many layers 6
Maximize Parallelizability

• Forward & backward require O(seq_len) unparallelizable operations


▪ GPU: performs many independent computation at once
▪ RNN: future hidden states can’t be computed in full before past hidden
→ Inhibits training on large-scale datasets
Require 𝑇 previous operations

1 2 3 … T

0 1 3 … T-1

𝒔1 𝒔2 𝒔3 𝒔𝑇

7
Self-attention

• Attention: a query word in the decoder retrieval and synthesize


information from a set of values in encoder
• Self-attention: is encoder-encoder (or decoder-decoder)
attention where each word “attends" to each other within input
(or output)
2 2 2 … 2

Only 2 previous operations are needed


1 1 1 … 1

0 0 0 … 0
𝒔1 𝒔2 𝒔3 𝒔𝑇
8
Transformer Advantanges:
Transformer: Idea • Number of parallel operations
independent of sequence length
• Each word interacts with each
other at maximum in O(1) step

9
Output

Transformer Architecture probabilities

Softmax

Linear

• Includes two main components: Add & Norm

encoder & decoder Feed forward

Add & Norm Decoder


Add & Norm
Multi-head Repeat 6x
Encoder Feed forward Cross-Attention (# layers = 6)

Repeat 6x
(# layers = 6) Add & Norm Add & Norm

Multi-head Masked Multi-


Attention head Attention

~ + + ~

Input Output
embedding embedding

Inputs Output
(shifted right)
10
Encoder: Self-Attention
Output
probabilities

• Self-Attention is core of Transformer

Decoder

Encoder

Self-Attention

Input Output
embedding embedding

Inputs Output
(shifted right)
11
Attention mechanism

• Attention ~ approximate hashtable


▪ To look up value, we compare a query with keys In a table

𝑣0 𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 𝑣6 𝑣0 𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 𝑣6

𝑘0 𝒌𝟏 𝑘2 𝑘3 𝑘4 𝑘5 𝑘6 𝒌𝟎 𝒌𝟏 𝒌𝟐 𝒌𝟑 𝒌𝟒 𝒌𝟓 𝒌𝟔

q q

Each query (hash) maps to one key-value pair Each query matches each key to varying degrees,
return a sum of values weighted by query-key

12
Recipe for Self-Attention in Encoder
• Step 1: For each word 𝑥𝑖 , caculate its query, key & value
𝑞𝑖 = 𝑄𝑥𝑖 𝑘𝑖 = 𝐾𝑥𝑖 𝑣𝑖 = 𝑉𝑥𝑖

• Step 2: Calculate attention score between query and keys


𝑟𝑖𝑗 = 𝑞𝑖 . 𝑘𝑗

• Step 3: Use softmax to calculate attention distribution: 𝑣0 𝑣1 𝑣2 𝑣3 𝑣4 𝑣5 𝑣6


𝑟
𝑒 𝑖𝑗
𝛼𝑖𝑗 = softmax 𝑟𝑖𝑗 = σ𝑘 𝑒 𝑟𝑖𝑘 𝒌𝟎 𝒌𝟏 𝒌𝟐 𝒌𝟑 𝒌𝟒 𝒌𝟓 𝒌𝟔

• Step 4: Take a weighted sum of values:


q
𝑜𝑢𝑡𝑝𝑢𝑡𝑖 = σ𝑗 𝛼𝑖𝑗 𝑣𝑗

13
Recipe for vectorized Self-Attention
• Step 1: words 𝑥𝑖 are stacked in 𝑋, calculate its query, key & value
𝑋𝑞 = 𝑋𝑄 𝑋𝑘 = 𝑋𝐾 𝑋𝑣 = 𝑋𝑉

• Step 2: Calculate attention score between query and keys


𝑅 = 𝑋𝑞 𝑋𝑘 T = 𝑋𝑄 𝑋𝐾 T = 𝑋𝑄𝐾 T 𝑋 T

• Step 3: Use softmax to calculate attention distribution:


𝐴 = softmax 𝑅
𝑜𝑢𝑡𝑝𝑢𝑡 = softmax 𝑋𝑄𝐾 T 𝑋 T ∗ 𝑋𝑉

• Step 4: Take a weighted sum of values:


𝑜𝑢𝑡𝑝𝑢𝑡 = 𝐴𝑋𝑉
14
Attention is all you need?

• Problem: Attention without non-linear transformation


Output
is just a weighted sum of value vectors probabilities

• Solution:Apply a feedforward layer with non-linear


activation to the output of self-attention:
Decoder

Feed forward
𝑚𝑖 = MLP 𝑜𝑢𝑡𝑝𝑢𝑡𝑖 Encoder

= 𝑊2 𝑅𝑒𝐿𝑈 𝑊1 × 𝑜𝑢𝑡𝑝𝑢𝑡𝑖 + 𝑏1 + 𝑏2 Self-Attention

Output
embedding
Input
embedding Output
(shifted right)
Inputs 15
Inheritance of Deep Learning Achievements

• Trick 1: Multi-layer stack Output


probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder

Feed forward
Encoder

Self-Attention

Output
embedding
Input
embedding Output
(shifted right)
Inputs 16
Inheritance of Deep Learning Achievements

• Trick 1: Multi-layer stack Output


probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder

Repeat 6x
#layer = 6
Encoder Feed forward

Repeat 6x
#layer = 6
Self-Attention

Output
embedding
Input
embedding Output
(shifted right)
Inputs 17
Inheritance of Deep Learning Achievements

• Trick 1: Multi-layer stack Output


probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Add

Repeat 6x
Feed forward #layer = 6
𝐹 𝑥 = 𝑥𝑙−1 + 𝐹(𝑥𝑙−1 ) Encoder
Add
Repeat 6x
#layer = 6
Self-Attention

Output
embedding
Input
embedding Output
(shifted right)
Inputs 18
Inheritance of Deep Learning Achievements

• Trick 1: Multi-layer stack Output


probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Add & Norm

Lặp 6x
Feed forward #layer = 6
Problem: Difficult to train the parameters of a Encoder
given layer because its input from the layer Add & Norm
Lặp 6x
beneath keeps shifting #layer = 6
Self-Attention

Solution: Reduce variation by normalizing to


Output
normal distribution 𝒩(0,1) within each layer embedding
Input
𝑙 𝑥𝑙
− 𝜇𝑙 embedding Output
𝑥′ = 𝑙 (shifted right)
𝜎 +𝜀 Inputs 19
Inheritance of Deep Learning Achievements

• Trick 1: Multi-layer stack Output


probabilities
• Trick 2: Residual connect
• Trick 3: Layer norm
• Trick 4: Scaled dot product attention Decoder
Add & Norm

Lặp 6x
Problem: After Norm, vector elements Feed forward #layer = 6
Encoder
follows a normal distribution, dot product still
Add & Norm
Lặp 6x
tends to be large, as its variance scales with #layer = 6
Scaled-Attention
dimensionality(𝑑𝑘 )
Output
embedding
Solution: After multiplying, divide by 𝑑𝑘 Input
embedding Output

𝑜𝑢𝑡𝑝𝑢𝑡 = softmax 𝑄𝐾 T / 𝑑𝑘 𝑉 Inputs


(shifted right)
20
Problem (big): word order is not considered

• Example: “do you understand” vs “you do understand”

21
Solution: Position representation vectors

• Self-attention do not care about position information


• It’s necessary to encode the order in query, key and value
• Assuming the 𝑖-ordinal indices are encoded:
𝑝𝑖 ∈ ℝ𝑑 với 𝑖 ∈ 1,2, … , 𝑇 là vector vị trí
• Where, value, key and query vectors: 𝑣෤𝑖 , 𝑘෨ 𝑖 , 𝑞෤𝑖 are old
𝑣𝑖 = 𝑣෤𝑖 + 𝑝𝑖 value, key, query
without positional
𝑘𝑖 = 𝑘෨ 𝑖 + 𝑝𝑖 information
𝑞𝑖 = 𝑞෤𝑖 + 𝑝𝑖

We can use concat, instead of addition


22
Solution: Position representation vectors

• Position embedding: Output


probabilities

encode position using


representation vector
Add & Norm Decoder

• Each position is a Encoder


Feed forward Repeat 6x
#layer = 6

distinct vector Repeat 6x Add & Norm


#layer = 6
Scaled Attention

Output
~ + embedding
Input
embedding Output
(shifted right)
Inputs
23
Position representation vectors through sinusoids

sin 𝑖/100002∗1/𝑑
cos 𝑖/100002∗1/𝑑
𝑝𝑖 = ⋮
𝑑
2∗ /𝑑
sin 𝑖/10000 2
𝑑
2∗ 2 /𝑑
cos 𝑖/10000

Pros:
• Periodicity indicates that maybe “absolute position” isn’t as important
• Can represent very long sequences
Cons:
• Not learnable from data, representation vectors are fixed
24
Multi-head Self-Attention
• A word can have multiple relationships in a sentence: perform self-
attention multiple times and combine the results
• Example: Tôi có hẹn với Bảo, nhưng anh ấy nhắn đến muộn

Multi-head attention

Single-head attention Linear

Linear Concat

Scaled Dot-product Attention


Scaled Dot-product Attention
Scaled Dot-product Attention Scaled Dot-product Attention

Linear Linear Linear


Linear Linear Linear Linear
Linear Linear
Linear Linear
Linear

V K Q V K Q 25
Multi-head Self-Attention – Recipe

𝑑
𝑑×ℎ
• Let 𝑄𝑙 , 𝐾𝑙 , 𝑉𝑙 ∈ ℝ , where ℎ is the number of attention heads, 𝑙 = 1. . ℎ\
T
output 𝑙 = softmax 𝑋𝑄𝑙 𝐾𝑙 𝑋 T / 𝑑𝑘 ∗ 𝑋𝑉𝑙 ∈ ℝ𝑑/ℎ

• Output of all attention heads:


output = 𝑌 output1 ; output 2 ; … ; output ℎ , trong đó 𝑌 ∈ ℝ𝑑×𝑑

• Each head of attention is a “look" of the language

26
Encoder architecture is complete, but Decoder?

• Ant Encoder is basically full


Output
probabilities

• Decoder conduct a query on


the feature from Encoder Add & Norm Decoder

Feed forward Repeat 6x


Encoder #layer = 6

Repeat 6x Add & Norm


#layer = 6 Multi-head
Attention

Output
~ + embedding
Input
embedding Output
(shifted right)
27
Inputs
Decoder: Masked Multi-head Self-Attention

• Problem: Decoder is sequential


decoding process in form of a
language model, we can’t “see” the
answer

• Solution: At each step of the


Decoder's calculation, we
gradually expand the set of Here, it is necessary to
keys and values. mask the following states

28
Decoder: Masked Multi-head Self-Attention

• Problem: Decoder is sequential


decoding process in form of a
language model, we can’t “see” the
answer

• Solution: At each step of the


Decoder's calculation, we
gradually expand the set of
keys and values.

29
Decoder: Masked Multi-head Self-Attention

• Problem: gradually expanding the set of


keys and values ➔Parallelization is not <Start> Do you understand

possible <Start>

• Solution: Use Masked Multi-head Self- Do

Attention to mask out attention to future


you
words by setting attention score to −∞
understand
𝑞𝑖 𝑘𝑗 , 𝑗 < 𝑖
𝑟𝑖𝑗 = ቊ
−∞,𝑗 ≥ 𝑖

30
Decoder: Masked Multi-head Self-Attention

• The following layer of Masked


Output

Multi-head Attention is “Add & probabilities

Norm” layer

Add & Norm Decoder

Feed forward Repeat 6x


Encoder Add & Norm #layer = 6

Add & Norm Masked Multi-


Repeat 6x
head Attention
#layer = 6 Multi-head
Attention
+ ~

~ +
Output
Input embedding
embedding
Output (shifted right)
Inputs 31
Encoder-Decoder Attention (Cross-Attention)

• In Attention Mechanism lesson, key comes


Output
from decoder,and value comes form encoder, probabilities

and so does Transformer :


𝑠𝑖
• 𝑠1 , 𝑠2 , … , 𝑠𝑇 ∈ ℝ𝑑 are encoder outputs Add & Norm

• ℎ1 , ℎ2 , … , ℎ𝑇 ∈ ℝ𝑑 are decoder inputs Add & Norm


Multi-head
Cross-Attention Decoder
ℎ𝑖
• Where key, value and query: Encoder Feed forward
Add & Norm
Repeat 6x
#layer = 6

𝑘𝑖 = 𝐾𝑠𝑖 Repeat 6x Add & Norm Masked Multi-


head Attention
#layer = 6
𝑣𝑖 = 𝑉𝑠𝑖 Multi-head
Attention

𝑞𝑖 = 𝑄ℎ𝑖 + ~

~ +
Output
Input embedding
embedding

Inputs Output (shifted right) 32


Decoder: Final steps Output
probabilities

Softmax

• Add “Feed forward” and “Add & norm” Linear

• Add “Linear” to project from a feature Add & Norm

Feed forward
space to a dictionary space 𝑠𝑖

• Add “softmax” to calculate the Add & Norm

Multi-head
probability of the next word Add & Norm Cross-Attention
ℎ𝑖
Decoder

Feed forward Repeat 6x


Encoder Add & Norm #layer = 6

Add & Norm Masked Multi-


Repeat 6x
head Attention
#layer = 6 Multi-head
Attention
+ ~

~ +
Output
Input embedding
embedding

Inputs Output (shifted right) 33


Content

1. TRANSFORMER ARCHITECTURE

2. TRANSFORMER WEAKNESSES

3. APPLICATIONS OF TRANSFORMER

34
Drawbacks

• Computation grows quadratically with the sequence length!


▪ Computing all pairs of interactions

• Position representations:
▪ Relative linear position attention [Shaw et al., 2018].
▪ Dependency syntax-based position [Wang et al., 2019]
▪ Rotary Embeddings [Su et al., 2021]

35
Improving on quadratic self-attention cost

• Linformer [Wang et al., 2020]: Map from 𝑇-dimensional space


to lower-dimensional space

36
Improving on quadratic self-attention cost

• BigBird [Zaheer et al., 2021]: replace all-pairs interactions with a


family of other interactions: random, window and global

37
Transformer Stability

• Transformer variants do not


meaningfully improve the
performance

38
Content

1. TRANSFORMER ARCHITECTURE

2. TRANSFORMER WEAKNESSES

3. APPLICATIONS OF TRANSFORMER

39
BERT and GPT Models

• BERT: Bidirectional Encoder Representation from Transformer


• GPT: Generative Pre-trained Transformer
• Common Points:
▪ are self-supervised learning
▪ use unlabeled data
▪ use for representation
▪ use Transformer
▪ Can use with downstream tasks

Source: https://scholar.harvard.edu/sites/scholar.harvard.edu/files/binxuw/files/mlfs_tutorial_nlp_transformer_ssl_updated.pdf 40
BERT and GPT Models

• Differences
BERT GPT
Model Masked language model Autoregressive Language Model

Construct Encoder Decoder


Task Predict the masked word predict the next word
Downstream task Text classification Machine Translation
Question answering Content generation
Text summary
Entity Identification

41
Using the Pre-trained Language Model

Fine-tuning Prompting

• Use Gradient descent to optimize • Design special prompt to


weight for 1 task hint/constraint network to solve
tasks
• Fine-tune ways: • No parameter update required
▪ Full model
▪ At readout heads
▪ Adapter
Thay đổi tham số mô hình Thay đổi cách sử dụng mô hình
42
Fine-tuning with Readout Heads

• Add the appropriate output and activation function


• Fine-tune the entire model with the new data

Fine-tune for Word Type Labeling Fine-tune for Question Answering Fine-tune for Text Classification

43
Fine-tuning with Adapter

• Add a small module to the language model


• Only fine-tune parameters of the newly added module
➔Easy to save models and save computational costs

44
Prompting: for models to learn from context

• Given some examples of tasks that need to be solved, the


model will find a solution on its own

45
Fine-tune + Prompting: Instruction tuning

• Refine the model to answer the questions


• Model can be generalized to solve other unprecedented tasks
Pre-trained Instruction-tune on Inference on new
Language model task B, C, D… task: A

46
Transformer for sequence data

• Transformers not only perform on text data, but can also be


extended to other sequence-type data. Example: Audio

Transformer on audio data


➔ Whisper model of OpenAI

47
Transformer for sequence data

• Transformers not only perform on text data, but can also be


extended to other sequence-type data. Example: Image

Transformer on image data


➔ Vision Transformer model

48
Transformer for sequence data

• Transformers not only perform on text data, but can also be


extended to other sequence-type data. Example: Image + text

Transformer on multimodal data


➔ Stable Diffusion model

49
QUIZ & QUESTIONS

50

You might also like