0% found this document useful (0 votes)

29 views25 pages

11 Transformers Notes

Uploaded by

lucky.pics45

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views25 pages

11 Transformers Notes

Uploaded by

lucky.pics45

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Self-attention and transformers

Herman Kamper

2024-02, CC BY-SA 4.0

Issues with RNNs

Attention recap

Self-attention

Positional encodings

Multi-head attention

Masking the future in self-attention

Cross-attention

Transformer

1
Issues with RNNs

Mister Dursley of number was

Architectural
Even with changes to deal with long-range dependencies (e.g. LSTM),
more recent observations inevitably have a bigger influence on the
current hidden state than those that are far away.

Computational
• Future RNN states can’t be computed before past hidden states
have been computed.

• Computations over time steps are therefore not parallelisable.

• We just can’t get away from the “for loop” over time in the
forward pass over an RNN.

• So we can’t take advantage of the full power of batching on

GPUs, which wants several independent computations to be
performed at once.

2
Attention doesn’t have these problems

he threw me
softmax

ŷ1 ŷ2 ŷ3

hy het my gegooi </s> <s> he threw me

x1 x2 x3 x4 x5 y0 y1 y2 y3

Idea: Remove recurrence and rely solely on attention

Mister Dursley of number was

3
Intuition from the Google AI blog post:

4
Attention recap
One way to think of attention intuitively is as a soft lookup table:
Keys Values Keys Values

Query Query Output

+
Output

Computational graph:

c Attention
+
output

k1 a α × v1

k2 a α × v2
softmax

Keys k3 a α × v3 Values

kN a α × vN

Query q

5
Mathematically:

• Output of attention: Context vector.

N
α(q, kn )vn ∈ RD
X
c=
n=1

• Attention weight:

α(q, kn ) = softmaxn (a(q, kn ))

exp {a(q, kn )}
= PN ∈ [0, 1]
j=1 exp {a(q, kj )}

• Attention score:
a(q, kn ) ∈ R

6
Self-attention

k1 k2 k3 k4 k5 k6 v6
q6

v1 v2 v3 v4 v5

x1 x2 x3 x4 x5 x6

i went to school to learn

7
Self-attention

k1 k2 k3 k4 k5 k6 v6
q6

v1 v2 v3 v4 v5

x1 x2 x3 x4 x5 x6

i went to school to learn

8
T
y6 yi =
X
αi,t vt
t=1

+
α6,1 v1
α6,6 v6

× × × × × × eai,t
α6,1 αi,t = PT
ai,j
j=1 e
softmax
a6,1 qi ⊤ kt
ai,t = √
Dk
k1 k2 k3 k4 k5 k6 v6 qt = Wq⊤ xt
q6
kt = Wk⊤ xt
v1 v2 v3 v4 v5 vt = Wv⊤ xt

x1 x2 x3 x4 x5 x6

i went to school to learn

Layer input: x1 , x2 , . . . xT

Layer output: y1 , y2 , . . . yT

9
In matrix form
Each of the T queries need to be compared to each of the T keys.
We can express this in a compact matrix form.

Stack all the queries, keys and values as rows in matrices:

Q ∈ RT ×Dk
K ∈ RT ×Dk
V ∈ RT ×Dv

We can then write all the dot products and weighting in a short
condensed form:
QK>
!
Attention(Q, K, V) = softmax √ V
Dk

If we denote the output as Y = Attention(Q, K, V), then we end up

with a result Y ∈ RT ×Dv .

The above holds in general for attention. For self-attention specifically,

we would have
Q = XWq
K = XWk
V = XWv
where the design matrix is X ∈ RT ×D , with D the dimensionality of
the input.

You can figure out the shapes for the W’s, e.g. Wk ∈ RD×Dk .

10
Self-attention: A new computational
block
A new block or layer, like an RNN or a CNN.

Can use this in both encoder and decoder modules. E.g. for machine
translation:

Figure from (Vaswani et al., 2017).

Sometimes “transformer” is used to refer to the self-attention layers

themselves, but other times it is used to refer to this specific encoder-
decoder model (which we will unpack in the rest of this note).

11
Positional encodings intuition

12
Positional encodings
In contrast to RNNs, there aren’t any order information in the inputs
of self-attention.

We can add positional encodings to the inputs:

pt ∈ RD

There is a unique pt for every input position. E.g. p10 will always be
the same for all input sequences.

How do we incorporate them? The positional encodings can be

concatenated to the inputs:
h i
x̃t = xt ; pt

But it is more common to just add them:1

x̃t = xt + pt

Where do the positional encodings pt come from?

Learned positional encodings

We can let the pt ’s be learnable parameters. This means we are adding
a learnable matrix P ∈ RD×T to all input sequences.

Problem: What if we have inputs that have longer lengths than T ?

(But this is still often used in practice.)

1
I like the idea of concatenation more than adding. But Benjamin van Niekerk
pointed out to me that if you pass x̃t through a single linear layer, then con-
catenation and addition are very similar: In both cases you end up with a new
representation that is a weighted sum of the original input and the positional
encoding (there are just additional weights specifically for the positional encoding
when you concatenate).

13
Represent position using sinusoids
Let’s use a single sinusoid as our pt :

1.00 d=6
0.75
0.50
Encoding feature value

0.25
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40 50 60
Position

In this case, we would have unique positional feature value for inputs
roughly with lengths up to T = 36, and then the feature value would
repeat. This could be useful, if relative position at this scale is more
important than absolute position.

Let’s add a cosine to obtain pt :

1.00 d=6
d=7
0.75
0.50
Encoding feature value

0.25
0.00
0.25
0.50
0.75
1.00
0 10 20 30 40 50 60
Position

14
Now we would have unique positional encodings for a longer range.
But the model could also just decide that relative position matters
more.

We used sinusoids at a single frequency, so you are limited in the types

of relative relationships you can model. So let us add more sine and
cosine functions at different frequencies:

1.00
0.75
0.50
Encoding feature value

0.25
0.00
0.25
0.50
d=6
0.75 d=7
d=8
1.00 d=9
0 10 20 30 40 50 60
Position

Formally (Vaswani et al., 2017):

 
t
 sin λ1 
 
 cos λt
 

 1
  
 sin t 

 λ 2


 cos t
 
pt =  λ2


 .. 

 . 

 
t
 
 sin 
λ
D/2 
 

t
 
cos λD/2

where
λm = 10 0002m/D

15
If we stack all these into P ∈ RD×T :

5
Encoding dimension

30
0 10 20 30 40 50
Position

There are formal reasons that this encodes relative position (Denk,
2019).2 But intuitively you should be able to see that periodicity
indicates that absolute position isn’t necessarily important.

In practice, however, this approach does not enable extrapolation to

sequences that are way longer than those seen during training (Hewitt,
2023).

(But it is still often used in practice. The original transformer paper

did this – look at the transformer diagram above.)

2
For a fixed offset between two positional encodings, there is a linear transfor-
mation to take you from the one to the other. E.g. you can go from p10 to p15
using some linear transformation, and this will be the same transformation needed
to go from p30 to p35 .

16
The clock analogy for positional encodings

Think of each pair of dimensions of pt as a clock rotating at a different

frequency. The position of the clock is uniquely determined by the
sine and cosine functions for that frequency.3

We have D/2 clocks. For each position t, we will have a specific

configuration of clocks. This tells us where in the input sequence we
are. This works, even if we never saw a long input sequence length
during training (the clocks just move on).

But there is also periodicity in how clocks change with different t. To

move from the configuration p10 to p15 , we need to change the clock
faces in some way (this can be done through a linear transformation).
But this way in which we change the clock faces, would be the same
as the transformation from p30 to p35 .

So in short, the sinusoidal positional encodings can tell us where we

are in the input, even if that position was never seen during training.
But it also allows for relative position information to be captured.

3
Analogy from Benjamin van Niekerk.

17
Multi-head attention
Hypothetical example:

Semantically related words: Syntactically related words:

k1 k2 k3 k4 k5 k6 v6 [1] k1 k2 k3 k4 k5 k6 v6 [2]
q6 q6

v1 v2 v3 v4 v5 v1 v2 v3 v4 v5

x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6

i went to school to learn i went to school to learn

Analogy: Each head is like a different kernel in a CNN

From Lena Voita’s blog:

18
Masking the future in self-attention
If we have a network or decoder that needs to be causal, then we
should ensure that it can only attend to the past when making the
current prediction.

E.g. if we are doing language modelling:

k1 k2 k3 k4 v4 q4 k5 v5 k6 v6

v1 v2 v3

x1 x2 x3 x4 x5 x6

i went to school to learn

Mathematically:  >
 q√i kt if t ≤ i
Dk
ai,t =
−∞ if t > i

19
Have a careful look at what happens in the Google transformer diagram
for machine translation:

20
Cross-attention
he threw me </s>

hy het my gegooi <s> he threw me

x1 x2 x3 x4 y0 y1 y2 y3

21
Cross-attention
he threw me </s>

Keys and values: Encoder

Queries: Decoder

hy het my gegooi <s> he threw me

x1 x2 x3 x4 y0 y1 y2 y3

Have a look at the Google transformer diagram again.

22
Transformer

Figure from (Vaswani et al., 2017).

We haven’t spoken about the add & norm block:

• Residual connections
• Layer normalisation

23
Videos covered in this note
• Intuition behind self-attention (12 min)
• Attention recap (6 min)
• Self-attention details (13 min)
• Self-attention in matrix form (5 min)
• Positional encodings in transformers (19 min)
• The clock analogy for positional encodings (5 min)
• Multi-head attention (5 min)
• Masking the future in self-attention (5 min)
• Cross-attention (7 min)
• Transformer (4 min)

24
Acknowledgments
Christiaan Jacobs and Benjamin van Niekerk were instrumental in
helping me to start to understand self-attention and transformers.

This note relied heavily on content from:

• Chris Manning’s CS224N course at Stanford University, particu-

larly the transformer lecture by John Hewitt
• Lena Voita’s NLP course for you

Further reading
A. Goldie, “CS224N: Pretraining,” Stanford University, 2022.

A. Huang, S. Subramanian, J. Sum, K. Almubarak, and S. Biderman,

“The annotated transformer,” Harvard University, 2022.

References
T. Denk, “Linear relationships in the transformer’s positional encoding,”
2019.

J. Hewitt, “CS224N: Self-attention and transformers,” Stanford Uni-

versity, 2023.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.

Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in
NeurIPS, 2017.

L. Voita, “Sequence to sequence (seq2seq) and attention,” 2023.

A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep

Learning, 2021.

Transformer
No ratings yet
Transformer
58 pages
Transformers 22nd April 2025
No ratings yet
Transformers 22nd April 2025
67 pages
Transformers - The Brain of ChatGPT
No ratings yet
Transformers - The Brain of ChatGPT
25 pages
2024 Transformer Master
No ratings yet
2024 Transformer Master
50 pages
The Transformer Family
No ratings yet
The Transformer Family
25 pages
Lec 12
No ratings yet
Lec 12
30 pages
Lecture15 Transformer
No ratings yet
Lecture15 Transformer
26 pages
495 Lecture 10 Attall
No ratings yet
495 Lecture 10 Attall
18 pages
Chapter 4
No ratings yet
Chapter 4
24 pages
NLP 8
No ratings yet
NLP 8
42 pages
Lecture 6 Transformers
No ratings yet
Lecture 6 Transformers
92 pages
20190630transformer 210110081057
No ratings yet
20190630transformer 210110081057
32 pages
AE556 2024 Topic7 Transformer
No ratings yet
AE556 2024 Topic7 Transformer
49 pages
ScalableAI Transformers
No ratings yet
ScalableAI Transformers
131 pages
Transformer
No ratings yet
Transformer
14 pages
Lecture Notes - Advanced Language Model - BERT, GPT
No ratings yet
Lecture Notes - Advanced Language Model - BERT, GPT
24 pages
Transformer
No ratings yet
Transformer
4 pages
CS414-Lesson 10.transformer and Applications
No ratings yet
CS414-Lesson 10.transformer and Applications
50 pages
Transformer 1803.02155
No ratings yet
Transformer 1803.02155
5 pages
XCS224N Module6 Slides
No ratings yet
XCS224N Module6 Slides
99 pages
Deep Neural Network Module 7 Attention Transformer
No ratings yet
Deep Neural Network Module 7 Attention Transformer
40 pages
Position Encoding: Intuition Lack Inherent Word Order Awareness
No ratings yet
Position Encoding: Intuition Lack Inherent Word Order Awareness
33 pages
02-Transformer Based NLP Applications
No ratings yet
02-Transformer Based NLP Applications
57 pages
Transformer Architecture Overview
No ratings yet
Transformer Architecture Overview
18 pages
NLP Week8 Transformers
No ratings yet
NLP Week8 Transformers
66 pages
Transformer NLP
No ratings yet
Transformer NLP
15 pages
Chap6 Transformer (20240219) - DL4H Practioner Guide
No ratings yet
Chap6 Transformer (20240219) - DL4H Practioner Guide
36 pages
Transformers
No ratings yet
Transformers
15 pages
Transformer
No ratings yet
Transformer
10 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
15 pages
The Transformer Family Version 20 LilLog
No ratings yet
The Transformer Family Version 20 LilLog
32 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
18 pages
Transformer
No ratings yet
Transformer
5 pages
TRANSFORMER
No ratings yet
TRANSFORMER
29 pages
Lecture 2
No ratings yet
Lecture 2
39 pages
Understanding Transformer Models
No ratings yet
Understanding Transformer Models
29 pages
Attention & Transformers
No ratings yet
Attention & Transformers
66 pages
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
No ratings yet
Attention Is All You Need - NIPS-2017-attention-is-all-you-need-Paper
11 pages
Attn Is All You Need
No ratings yet
Attn Is All You Need
15 pages
Transformers
No ratings yet
Transformers
15 pages
Lecture 10
No ratings yet
Lecture 10
66 pages
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
No ratings yet
Unlocking Linguistic Intelligence - Attention Mechanisms and Transformer Architectures in NLP
117 pages
DR 68 V 7 BT 98 Ny 9 M
No ratings yet
DR 68 V 7 BT 98 Ny 9 M
23 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
8 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
495 Lecture 8
No ratings yet
495 Lecture 8
28 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
L3 Transformer and PLMs
No ratings yet
L3 Transformer and PLMs
111 pages
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
No ratings yet
Lesson 4: Attention Is All You Need Encoder and Decoder Processes
5 pages
7181 Attention Is All You Need
No ratings yet
7181 Attention Is All You Need
11 pages
GenAIWorkshop GEOMAR With Footnotes Final
No ratings yet
GenAIWorkshop GEOMAR With Footnotes Final
41 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
CNNs and Transformers
No ratings yet
CNNs and Transformers
90 pages
Bahdanau Attention Mechanism (Also Known As Additive Attention)
No ratings yet
Bahdanau Attention Mechanism (Also Known As Additive Attention)
41 pages
Attention Is All You Need
No ratings yet
Attention Is All You Need
11 pages
1706.03762v7 5 15
No ratings yet
1706.03762v7 5 15
11 pages
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
No ratings yet
Attention Is All You Need: Ashish Vaswani Noam Shazeer Niki Parmar Jakob Uszkoreit
15 pages
Class47 49 - AttentionBasedModels Transformers 10 15may2023
No ratings yet
Class47 49 - AttentionBasedModels Transformers 10 15may2023
27 pages
Dr. Anjan Krishnamurthy Associate Professor Dept. of CSE, BMSIT&M
No ratings yet
Dr. Anjan Krishnamurthy Associate Professor Dept. of CSE, BMSIT&M
129 pages
Smart Home Review Preprint
No ratings yet
Smart Home Review Preprint
16 pages
SGP 22-v3 1
No ratings yet
SGP 22-v3 1
501 pages
1982 Tape Recording Guide
100% (1)
1982 Tape Recording Guide
116 pages
Chapter 4.1 Install openLDAP For Windows PDF
No ratings yet
Chapter 4.1 Install openLDAP For Windows PDF
6 pages
Daniel K. Schneider
No ratings yet
Daniel K. Schneider
363 pages
Xenserver 7 0 Supplemental Packs Guide
No ratings yet
Xenserver 7 0 Supplemental Packs Guide
34 pages
Am-Stick-Wb: Part.-No. 349081
No ratings yet
Am-Stick-Wb: Part.-No. 349081
19 pages
Econ 151
No ratings yet
Econ 151
4 pages
Lorenzo PDF
No ratings yet
Lorenzo PDF
37 pages
Aoishiro - Walkthroughs - Fuwanovel Forums
No ratings yet
Aoishiro - Walkthroughs - Fuwanovel Forums
13 pages
Vsx305 Om-Pioneer Text
No ratings yet
Vsx305 Om-Pioneer Text
28 pages
Emotional Ai: Andrew Mcstay
No ratings yet
Emotional Ai: Andrew Mcstay
12 pages
OLI Studio User Training Guide
No ratings yet
OLI Studio User Training Guide
7 pages
Arduino Programming Basics
No ratings yet
Arduino Programming Basics
1 page
74F382 4-Bit Arithmetic Logic Unit: General Description Features
No ratings yet
74F382 4-Bit Arithmetic Logic Unit: General Description Features
9 pages
Telecom Equipment Certification
No ratings yet
Telecom Equipment Certification
2 pages
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
No ratings yet
Modernizing Legacy C++ Code - Gregory and McNellis - CppCon 2014
81 pages
Aviation Procurement & Sales Expert Resume
No ratings yet
Aviation Procurement & Sales Expert Resume
5 pages
Module 3 - NC II - Solving and Addressing General Workplace Problems - ForTrainingOnly
100% (9)
Module 3 - NC II - Solving and Addressing General Workplace Problems - ForTrainingOnly
86 pages
Metaverse and Education
No ratings yet
Metaverse and Education
15 pages
Mark Sheet
No ratings yet
Mark Sheet
1 page
KIRAN's Resume
No ratings yet
KIRAN's Resume
1 page
CT-17B-6BT Manual Vers1 - 01 - 00 1
No ratings yet
CT-17B-6BT Manual Vers1 - 01 - 00 1
13 pages
LESSON 10 - Functions
No ratings yet
LESSON 10 - Functions
6 pages
ACP Tutorials PDF
No ratings yet
ACP Tutorials PDF
10 pages
G8TM ICT Units (1-6) Murali-Unlocked
100% (1)
G8TM ICT Units (1-6) Murali-Unlocked
96 pages
Using Database Partitioning With Oracle E-Business Suite (Doc ID 554539.1)
No ratings yet
Using Database Partitioning With Oracle E-Business Suite (Doc ID 554539.1)
39 pages
Path Planning For Unmanned Ground Vehicle: Fethi DEMIM, Kahina LOUADJ, Abdelkrim NEMRA
No ratings yet
Path Planning For Unmanned Ground Vehicle: Fethi DEMIM, Kahina LOUADJ, Abdelkrim NEMRA
3 pages
NRC 2018 Rules & Regulations Football
No ratings yet
NRC 2018 Rules & Regulations Football
14 pages