41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
Openinapp 7
@O dQ search Gwe O ©
eated from phtofunt:
Solving Transformer by Hand: A
Step-by-Step Math Example
@ Fareed Khan + Follow
Published in Level Up Coding - 1 minread . Dec 18,2028
S2x*x Qs W © fo
nitpsoevelupgitconnected.comunderstar formers ron-star-o-end-a-step-by-step-math-example-téd4e6ée6eb1 324, 14:11AM Salving Transformer by Hand: A Step-by-Step Math Example | by Fareed Knan | Level Up Cosing
Ihave already written a detailed blog on how transformers work using a very
small sample of the dataset, which will be my best blog ever because it has
elevated my profile and given me the motivation to write more. However,
that blog is incomplete as it only covers 20% of the transformer architecture
and contains numerous calculation errors, as pointed out by readers. After a
considerable amount of time has passed since that blog, I will be revisiting
the topic in this new blog.
My previous blog on transformer architecture (covers only 20%):
Understanding Transformers: A Step-by-Step Math Example —
Part1
lunderstand that the transformer architecture may seem scary, and
you might have encountered various explanations on...
medium.com
I plan to explain the transformer again in the same manner as I did in my
previous blog (for both coders and non-coders), providing a complete guide
with a step-by-step approach to understanding how they work.
Table of Contents
+ Defining our Dataset
+ Finding Vocab Size
+ Encoding
+ Calculating Embedding
* Calculating Positional Embedding
* Concatenating Positional and Word Embeddings
ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 2183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
Multi Head Attention
Adding and Normalizing
Feed Forward Network
Adding and Normalizing Again
* Di r Part
+ Understanding Mask Multi Head Attention
* Calculating the Predicted Word
Important Points
Conclusion
Step 1— Defining our Dataset
The dataset used for creating ChatGPT is 570 GB. On the other hand, for our
purposes, we will be using a very small dataset to perform numerical
calculations visually.
Dataset (corpus)
| drink and | know things.
When you play the game of thrones, you win or you die.
The true enemy won't wait out the storm, He brings the storm.
Our entire dataset containing only three sentences
Our entire dataset contains only three sentences, all of which are dialogues
taken from a TV show. Although our dataset is cleaned, in real-world
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 3143srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
scenarios like ChatGPT creation, cleaning a 570 GB dataset requires a
significant amount of effort.
Step 2— Finding Vocab Size
The vocabulary size determines the total number of unique words in our
dataset. It can be calculated using the below formula, where N is the total
number of words in our dataset.
vocab size = count(set(N))
vocab_size formula where N is total number of words
In order to find N, we need to break our dataset into individual words.
Dataset (Corpus)
| drink and | know things.
When you play the game of thrones, you win or you die.
The true enemy won't wait out the storm, He brings the storm.
When, you, play, the, game, of, thrones, you, win, or, you, die,
N |, drink, and, 1, Know, things,
The, true, enemy, won't, wait, out, the, storm, He, brings, the, storm
calculating variable N
After obtaining N, we perform a set operation to remove duplicates, and
then we can count the unique words to determine the vocabulary size.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 4183s1ar24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
vocab size = count(set(N))
I rink, and, 1, Know, things,
set (irene lay, the, game, of, thrones, you, win, oF you, de, )
The, ie, enemy, wor, welt, out the, storm, He, brings, te, storm
|, drink, and, Know, things, When, you, play, the, game, of,
L, COUNT (tiveness or ce te, enemy, wont wat ot stom, Ho
L,=23
finding vocab size
Therefore, the vocabulary size is 23, as there are 23 unique words in our
dataset.
Step 3 — Encoding
Now, we need to assign a unique number to each unique word.
12 3 4 5 6 7 8 9 10 1 12
| drink things Know When won't play out true storm brings game
13001415 6 7 #18 19 20 2 © 22028
the win of enemy you wait thrones and or die He
encoding our unique words
‘As we have considered a single token as a single word and assigned a
number to it, ChatGPT has considered a portion of a word as a single token
using this formula: 1 Token = 0.75 Word
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 5143srar24, 111 AM
Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
After encoding our entire dataset, it’s time to select our input and start
working with the transformer architecture.
Step 4— Calculating Embedding
Let's select a sentence from our corpus that will be processed in our
transformer architecture.
12 34 5 6 7 8
| drink things Know When won't play out
Bo 4 6 6 7 8 19
the win of enemy you wait thrones
Positional
Encoding
@&-o
Inputs. <——4
9 mo on 2
true storm brings game
2 «8
andor diese
play game of-_—_— thrones,
7 2 18 19
Input sentence for transformer
We have selected our input, and we need to find an embedding vector for it.
The original paper uses a 512-dimensional embedding vector for each input
word.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 6143star24, 111 AM
512
Attention Is All You Need paper
When
s
you
7
play
7
game of
12
15
Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
thrones
19
Original Paper uses 512 dimension vector
Since, for our case, we need to work with a smaller dimension of embedding
vector to visualize how the calculation is taking place. So, we will be using a
dimension of 6 for the embedding vector.
ional
seang Q-@
Tost
embsng
Inputs
When | you play_|_ game of _ | thrones
5 7 7 12 15 19
el e2 3 ed 5 26
o7o | 038 | 001 | 012 | 088 06
06 012 [| 051 06 041 | 0.33
os | 0.06 | 0.27 | 065 | 079 | 075
oes | 079 | 031 | 0.22 | 062 | 0.48
0.97 og 0.56 0.07 0.5 0.94
0.2 074 | 059 | 037 07 0.21
Embedding vectors of our input
These values of the embedding vector are between 0 and 1 and are filled
randomly in the beginning. They will later be updated as our transformer
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
starts understanding the meanings among the words.
Step 5 — Calculating Positional Embedding
Now we need to find positional embeddings for our input. There are two
formulas for positional embedding depending on the position of the ith
value of that embedding vector for each word.
Embedding vector for any word
even position For even position
‘odd position PE (qos,2i) = sin(pos/100007!/4')
even position
odd position
‘even position For odd position
PE (pos,2i-+1) = €08(pos/10000?"/4«!)
Positional Embedding formula
‘As you do know, our input sentence is “when you play the game of thrones”
and the starting word is “when” with a starting index (POS) value is 0 ,
having a dimension (d) of 6. For i from 0 to 5, we calculate the positional
embedding for our first word of the input sentence.
nitps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt 2143star24, 111.0
Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
When
5
i el Position Formula pi
0 0.79 Even _|sin(0/10000°*")| 0
1 0.6 Odd | cos(0/10000°*/)) fa
2 | 0.96 | Even _|sin(0/10000°/)|_g
3_| 064 | odd _[¢as(0/10000")F 4
4 | 0.97 | Even |sin(0/10000°")| oo
5 0.2 Oda | cos(0/10000°°°/)| 4
d(dim) 6
Pos 0
Positional Embedding for word: When
Similarly, we can calculate positional embedding for all the words in our
input sentence.
When | you | play | same | of | thrones
5 17, 7 12 15 19
i[o [2 [ 23 [| m [ % [ os Postrel
0 0 | oeais | 09093 | o.ai1 | -0.7568 | -0.9589 eer OF
1 1_| 00464 | 0.9957 _| 0.1388 | 0.1846 | 0.9732 mim
2 o 0.0022 0.0043 | 0.0065 | 0.0086 | 0.0108 Inputs:
3 [1 [00001 [ 1 | 0.0003 | 0.0004 | 1
4 oO 0 oO oO 0 0
5 ? 0 1 oO oO 1
ddim) 6 6 6 6 6 6
Pos 0) 1 2
Calculating Positional Embeddings of our input (The calculated values are rounded)
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 91asrar24, 111 AM
Step 6 — Concatenating Positional and Word Embeddings
After calculating positional embedding, we need to add word embeddings
Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
and positional embeddings.
gee Poe Te Tee a Tee —
Sig O-4 <——_
Position Embedding Matrix ae
[aoa | os [os [aan [oe ous
| aoeze | oon | a0oss | acoes | core
1 [ooo | 1 | oaeos | omnes [1 eot_[ exe [eos [eo [ens | ono
tt ce o7 | 122 | 092 | 026 | 012 | -0.36
xe | oa7 | 151 | 07 | oso [13
+ oss | 00s | o27 | oss | os | 076
Word Embedding Matrix aea_[ 079 | 131 | 022 | 062 | 149
ote poe ee os7 | os | 056 | 007 | os | ose
as |-e12 | osi_| 06 | oat _| ox 12 [| o7 [ ase [oa | o7 | 1a
a2 | a7 | aso [087 | 07 | oz
concatenation step
This resultant matrix from combining both matrices (Word embedding
matrix and positional embedding matrix) will be considered as an input to
the encoder part.
Step 7 — Multi Head Attention
A multi-head attention is comprised of many single-head attentions. It is up
to us how many single heads we need to combine. For example, LLaMA LLM
from Meta has used 32 single heads in the encoder architecture. Below is the
illustrated diagram of how a single-head attention looks like.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
1018341424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
Single Head Attention
Single Head attention in Transformer
There are three inputs: query, key, and value. Each of these matrices is
obtained by multiplying a different set of weights matrix from the Transpose
of same matrix that we computed earlier by adding the word embedding and
positional embedding matrix.
Let's say, for computing the query matrix, the set of weights matrix must
have the number of rows the same as the number of columns of the
transpose matrix, while the columns of the weights matrix can be any; for
example, we suppose 4 columns in our weights matrix. The values in the
weights matrix are between © and 1 randomly, which will later be updated
when our transformer starts learning the meaning of these words.
ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 13s1ar24, 1211 AM, Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
‘Word Embedding + Positional Embedding Linear weights for query
wien | 070 | 16 | 096 | 16¢ | oo7 | 12 os | 045 | os: | 069
you | 122 | o17 | 00s | 072 | 09 | o74 00s | oss | 037 | oss
oy | oz | asi | or | am [oss | sso | Xx [oss | or | ose | oct
game | 026 | o7 | 065 | om | oo7 | o37 on | os | o4 | om
ot | o12 | 059 | os | os | os | o7 07 | 027 | 02 | 067
vones | 096 | 13 | 076 | 148 | oo4 | 1.20 oas | 056 | 057 | 0.07
6x6 6x4
calculating Query matrix
Similarly, we can compute the key and value matrices using the same
procedure, but the values in the weights matrix must be different for both.
Word Embedding + Positional Embedding. Linear weights for key
when [073 | 16 | os | 16¢ | o97 | 12 o7 | os7 | om | 073
vou | 122 [ 017 | aos | 07 | 09 | 074 oss | ors | os | 0.17
voy | ose [xsi | 027 | 1m | ose | 150 | y | os | ove | 08 | ose
some | 025 [07 | 06s | 022 | 007 | 037 os | 073 | 02 | oat
a [ow | ose | os | os | os | o7 os7 | 086 | 042 | 0.08
wones | -036 | 13 | 076 | 148 | ose | 121 078 | om | 087 | 088
6x6 6x4
Word Embedding + Positional Embedding Linear weights for value
Wien [073 | 46 | 096 | 166 [ 097 [| 12 ose [007 [07 | 035
vou | 122 | 017 | 006s | 070 | 09 | 074 02 | os7 | 061 | 0.35
pay | 092 | 151 | 027 | 131 | 086 | 1.58 os7 | os | os | os
ome | 026 | 074 | 066 [022 | 007 | 037 | X o67 | 035 | 098 | 054
of [01 [050 [08 | o62 | 05 | 07 047 | 083 | 034 | 0.98
Grones | 036 | 13 | 076 | 148 | 094 | 121 ‘06 | 069 | 013 | 0.98
6x6 6x4
Calculating Key and Value Matrices
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
12183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
So, after multiplying matrices, the resultant query, key, and values are
obtained:
Query
3es_| 38 | 408 | 342
2ss | 196 | 277 | 178
aa | 36 | 349 | 272
oe: |i iis; |i aaa |? 45
19 | 156 | 108 | 153
aoe | 26 [L2ms. || 222
6x4
key value
am_| 40a | 415 | 3a 3ea_| a8 | 408 | 3.42
21g | 251 | 164 | 193 255 | 196 | 277 | 178
3.28 | 311 | 365 | 3.01 3so_|_36 | sao | 272
1o7_| 113 | 164 | 135 102 | 11¢ | 124 | 13
asi |_.ate7!_|| ‘onal |_ aon ao) |l|_fs6_|_ te6 "|| 15s
251 | 30a | 345 | 2.28 aoa | 29 | 273 | 222
6x4 6x4
Query, Key, Value matrices
Now that we have all three matrices, let’s start calculating single-head
attention step by step.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 19143star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
Query
Transpose (Key)
ae | 38 | aoe | aa
25s | 196 | 27 | 170 an | 21 | a2 | 107 | 149 | 251
aa [as [ae [am | yy [om fos [am [us {aor [aoe
102 | ste | 12 | 19 ats [noe | 30s | tee [21a | sas
19 [ass | 180 | ass aa 193 | am | 105 | 101 | 228
3.04 273 2.22 4x6
6x4
[Mati]
50.341 | 31.2000 | 49.7206 | 19.7596 | 20.1006 | 43.1604
(_Sottvax_)
34.5402 | 16.2056 | 29.6169 | 11.761 | 16.6133 | 25.6608
= _| 50.8796 | 27.3994 | 43.2409 | 17.0909 | 24.5349 | 37.695, ss
18.1304 | 9.728 | 15.4544 | 6.2134 | a.061_| 13.3808
26,3707 | 14.0997 | 22.5500 | 8.9445 | 12.6967 | 19.4858
«.a0ei | 22.668 | 35.6360 | 14008 | 20.103 | sosass | ——> Cuatwat)
matrix multiplication between Query and Key
For scaling the resultant matrix, we have to reuse the dimension of our
embedding vector, which is 6.
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
aiasrar24, 1-11 AM
Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
sea | 51.208
52 [48 7306
19,7538 | 20.006 | a3.16ae
‘34.5002 | 10.2058 | 206160
1.7761 | 168139 | 25.6608
50.8796 | 27.904 | 23.2409
17.0908 | 24.seo | 97.685,
rei | 9.726
1sasee
62s | 98st
13.2004
26.707 | 14.085
7 | 225800
cas | 12.6067 | 19.4058
‘axes | 22.668 | 95.6089
4.004 | 20.109 | s0s26s
vay
where d (dimension) is 6
23.81721
12.77409 | 20.30219
8.062904,
1.50852
17.62
14.1009
7.434201 | 12.09231
4.809165,
6.781004,
10.47973
20.7167 |
11.186
17.6526
6.976963,
10.01433
15.39096
7.401542 |
3.972256 | 6.307436.
2.535222,
3.612997 |
5.466445
10.7651
5.752218 | 9.205999
3.64974
5.184753)
7.956759
17.10152
9.254989 | 14.5497.
5.715476
8.205791
12.62712
Cama)
Came)
>
Gam
scaling the resultant matrix with dimension 5
The next step of masking is optional, and we won't be calculating it. Masking
is like telling the model to focus only on what’s happened before a certain
point and not peek into the future while figuring out the importance of
different words in a sentence. It helps the model understand things in a step-
by-step manner, without cheating by looking ahead.
So now we will be applying the softmax operation on our scaled resultant
matrix.
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 15183s1ar24, 111.0
Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
(EEN er | os [om [os | ve SoftMax:
141 | 743 | 1209 | 481 | 678 | 1048 s(x) —u
zor | 1139 | sas | ese | 1001 | 1520 Dee!
war [57 | sat | 30s | aie | 708
va [eas Duss Dan [an [ees
as
softmax(23.82) = rary Gait 4 gas BOF GAT y GIR
L (ener)
Eo or 0 | 0002 | ———> (Gavan
se | aooit { axts2 | acoo1 | ocoos | o.022 =
= 0.9534 | 0.0001 | 0.0421 oO 0.0044
0.6476 | 0.021 0.2177 | 0.005 | 0.0146 | 0.094 (teen)
0.7603 | 0.0052 | 0.164 | 0.0006 | 0.0023 | 0.047 a]
0.9174 | 0.0004 | 0.0716 0.0001 | 0.0105 axKy
Applying soft max on resultant matrix
Doing the final multiplication step to obtain the resultant matrix from
single-head attention.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
16183star24, 111.0
Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
oe)
softmax|
(iz Value
ose |_o | oars] o | 0 | ooo ace | a8 | aoe | an
0.96 | oooii | 0.1152 | acoos | oo006 | 0.020 zs5_| 106 | 277 | 170
esse | oooor | oo | o | 0 | conus ase [a6 | s49 | 272
asa | ones | oxi | 005 | onus | ove | X [soe | use [1m [sa
0.7003 | ao0s2 | oes | 0.0006 | a.0029 | 0.047 19 | ise | ree | so
09174 | ooo | canis | 0 | 0001 | o.o105 aoa [ 29 [ 27a | 22
6x6 6x4
3.864257| 3.79246 | 4.060367| 3.39751 | ——————> [amu]
3,601295| 3.75252 |3.977937| 3.30861
= [3.855542] 3.787426] 4.04909 | 3.386086
Mask (o5t)
3.622841 | 3.584996] 3.750419 | 3.081634)
3.745786 | 3.706744] 3.904894 | 3.233619) (sexe)
3.835366] 3.77528 | 4.022837 | 3.356435) Tattaar
We have calculated single-head attention, while multi-head attention
comprises many single-head attentions, as I stated earlier. Below is a visual
calculating the final matrix of single head attention
of how it looks like:
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
mas41424, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
Multi Head Attention
Multi Head attention in Transformer
Each single-head attention has three inputs: query, key, and value, and each
three have a different set of weights. Once all single-head attentions output
their resultant matrices, they will all be concatenated, and the final
concatenated matrix is once again transformed linearly by multiplying it
with a set of weights matrix initialized with random values, which will later
get updated when the transformer starts training.
Since, in our case, we are considering a single-head attention, but this is how
it looks if we are working with multi-head attention.
ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 1643s1ar24, 1-11 AM
Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
Single Head Attention Multi Head Attention (N Heads)
Our Case Real world Case Concatenation
3.884257| 3.79246 | 4.060367) 3.39751
3.801295 | 3.75252 | 3.977997) 3.90061
3.855542| 3.767426] 4.04000 | 3.385086]
3.622641 | 3.564996] 3.750419] 3.081854
3.745766 | 3.706744 3.904898] 3.233519|
3.835966] 3.7752 | 4.022837] 3.356435 |
(mata
(Sona)
ma) Cae) Ta
(task ot) J es
(x) GD} ee
ee)
Cotari_} or)
qd ow Sev okey
Single Head attention vs Multi Head attention
In either case, whether it’s single-head or multi-head attention, the resultant
matrix needs to be once again transformed linearly by multiplying a set of
weights matrix.
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
19143star24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
o Linear weights
soma la ) %. ue colina length mst be,
sen ae TO (embedding positional) matrix columns length
38 | 375 | 398 | 331 oa [os | 045 | 054 | 007 | 058
306 | 379 | 405 | 3.39 xX oes | 074 | 078 0s 075 | 055
362_[ "358 [375 [3.08 os3_| oa | _os5 | os9 | o49 | 014
375 [371 [39 [323 o7 | 06 | 012 | 042 | 029 | 087
3ea_| 378 | 402 | 3:36 4x6
exa
10.84 | 945 | 7.33 78 6.09 | 7.66
10.65 | 9.28 | 722 | 767 | 599 | 7.51
10.83 | 943 | 7.33 | 7.79 | 608 | 7.65
10.08 | 8.77 | 685 | 7.25 | 567 | 7.09
0.48 [| 912 | 741 | 754 | 589 | 7.38
10.7 [| 9.38 | 7.29 | 7.75 | 6.05 76
normalizing single head attention matrix
Make sure the linear set of weights matrix number of columns must be equal
to the matrix that we computed earlier (word embedding + positional
embedding) matrix number of columns, because the next step, we will be
adding the resultant normalized matrix with (word embedding + positional
embedding) matrix.
Output of Multi Head attention
Fi =
10.84 | 945 | 733 | 78 6.09 | 7.66 cee | |i
10.65 | 9.28 | 7.22 | 7.67 | 599 | 7.51 ee
10.83 | 943 | 733 | 7.79 | 608 | 7.65
10.08 | 8.77 | 685 | 7.25 | 567 | 7.09 feral
10.48 | 912 | 741 | 754 | 5.89 | 7.98
10.77 | 938 | 729 | 775 | 605 | 76
6x6
Output matrix of multi head attention
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 20183srar24, 111 AM
As we have computed the resultant matrix for multi-head attention, next, we
will be working on adding and normalizing step.
Step 8 — Adding and Normalizing
Once we obtain the resultant matrix from multi-head attention, we have to
add it to our original matrix. Let’s do it first.
Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
‘Word Embedding + Positional Embedding
Wen [ore [a6 [oss [iss [amr [a2
you | -u2a[oa7_| 006 [ore | 09 | ove
voy [os [ast | oz [ist | oss | 159
wne_|_oz6 [074 [06s | 022 [007 [037
# [012 |os9 [08 [062 | 05 | 07
Worse [036 | 13 [076 | 148 [ose | 129
6x6
Output of Multi Head attention
wos [sas [733 | 78 | 609 | 705
106s 928 | 722 | 767 599 | 7.51
1os3_|043_| 733 | 779 | 608 _| 7.65
10.08_| 677 | 6as_| 7.25 | sa7_| 7.09
10ae[912 | 71 | 758 | 589 | 7.8
10.77 [938] 729 | 775 | 605 | 76
6x6
11.63
11.05
8.29
9.44
7.06
8.86
11.87
9.45
7.28
8.46
6.89
8.25
11.75
10.94
76
9.1
6.64
9.24
10.34
9.51
7.51
7.A7
5.74.
7.46
10.6
9.71
791
8.16
6.39
8.08
10.41
10.68
8.05
9.23
6.99
8.81
To normalize the above matrix, we need to compute the mean and standard
‘Adding matrices to perform add and norm step
deviation row-wise for each row.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
2118ss1ar24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
mean =
Row Wise Implementation
Mean __ Standard Deviation
11.63 [11.05 [8.29 [9.44[7.06| 8.86] ————>__9.26 1.57
11.87 9.45 [7.28 [8.46] 6.89] 8.25| ————>_ 8.56 1.64
11.75| 10.94] 7.6 | 9.1 [6.64] 9.28]-————>|_9.04 1.76
10.34| 9.51 [7.51|7.47| 5.74] 7.46] ————>|_7.86. 1.51
10.6 [ 9.71 |7.91[8.16|6.39| 8.08] ————>|_ 8.37, 1.35)
10.41 [10.68 [8.05 [9.23 6.99] 8.81] ————__ 8.93 1.28
calculating meand and std
we subtract each value of the matrix by the corresponding row mean and
divide it by the corresponding standard deviation.
Mean Sed
11.05] 629 9.44 ]7.06 [6.85]
11.87/ 9.45 |7.26[6.46|6.09/8.25| [ese [160
1175/1084] 7 | 91 [6st|9.28) [aoa | 176
10.34] 951 |751|7.47|5.74|7.46| [796 [151
106 [9.71 |7.91]8.16[629/6.08] [esr [1.35
10.41] 10.68 /6.05]9.23| 699/861] [eos | 1.26
value mean __ 11.63 — 9.26
std+error 1.57 + 0.0001
i
H -
aia _[-0.62[ 011 [ -14 | -0.25
202 | 0.54 | -0.78 | -0.06[ -1.02 | -0.19
154 | 108 | -082[ 0.03 | -136 | 0.11
164 | 1.09 | -0.23[-026| 14 | -0.26
165 | 0.99 | -0.34[-016[ -1.47 | -0.21
ais | 137 [-069|[ 0.23 | -152 | -0.09
normalizing the resultant matrix
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 22183cs, 81084 Salvng Tantomery Hand A Stepy-Stp Mah Exaile by Fareed Kran | Level Up Cong
Adding a small value of error prevents the denominator from being zero and
avoids making the entire term infinity.
Step 9 — Feed Forward Network
After normalizing the matrix, it will be processed through a feedforward
network. We will be using a very basic network that contains only one linear
layer and one ReLU activation function layer. This is how it looks like
visually:
ReLU(
max(0, x) Real world case
(multiple layers)
Linear Layer = X-W+b
our case (one linear layer) tT
=
Feed Forward network comparison
First, we need to calculate the linear layer by multiplying our last calculated
matrix with a random set of weights matrix, which will be updated when the
transformer starts learning, and adding the resultant matrix to a bias matrix
that also contains random values.
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 23183tar24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
Matrix afrer add and norm step w
ra_[ 116 [oe [on [1a [025 a TS
202 | 058 |-0.78 | 006 | 1.02 | 019 ox7_{ 082 | 063 | 046 | 006
1.54 | 1.08 [0.82 [0.03 [1.36 [011 | X | 053 | 087 ) 047 | 1) oat
tee 108 Tos Lome ot 5 ‘053| 088 [038 | 009 | oe | 0.25
tes [oss[-ase[-ose [ray [om] | eet eas fore {oss fost [ose
teas Passos ase | on
ex6
6x6
we [on [on [om or Bias
‘02 [-126-[- 111 [012 | 046] 07
oss [ise [oso [os {os [oso] y i 2st
‘053 [097 [096 | ~015_| 016 | 02
oss_[-1_[-as7_[-on1 | 02 | oe
‘os2[102[ 061 | 026-014 | 02
x6
(lo [one [025 [ae [035 [04s]
os | 125 | 109 | 056 | 057 | 115
oes | 144 | 196 | 054 | O81 | 142
= os | 136 | -057 | 081 | 068 | 1.04
oss [| iis | 123 | 057 [ 051 | 0.97
ose | 129 | -o62 | 053 | 055 | 1.09
1.04 12 oss | 068 | 049 | 0.97
6x6
Calculating Linear Layer
After calculating the linear layer, we need to pass it through the ReLU layer
and use its formula.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
alasster24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
ReLU(z) = max(0, x)
1.25 1.09 0.56 0.57 1.15
0.66 1.44 1.36 0.54 0.81 1.42
0.95, 1.36 -0.57 0.81 0.68 1.04
0.95 1.15 1.23 0.57, 0.51 0.97
0.98 1.29 0.62 0.53 0.55 1.09
1.04 1.2 0.86 0.68 0.49 0.97
6x6
max(0, 0.91)
[isi] 1.25 [1.09] 0.56 | 0.57] 1.15 ==
0.66 | 1.44/1.36 | 0.54 | 0.81 | 1.42 Forward
0.95[ 1.36] 0 | 0.81 | 0.68 | 1.04
0.95 | 1.15 |1.23 | 0.57 | 0.51 | 0.97 Fara Nom
0.98129] 0 | 0.53| 0.55 | 1.09
1.04] 1.2 [0.86] 0.68 | 0.49 | 0.97
Calculating ReLU Layer
Step 10 — Adding and Normalizing Again
Once we obtain the resultant matrix from feed forward network, we have to
add it to the matrix that is obtained from previous add and norm step, and
then normalizing it using the row wise mean and standard deviation.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
25183star24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
Matrix from Feed Forward Network Matrix from Previous Add and Norm Step
oat [125] 1.08] 086 [O57 [15 Te Pa] eyo
0.66] 1.44 1.36] 054 [0.81 [1.42 per ost are | aos [ar [039
tetieteterteetie |p Fectietsetestister
095/1.15]1.23] 057 [051 [0.97 je tee tat | em |__| ae
096/129] 0 [083/055 109 nei [aes | ans [ise | on
108] 1.2 [0.66] 0.69 | 049] 097
Mean Std
1.0033 [1.103534
za a ae % 1.1233 [1.214349
— Pes hea ee Pas is] —» [0.9033 [1.301837
TT soe Ci oar 9.9933 | 1.289058
os Pe Pata 0.8167 |1.306016|
0.95 [1.320773
ize [ 126 | -o4s | -03 | -166 | -0.09
ize | om | 045 | -053 | 11 [0.09
122 | 118 | 1.32 | -005 | -1.22 | 0.19
124 [097 | 001 | -053 | -1.46 | 0.22 Forward
139 [112 | -o89 | -o34 | -133 [0.05
09s | 123 | -059 | -003 | -15 | -0.05 =i
‘Add and Norm after Feed Forward Network
‘The output matrix of this add and norm step will serve as the query and key
matrix in one of the multi-head attention mechanisms present in the
decoder part, which you can easily understand by tracing outward from the
add and norm to the decoder section.
Step 11 — Decoder Part
The good news is that up until now, we have calculated Encoder part, all the
steps that we have performed, from encoding our dataset to passing our
matrix through the feedforward network, are unique. It means we haven't
calculated them before. But from now on, all the upcoming steps that is the
remaining architecture of the transformer (Decoder part) are going to
involve similar kinds of matrix multiplications.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 26183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
Take a look at our transformer architecture. What we have covered so far
and what we have to cover yet:
What we have covered so far What we have to cover
Output
Probabities
Tres
em
Feed
evar
Tan
‘Aertan
Ne
Nasied
aoa
‘tern
Postrel Eb Poston
Encoding QY Encoding
Tot Sues
Emeeg enbessra
t
Inputs Cutouts
Upcoming steps illustration
We won't be calculating the entire decoder because most of its portion
contains similar calculations to what we have already done in the encoder.
Calculating the decoder in detail would only make the blog lengthy due to
repetitive steps. Instead, we only need to focus on the calculations of the
input and output of the decoder.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 278ssrar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
When training, there are two inputs to the decoder. One is from the encoder,
where the output matrix of the last add and norm layer serves as the query
and key for the second multi-head attention layer in the decoder part. Below
is the visualization of it (from batool haider):
Encoder
1
Visualization is from Batol Haider
While the value matrix comes from the decoder after the first add and norm
step.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 28183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
The second input to the decoder is the predicted text. If you remember, our
input to the encoder is when you play game of thrones so the input to the
decoder is the predicted text, which in our case is you win or you die .
But the predicted input text needs to follow a standard wrapping of tokens
that make the transformer aware of where to start and where to end.
Encoder Input ———> When you play game of thrones
Decoder Input ———>
you win or you die
input comparison of encoder and decoder
Where and are two new tokens being introduced. Moreover,
the decoder takes one token as an input at a time. It means that will
be served as an input, and you must be the predicted text for it.
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
29183star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
a a =
5 EQ) Pesiti¢
? © YY Encod el
jut } < 0.31
dding 0.21
| 0.12
0.64
uts Outputs 0.98
(shifted right) 0.2
Decoder input word
‘As we already know, these embeddings are filled with random values, which
will later be updated during the training process.
Compute rest of the blocks in the same way that we computed earlier in the
encoder part.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 3018341424, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
Output
Probabilities
Calculation is same
as in Encoder
moods
Outputs
Calculating Decoder
Before diving into any further details, we need to understand what masked
multi-head attention is, using a simple mathematical example.
Step 12 — Understanding Mask Multi Head Attention
Ina Transformer, the masked multi-head attention is like a spotlight that a
model uses to focus on different parts of a sentence. It’s special because it
doesn’t let the model cheat by looking at words that come later in the
sentence. This helps the model understand and generate sentences step by
step, which is important in tasks like talking or translating words into
another language.
ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 31a41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
Suppose we have the following input matrix, where each row represents a
position in the sequence, and each column represents a feature:
1
Input Matrix 4
7
Co Ot bh
OMDW
inpur matrix for masked multi head attentions
Now, let’s understand the masked multi-head attention components having
two heads:
»
. Linear Projections (Query, Key, Value): Assume the linear projections for
each head: Head 1: Wql,WK1,Wv1 and Head 2: Wq2,Wk2,Wv2
nN
. Calculate Attention Scores: For each head, calculate attention scores
using the dot product of Query and Key, and apply the mask to prevent
attending to future positions.
»
. Apply Softmax: Apply the softmax function to obtain attention weights.
+
. Weighted Summation (Value): Multiply the attention weights by the
Value to get the weighted sum for each head.
a
. Concatenate and Linear Transformation: Concatenate the outputs from
both heads and apply a linear transformation.
Let’s do a simplified calculation:
ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 323srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
Assuming two conditions
+ Wql = Wk1 = Wv1 = Wq2 = Wk2 = Wv2=I, the identity matrix.
°@
‘Input Matrix
Head 1: Head 2:
Qi = K, =V; = Input Matrix Qz = Ky = Va = Input Matrix
A, =Q:- KT Ay = Qo: KY
147 23
=|2 58 56
369 8 9
100 23
=|4 5 0] (Masked) 5 6] (Masked)
789 09
softmax(A1)
=WeM
Concatenate and Linear Transformation:
Coneatenate([O1, O2))
(Apply Learnable Linear Transformation)
Mask Multi Head Attention (Two Heads)
The concatenation step combines the outputs from the two attention heads
into a single set of information. Imagine you have two friends who each give
you advice on a problem. Concatenating their advice means putting both
pieces of advice together so that you have a more complete view of what they
suggest. In the context of the transformer model, this step helps capture
different aspects of the input data from multiple perspectives, contributing
toa richer representation that the model can use for further processing.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 33143sara, 11 AM Solving Transformer by Hand: A Step-by-Step Math Example [by Fareed Kan | Level Up Coding
Step 13 — Calculating the Predicted Word
The output matrix of the last add and norm block of the decoder must
contain the same number of rows as the input matrix, while the number of
columns can be any. Here, we work with 6.
Outrut
Prbepies Rows must be 6 while columns can be any length
1.2 133 [ 211 | 262 [ 306 | 354
ose | 145 | 224 | 278 | 378 | 455
134 [204 [ 226 | 266 | 356 | 3.96
oso | 179 | 246 | 249 | 332 | 3.64
oot | 104 [| 163 | 192 | 195 | 21
oi2 | 099 | 12 156 | 157 17
=)
‘Add and Norm output of decoder
The last add and norm block resultant matrix of the decoder must be
flattened in order to match it with a linear layer to find the predicted
probability of each unique word in our dataset (corpus).
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
44sa6, 1:44AM Solving Transformer by Hand A Stop-by-Step Math Example | by Fareed Kan | Level Up Coding
12
1.33
211
2.62
3.06
4 3.54)
Flatten the matrix 0.56
a
12 [| 133 [ 211 [ 262 [ 306 | 354 23
oss | 145 | 224 | 278 | 3.78 | 4.55 278
1.34 2.04 2.26 2.66 3.56 3.86 az8!
oso | 179 | 246 | 249 | 3.32 | 3.64 458
ost | 104 | 183 | 192 [ 195 [21 .
oi2 | o99 | 12 | 156 | 157 | 17
This flattened layer will be passed through a linear layer to compute the
logits (scores) of each unique word in our dataset.
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
flattened the last add and norm block matrix
012
0.99
12
1.56
157
17
35183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
Linear Layer = X -W
No Bias Linear set of weights
rink | things | He
af 2 [a eos Pa
os [ om [om xt
1 [os [oa 037
Flatten Layer (Row Matrix) om {em [ov | +++ Last
023 | oe | oss 076
1.2[1.33[241] ee+ |1.2]1.56[1.57]17 4 075 | 076 | 042] 444 | 026
1x 1n (our case nis 36) ase [os [oa a
a3 | 036 | 00s os
035 | oa | one 087
nxm
(m is 23 Vocab Size)
1 | drink | things| .,, | you | ,,, | He
: a a 7 2
logits =
iis [23 [ 115] e** [5a] ee [25
Calculating Logits
Once we obtain the logits, we can use the softmax function to normalize
them and find the word that contains the highest probability.
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebtstar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
1_| drink [things] .,, [you ] .., [_He
. 1 2 3 17 23
logits =
iia [23 | 115 sa] e+ (23
Applying softmax
@) =>
si) = Sy
2;
drink | things you He
sees. 1 z 3 Psa 17 23
Probabilities =
0.21 0.05, 0.001 oor 0.56 ver 0.12
,
highest Probability
Finding the Predicted word
So based on our calculations, the predicted word from the decoder is you.
hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt
37183tar24, 1211.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding
you
7
0.56
you is the predicted word, which will
now act an an input for our decoder
‘and so on.
Output
Probabilties
Final output of decoder
This predicted word you, will be treated as the input word for the decoder,
and this process continues until the token is predicted.
Important Points
1. The above example is very simple, as it does not involve epochs or any
other important parameters that can only be visualized using a
programming language like Python.
2. It has shown the process only until training, while evaluation or testing
cannot be visually seen using this matrix approach.
3. Masked multi-head attentions can be used to prevent the transformer
from looking at the future, helping to avoid overfitting your model.
Conclusion
hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 38143cs, 81084 Salvng Tantomery Hand A Stepy-Stp Mah Exaile by Fareed Kran | Level Up Cong
In this blog, I have shown you a very basic way of how transformers
mathematically work using matrix approaches. We have applied positional
encoding, softmax, feedforward network, and most importantly, multi-head
attention.
In the future, I will be posting more blogs on transformers and LLM as my
core focus is on NLP. More importantly, if you want to build your own
million-parameter LLM from scratch using Python, I have written a blog on
it which has received a lot of appreciation on Medium. You can read it here:
Building a Million-Parameter LLM from Scratch Using Python
A Step-by-Step Guide to Replicating LLaMA Architecture
levelupptconnectedicom
Have a great time reading!
Data Science Artificial Intelligence Machine Learning Python Deep Learning
¢
hntps:evelupgiteornectee,comlunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 3914341424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
@
Written by Fareed Khan Croton) O
21K Followers + Writer for Level Up Coding
MSc Data Science, | write on Al https://wwwlinkedin.com/in/fareed-khan-dew/
More from Fareed Khan and Level Up Coding
@) Fareed Khan @ Tirendaz Al in Level Up Coding
Modern GUI using Tkinter How to Use ChatGPT in Daily Life?
There are two things to remember: Save time and money using Chat@PT
4minread - Sep 4,2022 @minroad - Apr 4,2023
&) 59k Q 107 imi oe
i
System Design Interview Question: Building a Million-Parameter LLM
Design Spotify from Scratch Using Python
@ Hayk Simonyan in Level Up Coding @ Fareed Khan in Level Up Coding
ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 404941424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
High-level overview of a System Design A Step-by-Step Guide to Replicating LlaMA
Interview Question - Design Spotify. Architecture
Sminread + Feb 21,2024 25minread » Dec, 2023
4K Qs wt we 25K QQ 30 Chow
See all from Fareed Khan See all from Level Up Coding
Recommended from Medium
a
kaggle imi
@veepLeorning A
© Benedict Neo in btgrit Data Science Publication XQ in The Research Nest
Roadmap to Learn Al in 2024 Explained: Transformers for
A free curriculum for hackers and Everyone
programmers to learn Al The underlying architecture of modern LLMs
Mmin read + Mar 11,2024 1Smin read » Mar 8,2024
& a4x Q 100 too mm Qs tio
ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 44941424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding
Lists
Predictive Modeling w/ === Practical Guides to Machine
Python Learning
20stories » 1060 saves TOstories - 1269saves
Natural Language Processing = === —\} ChatGPT
1249 stories - 828 saves ECT 2istories - 551 saves
@ Austin star. in Arica intetigence in Plain Ena
Reinforcement Learning is Dead.
Long Live the Transformer! from Scratch Using Python
Large Language Models are more powerful A Step-by-Step Guide to Replicating LLaMA
than you imagine Architecture
+ © Bminread - Jan 13,2024 25minread - Dec7, 2023
Sy 199k QQ 48 af we Sp 25K