Solving Transformer by Hand A Step-by-Step Math Example

dfersfwesdwes

Uploaded by

satya_sap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

1K views43 pages

Solving Transformer by Hand A Step-by-Step Math Example

dfersfwesdwes

Uploaded by

satya_sap

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 43

41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Openinapp 7 @O dQ search Gwe O © eated from phtofunt: Solving Transformer by Hand: A Step-by-Step Math Example @ Fareed Khan + Follow Published in Level Up Coding - 1 minread . Dec 18,2028 S2x*x Qs W © fo nitpsoevelupgitconnected.comunderstar formers ron-star-o-end-a-step-by-step-math-example-téd4e6ée6eb1 324, 14:11AM Salving Transformer by Hand: A Step-by-Step Math Example | by Fareed Knan | Level Up Cosing Ihave already written a detailed blog on how transformers work using a very small sample of the dataset, which will be my best blog ever because it has elevated my profile and given me the motivation to write more. However, that blog is incomplete as it only covers 20% of the transformer architecture and contains numerous calculation errors, as pointed out by readers. After a considerable amount of time has passed since that blog, I will be revisiting the topic in this new blog. My previous blog on transformer architecture (covers only 20%): Understanding Transformers: A Step-by-Step Math Example — Part1 lunderstand that the transformer architecture may seem scary, and you might have encountered various explanations on... medium.com I plan to explain the transformer again in the same manner as I did in my previous blog (for both coders and non-coders), providing a complete guide with a step-by-step approach to understanding how they work. Table of Contents + Defining our Dataset + Finding Vocab Size + Encoding + Calculating Embedding * Calculating Positional Embedding * Concatenating Positional and Word Embeddings ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 2183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Multi Head Attention Adding and Normalizing Feed Forward Network Adding and Normalizing Again * Di r Part + Understanding Mask Multi Head Attention * Calculating the Predicted Word Important Points Conclusion Step 1— Defining our Dataset The dataset used for creating ChatGPT is 570 GB. On the other hand, for our purposes, we will be using a very small dataset to perform numerical calculations visually. Dataset (corpus) | drink and | know things. When you play the game of thrones, you win or you die. The true enemy won't wait out the storm, He brings the storm. Our entire dataset containing only three sentences Our entire dataset contains only three sentences, all of which are dialogues taken from a TV show. Although our dataset is cleaned, in real-world hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 3143srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding scenarios like ChatGPT creation, cleaning a 570 GB dataset requires a significant amount of effort. Step 2— Finding Vocab Size The vocabulary size determines the total number of unique words in our dataset. It can be calculated using the below formula, where N is the total number of words in our dataset. vocab size = count(set(N)) vocab_size formula where N is total number of words In order to find N, we need to break our dataset into individual words. Dataset (Corpus) | drink and | know things. When you play the game of thrones, you win or you die. The true enemy won't wait out the storm, He brings the storm. When, you, play, the, game, of, thrones, you, win, or, you, die, N |, drink, and, 1, Know, things, The, true, enemy, won't, wait, out, the, storm, He, brings, the, storm calculating variable N After obtaining N, we perform a set operation to remove duplicates, and then we can count the unique words to determine the vocabulary size. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 4183s1ar24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding vocab size = count(set(N)) I rink, and, 1, Know, things, set (irene lay, the, game, of, thrones, you, win, oF you, de, ) The, ie, enemy, wor, welt, out the, storm, He, brings, te, storm |, drink, and, Know, things, When, you, play, the, game, of, L, COUNT (tiveness or ce te, enemy, wont wat ot stom, Ho L,=23 finding vocab size Therefore, the vocabulary size is 23, as there are 23 unique words in our dataset. Step 3 — Encoding Now, we need to assign a unique number to each unique word. 12 3 4 5 6 7 8 9 10 1 12 | drink things Know When won't play out true storm brings game 13001415 6 7 #18 19 20 2 © 22028 the win of enemy you wait thrones and or die He encoding our unique words ‘As we have considered a single token as a single word and assigned a number to it, ChatGPT has considered a portion of a word as a single token using this formula: 1 Token = 0.75 Word hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 5143srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding After encoding our entire dataset, it’s time to select our input and start working with the transformer architecture. Step 4— Calculating Embedding Let's select a sentence from our corpus that will be processed in our transformer architecture. 12 34 5 6 7 8 | drink things Know When won't play out Bo 4 6 6 7 8 19 the win of enemy you wait thrones Positional Encoding @&-o Inputs. <——4 9 mo on 2 true storm brings game 2 «8 andor diese play game of-_—_— thrones, 7 2 18 19 Input sentence for transformer We have selected our input, and we need to find an embedding vector for it. The original paper uses a 512-dimensional embedding vector for each input word. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 6143star24, 111 AM 512 Attention Is All You Need paper When s you 7 play 7 game of 12 15 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding thrones 19 Original Paper uses 512 dimension vector Since, for our case, we need to work with a smaller dimension of embedding vector to visualize how the calculation is taking place. So, we will be using a dimension of 6 for the embedding vector. ional seang Q-@ Tost embsng Inputs When | you play_|_ game of _ | thrones 5 7 7 12 15 19 el e2 3 ed 5 26 o7o | 038 | 001 | 012 | 088 06 06 012 [| 051 06 041 | 0.33 os | 0.06 | 0.27 | 065 | 079 | 075 oes | 079 | 031 | 0.22 | 062 | 0.48 0.97 og 0.56 0.07 0.5 0.94 0.2 074 | 059 | 037 07 0.21 Embedding vectors of our input These values of the embedding vector are between 0 and 1 and are filled randomly in the beginning. They will later be updated as our transformer hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding starts understanding the meanings among the words. Step 5 — Calculating Positional Embedding Now we need to find positional embeddings for our input. There are two formulas for positional embedding depending on the position of the ith value of that embedding vector for each word. Embedding vector for any word even position For even position ‘odd position PE (qos,2i) = sin(pos/100007!/4') even position odd position ‘even position For odd position PE (pos,2i-+1) = €08(pos/10000?"/4«!) Positional Embedding formula ‘As you do know, our input sentence is “when you play the game of thrones” and the starting word is “when” with a starting index (POS) value is 0 , having a dimension (d) of 6. For i from 0 to 5, we calculate the positional embedding for our first word of the input sentence. nitps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt 2143star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding When 5 i el Position Formula pi 0 0.79 Even _|sin(0/10000°*")| 0 1 0.6 Odd | cos(0/10000°*/)) fa 2 | 0.96 | Even _|sin(0/10000°/)|_g 3_| 064 | odd _[¢as(0/10000")F 4 4 | 0.97 | Even |sin(0/10000°")| oo 5 0.2 Oda | cos(0/10000°°°/)| 4 d(dim) 6 Pos 0 Positional Embedding for word: When Similarly, we can calculate positional embedding for all the words in our input sentence. When | you | play | same | of | thrones 5 17, 7 12 15 19 i[o [2 [ 23 [| m [ % [ os Postrel 0 0 | oeais | 09093 | o.ai1 | -0.7568 | -0.9589 eer OF 1 1_| 00464 | 0.9957 _| 0.1388 | 0.1846 | 0.9732 mim 2 o 0.0022 0.0043 | 0.0065 | 0.0086 | 0.0108 Inputs: 3 [1 [00001 [ 1 | 0.0003 | 0.0004 | 1 4 oO 0 oO oO 0 0 5 ? 0 1 oO oO 1 ddim) 6 6 6 6 6 6 Pos 0) 1 2 Calculating Positional Embeddings of our input (The calculated values are rounded) hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 91asrar24, 111 AM Step 6 — Concatenating Positional and Word Embeddings After calculating positional embedding, we need to add word embeddings Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding and positional embeddings. gee Poe Te Tee a Tee — Sig O-4 <——_ Position Embedding Matrix ae [aoa | os [os [aan [oe ous | aoeze | oon | a0oss | acoes | core 1 [ooo | 1 | oaeos | omnes [1 eot_[ exe [eos [eo [ens | ono tt ce o7 | 122 | 092 | 026 | 012 | -0.36 xe | oa7 | 151 | 07 | oso [13 + oss | 00s | o27 | oss | os | 076 Word Embedding Matrix aea_[ 079 | 131 | 022 | 062 | 149 ote poe ee os7 | os | 056 | 007 | os | ose as |-e12 | osi_| 06 | oat _| ox 12 [| o7 [ ase [oa | o7 | 1a a2 | a7 | aso [087 | 07 | oz concatenation step This resultant matrix from combining both matrices (Word embedding matrix and positional embedding matrix) will be considered as an input to the encoder part. Step 7 — Multi Head Attention A multi-head attention is comprised of many single-head attentions. It is up to us how many single heads we need to combine. For example, LLaMA LLM from Meta has used 32 single heads in the encoder architecture. Below is the illustrated diagram of how a single-head attention looks like. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 1018341424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Single Head Attention Single Head attention in Transformer There are three inputs: query, key, and value. Each of these matrices is obtained by multiplying a different set of weights matrix from the Transpose of same matrix that we computed earlier by adding the word embedding and positional embedding matrix. Let's say, for computing the query matrix, the set of weights matrix must have the number of rows the same as the number of columns of the transpose matrix, while the columns of the weights matrix can be any; for example, we suppose 4 columns in our weights matrix. The values in the weights matrix are between © and 1 randomly, which will later be updated when our transformer starts learning the meaning of these words. ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 13s1ar24, 1211 AM, Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding ‘Word Embedding + Positional Embedding Linear weights for query wien | 070 | 16 | 096 | 16¢ | oo7 | 12 os | 045 | os: | 069 you | 122 | o17 | 00s | 072 | 09 | o74 00s | oss | 037 | oss oy | oz | asi | or | am [oss | sso | Xx [oss | or | ose | oct game | 026 | o7 | 065 | om | oo7 | o37 on | os | o4 | om ot | o12 | 059 | os | os | os | o7 07 | 027 | 02 | 067 vones | 096 | 13 | 076 | 148 | oo4 | 1.20 oas | 056 | 057 | 0.07 6x6 6x4 calculating Query matrix Similarly, we can compute the key and value matrices using the same procedure, but the values in the weights matrix must be different for both. Word Embedding + Positional Embedding. Linear weights for key when [073 | 16 | os | 16¢ | o97 | 12 o7 | os7 | om | 073 vou | 122 [ 017 | aos | 07 | 09 | 074 oss | ors | os | 0.17 voy | ose [xsi | 027 | 1m | ose | 150 | y | os | ove | 08 | ose some | 025 [07 | 06s | 022 | 007 | 037 os | 073 | 02 | oat a [ow | ose | os | os | os | o7 os7 | 086 | 042 | 0.08 wones | -036 | 13 | 076 | 148 | ose | 121 078 | om | 087 | 088 6x6 6x4 Word Embedding + Positional Embedding Linear weights for value Wien [073 | 46 | 096 | 166 [ 097 [| 12 ose [007 [07 | 035 vou | 122 | 017 | 006s | 070 | 09 | 074 02 | os7 | 061 | 0.35 pay | 092 | 151 | 027 | 131 | 086 | 1.58 os7 | os | os | os ome | 026 | 074 | 066 [022 | 007 | 037 | X o67 | 035 | 098 | 054 of [01 [050 [08 | o62 | 05 | 07 047 | 083 | 034 | 0.98 Grones | 036 | 13 | 076 | 148 | 094 | 121 ‘06 | 069 | 013 | 0.98 6x6 6x4 Calculating Key and Value Matrices hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 12183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding So, after multiplying matrices, the resultant query, key, and values are obtained: Query 3es_| 38 | 408 | 342 2ss | 196 | 277 | 178 aa | 36 | 349 | 272 oe: |i iis; |i aaa |? 45 19 | 156 | 108 | 153 aoe | 26 [L2ms. || 222 6x4 key value am_| 40a | 415 | 3a 3ea_| a8 | 408 | 3.42 21g | 251 | 164 | 193 255 | 196 | 277 | 178 3.28 | 311 | 365 | 3.01 3so_|_36 | sao | 272 1o7_| 113 | 164 | 135 102 | 11¢ | 124 | 13 asi |_.ate7!_|| ‘onal |_ aon ao) |l|_fs6_|_ te6 "|| 15s 251 | 30a | 345 | 2.28 aoa | 29 | 273 | 222 6x4 6x4 Query, Key, Value matrices Now that we have all three matrices, let’s start calculating single-head attention step by step. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 19143star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Query Transpose (Key) ae | 38 | aoe | aa 25s | 196 | 27 | 170 an | 21 | a2 | 107 | 149 | 251 aa [as [ae [am | yy [om fos [am [us {aor [aoe 102 | ste | 12 | 19 ats [noe | 30s | tee [21a | sas 19 [ass | 180 | ass aa 193 | am | 105 | 101 | 228 3.04 273 2.22 4x6 6x4 [Mati] 50.341 | 31.2000 | 49.7206 | 19.7596 | 20.1006 | 43.1604 (_Sottvax_) 34.5402 | 16.2056 | 29.6169 | 11.761 | 16.6133 | 25.6608 = _| 50.8796 | 27.3994 | 43.2409 | 17.0909 | 24.5349 | 37.695, ss 18.1304 | 9.728 | 15.4544 | 6.2134 | a.061_| 13.3808 26,3707 | 14.0997 | 22.5500 | 8.9445 | 12.6967 | 19.4858 «.a0ei | 22.668 | 35.6360 | 14008 | 20.103 | sosass | ——> Cuatwat) matrix multiplication between Query and Key For scaling the resultant matrix, we have to reuse the dimension of our embedding vector, which is 6. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt aiasrar24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding sea | 51.208 52 [48 7306 19,7538 | 20.006 | a3.16ae ‘34.5002 | 10.2058 | 206160 1.7761 | 168139 | 25.6608 50.8796 | 27.904 | 23.2409 17.0908 | 24.seo | 97.685, rei | 9.726 1sasee 62s | 98st 13.2004 26.707 | 14.085 7 | 225800 cas | 12.6067 | 19.4058 ‘axes | 22.668 | 95.6089 4.004 | 20.109 | s0s26s vay where d (dimension) is 6 23.81721 12.77409 | 20.30219 8.062904, 1.50852 17.62 14.1009 7.434201 | 12.09231 4.809165, 6.781004, 10.47973 20.7167 | 11.186 17.6526 6.976963, 10.01433 15.39096 7.401542 | 3.972256 | 6.307436. 2.535222, 3.612997 | 5.466445 10.7651 5.752218 | 9.205999 3.64974 5.184753) 7.956759 17.10152 9.254989 | 14.5497. 5.715476 8.205791 12.62712 Cama) Came) > Gam scaling the resultant matrix with dimension 5 The next step of masking is optional, and we won't be calculating it. Masking is like telling the model to focus only on what’s happened before a certain point and not peek into the future while figuring out the importance of different words in a sentence. It helps the model understand things in a step- by-step manner, without cheating by looking ahead. So now we will be applying the softmax operation on our scaled resultant matrix. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 15183s1ar24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding (EEN er | os [om [os | ve SoftMax: 141 | 743 | 1209 | 481 | 678 | 1048 s(x) —u zor | 1139 | sas | ese | 1001 | 1520 Dee! war [57 | sat | 30s | aie | 708 va [eas Duss Dan [an [ees as softmax(23.82) = rary Gait 4 gas BOF GAT y GIR L (ener) Eo or 0 | 0002 | ———> (Gavan se | aooit { axts2 | acoo1 | ocoos | o.022 = = 0.9534 | 0.0001 | 0.0421 oO 0.0044 0.6476 | 0.021 0.2177 | 0.005 | 0.0146 | 0.094 (teen) 0.7603 | 0.0052 | 0.164 | 0.0006 | 0.0023 | 0.047 a] 0.9174 | 0.0004 | 0.0716 0.0001 | 0.0105 axKy Applying soft max on resultant matrix Doing the final multiplication step to obtain the resultant matrix from single-head attention. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 16183star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding oe) softmax| (iz Value ose |_o | oars] o | 0 | ooo ace | a8 | aoe | an 0.96 | oooii | 0.1152 | acoos | oo006 | 0.020 zs5_| 106 | 277 | 170 esse | oooor | oo | o | 0 | conus ase [a6 | s49 | 272 asa | ones | oxi | 005 | onus | ove | X [soe | use [1m [sa 0.7003 | ao0s2 | oes | 0.0006 | a.0029 | 0.047 19 | ise | ree | so 09174 | ooo | canis | 0 | 0001 | o.o105 aoa [ 29 [ 27a | 22 6x6 6x4 3.864257| 3.79246 | 4.060367| 3.39751 | ——————> [amu] 3,601295| 3.75252 |3.977937| 3.30861 = [3.855542] 3.787426] 4.04909 | 3.386086 Mask (o5t) 3.622841 | 3.584996] 3.750419 | 3.081634) 3.745786 | 3.706744] 3.904894 | 3.233619) (sexe) 3.835366] 3.77528 | 4.022837 | 3.356435) Tattaar We have calculated single-head attention, while multi-head attention comprises many single-head attentions, as I stated earlier. Below is a visual calculating the final matrix of single head attention of how it looks like: hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt mas41424, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Multi Head Attention Multi Head attention in Transformer Each single-head attention has three inputs: query, key, and value, and each three have a different set of weights. Once all single-head attentions output their resultant matrices, they will all be concatenated, and the final concatenated matrix is once again transformed linearly by multiplying it with a set of weights matrix initialized with random values, which will later get updated when the transformer starts training. Since, in our case, we are considering a single-head attention, but this is how it looks if we are working with multi-head attention. ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 1643s1ar24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Single Head Attention Multi Head Attention (N Heads) Our Case Real world Case Concatenation 3.884257| 3.79246 | 4.060367) 3.39751 3.801295 | 3.75252 | 3.977997) 3.90061 3.855542| 3.767426] 4.04000 | 3.385086] 3.622641 | 3.564996] 3.750419] 3.081854 3.745766 | 3.706744 3.904898] 3.233519| 3.835966] 3.7752 | 4.022837] 3.356435 | (mata (Sona) ma) Cae) Ta (task ot) J es (x) GD} ee ee) Cotari_} or) qd ow Sev okey Single Head attention vs Multi Head attention In either case, whether it’s single-head or multi-head attention, the resultant matrix needs to be once again transformed linearly by multiplying a set of weights matrix. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 19143star24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding o Linear weights soma la ) %. ue colina length mst be, sen ae TO (embedding positional) matrix columns length 38 | 375 | 398 | 331 oa [os | 045 | 054 | 007 | 058 306 | 379 | 405 | 3.39 xX oes | 074 | 078 0s 075 | 055 362_[ "358 [375 [3.08 os3_| oa | _os5 | os9 | o49 | 014 375 [371 [39 [323 o7 | 06 | 012 | 042 | 029 | 087 3ea_| 378 | 402 | 3:36 4x6 exa 10.84 | 945 | 7.33 78 6.09 | 7.66 10.65 | 9.28 | 722 | 767 | 599 | 7.51 10.83 | 943 | 7.33 | 7.79 | 608 | 7.65 10.08 | 8.77 | 685 | 7.25 | 567 | 7.09 0.48 [| 912 | 741 | 754 | 589 | 7.38 10.7 [| 9.38 | 7.29 | 7.75 | 6.05 76 normalizing single head attention matrix Make sure the linear set of weights matrix number of columns must be equal to the matrix that we computed earlier (word embedding + positional embedding) matrix number of columns, because the next step, we will be adding the resultant normalized matrix with (word embedding + positional embedding) matrix. Output of Multi Head attention Fi = 10.84 | 945 | 733 | 78 6.09 | 7.66 cee | |i 10.65 | 9.28 | 7.22 | 7.67 | 599 | 7.51 ee 10.83 | 943 | 733 | 7.79 | 608 | 7.65 10.08 | 8.77 | 685 | 7.25 | 567 | 7.09 feral 10.48 | 912 | 741 | 754 | 5.89 | 7.98 10.77 | 938 | 729 | 775 | 605 | 76 6x6 Output matrix of multi head attention hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 20183srar24, 111 AM As we have computed the resultant matrix for multi-head attention, next, we will be working on adding and normalizing step. Step 8 — Adding and Normalizing Once we obtain the resultant matrix from multi-head attention, we have to add it to our original matrix. Let’s do it first. Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding ‘Word Embedding + Positional Embedding Wen [ore [a6 [oss [iss [amr [a2 you | -u2a[oa7_| 006 [ore | 09 | ove voy [os [ast | oz [ist | oss | 159 wne_|_oz6 [074 [06s | 022 [007 [037 # [012 |os9 [08 [062 | 05 | 07 Worse [036 | 13 [076 | 148 [ose | 129 6x6 Output of Multi Head attention wos [sas [733 | 78 | 609 | 705 106s 928 | 722 | 767 599 | 7.51 1os3_|043_| 733 | 779 | 608 _| 7.65 10.08_| 677 | 6as_| 7.25 | sa7_| 7.09 10ae[912 | 71 | 758 | 589 | 7.8 10.77 [938] 729 | 775 | 605 | 76 6x6 11.63 11.05 8.29 9.44 7.06 8.86 11.87 9.45 7.28 8.46 6.89 8.25 11.75 10.94 76 9.1 6.64 9.24 10.34 9.51 7.51 7.A7 5.74. 7.46 10.6 9.71 791 8.16 6.39 8.08 10.41 10.68 8.05 9.23 6.99 8.81 To normalize the above matrix, we need to compute the mean and standard ‘Adding matrices to perform add and norm step deviation row-wise for each row. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 2118ss1ar24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding mean = Row Wise Implementation Mean __ Standard Deviation 11.63 [11.05 [8.29 [9.44[7.06| 8.86] ————>__9.26 1.57 11.87 9.45 [7.28 [8.46] 6.89] 8.25| ————>_ 8.56 1.64 11.75| 10.94] 7.6 | 9.1 [6.64] 9.28]-————>|_9.04 1.76 10.34| 9.51 [7.51|7.47| 5.74] 7.46] ————>|_7.86. 1.51 10.6 [ 9.71 |7.91[8.16|6.39| 8.08] ————>|_ 8.37, 1.35) 10.41 [10.68 [8.05 [9.23 6.99] 8.81] ————__ 8.93 1.28 calculating meand and std we subtract each value of the matrix by the corresponding row mean and divide it by the corresponding standard deviation. Mean Sed 11.05] 629 9.44 ]7.06 [6.85] 11.87/ 9.45 |7.26[6.46|6.09/8.25| [ese [160 1175/1084] 7 | 91 [6st|9.28) [aoa | 176 10.34] 951 |751|7.47|5.74|7.46| [796 [151 106 [9.71 |7.91]8.16[629/6.08] [esr [1.35 10.41] 10.68 /6.05]9.23| 699/861] [eos | 1.26 value mean __ 11.63 — 9.26 std+error 1.57 + 0.0001 i H - aia _[-0.62[ 011 [ -14 | -0.25 202 | 0.54 | -0.78 | -0.06[ -1.02 | -0.19 154 | 108 | -082[ 0.03 | -136 | 0.11 164 | 1.09 | -0.23[-026| 14 | -0.26 165 | 0.99 | -0.34[-016[ -1.47 | -0.21 ais | 137 [-069|[ 0.23 | -152 | -0.09 normalizing the resultant matrix hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 22183cs, 81084 Salvng Tantomery Hand A Stepy-Stp Mah Exaile by Fareed Kran | Level Up Cong Adding a small value of error prevents the denominator from being zero and avoids making the entire term infinity. Step 9 — Feed Forward Network After normalizing the matrix, it will be processed through a feedforward network. We will be using a very basic network that contains only one linear layer and one ReLU activation function layer. This is how it looks like visually: ReLU( max(0, x) Real world case (multiple layers) Linear Layer = X-W+b our case (one linear layer) tT = Feed Forward network comparison First, we need to calculate the linear layer by multiplying our last calculated matrix with a random set of weights matrix, which will be updated when the transformer starts learning, and adding the resultant matrix to a bias matrix that also contains random values. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 23183tar24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Matrix afrer add and norm step w ra_[ 116 [oe [on [1a [025 a TS 202 | 058 |-0.78 | 006 | 1.02 | 019 ox7_{ 082 | 063 | 046 | 006 1.54 | 1.08 [0.82 [0.03 [1.36 [011 | X | 053 | 087 ) 047 | 1) oat tee 108 Tos Lome ot 5 ‘053| 088 [038 | 009 | oe | 0.25 tes [oss[-ase[-ose [ray [om] | eet eas fore {oss fost [ose teas Passos ase | on ex6 6x6 we [on [on [om or Bias ‘02 [-126-[- 111 [012 | 046] 07 oss [ise [oso [os {os [oso] y i 2st ‘053 [097 [096 | ~015_| 016 | 02 oss_[-1_[-as7_[-on1 | 02 | oe ‘os2[102[ 061 | 026-014 | 02 x6 (lo [one [025 [ae [035 [04s] os | 125 | 109 | 056 | 057 | 115 oes | 144 | 196 | 054 | O81 | 142 = os | 136 | -057 | 081 | 068 | 1.04 oss [| iis | 123 | 057 [ 051 | 0.97 ose | 129 | -o62 | 053 | 055 | 1.09 1.04 12 oss | 068 | 049 | 0.97 6x6 Calculating Linear Layer After calculating the linear layer, we need to pass it through the ReLU layer and use its formula. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt alasster24, 1-11. AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding ReLU(z) = max(0, x) 1.25 1.09 0.56 0.57 1.15 0.66 1.44 1.36 0.54 0.81 1.42 0.95, 1.36 -0.57 0.81 0.68 1.04 0.95 1.15 1.23 0.57, 0.51 0.97 0.98 1.29 0.62 0.53 0.55 1.09 1.04 1.2 0.86 0.68 0.49 0.97 6x6 max(0, 0.91) [isi] 1.25 [1.09] 0.56 | 0.57] 1.15 == 0.66 | 1.44/1.36 | 0.54 | 0.81 | 1.42 Forward 0.95[ 1.36] 0 | 0.81 | 0.68 | 1.04 0.95 | 1.15 |1.23 | 0.57 | 0.51 | 0.97 Fara Nom 0.98129] 0 | 0.53| 0.55 | 1.09 1.04] 1.2 [0.86] 0.68 | 0.49 | 0.97 Calculating ReLU Layer Step 10 — Adding and Normalizing Again Once we obtain the resultant matrix from feed forward network, we have to add it to the matrix that is obtained from previous add and norm step, and then normalizing it using the row wise mean and standard deviation. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 25183star24, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Matrix from Feed Forward Network Matrix from Previous Add and Norm Step oat [125] 1.08] 086 [O57 [15 Te Pa] eyo 0.66] 1.44 1.36] 054 [0.81 [1.42 per ost are | aos [ar [039 tetieteterteetie |p Fectietsetestister 095/1.15]1.23] 057 [051 [0.97 je tee tat | em |__| ae 096/129] 0 [083/055 109 nei [aes | ans [ise | on 108] 1.2 [0.66] 0.69 | 049] 097 Mean Std 1.0033 [1.103534 za a ae % 1.1233 [1.214349 — Pes hea ee Pas is] —» [0.9033 [1.301837 TT soe Ci oar 9.9933 | 1.289058 os Pe Pata 0.8167 |1.306016| 0.95 [1.320773 ize [ 126 | -o4s | -03 | -166 | -0.09 ize | om | 045 | -053 | 11 [0.09 122 | 118 | 1.32 | -005 | -1.22 | 0.19 124 [097 | 001 | -053 | -1.46 | 0.22 Forward 139 [112 | -o89 | -o34 | -133 [0.05 09s | 123 | -059 | -003 | -15 | -0.05 =i ‘Add and Norm after Feed Forward Network ‘The output matrix of this add and norm step will serve as the query and key matrix in one of the multi-head attention mechanisms present in the decoder part, which you can easily understand by tracing outward from the add and norm to the decoder section. Step 11 — Decoder Part The good news is that up until now, we have calculated Encoder part, all the steps that we have performed, from encoding our dataset to passing our matrix through the feedforward network, are unique. It means we haven't calculated them before. But from now on, all the upcoming steps that is the remaining architecture of the transformer (Decoder part) are going to involve similar kinds of matrix multiplications. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 26183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Take a look at our transformer architecture. What we have covered so far and what we have to cover yet: What we have covered so far What we have to cover Output Probabities Tres em Feed evar Tan ‘Aertan Ne Nasied aoa ‘tern Postrel Eb Poston Encoding QY Encoding Tot Sues Emeeg enbessra t Inputs Cutouts Upcoming steps illustration We won't be calculating the entire decoder because most of its portion contains similar calculations to what we have already done in the encoder. Calculating the decoder in detail would only make the blog lengthy due to repetitive steps. Instead, we only need to focus on the calculations of the input and output of the decoder. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 278ssrar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding When training, there are two inputs to the decoder. One is from the encoder, where the output matrix of the last add and norm layer serves as the query and key for the second multi-head attention layer in the decoder part. Below is the visualization of it (from batool haider): Encoder 1 Visualization is from Batol Haider While the value matrix comes from the decoder after the first add and norm step. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 28183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding The second input to the decoder is the predicted text. If you remember, our input to the encoder is when you play game of thrones so the input to the decoder is the predicted text, which in our case is you win or you die . But the predicted input text needs to follow a standard wrapping of tokens that make the transformer aware of where to start and where to end. Encoder Input ———> When you play game of thrones Decoder Input ———> you win or you die input comparison of encoder and decoder Where and are two new tokens being introduced. Moreover, the decoder takes one token as an input at a time. It means that will be served as an input, and you must be the predicted text for it. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 29183star24, 111.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding a a = 5 EQ) Pesiti¢ ? © YY Encod el jut } < 0.31 dding 0.21 | 0.12 0.64 uts Outputs 0.98 (shifted right) 0.2 Decoder input word ‘As we already know, these embeddings are filled with random values, which will later be updated during the training process. Compute rest of the blocks in the same way that we computed earlier in the encoder part. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 3018341424, 1-11 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Output Probabilities Calculation is same as in Encoder moods Outputs Calculating Decoder Before diving into any further details, we need to understand what masked multi-head attention is, using a simple mathematical example. Step 12 — Understanding Mask Multi Head Attention Ina Transformer, the masked multi-head attention is like a spotlight that a model uses to focus on different parts of a sentence. It’s special because it doesn’t let the model cheat by looking at words that come later in the sentence. This helps the model understand and generate sentences step by step, which is important in tasks like talking or translating words into another language. ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 31a41424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Suppose we have the following input matrix, where each row represents a position in the sequence, and each column represents a feature: 1 Input Matrix 4 7 Co Ot bh OMDW inpur matrix for masked multi head attentions Now, let’s understand the masked multi-head attention components having two heads: » . Linear Projections (Query, Key, Value): Assume the linear projections for each head: Head 1: Wql,WK1,Wv1 and Head 2: Wq2,Wk2,Wv2 nN . Calculate Attention Scores: For each head, calculate attention scores using the dot product of Query and Key, and apply the mask to prevent attending to future positions. » . Apply Softmax: Apply the softmax function to obtain attention weights. + . Weighted Summation (Value): Multiply the attention weights by the Value to get the weighted sum for each head. a . Concatenate and Linear Transformation: Concatenate the outputs from both heads and apply a linear transformation. Let’s do a simplified calculation: ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 323srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Assuming two conditions + Wql = Wk1 = Wv1 = Wq2 = Wk2 = Wv2=I, the identity matrix. °@ ‘Input Matrix Head 1: Head 2: Qi = K, =V; = Input Matrix Qz = Ky = Va = Input Matrix A, =Q:- KT Ay = Qo: KY 147 23 =|2 58 56 369 8 9 100 23 =|4 5 0] (Masked) 5 6] (Masked) 789 09 softmax(A1) =WeM Concatenate and Linear Transformation: Coneatenate([O1, O2)) (Apply Learnable Linear Transformation) Mask Multi Head Attention (Two Heads) The concatenation step combines the outputs from the two attention heads into a single set of information. Imagine you have two friends who each give you advice on a problem. Concatenating their advice means putting both pieces of advice together so that you have a more complete view of what they suggest. In the context of the transformer model, this step helps capture different aspects of the input data from multiple perspectives, contributing toa richer representation that the model can use for further processing. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 33143sara, 11 AM Solving Transformer by Hand: A Step-by-Step Math Example [by Fareed Kan | Level Up Coding Step 13 — Calculating the Predicted Word The output matrix of the last add and norm block of the decoder must contain the same number of rows as the input matrix, while the number of columns can be any. Here, we work with 6. Outrut Prbepies Rows must be 6 while columns can be any length 1.2 133 [ 211 | 262 [ 306 | 354 ose | 145 | 224 | 278 | 378 | 455 134 [204 [ 226 | 266 | 356 | 3.96 oso | 179 | 246 | 249 | 332 | 3.64 oot | 104 [| 163 | 192 | 195 | 21 oi2 | 099 | 12 156 | 157 17 =) ‘Add and Norm output of decoder The last add and norm block resultant matrix of the decoder must be flattened in order to match it with a linear layer to find the predicted probability of each unique word in our dataset (corpus). hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 44sa6, 1:44AM Solving Transformer by Hand A Stop-by-Step Math Example | by Fareed Kan | Level Up Coding 12 1.33 211 2.62 3.06 4 3.54) Flatten the matrix 0.56 a 12 [| 133 [ 211 [ 262 [ 306 | 354 23 oss | 145 | 224 | 278 | 3.78 | 4.55 278 1.34 2.04 2.26 2.66 3.56 3.86 az8! oso | 179 | 246 | 249 | 3.32 | 3.64 458 ost | 104 | 183 | 192 [ 195 [21 . oi2 | o99 | 12 | 156 | 157 | 17 This flattened layer will be passed through a linear layer to compute the logits (scores) of each unique word in our dataset. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt flattened the last add and norm block matrix 012 0.99 12 1.56 157 17 35183srar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding Linear Layer = X -W No Bias Linear set of weights rink | things | He af 2 [a eos Pa os [ om [om xt 1 [os [oa 037 Flatten Layer (Row Matrix) om {em [ov | +++ Last 023 | oe | oss 076 1.2[1.33[241] ee+ |1.2]1.56[1.57]17 4 075 | 076 | 042] 444 | 026 1x 1n (our case nis 36) ase [os [oa a a3 | 036 | 00s os 035 | oa | one 087 nxm (m is 23 Vocab Size) 1 | drink | things| .,, | you | ,,, | He : a a 7 2 logits = iis [23 [ 115] e** [5a] ee [25 Calculating Logits Once we obtain the logits, we can use the softmax function to normalize them and find the word that contains the highest probability. hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebtstar24, 111 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding 1_| drink [things] .,, [you ] .., [_He . 1 2 3 17 23 logits = iia [23 | 115 sa] e+ (23 Applying softmax @) => si) = Sy 2; drink | things you He sees. 1 z 3 Psa 17 23 Probabilities = 0.21 0.05, 0.001 oor 0.56 ver 0.12 , highest Probability Finding the Predicted word So based on our calculations, the predicted word from the decoder is you. hntps:evelupgiteornectee,comunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 37183tar24, 1211.0 Solving Transformer by Hand: A Step-by-Step Math Example | by Farced Knan | Level Up Coding you 7 0.56 you is the predicted word, which will now act an an input for our decoder ‘and so on. Output Probabilties Final output of decoder This predicted word you, will be treated as the input word for the decoder, and this process continues until the token is predicted. Important Points 1. The above example is very simple, as it does not involve epochs or any other important parameters that can only be visualized using a programming language like Python. 2. It has shown the process only until training, while evaluation or testing cannot be visually seen using this matrix approach. 3. Masked multi-head attentions can be used to prevent the transformer from looking at the future, helping to avoid overfitting your model. Conclusion hntps:evelupgitcornectee,comlunerstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 38143cs, 81084 Salvng Tantomery Hand A Stepy-Stp Mah Exaile by Fareed Kran | Level Up Cong In this blog, I have shown you a very basic way of how transformers mathematically work using matrix approaches. We have applied positional encoding, softmax, feedforward network, and most importantly, multi-head attention. In the future, I will be posting more blogs on transformers and LLM as my core focus is on NLP. More importantly, if you want to build your own million-parameter LLM from scratch using Python, I have written a blog on it which has received a lot of appreciation on Medium. You can read it here: Building a Million-Parameter LLM from Scratch Using Python A Step-by-Step Guide to Replicating LLaMA Architecture levelupptconnectedicom Have a great time reading! Data Science Artificial Intelligence Machine Learning Python Deep Learning ¢ hntps:evelupgiteornectee,comlunderstanding-transtormers-rom-star-to-end-a-step-by-step-math-example-t6ddesdebebt 3914341424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding @ Written by Fareed Khan Croton) O 21K Followers + Writer for Level Up Coding MSc Data Science, | write on Al https://wwwlinkedin.com/in/fareed-khan-dew/ More from Fareed Khan and Level Up Coding @) Fareed Khan @ Tirendaz Al in Level Up Coding Modern GUI using Tkinter How to Use ChatGPT in Daily Life? There are two things to remember: Save time and money using Chat@PT 4minread - Sep 4,2022 @minroad - Apr 4,2023 &) 59k Q 107 imi oe i System Design Interview Question: Building a Million-Parameter LLM Design Spotify from Scratch Using Python @ Hayk Simonyan in Level Up Coding @ Fareed Khan in Level Up Coding ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 404941424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding High-level overview of a System Design A Step-by-Step Guide to Replicating LlaMA Interview Question - Design Spotify. Architecture Sminread + Feb 21,2024 25minread » Dec, 2023 4K Qs wt we 25K QQ 30 Chow See all from Fareed Khan See all from Level Up Coding Recommended from Medium a kaggle imi @veepLeorning A © Benedict Neo in btgrit Data Science Publication XQ in The Research Nest Roadmap to Learn Al in 2024 Explained: Transformers for A free curriculum for hackers and Everyone programmers to learn Al The underlying architecture of modern LLMs Mmin read + Mar 11,2024 1Smin read » Mar 8,2024 & a4x Q 100 too mm Qs tio ntps:evelupgitconnectes.com/understanding-ransformers-rom-startto-ond-a-step-by-stop-math-example-18d4e6¢o6ebt1 44941424, 1211 AM Solving Transformer by Hand: A Step-by-Step Math Example | by Farsed Knan | Level Up Coding Lists Predictive Modeling w/ === Practical Guides to Machine Python Learning 20stories » 1060 saves TOstories - 1269saves Natural Language Processing = === —\} ChatGPT 1249 stories - 828 saves ECT 2istories - 551 saves @ Austin star. in Arica intetigence in Plain Ena Reinforcement Learning is Dead. Long Live the Transformer! from Scratch Using Python Large Language Models are more powerful A Step-by-Step Guide to Replicating LLaMA than you imagine Architecture + © Bminread - Jan 13,2024 25minread - Dec7, 2023 Sy 199k QQ 48 af we Sp 25K

Word2Vec for NLP Enthusiasts
100% (1)
Word2Vec for NLP Enthusiasts
12 pages
Transformers From Scratch
No ratings yet
Transformers From Scratch
39 pages
Performance Analysis of LoRA Finetuning Llama-2
No ratings yet
Performance Analysis of LoRA Finetuning Llama-2
4 pages
AI by Hand: Neural Network Concepts
No ratings yet
AI by Hand: Neural Network Concepts
28 pages
Weaviate Advanced RAG Techniques Ebook
100% (1)
Weaviate Advanced RAG Techniques Ebook
13 pages
CHATBOT Final
No ratings yet
CHATBOT Final
54 pages
LangGraph Tutorials
100% (2)
LangGraph Tutorials
3 pages
Intro Class
100% (1)
Intro Class
81 pages
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
No ratings yet
RAG (Generative AI) - A "Rags To Riches" Moment For Artificial Intelligence - by Kanishk Khatter - Medium
12 pages
11 Machine Learning System Design PDF
No ratings yet
11 Machine Learning System Design PDF
7 pages
Combinatorial Optimization Basics
0% (1)
Combinatorial Optimization Basics
528 pages
Probability and Statistics For ML - Cwa
No ratings yet
Probability and Statistics For ML - Cwa
822 pages
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
Small Language Models Survey
100% (1)
Small Language Models Survey
20 pages
GenAI Pinnacle Plus Brochure
No ratings yet
GenAI Pinnacle Plus Brochure
10 pages
Retrieval-Augmented Generation For Large Language Models: A Survey
No ratings yet
Retrieval-Augmented Generation For Large Language Models: A Survey
26 pages
Maths of Machine Learning
No ratings yet
Maths of Machine Learning
75 pages
ف1
No ratings yet
ف1
4 pages
Intersymbolic AI Interlinking Symbolic AI and Subs
No ratings yet
Intersymbolic AI Interlinking Symbolic AI and Subs
19 pages
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
No ratings yet
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
8 pages
Regularization For Neural Networks 1718966083
No ratings yet
Regularization For Neural Networks 1718966083
9 pages
The Gainz Manual
No ratings yet
The Gainz Manual
28 pages
The Rise of Vector Databases in The Age of LLMs
No ratings yet
The Rise of Vector Databases in The Age of LLMs
26 pages
Graph Machine Learning
No ratings yet
Graph Machine Learning
39 pages
GPT2 From Scratch in PyTorch
No ratings yet
GPT2 From Scratch in PyTorch
13 pages
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
No ratings yet
OceanofPDF - Com Hands-On Machine Learning From Scratch - Venelin Valkov
119 pages
Comprehensive List of Mathematical Symbols (2020)
No ratings yet
Comprehensive List of Mathematical Symbols (2020)
76 pages
Lang Graph
No ratings yet
Lang Graph
113 pages
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
No ratings yet
5 Pretraining On Unlabeled Data - Build A Large Language Model (From Scratch)
61 pages
Generative AI With LangChain Build Production-Ready LLM Applications and Advanced Agents Using Python, LangChain, and LangGraph
No ratings yet
Generative AI With LangChain Build Production-Ready LLM Applications and Advanced Agents Using Python, LangChain, and LangGraph
477 pages
Generative Deep Learning: Teaching Machines To Paint, Write, Compose, and Play, 2nd Edition David Foster Kindle & PDF Formats
No ratings yet
Generative Deep Learning: Teaching Machines To Paint, Write, Compose, and Play, 2nd Edition David Foster Kindle & PDF Formats
88 pages
Linear Algebra in Data Science Peter Zizler, Roberta La Haye Z Library
No ratings yet
Linear Algebra in Data Science Peter Zizler, Roberta La Haye Z Library
232 pages
Machine Learning for Python Users
No ratings yet
Machine Learning for Python Users
15 pages
LLM Ai Interview SS
No ratings yet
LLM Ai Interview SS
187 pages
Fine-Tuning Large Language Models For Specialized Use Cases - 2025
No ratings yet
Fine-Tuning Large Language Models For Specialized Use Cases - 2025
13 pages
LCM LoRA Technical Report
No ratings yet
LCM LoRA Technical Report
7 pages
LSTM for Touchpoint Prediction
100% (1)
LSTM for Touchpoint Prediction
73 pages
LangChain in Action v5 MEAP
100% (1)
LangChain in Action v5 MEAP
372 pages
Crud Rag
No ratings yet
Crud Rag
31 pages
An Introduction To Mathematics Behind Neural Networks
No ratings yet
An Introduction To Mathematics Behind Neural Networks
5 pages
Guide To Becoming An AI Expert in 2025.
No ratings yet
Guide To Becoming An AI Expert in 2025.
21 pages
Autoregressive Generative Models Guide
No ratings yet
Autoregressive Generative Models Guide
57 pages
6 Types of Neural Network
No ratings yet
6 Types of Neural Network
8 pages
Machine Learning For Beginners Complete A - Declan Mellor
No ratings yet
Machine Learning For Beginners Complete A - Declan Mellor
102 pages
Five Equations That Changed The World
No ratings yet
Five Equations That Changed The World
288 pages
Building Large Language Models (LLM) - A Step-By-Step Guide - SaberiKamarposhti, Morteza - 2024
No ratings yet
Building Large Language Models (LLM) - A Step-By-Step Guide - SaberiKamarposhti, Morteza - 2024
374 pages
TensorFlow Deep Learning Guide
No ratings yet
TensorFlow Deep Learning Guide
35 pages
Machine Learning For Engineers: Ryan G. Mcclarren
No ratings yet
Machine Learning For Engineers: Ryan G. Mcclarren
252 pages
10 Most Asked LLM Interview Questions
No ratings yet
10 Most Asked LLM Interview Questions
12 pages
Deep Learning - Roy Keyes
100% (1)
Deep Learning - Roy Keyes
163 pages
How To Deploy Machine Learning Model As Microservices
No ratings yet
How To Deploy Machine Learning Model As Microservices
7 pages
Stanford CS224W Graph Representation Learning 09-Node2vec PDF
No ratings yet
Stanford CS224W Graph Representation Learning 09-Node2vec PDF
60 pages
Reinforcement Learning With Python - Master Reinforcemearning in Python Without Being An Expert - Bob Story (Bob Story) (Z-Library)
No ratings yet
Reinforcement Learning With Python - Master Reinforcemearning in Python Without Being An Expert - Bob Story (Bob Story) (Z-Library)
58 pages
LangChain QuickStart With Llama 2
No ratings yet
LangChain QuickStart With Llama 2
16 pages
Lecture+Notes Intro To MLOps Session3
No ratings yet
Lecture+Notes Intro To MLOps Session3
8 pages
Self-Improving LLM Architectures With Open Source
No ratings yet
Self-Improving LLM Architectures With Open Source
14 pages
Logic Problemset PDF
100% (2)
Logic Problemset PDF
108 pages
Solved Example of Transformers
No ratings yet
Solved Example of Transformers
20 pages
The Positional Encoding Blog
No ratings yet
The Positional Encoding Blog
17 pages
Training
No ratings yet
Training
4 pages
Immersive 3D Solutions for Brands
No ratings yet
Immersive 3D Solutions for Brands
3 pages
Elevate Your Business With AI at XYZ
No ratings yet
Elevate Your Business With AI at XYZ
4 pages
Scalable LLM Deployment Architecture and Design
No ratings yet
Scalable LLM Deployment Architecture and Design
10 pages
AI Applications at XYZ Company
No ratings yet
AI Applications at XYZ Company
4 pages
Math Behind Euralnets
No ratings yet
Math Behind Euralnets
50 pages
Python Function and Object Handling
100% (1)
Python Function and Object Handling
20 pages
What Is The Difference Between RFC and BAPI ?: SAP Business Object
No ratings yet
What Is The Difference Between RFC and BAPI ?: SAP Business Object
13 pages
SCN Sap Com Community Data Services Blog 2013-07-17 Step by
No ratings yet
SCN Sap Com Community Data Services Blog 2013-07-17 Step by
11 pages

Solving Transformer by Hand A Step-by-Step Math Example

Uploaded by

Solving Transformer by Hand A Step-by-Step Math Example

Uploaded by

You might also like