KEMBAR78
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey | PDF | Artificial Intelligence | Intelligence (AI) & Semantics
0% found this document useful (0 votes)
195 views14 pages

Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey

This document is a tutorial on attention mechanisms, transformers, BERT, and GPT. It begins by explaining attention in neural networks and how it allows a model to focus on important parts of input data, similar to human vision. It then discusses sequence-to-sequence models with and without attention. Transformers are introduced as models composed solely of attention modules. Finally, BERT and GPT are described as stacks of transformer encoders and decoders, respectively.

Uploaded by

mobarmg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views14 pages

Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey

This document is a tutorial on attention mechanisms, transformers, BERT, and GPT. It begins by explaining attention in neural networks and how it allows a model to focus on important parts of input data, similar to human vision. It then discusses sequence-to-sequence models with and without attention. Transformers are introduced as models composed solely of attention modules. Finally, BERT and GPT are described as stacks of transformer encoders and decoders, respectively.

Uploaded by

mobarmg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

To appear as a part of Prof. Ali Ghodsi’s material on deep learning.

Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey

Benyamin Ghojogh BGHOJOGH @ UWATERLOO . CA


Department of Electrical and Computer Engineering,
Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada
Ali Ghodsi ALI . GHODSI @ UWATERLOO . CA
Department of Statistics and Actuarial Science & David R. Cheriton School of Computer Science,
Data Analytics Laboratory, University of Waterloo, Waterloo, ON, Canada

Abstract
This is a tutorial and survey paper on the atten-
tion mechanism, transformers, BERT, and GPT.
We first explain attention mechanism, sequence-
to-sequence model without and with attention,
self-attention, and attention in different areas
such as natural language processing and com-
puter vision. Then, we explain transformers
which do not use any recurrence. We ex-
plain all the parts of encoder and decoder in
the transformer, including positional encoding,
multihead self-attention and cross-attention, and
masked multihead attention. Thereafter, we in-
troduce the Bidirectional Encoder Representa-
tions from Transformers (BERT) and Generative
Figure 1. Attention in visual system for (a) seeing a picture by
Pre-trained Transformer (GPT) as the stacks of attending to more important parts of scene and (b) reading a sen-
encoders and decoders of transformer, respec- tence by attending to more informative words in the sentence.
tively. We explain their characteristics and how
they work.
more in skimming.
1. Introduction The concept of attention can be modeled in machine learn-
When looking at a scene or picture, our visual system, so as ing where attention is a simple weighting of data. In the
a machine learning model (Li et al., 2019b), focuses on or attention mechanism, explained in this tutorial paper, the
attends to some specific parts of the scene/image with more more informative or more important parts of data are given
information and importance and ignores the less informa- larger weights for the sake of more attention. Many of the
tive or less important parts. For example, when we look at state-of-the-art Natural Language Processing (NLP) (In-
the Mona Lisa portrait, our visual system attends to Mona durkhya & Damerau, 2010) and deep learning techniques
Lisa’s face and smile, as Fig. 1 illustrates. Moreover, when in NLP (Socher et al., 2012).
reading a text, especially when we want to try fast read- Transformers are also autoencoders which encode the input
ing, one technique is skimming (Xu, 2011) in which our data to a hidden space and then decodes those to another
visual system or a model skims the data with high pacing domain. Transfer learning is widely used in NLP (Wolf
and only attends to more informative words of sentences et al., 2019b). Transformers can also be used for trans-
(Yu et al., 2018). Figure 1 shows a sample sentence and fer learning. Recently, transformers were proposed merely
highlights the words to which our visual system focuses composed of attention modules, excluding recurrence and
any recurrent modules (Vaswani et al., 2017). This was a
great breakthrough. Prior to the proposal of transformers
with only attention mechanism, recurrent models such as
Long-Short Term Memory (LSTM) (Hochreiter & Schmid-
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 2

huber, 1997) and Recurrent Neural Network (RNN) (Kom-


brink et al., 2011) were mostly used for NLP. Also, other
NLP models such as word2vec (Mikolov et al., 2013a;b;
Goldberg & Levy, 2014; Mikolov et al., 2015) and GloVe
(Pennington et al., 2014) were state-of-the-art before the
appearance of transformers. Recently, two NLP methods,
named Bidirectional Encoder Representations from Trans- Figure 2. Using an autoencoder for Transformation of one domain
formers (BERT) and Generative Pre-trained Transformer to another domain. The used images are taken from the PACS
(GPT), are proposed which are the stacks of encoders and dataset (Li et al., 2017).
decoders of transformer, respectively.
In this paper, we introduce and review the attention mech-
forms a sequence to another related sequence is named the
anism, transformers, BERT, and GPT. In Section 2, we in-
sequence-to-sequence model (Bahdanau et al., 2015).
troduce the sequence-to-sequence model with and without
attention mechanism as well as the self-attention. Section 3 Suppose the number of words in a document or
explains the different parts of encoder and decoder of trans- considered sentence be n. Let the ordered in-
formers. BERT and its different variants of BERT are in- put tokens or words of the sequence be denoted by
troduced in Section 4 while GPT and its variants are intro- {x1 , x2 , . . . , xi−1 , xi , . . . , xn } and the output sequence
duced in Section 5. Finally, Section 6 concludes the paper. be denoted by {y 1 , y 2 , . . . , y i−1 , y i , . . . , y n }. As Fig. 3-a
illustrates, there exist latent vectors, denoted by {li }ni=1 ,
in the decoder part for every word. In the sequence-to-
2. Attention Mechanism
sequence model, the probability of generation of the i-th
2.1. Autoencoder and the Context Vector word conditioning on all the previous words is determined
Consider an autoencoder with encoder and decoder parts by a function g(.) whose inputs are the immediate previous
where the encoder gets an input and converts it to a context word y i−1 , the i-th latent vector li , and the context vector
vector and the decoder gets the context vector and converts c:
to an output. The output is related to the input through
the context vector in the so-called hidden space. Figure 3- P(y i |y 1 , . . . , y i−1 ) = g(y i−1 , li , c). (1)
a illustrates an autoencoder with the encoder (left part of
autoencoder), hidden space (in the middle), and decoder Figure 3-a depicts the sequence-to-sequence model. In the
(right part of autoencoder) parts. For example, the input sequence-to-sequence model, every word xi produces a
can be a sentence or a word in English and the output is hidden vector hi in the encoder part of the autoencoder.
the same sentence or word but in French. Assume the word The hidden vector of every word, hi , is fed to the next hid-
“elephant” in English is fed to the encoder and the word den vector, hi+1 , by a projection matrix W . In this model,
“l’éléphant” in French is output. The context vector mod- for the whole sequence, there is only one context vector
els the concept of elephant which also exists in the mind c which is equal to the last hidden vector of the encoder,
of human when thinking to elephant. This context is ab- i.e., c = hn . Note that the encoder and decoder in the
stract in mind and can be referred to any fat, thin, huge, sequence-to-sequence model can be any sequential model
or small elephant (Perlovsky, 2006). Another example for such as RNN (Kombrink et al., 2011) or LSTM (Hochreiter
transformer is transforming a cartoon image of elephant to & Schmidhuber, 1997).
picture of a real elephant (see Fig. 2). As the autoencoder
is transforming data from a domain to a hidden space and 2.3. The Sequence-to-Sequence Model with Attention
then to another domain, it can be used for domain transfor- The explained sequence-to-sequence model can be with at-
mation (Wang et al., 2020), domain adaptation (Ben-David tention (Chorowski et al., 2014; Luong et al., 2015). In the
et al., 2010), and domain generalization (Dou et al., 2019). sequence-to-sequence model with attention, the probability
Here, every context is modeled as a vector in the hidden of generation of the i-th word is determined as (Chorowski
space. Let the context vector be denoted by c ∈ Rp in the et al., 2014):
p-dimensional hidden space.
P(y i |y 1 , . . . , y i−1 ) = g(y i−1 , li , ci ). (2)
2.2. The Sequence-to-Sequence Model
Consider a sequence of ordered tokens, e.g., a sequence Figure 3-b shows the sequence-to-sequence model with
of words which make a sentence. We want to transform attention. In this model, in contrast to the sequence-to-
this sequence to another related sequence. For example, sequence model which has only one context vector for the
we want to take a sentence in English and translate it to whole sequence, this model has a context vector for every
the same sentence in French. This model which trans- word. The context vector of every word is a linear combi-
nation, or weighted sum, of all the hidden vectors; hence,
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 3

Figure 3. Sequence-to-sequence model (a) with and (b) without attention.

the i-th context vector is: In order to make this score a probability, these scores
n
should sum to one; hence, we make its softmax form as
X (Chorowski et al., 2014):
ci = aij hj , (3)
j=1 esij
R 3 aij := Pn sik
. (5)
k=1 e
where aij ≥ 0 is the weight of hj for the i-th context
vector. This weighted sum for a specific word in the se- In this way, the score vector [ai1 , ai2 , . . . , ain ]> behaves
quence determines which words in the sequence have more as a discrete probability distribution. Therefore, In Eq. (3),
effect on that word. In other words, it determines which the weights sum to one and the weights with higher values
words this specific word “attends” to more. This notion of attend more to their corresponding hidden vectors.
weighted impact, to see which parts have more impact, is
called “attention”. It is noteworthy that the original idea of 2.4. Self-Attention
arithmetic linear combination of vectors for the purpose of 2.4.1. T HE N EED FOR C OMPOSITE E MBEDDING
word embedding, similar to Eq. (3), was in the Word2Vec Many of the previous methods for NLP, such as word2vec
method (Mikolov et al., 2013a;b). (Mikolov et al., 2013a;b; Goldberg & Levy, 2014; Mikolov
The sequence-to-sequence model with attention considers et al., 2015) and GloVe (Pennington et al., 2014), used to
a notion of similarity between the latent vector li−1 of the learn a representation for every word. However, for un-
decoder and the hidden vector hj of the encoder (Bahdanau derstanding how the words relate to each other, we can
et al., 2015): have a composite embedding where the compositions of
words also have some embedding representation (Cheng
R 3 sij := similarity(li−1 , hj ). (4) et al., 2016). For example, Fig. 4 shows a sentence which
highlights the relation of words. This figure shows, when
The intuition for this similarity score is as follows. The reading a word in a sentence, which previous words in the
output word y i depends on the previous latent vector li−1 sentence we remember more. This relation of words shows
(see Fig. 3) and and the hidden vector hj depends son the that we need to have a composite embedding for natural
input word xj . Hence, this similarity score relates to the language embedding.
impact of the input xj on the output y i . In this way, the
score sij shows the impact of the i-th word to generate the 2.4.2. Q UERY-R ETRIEVAL M ODELING
j-th word in the sequence. This similarity notion can be Consider a database with keys and values where a query
a neural network learned by backpropagation (Rumelhart is searched through the keys to retrieve a value (Garcia-
et al., 1986). Molina et al., 1999). Figure 5 shows such database. We
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 4

Figure 4. The relation of words in a sentence raising the need for


composite embedding. The credit of this image for (Cheng et al.,
2016).
Figure 6. Illustration of Eqs. (6), (7), and (8) in attention mecha-
nism. In this example, it is assumed there exist five (four keys and
one query) words in the sequence.

measures are (Vaswani et al., 2017):

inner product: si = q > ki , (9)


>
q ki
scaled inner product: si = √ , (10)
p
Figure 5. Query-retrieval in a database. general inner product: si = q > W ki , (11)
additive similarity: si = w>
q q + w>
k ki , (12)

can generalize this hard definition of query-retrieval to a where W ∈ Rp×p , wq ∈ Rp , and wk ∈ Rp are some
soft query-retrieval where several keys, rather than only learn-able matrices and vectors. Among these similarity
one key, can be corresponded to the query. For this, we measures, the scaled inner product is used most.
calculate the similarity of the query with all the keys to see The Eq. (6) calculates the attention of a target word (or
which keys are more similar to the query. This soft query- query) with respect to every input word (or keys) which are
retrieval is formulated as: the previous and forthcoming words. As Fig. 4 illustrates,
n
when processing a word which is considered as the query,
attention(q, {ki }ni=1 , {v i }ni=1 ) :=
X
ai v i , (6) the other words in the sequence are the keys. Using Eq.
i=1
(6), we see how similar the other words of the sequence
are to that word. In other words, we see how impactful the
where: other previous and forthcoming words are for generating a
missing word in the sequence.
R 3 si := similarity(q, ki ), (7) We provide an example for Eq. (6), here. Consider a sen-
tence “I am a student”. Assume we are processing the word
esi
R 3 ai := softmax(si ) = Pn , (8) “student” in this sequence. Hence, we have a query cor-
k=1 esk responding to the word “student”. The values are corre-
sponding to the previous words which are “I”, “am”, and
and q, {ki }ni=1 , and {v i }ni=1 denote the query, keys, and “a”. Assume we calculate the normalized similarity of the
values, respectively. Recall that the the context vector of query and the values and obtain the weights 0.7, 0.2, and
a sequence-to-sequence model with attention, introduced 0.1 for “I”, “am”, and “a”, respectively, where the weights
by Eq. (3), was also a linear combination with weights of sum to one. Then, the attention value for the word “stu-
normalized similarity (see Eq. (5)). The same linear com- dent” is 0.7v I + 0.2v am + 0.1v a .
bination is the Eq. (6) where the weights are the similarity
of query with the keys. An illustration of Eqs. (6), (7), and 2.4.3. ATTENTION F ORMULATION
(8) is shown in Fig. 6. Note that the similarity si can be Let the words of a sequence of words be in a d-dimensional
any notion of similarity. Some of the well-known similarity space, i.e., the sequence is {xi ∈ Rd }ni=1 . This d-
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 5

dimensional representation of words can be taken from any


word-embedding NLP method such as word2vec (Mikolov
et al., 2013a;b; Goldberg & Levy, 2014; Mikolov et al.,
2015) and GloVe (Pennington et al., 2014). The query, key,
and value are projection of the words into p-dimensional,
p-dimensional, and r-dimensional subspaces, respectively:
Rp 3 q i = W >
Q xi , (13)
p
R 3 ki = W>K xi , (14)
r >
R 3 vi = W V xi , (15)
where W Q ∈ Rd×p , W K ∈ Rd×p , and W V ∈
Rd×r are the projection matrices into the low-dimensional
query, key, and value subspaces, respectively. Consider
the words, queries, keys, and values in matrix forms as
X := [x1 , . . . , xn ] ∈ Rd×n , Q := [q 1 , . . . , q n ] ∈ Rp×n ,
K := [k1 , . . . , kn ] ∈ Rp×n , and V := [v 1 , . . . , v n ] ∈
Rr×n , respectively. It is noteworthy that in the similar-
ity measures, such as the scaled inner product, we have
the inner product q > ki . Hence, the similarity measures
> Figure 7. The low-level and high-level features learned in the low
contain q > ki = x> i W Q W K xi . Note that if we had and high convolutional layers, respectively. The credit of this im-
W Q = W K , this would be a kernel matrix so its be- age is for (Lee et al., 2009).
haviour is similar to the kernel for measuring similarity.
Considering Eqs. (7), and (8) and the above definitions,
the Eq. (6) can be written in matrix form, for the whole Convolutional Neural Network (CNN) (LeCun et al., 1998)
sequence of n words, as: for extracting visual features and the decoder consists of
Rr×n 3 Z := attention(Q, K, V ) LSTM (Hochreiter & Schmidhuber, 1997) or RNN (Kom-
brink et al., 2011) modules for generating the caption text.
1 (16)
= V softmax( √ Q> K), Literature has shown that the lower convolutional layers in
p
CNN capture low-level features and different partitions of
where Z = [z 1 , . . . , z n ] is the attention values, for all the input images (Lee et al., 2009). Figure 7 shows an exam-
words, which shows how much every word attends to its ple of extracted features by CNN layers trained by facial
previous and forthcoming words. In Eq. (16), the softmax images. Different facial organs and features have been ex-
operator applies the softmax function on every row of its tracted in lower layers of CNN. Therefore, as Fig. 8-b
input matrix so that every row sums to one. shows, we can consider the extracted low-layer features,
Note that as the queries, keys, and values are all from the which are different parts of image, as the hidden vectors
0
same words in the sequence, this attention is referred to as {hj }nj=1 in Eq. (4), where n0 is the number of features for
the “self-attention” (Cheng et al., 2016). extracted image partitions. Similarity with latent vectors
of decoder (LSTM) is computed by Eq. (4) and the query-
2.5. Attention in Other Fields Such as Vision and retrieval model of attention mechanism, introduced before,
Speech is used to learn a self-attention on the images. Note that,
Note that the concept of attention can be used in any field of as the partitions of image are considered to be the hidden
research and not merely in NLP. The attention concept has variables used for attention, the model attends to important
widely been used in NLP (Chorowski et al., 2014; Luong parts of input image; e.g., see Fig. 1.
et al., 2015). Attention can be used in the field of com- Note that using attention in different fields of science is
puter vision (Xu et al., 2015). Attention in computer vision usually referred to as “attend, tell, and do something...”.
means attending to specific parts of image which are more Some examples of applications of attention are caption
important and informative (see Fig. 1). This simulates at- generation for images (Xu et al., 2015), caption gener-
tention and exception in human visual system (Summer- ation for images with ownership protection (Lim et al.,
field & Egner, 2009) where our brain filters the observed 2020), text reading from images containing a text (Li et al.,
scene to focus on its important parts. 2019a), translation of one image to another related image
For example, we can generate captions for an input image (Zhang et al., 2018; Yang et al., 2019a), visual question
using an autoencoder, illustrated in Fig. 8-a, with the at- answering (Kazemi & Elqursh, 2017), human-robot social
tention mechanism. As this figure shows, the encoder is a interaction (Qureshi et al., 2017), and speech recognition
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 6

Figure 8. Using attention in computer vision: (a) transformer of image to caption with CNN for its encoder and RNN or LSTM for
its decoder. The caption for this image is “Elephant is in water”, (b) the convolutional filters take the values from the image for the
query-retrieval modeling in attention mechanism.

(Chan et al., 2015; 2016). 3.2.1. P OSITIONAL E NCODING


We will explain in Section 3.4 that the transformer in-
3. Transformers troduced here does not have any recurrence and RNN or
3.1. The Concept of Transformation LSTM module. As there is no recurrence and no convolu-
As was explained in Section 2.1, we can have an autoen- tion, the model has no sense of order in sequence. As the
coder which takes an input, embeds data to a context vec- order of words is important for meaning of sentences, we
tor which simulates a concept in human’s mind (Perlovsky, need a way to account for the order of tokens or words in
2006), and generates an output. The input and output are the sequence. For this, we can add a vector accounting for
related to each other through the context vector. In other the position to each input word embedding.
words, the autoencoder transforms the input to a related Consider the embedding of the i-th word in the sequence,
output. An example for transformation is translating a sen- denoted by xi ∈ Rd . For encoding the position of the i-th
tence from a language to the same sentence in another lan- word in the sequence, the position vector pi ∈ Rd can be
guage. Another example for transformer is image caption- set as:
ing in which the image is transformed to its caption ex-
 
plaining the content of image. A pure computer vision ex-

i
 pi (2j + 1) := cos 2j ,
ample for transformation is transforming a day-time input

 10000 p
image and convert it to the same image but at night time.   (17)
 i
An autoencoder, named “transformer”, is proposed in the  pi (2j) := sin

2j ,
10000 p
literature for the task for transformation (Vaswani et al.,
2017). The structure of transformer is depicted in Fig. 9.
As this figure shows, a transformer is an autoencoder con- for all j ∈ {0, 1, . . . , bd/2c}, where pi (2j + 1) and pi (2j)
sisting of an encoder and a decoder. In the following, we denote the odd and even elements of pi , respectively. Fig-
explain the details of encoder and decoder of a transformer. ure 10 illustrates the dimensions of the position vectors
across different positions. As can be seen in this figure,
3.2. Encoder of Transformer the position vectors for different positions of words are dif-
The encoder part of transformer, illustrated in Fig. 9, em- ferent as expected. Moreover, this figure shows that the dif-
beds the input sequence of n words X ∈ Rd×n into context ference of position vectors concentrate more on the initial
vectors with the attention mechanism. Different parts of the dimensions of vectors. As Fig. 9 shows, for incorporating
encoder are explained in the following. the information of position with data, we add the positional
encoding to the input embedding:

x i ← x i + pi . (18)
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 7

Figure 10. The vectors of positional encoding. In this example, it


is assumed that n = 10 (number of positions of words), d = 60,
and p = 100.

Figure 9. The encoder and decoder parts of a transformer. The


credit of this image is for (Vaswani et al., 2017).

3.2.2. M ULTIHEAD ATTENTION WITH


S ELF -ATTENTION
After positional encoding, data are fed to a multihead atten-
tion module with self-attention. The multihead attention is
illustrated in Fig. 11. This module applies the attention
mechanism for h times. This several repeats of attention is
for the reason explained here. The first attention determines
how much every word attends other words. The second re-
peat of attention calculates how much every pair of words
attends other pairs of words. Likewise, the third repeat of
attention sees how much every pair of pairs of words at-
tends other pairs of pairs of words; and so on. Note that
this measure of attention or similarity between hierarchical Figure 11. Multihead attention with h heads.
pairs of words reminds us of the maximum mean discrep-
ancy (Gretton et al., 2007; 2012) which measures similarity
between different moments of data distributions.
As Fig. 11 shows, the data, which include positional encod-
ing, are passed from linear layers for obtaining the queries, tion introduced in Eqs. (13), (15), and (14). We have h of
values, and keys. These linear layers model linear projec- these linear layers to generate h set of queries, value, and
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 8

keys as: 3.2.5. S TACKING


As Fig. 9 shows, the encoder is a stack of N identical lay-
Rp×n 3 Qi = W >
Q,i X, ∀i ∈ {1, . . . , h}, (19)
ers. This stacking is for having more learn-able parameters
R p×n
3 V i = W>
V,i X, ∀i ∈ {1, . . . , h}, (20) to have enough degree of freedom to learn the whole dic-
r×n
3 Ki = W > ∀i ∈ {1, . . . , h}. tionary of words. Through experiments, a good number of
R K,i X, (21)
stacks is found to be N = 6 (Vaswani et al., 2017).
Then, the scaled dot product similarity, defined in Eq. (10),
is used to generate the h attention values {Z i }hi=1 defined 3.3. Decoder of Transformer
by Eq. (16). These h attention values are concatenated to The decoder part of transformer is shown in Fig. 9. In the
make a new long flattened vector. Then, by a linear layer, following, we explain the different parts of decoder.
which is a linear projection, the total attention value, Z t , is
obtained: 3.3.1. M ASKED M ULTIHEAD ATTENTION WITH
S ELF -ATTENTION
Z t := W >
O concat(Z 1 , Z 2 , . . . , Z h ). (22) A part of decoder is the masked multihead attention module
whose input is the shifted output embeddings {y i }ni=1 one
3.2.3. L AYER N ORMALIZATION word to the right. Positional encoding is also added to the
As Fig. 9 shows, the data (containing positional encoding) output embeddings for including the information of their
and the total attention value are added: positions. For this, we use Eq. (18) where xi is replaced
by y i .
Z 0t ← Z t + X. (23)
The output embeddings added with the positional encod-
This addition is inspired by the concept of residual intro- ings are fed to the masked multihead attention module.
duced by ResNet (He et al., 2016). After this addition, a This module is similar to the multihead attention mod-
layer normalization is applied where for each hidden unit ule but masks away the forthcoming words after a word.
hi we have: Therefore, every output word only attends to its previous
g output words, every pairs of output words attends to its
hi ← (hi − µ), (24) previous pair of output words, every pair of pairs of output
σ
words attends to its previous pair of pairs of output words,
where µ and σ are the empirical mean and standard devia- and so on. The reason for using the masked version of
tion over H hidden units: multihead attention for the output embeddings is that when
H we are generating the output text, we do not have the next
1 X
µ := hi , (25) words yet because the next words are not generated yet. It
H i=1 is noteworthy that this masking imposes some idea of spar-
sity which was also introduced by the dropout technique
v
uH
(Srivastava et al., 2014) but in a stochastic manner.
uX
σ := t (hi − µ)2 . (26)
i=1 Recall Eq. (16) which was used for multihead attention
(see Section 3.2.2). The masked multihead attention is de-
This is a standardization which makes the mean zero and fined as:
the variance one; it is closely related to batch normalization
and reduces the covariate shift (Ioffe & Szegedy, 2015). Rr×n 3 Z m := maskedAttention(Q, K, V )
 1  (28)
3.2.4. F EEDFORWARD L AYER = V softmax √ Q> K + M ,
p
Henceforth, let Z 0t denote the total attention after both ad-
dition and layer normalization. We feed Z 0t to a feedfor- where the mask matrix M ∈ Rn×n is:
ward network, having nonlinear activation functions, and 
0 if j ≤ i,
then liek before, we add the input of feedforward network M (i, j) := (29)
−∞ if j > i.
to its output:
As the softmax function has exponential operator, the mask
Z 00t ← R + Z 0t , (27)
does not have any impact for j ≤ i (because it is multiplied
where R denotes the output of feedforward network. by e0 = 1) and masks away for j > i (because it is multi-
Again, layer normalization is applied and we, henceforth, plied by e−∞ = 0). Note that j ≤ i and j > i correspond
denote the output of encoder by Z 00t . This is the encoding to the previous and next words, respectively, in terms of
for the whole input sequence or sentence having the infor- position in the sequence.
mation of attention of words and hierarchical pairs of words Similar to before, the output of masked multihead attention
on each other. is normalized and then is added to its input.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 9

Table 1. Comparison of complexities between self-attention and recurrence (Vaswani et al., 2017).
Complexity per layer Sequential operations Maximum path length
Self-Attention O(n2 p) O(1) O(1)
Recurrence O(np2 ) O(n) O(n)

3.3.2. M ULTIHEAD ATTENTION WITH former. We showed that we can learn a sequence using
C ROSS -ATTENTION the transformer. Therefore, attention is all we need to learn
As Figs. 9 and 11 illustrate, the output of masked multihead a sequence and there is no need to any recurrence module.
attention module is fed to a multihead attention module The proposal of transformers (Vaswani et al., 2017) was a
with cross-attention. This module is not self-attention be- breakthrough in NLP; the state-of-the-art NLP methods are
cause all its values, keys, and queries are not from the same all based on transformers nowadays.
sequence but its values and keys are from the output of en-
3.4.2. C OMPLEXITY C OMPARISON
coder and the queries are from the output of the masked
multihead attention module in the decoder. In other words, Table 1 reports the complexity of operations in the self-
the values and keys come from the processed input embed- attention mechanism and compares them with those in re-
dings and the queries are from the processed output embed- currence such as RNN. In self-attention, we learn attention
dings. The calculated multihead attention determines how of every word to every other word in the sequence of n
much every output embedding attends to the input embed- words. Also, we learn a p-dimensional embedding for ev-
dings, how much every pair of output embeddings attends ery word. Hence, the complexity of operations per layer
to the pairs of input embeddings, how much every pair of is O(n2 p). This is while the complexity per layer in re-
pairs of output embeddings attends to the pairs of pairs of currence is O(np2 ). Although, the complexity per layer in
input embeddings, and so on. This shows the connection self-attention is worse than recurrence, many of its opera-
between input sequence and the generated output sequence. tions can be performed in parallel because all the words of
sequence are processed simultaneously, as also explained
3.3.3. F EEDFORWARD L AYER AND S OFTMAX in the following. Hence, the O(n2 p) is not very bad for
ACTIVATION being able to parallelize it. That is while the recurrence
Again, the output of the multihead attention module with cannot be parallelized for its sequential nature.
cross-attention is normalized and added to its input. Then, As for the number of sequential operations, the self-
it is fed to a feedforward neural network with layer nor- attention mechanism processes all the n words simultane-
malization and adding to its input afterwards. Note that the ously so its sequential operations is in the order of O(1). As
masked multihead attention, the multihead attention with recurrence should process the words sequentially, the num-
cross-attention, and the feedforward network are stacked ber of its sequential operations is of order O(n). As for
for N = 6 times. the maximum path length between every two words, self-
The output of feedforward network passes through a linear attention learns attention between every two words; hence,
layer by linear projection and a softmax activation function its maximum path length is of the order O(1). However,
is applied finally. The number of output neurons with the in recurrence, as every word requires a path with a length
softmax activation functions is the number of all words in of a fraction of sequence (a length of n in the worst case)
the dictionary which is a large number. The outputs of de- to reach the process of another word, its maximum path
coder sum to one and are the probability of every word in length is O(n). This shows that attention reduces both se-
the dictionary to be the generated next word. For the sake quential operations and maximum path length, compared
of sequence generation, the token or word with the largest to recurrence.
probability is the next word.
4. BERT: Bidirectional Encoder
3.4. Attention is All We Need! Representations from Transformers
3.4.1. N O N EED TO RNN! BERT (Devlin et al., 2018) is one of the state-of-the-art
As Fig. 9 illustrates, the output of decoder is fed to the methods for NLP. It is a stack of encoders of transformer
masked multihead attention module of decoder with some (see Fig. 9). In other words, it is built using transformer
shift. Note that this is not a notion of recurrence because encoder blocks. Although some NLP methods, such as
it can be interpreted by the procedure of teacher-forcing XLNet (Yang et al., 2019b) have slightly outperformed it,
(Kolen & Kremer, 2001). Hence, we see that there is not BERT is still one of the best models for different NLP tasks
any recurrent module like RNN (Kombrink et al., 2011) such as question answering (Qu et al., 2019), natural lan-
and LSTM (Hochreiter & Schmidhuber, 1997) in trans- guage understanding (Dong et al., 2019), sentiment analy-
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 10

and train a classifier on them for the task of spam detection


or sentiment analysis.
During training the BERT model, in addition to learning the
embeddings for the words and the whole sentence, paper
(Devlin et al., 2018) has also learned an additional task.
This task is given two sentences A and B, is B likely to be
the sentence that follows A or not?
The BERT model is usually not trained from the scratch as
its training has been done in a long time on huge amount
of Internet data. For using it in different NLP applications,
such as sentiment analysis, researchers usually add one or
several neural network layers on top of a pre-trained BERT
model and train the network for their own task. During
training, one can either lock the weights of the BERT model
or fine tune them by backpropagation.
Figure 12. Feeding sentences with missing words to the BERT
model for training.
The parameters of encoder in transformer (Vaswani et al.,
2017) are 6 encoder layers, 512 hidden layer units in the
fully connected network and 8 attention heads (h = 8).
sis, and language inference (Song et al., 2020). This is while BERT (Devlin et al., 2018) has 24 encoder
layers, 1024 hidden layer units in the fully connected net-
BERT uses the technique of masked language modeling. It
work and 16 attention heads (h = 16). Usually, when
masks 15% of words in the input document/corpus and asks
we say BERT, we mean the large BERT (Devlin et al.,
the model to predict the missing words. As Fig. 12 depicts,
2018) with the above mentioned parameters. As the BERT
a sentence with a missing word is given to every trans-
model is huge and requires a lot of memory for saving
former encoding block in the stack and the block is sup-
the model, it cannot easily be used in embedded systems.
posed to predict the missing word. 15% of words are miss-
Hence, many commercial smaller versions of BERT are
ing in the sentences and every missing word is assigned to
proposed with less number of parameters and number of
every encoder block in the stack. It is an unsupervised man-
stacks. Some of these smaller versions of BERT are small
ner because any word can be masked in a sentence and the
BERT (Tsai et al., 2019), tiny BERT (Jiao et al., 2019), Dis-
output is supposed to be that word. As it is unsupervised
tilBERT (Sanh et al., 2019), and Roberta BERT (Staliūnaitė
and does not require labels, the huge text data of Internet
& Iacobacci, 2020). Some BERT models, such as clinical
can be used for training the BERT model where words are
BERT (Alsentzer et al., 2019) and BioBERT (Lee et al.,
randomly selected to be masked.
2020), have also been trained on medical texts for the
Note that BERT learns to predict the missing word based biomedical applications.
on attention to its previous and forthcoming words, it is
bidirectional. Hence, BERT jointly conditions on both 5. GPT: Generative Pre-trained Transformer
left (previous) and right (forthcoming) context of every
word. Moreover, as the missing word is predicted based GPT, or GPT-1, (Radford et al., 2018) is another state-of-
on the other words of sentence, BERT embeddings for the-art method for NLP. It is a stack of decoders of trans-
words are context-aware embeddings. Therefore, in con- former (see Fig. 9). In other words, it is built using trans-
trast to word2vec (Mikolov et al., 2013a;b; Goldberg & former decoder blocks. In GPT, the multihead attention
Levy, 2014; Mikolov et al., 2015) and GloVe (Pennington module with cross-attention is removed from the decoder of
et al., 2014) which provide a single embedding per every transformer because there is no encoder in GPT. Hence, the
word, every word has different BERT embeddings in var- decoder blocks used in GPT have only positional encoding,
ious sentences. The BERT embeddings of words differ in masked multihead self-attention module and feedforward
different sentences based on their context. For example, the network with their adding, layer normalization, and activa-
word “bank” has different meanings and therefore different tion functions.
embeddings in the sentences “Money is in the bank” and Note that as GPT uses the masked multihead self-attention,
“Some plants grow in bank of rivers”. it considers attention of word, pairs of words, pairs of pairs
It is also noteworthy that, for an input sentence, BERT out- of words, and so on, only on the previous (left) words, pairs
puts an embedding for the whole sentence in addition to of words, pairs of pairs of words, and so on. In other words,
giving embeddings for every word of the sentence. This GPT only conditions on the previous words and not the
sentence embedding is not perfect but works well enough forthcoming words. As was explained before, the objec-
in applications. One can use the BERT sentence embedding tive of BERT was to predict a masked word in a sentence.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 11

However, GPT model is used for language model (Rosen- can pass the writer’s Turing’s test (Elkins & Chun, 2020;
feld, 2000; Jozefowicz et al., 2016; Jing & Xu, 2019) whose Floridi & Chiriatti, 2020). Note that GPT-3 has kind of
objective is to predict the next word, in an incomplete sen- memorized the texts of all subjects but not in bad way, i.e.,
tence, given all of the previous words. The predicted new overfitting, but in a good way. This memorization is be-
word is then added to the sequence and is fed to the GPT as cause of the complexity of huge number of learn-able pa-
input again and the next other word is predicted. This goes rameters (Arpit et al., 2017) and not being overfitted is be-
on until the sentences get complete with their next com- cause of being trained by big enough Internet data.
ing words. In other words, GPT model takes some docu- GPT-3 has had many different interesting applications such
ment and continues the text in the best and related way. For as fiction and poetry generation (Branwen, 2020). Of
example, if the input sentences are about psychology, the course, it is causing some risks, too (McGuffie & New-
trained GPT model generates the next sentences also about house, 2020).
psychology to complete the document. Note that as any text
without label can be used for predicting the next words in 6. Conclusion
sentences, GPT is an unsupervised method making it pos-
Transformers are very essential tools in natural language
sible to be trained on huge amount of Internet data.
processing and computer vision. This paper was a tuto-
The successors of GPT-1 (Radford et al., 2018) are GPT- rial and survey paper on attention mechanism, transform-
2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020). ers, BERT, and GPT. We explained attention mechanism,
GPT-2 and GPT-3 are extension of GPT-1 with more num- the sequence-to-sequence model with and without atten-
ber of stacks of transformer decoder. Hence, they have tion, and self-attention. The different parts of encoder and
more learn-able parameters and can be trained with more decoder of a transformer were explained. Finally, BERT
data for better language modeling and inference. For ex- and GPT were introduced as stacks of the encoders and de-
ample, GPT-2 has 1.5 billion parameters. GPT-2 and es- coders of transformer, respectively.
pecially GPT-3 have been trained with much more Internet
data with various general and academic subjects to be able Acknowledgment
to generate text in any subject and style of interest. For
example, GPT-2 has been trained on 8 million web pages The authors hugely thank Prof. Pascal Poupart whose
which contain 40GB of Internet text data. course partly covered some of materials in this tutorial pa-
per. Some of the materials of this paper can also be found
GPT-2 is a quite large model and cannot be easily used
in Prof. Ali Ghodsi’s course videos.
in embedded systems because of requiring large memory.
Hence, different sizes of GPT-2, like small, medium, large,
Xlarge, DistilGPT-2, are provided for usage in embedded
References
systems, where the number of stacks and learn-able param- Alsentzer, Emily, Murphy, John R, Boag, Willie, Weng,
eters differ in these versions. These versions of GPT-2 can Wei-Hung, Jin, Di, Naumann, Tristan, and McDermott,
be found and used in the HuggingFace transformer Python Matthew. Publicly available clinical BERT embeddings.
package (Wolf et al., 2019a). arXiv preprint arXiv:1904.03323, 2019.
GPT-2 has been used in many different applications such Arpit, Devansh, Jastrzebski, Stanisław, Ballas, Nicolas,
as dialogue systems (Budzianowski & Vulić, 2019), patent Krueger, David, Bengio, Emmanuel, Kanwal, Maxin-
claim generation (Lee & Hsiang, 2019), and medical text der S, Maharaj, Tegan, Fischer, Asja, Courville, Aaron,
simplification (Van et al., 2020). A combination of GPT-2 Bengio, Yoshua, and Lacoste-Julien, Simon. A closer
and BERT has been used for question answering (Klein & look at memorization in deep networks. In International
Nabi, 2019). It is noteworthy that GPT can be seen as few Conference on Machine Learning, 2017.
shot learning (Brown et al., 2020). A comparison of GPT
and BERT can also be found in (Ethayarajh, 2019). Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio,
GPT-3 is a very huge version of GPT with so many number Yoshua. Neural machine translation by jointly learning
of stacks and learn-able parameters. For comparison, note to align and translate. In International Conference on
that GPT-2, NVIDIA Megatron (Shoeybi et al., 2019), Mi- Learning Representations, 2015.
crosoft Turing-NLG (Microsoft, 2020), and GPT-3 (Brown
et al., 2020) have 1.5 billion, 8 billion, 17 billion, and 175 Ben-David, Shai, Blitzer, John, Crammer, Koby, Kulesza,
billion learn-able parameters, respectively. This huge num- Alex, Pereira, Fernando, and Vaughan, Jennifer Wort-
ber of parameters allows GPT-3 to be trained by very huge man. A theory of learning from different domains. Ma-
amount of Internet text data on various subjects and topics. chine learning, 79(1-2):151–175, 2010.
Hence, GPT-3 has been able to learn almost all topics of
documents and even some people are discussing whether it Branwen, Gwern. GPT-3 creative fiction. https://
www.gwern.net/GPT-3, 2020.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 12

Brown, Tom B, Mann, Benjamin, Ryder, Nick, Subbiah, Floridi, Luciano and Chiriatti, Massimo. GPT-3: Its nature,
Melanie, Kaplan, Jared, Dhariwal, Prafulla, Neelakan- scope, limits, and consequences. Minds and Machines,
tan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, pp. 1–14, 2020.
Amanda, et al. Language models are few-shot learners.
In Advances in neural information processing systems, Garcia-Molina, Hector, D. Ullman, Jeffrey, and Widom,
2020. Jennifer. Database systems: The complete book. Pren-
tice Hall, 1999.
Budzianowski, Paweł and Vulić, Ivan. Hello, it’s GPT-2–
how can I help you? Towards the use of pretrained lan- Goldberg, Yoav and Levy, Omer. word2vec explained:
guage models for task-oriented dialogue systems. arXiv deriving Mikolov et al.’s negative-sampling word-
preprint arXiv:1907.05774, 2019. embedding method. arXiv preprint arXiv:1402.3722,
2014.
Chan, William, Jaitly, Navdeep, Le, Quoc V, and Vinyals,
Oriol. Listen, attend and spell. arXiv preprint Gretton, Arthur, Borgwardt, Karsten, Rasch, Malte,
arXiv:1508.01211, 2015. Schölkopf, Bernhard, and Smola, Alex J. A kernel
Chan, William, Jaitly, Navdeep, Le, Quoc, and Vinyals, method for the two-sample-problem. In Advances in
Oriol. Listen, attend and spell: A neural network neural information processing systems, pp. 513–520,
for large vocabulary conversational speech recognition. 2007.
In 2016 IEEE International Conference on Acoustics,
Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J,
Speech and Signal Processing, pp. 4960–4964. IEEE,
Schölkopf, Bernhard, and Smola, Alexander. A kernel
2016.
two-sample test. The Journal of Machine Learning Re-
Cheng, Jianpeng, Dong, Li, and Lapata, Mirella. Long search, 13(1):723–773, 2012.
short-term memory-networks for machine reading. In
Conference on Empirical Methods in Natural Language He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,
Processing, pp. 551–561, 2016. Jian. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vision
Chorowski, Jan, Bahdanau, Dzmitry, Cho, Kyunghyun, and and pattern recognition, pp. 770–778, 2016.
Bengio, Yoshua. End-to-end continuous speech recog-
nition using attention-based recurrent NN: First results. Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
arXiv preprint arXiv:1412.1602, 2014. term memory. Neural computation, 9(8):1735–1780,
1997.
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and
Toutanova, Kristina. BERT: Pre-training of deep bidi- Indurkhya, Nitin and Damerau, Fred J. Handbook of natu-
rectional transformers for language understanding. arXiv ral language processing, volume 2. CRC Press, 2010.
preprint arXiv:1810.04805, 2018.
Ioffe, Sergey and Szegedy, Christian. Batch normalization:
Dong, Li, Yang, Nan, Wang, Wenhui, Wei, Furu, Liu, Xi-
Accelerating deep network training by reducing internal
aodong, Wang, Yu, Gao, Jianfeng, Zhou, Ming, and Hon,
covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Hsiao-Wuen. Unified language model pre-training for
natural language understanding and generation. In Ad- Jiao, Xiaoqi, Yin, Yichun, Shang, Lifeng, Jiang, Xin, Chen,
vances in Neural Information Processing Systems, pp. Xiao, Li, Linlin, Wang, Fang, and Liu, Qun. Tiny-
13063–13075, 2019. BERT: Distilling BERT for natural language understand-
Dou, Qi, Coelho de Castro, Daniel, Kamnitsas, Konstanti- ing. arXiv preprint arXiv:1909.10351, 2019.
nos, and Glocker, Ben. Domain generalization via
Jing, Kun and Xu, Jungang. A survey on neural network
model-agnostic learning of semantic features. Advances
language models. arXiv preprint arXiv:1906.03591,
in Neural Information Processing Systems, 32:6450–
2019.
6461, 2019.
Elkins, Katherine and Chun, Jon. Can GPT-3 pass a writer’s Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer,
Turing test? Journal of Cultural Analytics, 2371:4549, Noam, and Wu, Yonghui. Exploring the limits of
2020. language modeling. arXiv preprint arXiv:1602.02410,
2016.
Ethayarajh, Kawin. How contextual are contextualized
word representations? Comparing the geometry of Kazemi, Vahid and Elqursh, Ali. Show, ask, attend, and
BERT, ELMo, and GPT-2 embeddings. arXiv preprint answer: A strong baseline for visual question answering.
arXiv:1909.00512, 2019. arXiv preprint arXiv:1704.03162, 2017.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 13

Klein, Tassilo and Nabi, Moin. Learning to answer by McGuffie, Kris and Newhouse, Alex. The radicalization
learning to ask: Getting the best of gpt-2 and bert worlds. risks of GPT-3 and advanced neural language models.
arXiv preprint arXiv:1911.02365, 2019. arXiv preprint arXiv:2009.06807, 2020.

Kolen, John F and Kremer, Stefan C. A field guide to dy- Microsoft. Turing-NLG: A 17-billion-parameter language
namical recurrent networks. John Wiley & Sons, 2001. model by Microsoft. Microsoft Blog, 2020.

Kombrink, Stefan, Mikolov, Tomáš, Karafiát, Martin, and Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jef-
Burget, Lukáš. Recurrent neural network based language frey. Efficient estimation of word representations in vec-
modeling in meeting recognition. In Twelfth annual con- tor space. In International Conference on Learning Rep-
ference of the international speech communication asso- resentations, 2013a.
ciation, 2011. Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,
LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Greg S, and Dean, Jeff. Distributed representations of
Patrick. Gradient-based learning applied to document words and phrases and their compositionality. In Ad-
recognition. Proceedings of the IEEE, 86(11):2278– vances in neural information processing systems, pp.
2324, 1998. 3111–3119, 2013b.

Lee, Honglak, Grosse, Roger, Ranganath, Rajesh, and Ng, Mikolov, Tomas, Chen, Kai, Corrado, Gregory S, and
Andrew Y. Convolutional deep belief networks for scal- Dean, Jeffrey A. Computing numeric representations of
able unsupervised learning of hierarchical representa- words in a high-dimensional space, May 19 2015. US
tions. In International Conference on Machine Learning, Patent 9,037,464.
pp. 609–616, 2009. Pennington, Jeffrey, Socher, Richard, and Manning,
Christopher D. GloVe: Global vectors for word rep-
Lee, Jieh-Sheng and Hsiang, Jieh. Patent claim gener-
resentation. In Proceedings of the 2014 conference on
ation by fine-tuning OpenAI GPT-2. arXiv preprint
empirical methods in natural language processing, pp.
arXiv:1907.02052, 2019.
1532–1543, 2014.
Lee, Jinhyuk, Yoon, Wonjin, Kim, Sungdong, Kim,
Perlovsky, Leonid I. Toward physics of the mind: Con-
Donghyeon, Kim, Sunkyu, So, Chan Ho, and Kang, Jae-
cepts, emotions, consciousness, and symbols. Physics of
woo. BioBERT: a pre-trained biomedical language rep-
Life Reviews, 3(1):23–55, 2006.
resentation model for biomedical text mining. Bioinfor-
matics, 36(4):1234–1240, 2020. Qu, Chen, Yang, Liu, Qiu, Minghui, Croft, W Bruce,
Zhang, Yongfeng, and Iyyer, Mohit. BERT with his-
Li, Da, Yang, Yongxin, Song, Yi-Zhe, and Hospedales, tory answer embedding for conversational question an-
Timothy M. Deeper, broader and artier domain gener- swering. In Proceedings of the 42nd International ACM
alization. In Proceedings of the IEEE international con- SIGIR Conference on Research and Development in In-
ference on computer vision, pp. 5542–5550, 2017. formation Retrieval, pp. 1133–1136, 2019.
Li, Hui, Wang, Peng, Shen, Chunhua, and Zhang, Guyu. Qureshi, Ahmed Hussain, Nakamura, Yutaka, Yoshikawa,
Show, attend and read: A simple and strong baseline Yuichiro, and Ishiguro, Hiroshi. Show, attend and inter-
for irregular text recognition. In Proceedings of the act: Perceivable human-robot social interaction through
AAAI Conference on Artificial Intelligence, volume 33, neural attention q-network. In 2017 IEEE International
pp. 8610–8617, 2019a. Conference on Robotics and Automation, pp. 1639–
Li, Yang, Kaiser, Lukasz, Bengio, Samy, and Si, Si. 1645. IEEE, 2017.
Area attention. In International Conference on Machine Radford, Alec, Narasimhan, Karthik, Salimans, Tim, and
Learning, pp. 3846–3855. PMLR, 2019b. Sutskever, Ilya. Improving language understanding by
generative pre-training. Technical report, OpenAI, 2018.
Lim, Jian Han, Chan, Chee Seng, Ng, Kam Woh, Fan,
Lixin, and Yang, Qiang. Protect, show, attend and Radford, Alec, Wu, Jeffrey, Child, Rewon, Luan, David,
tell: Image captioning model with ownership protection. Amodei, Dario, and Sutskever, Ilya. Language models
arXiv preprint arXiv:2008.11009, 2020. are unsupervised multitask learners. OpenAI blog, 1(8):
9, 2019.
Luong, Minh-Thang, Pham, Hieu, and Manning, Christo-
pher D. Effective approaches to attention-based neural Rosenfeld, Ronald. Two decades of statistical language
machine translation. arXiv preprint arXiv:1508.04025, modeling: Where do we go from here? Proceedings
2015. of the IEEE, 88(8):1270–1278, 2000.
Attention Mechanism, Transformers, BERT, and GPT: Tutorial and Survey 14

Rumelhart, David E, Hinton, Geoffrey E, and Williams, particular: Multi-domain translation with domain trans-
Ronald J. Learning representations by back-propagating formation networks. In Proceedings of the AAAI Con-
errors. Nature, 323(6088):533–536, 1986. ference on Artificial Intelligence, volume 34, pp. 9233–
9241, 2020.
Sanh, Victor, Debut, Lysandre, Chaumond, Julien, and
Wolf, Thomas. DistilBERT, a distilled version of bert: Wolf, Thomas, Debut, Lysandre, Sanh, Victor, Chaumond,
smaller, faster, cheaper and lighter. arXiv preprint Julien, Delangue, Clement, Moi, Anthony, Cistac, Pier-
arXiv:1910.01108, 2019. ric, Rault, Tim, Louf, Rémi, Funtowicz, Morgan, et al.
HuggingFace’s transformers: State-of-the-art natural
Shoeybi, Mohammad, Patwary, Mostofa, Puri, Raul, language processing. arXiv preprint arXiv:1910.03771,
LeGresley, Patrick, Casper, Jared, and Catanzaro, Bryan. 2019a.
Megatron-LM: Training multi-billion parameter lan-
guage models using model parallelism. arXiv preprint Wolf, Thomas, Sanh, Victor, Chaumond, Julien, and De-
arXiv:1909.08053, 2019. langue, Clement. TransferTransfo: A transfer learn-
ing approach for neural network based conversational
Socher, Richard, Bengio, Yoshua, and Manning, Chris. agents. In Proceedings of the AAAI Conference on Arti-
Deep learning for nlp. Tutorial at Association of Com- ficial Intelligence, 2019b.
putational Logistics (ACL), 2012.
Xu, Jun. On the techniques of English fast-reading. In
Song, Youwei, Wang, Jiahai, Liang, Zhiwei, Liu, Zhiyue, Theory and Practice in Language Studies, volume 1, pp.
and Jiang, Tao. Utilizing BERT intermediate layers for 1416–1419. Academy Publisher, 2011.
aspect based sentiment analysis and natural language in- Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, Cho, Kyunghyun,
ference. arXiv preprint arXiv:2002.04815, 2020. Courville, Aaron, Salakhudinov, Ruslan, Zemel, Rich,
and Bengio, Yoshua. Show, attend and tell: Neural im-
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, age caption generation with visual attention. In Interna-
Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout: a tional conference on machine learning, pp. 2048–2057,
simple way to prevent neural networks from overfitting. 2015.
The journal of machine learning research, 15(1):1929–
1958, 2014. Yang, Chao, Kim, Taehwan, Wang, Ruizhe, Peng, Hao, and
Kuo, C-C Jay. Show, attend, and translate: Unsupervised
Staliūnaitė, Ieva and Iacobacci, Ignacio. Compositional image translation with self-regularization and attention.
and lexical semantics in RoBERTa, BERT and Dis- IEEE Transactions on Image Processing, 28(10):4845–
tilBERT: A case study on CoQA. arXiv preprint 4856, 2019a.
arXiv:2009.08257, 2020.
Yang, Zhilin, Dai, Zihang, Yang, Yiming, Carbonell,
Summerfield, Christopher and Egner, Tobias. Expectation Jaime, Salakhutdinov, Russ R, and Le, Quoc V. XL-
(and attention) in visual cognition. Trends in cognitive net: Generalized autoregressive pretraining for language
sciences, 13(9):403–409, 2009. understanding. In Advances in neural information pro-
cessing systems, pp. 5753–5763, 2019b.
Tsai, Henry, Riesa, Jason, Johnson, Melvin, Arivazhagan,
Naveen, Li, Xin, and Archer, Amelia. Small and practi- Yu, Keyi, Liu, Yang, Schwing, Alexander G, and Peng,
cal BERT models for sequence labeling. arXiv preprint Jian. Fast and accurate text classification: Skimming,
arXiv:1909.00100, 2019. rereading and early stopping. In International Confer-
ence on Learning Representations, 2018.
Van, Hoang, Kauchak, David, and Leroy, Gondy. Au-
toMeTS: The autocomplete for medical text simplifica- Zhang, Honglun, Chen, Wenqing, Tian, Jidong, Wang,
tion. arXiv preprint arXiv:2010.10573, 2020. Yongkun, and Jin, Yaohui. Show, attend and translate:
Unpaired multi-domain image-to-image translation with
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, visual attention. arXiv preprint arXiv:1811.07483, 2018.
Jakob, Jones, Llion, Gomez, Aidan N, Kaiser, Łukasz,
and Polosukhin, Illia. Attention is all you need. In
Advances in neural information processing systems, pp.
5998–6008, 2017.

Wang, Yong, Wang, Longyue, Shi, Shuming, Li, Vic-


tor OK, and Tu, Zhaopeng. Go from the general to the

You might also like