UNIT – 7
Natural Language Processing Models
Natural language processing is one of the most fascinating topics in artificial intelligence. Deep
learning models that have been trained on a large dataset to perform specific NLP tasks are referred
to as pre-trained models (PTMs) for NLP, and they can aid in downstream NLP tasks by avoiding
the need to train a new model from scratch.
List of natural language processing models :
1- BERT
Bidirectional Encoder Representations from Transformers is abbreviated as BERT, which was
created by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. It is a natural
language processing machine learning (ML) model that was created in 2018 and serves as a Swiss
Army Knife solution to 11+ of the most common language tasks, such as sentiment analysis and
named entity recognition.
BERT, compared to recent language representation models, is intended to pre-train deep
bidirectional representations by conditioning on both the left and right contexts in all layers. As a
matter of fact, the pre-trained BERT representations can be fine-tuned with just one additional
output layer to produce cutting-edge models for a variety of tasks, including question answering
and language inference, without requiring significant task-specific modifications.
This bidirectional model can learn from both the left and the right directions of a token’s context,
which is very important when we want the model to understand the context very well.
Ex:
We went to the river bank.
I need to go to bank to make a deposit. [context]
We want to get the context of the word “Bank”
If we take left to right part of the sentence or right to left part of the sentence alone, definitely we
commit a mistake.
We must consider both the left and right parts of the sentence to get the right context of the word
“bank”,
This is exactly done in BERT.
BERT’s continued success has been aided by a massive dataset of 3.3 billion words. It was trained
specifically on Wikipedia with 2.5B words and Google BooksCorpus with 800M words. These
massive informational datasets aided BERT’s deep understanding of not only the English language
but also of our world.
Key performances of BERT
• BERT suggests a pre-trained model that does not require any significant architecture
changes to be applied to specific NLP tasks.
2. GPT – 2/3
GPT means Generative Pre-trained Transformer 2/3 is an autoregressive language model that
uses deep learning to produce human-like text.
It utilizes the concept of the multi-layer transformer decoder as the feature extractor.
Given the wide variety of possible tasks and the difficulty of collecting a large labeled training
dataset, researchers proposed an alternative solution, which was scaling up language models to
improve task-agnostic few-shot performance. They put their solution to the test by training and
evaluating a 175B-parameter autoregressive language model called GPT-3 on a variety of NLP
tasks.
When using the transformers model, we see that it does not consider performing the feature
extraction in one direction, ie., from left to the right instead, it observes the following word while
predicting the next word.
It uses mask multi head attention concept ie., the transformers only consider one part of the input
text called one way transformer. GPT-2 is a successor of the GPT model that is trained on more
than 1.8 billion parameters and many web pages. The main objective is to predict the next word
with the presence of all the other prior words in the context.
The evaluation results show that GPT-3 achieves promising results and occasionally outperforms
the state of the art achieved by fine-tuned models under few-shot learning, one-shot learning, and
zero-shot learning.
Key performances of GPT-3
• It can create anything with a text structure, and not just human language text.
• It can automatically generate text summarizations and even programming code.
3- XLNet
The XLNet model was proposed in XLNet: Generalized Autoregressive Pretraining for Language
Understanding by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan
Salakhutdinov, and Quoc V. Le.
XLnet is a Transformer-XL model extension that was pre-trained using an autoregressive method
to maximize the expected likelihood across all permutations of the input sequence factorization
order.
XLNet is a generalized autoregressive pretraining method that enables learning in bidirectional
contexts by maximizing the expected likelihood over all permutations of the factorization order and
overcomes the limitations of other models thanks to its autoregressive formulation.
Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive
model, into pretraining. Empirically, XLNet outperforms BERT, for example, on 20 tasks, often by
a large margin, and achieves state-of-the-art results on 18 tasks, including question answering,
natural language inference, sentiment analysis, and document ranking.
Key performances of XLnet
• The new model outperforms previous models on 18 NLP tasks, including question
answering, natural language inference, sentiment analysis, and document ranking.
• XLnet consistently outperforms BERT often by a wide margin.
4.ELMO
Embeddings from Language Models (ELMO) is another pre trained autoregressive model used
in building NLP applications. It comprises the two i-way language models and the 1-way long
short term memory model for auto-regressive pre-training.
The long short term memory models are not used for encoding purposes because of the pretraining
issues. The LSTM model will have con textual representation & will impact the result of
prediction. The Embeddings from Language Models are a 1-way model as it handles the encoding
and characterization in one direction but then combines both later.
5.ERNIE
The bidirectional encoder representations from Transformer’s MLM model in the Chinese version
can capture linguistic data from the input text such as the collations; it is not comparatively stronger
when capturing semantic and entity level information. Hence, the ERNIE model utilizes the
extensively used pre-trained models outside the comprehension for performance boosting on the
knowledge level workloads.
Pre-trained tasks are
• Basic level Masking: The masking of words takes place like BERT. It is challenging to
learn the semantic information
• Phrase level masking: A single word is in the input level, and the mask happens for a
continuous phrase.
• Entity level masking: Entity recognition takes place as the first step and entities are finally
masked.
6.ELECTRA
It is a language model commonly used for supervised representation learning tasks. It takes less
computation and can be used to pre train the transformer-based models. These models can
distinguish between the real and the fake tokens created by other networks.
Existing pre-training methods generally fall under two categories: language models (LMs), such
as GPT, which process the input text left-to-right, predicting the next word given the previous
context, and masked language models (MLMs), such as BERT, RoBERTa, and ALBERT, which
instead predict the identities of a small number of words that have been masked out of the input.
MLMs have the advantage of being bidirectional instead of unidirectional in that they “see” the
text to both the left and right of the token being predicted, instead of only to one side. However,
the MLM objective (and related objectives such as XLNet’s) also have a disadvantage. Instead of
predicting every single input token, those models only predict a small subset — the 15% that was
masked out, reducing the amount learned from each sentence.
Existing pre-training methods and their disadvantages. Arrows indicate which tokens are used to produce a given output
representation (rectangle). Left: Traditional language models (e.g., GPT) only use context to the left of the current
word. Right: Masked language models (e.g., BERT) use context from both the left and right, but predict only a small
subset of words for each input.
Replaced token detection trains a bidirectional model while learning from all input positions.
The replacement tokens come from another neural network called the generator. While the
generator can be any model that produces an output distribution over tokens, we use a small
masked language model (i.e., a BERT model with small hidden size) that is trained jointly with
the discriminator. Although the structure of the generator feeding into the discriminator is similar
to a GAN, we train the generator with maximum likelihood to predict masked words, rather than
adversarially, due to the difficulty of applying GANs to text. The generator and discriminator share
the same input word embeddings. After pre-training, the generator is dropped and the discriminator
(the ELECTRA model) is fine-tuned on downstream tasks.
Our models all use the transformer neural architecture.
Text to Text Transfer Transformer
The model calls Text-to-Text Transfer Transformer (T5) .From the model name, you may already
know that the architecture of T5 is Transformer and leveraging transfer learning. This story will
not go through the detail of transformer architecture while you may visit this blog to understand
it.
The T5 model uses the transformer framework. The T5 model finds its application in language
translation, document translation, question generation and answering and text classification tasks.
It takes a text document as the input for training and outputs a text. This capability of the model
lets itself, the loss function and the hyperparameters are used across various text-based tasks.
Text Summarization Techniques
Text Summarization is the process of creating a compact yet accurate summary of text
documents.
What is the need for Text Summarization?
With the ever-growing amount of information available for use, it is important to have shorter,
and meaningful summaries for better structure to the same information.
Forms of Text Summarization
There are two primary approaches towards text summarization. They are -
1. Extractive
Within this approach, the most relevant sentences in the text document are reproduced as it is in
the summary. New words or phrases are thus, not added.
2. Abstractive
This approach, on the other hand, focuses on interpreting the text within documents and generating
new phrases that best represent the essence of the document.
Extractive Approach
The Extractive Approach is mainly based on three independent steps as described below.
1. Generation of an Intermediate Representation
The text that is to be summarized is drawn to an intermediate form, either by topic
representation or indicator representation.
Each kind of representation differs in complexity and has several techniques for
performing it.
2. Assign a score to each sentence
The Sentence Score directly implies how important the sentence is to the text.
3. Select Sentences for the Summary
The most relevant k number of sentences are selected for the summary based on several
factors such as eliminating redundancy, fulfilling the context, etc.
Abstractive Approach
The Abstractive Approach is mainly based on the following steps -
1. Establishing a context for the text
An Abstractive Approach works similar to human understanding of text summarization.
Thus, the first step is to understand the context of the text.
2. Semantics
Words based on semantic understanding of the text are either reproduced from the
original text or newly generated.
Example
Text: There were bad weather conditions in the town. Subsequently, the roads were impassable.
Extractive Approach
There were bad weather conditions in the town. Subsequently, the roads were impassable.
Bad weather conditions town. Subsequently, roads impassable
Abstractive Approach
Bad weather conditions made town roads impassable.
Methods of Implementation
Following are the text summarization techniques:
• Luhn's Heuristic Method
• Edmundson's Heuristic Method
• SumBasic
• KL-Sum
• LexRank
• TextRank
• Reduction
• Latent Semantic Analysis
Listed below are some common methods of text summarization, their advantages and
disadvantages –
Luhn's Heuristic Method
• Luhn proposed that the significance of each word in a text document signifies how
relevant it is.
• Filler words like 'a', 'and', 'the' and likewise are ignored and more importance is assigned
to the sentences in the beginning.
• The idea is that any sentence with maximum occurrences of the highest frequency words
are more important to the meaning of the document than others.
This is one of the earliest approaches of text summarization and is not considered very accurate
Edmundson's Heuristic Method
• This method uses the idea of defining bonus words and stigma words, words that are of
high or low importance respectively.
• Words in the document title are given additional importance.
• It is one of the earlier methods of text summarization, along with Luhn's Method.
SumBasic
• It is generally used for generating multi-document summaries.
• It applies the basic idea of probability, assuming that the high-frequency words in the
bag-of-words model of the document have a higher possibility of occurring in the
summary of the document.
• Probabilities are assigned to each word on the basis of their term frequency in the
document, and these probabilities are updated as sentences are chosen for the summary.
KL-Sum
• This method is based on the concept of KL Divergence and Unigram distribution.
• It includes those sentences to the summary that minimize Summary Vocabulary
Divergence from the original In out Vocabulary.
This method has no explicit way of eliminating redundancy
LexRank
It is "based on the concept of eigenvector centrality in a graph representation of sentences".
• Within this algorithm, each sentence recommends sentences similar to it.
• A graph is created with each node being a sentence, connected to its similar sentences
( the similarity measure is usually Cosine Similarity or TF-IDF )
• Sentences with maximum recommendations is more likely to get picked for the summary.
• The idea is that any sentence important to the text document will probably be repeated in
similar ways thus having more number of similar sentences.
TextRank
This algorithm is similar to LexRank but relatively simpler.
• This algorithm works on the same basic principle as LexRank, with the only difference
being the similarity measure or metric for construction of edges of the graph.
• In this algorithm, number of common words measure the sentences similarity.
• While LexRank can be applied to multiple documents, TextRank is primarily used for
single documents.
Reduction
• This method also works on the idea of graph-based modelling of the text document.
• It assigns importance to sentences in accordance with the sum of their edges to other
sentences.
Latent Semantic Analysis
• It works on the principle of Term Frequency along with Singular Value Decomposition.
• The idea is to resolve the document space to a "concept space", meaning the document is
broken down into the actual underlying concept and comparisons are made within that
space.
• This is a more complicated method as compared to others.
Applications
Text Summarizations finds a wide variety of applications in creation of headlines, synopses,
reviews, book, movie ,summaries, resumes, and so on.
Extractive Text summarization refers to extracting (summarizing) out the relevant information
from a large document while retaining the most important information.
BERT
BERT (Bidirectional Encoder Representations from Transformers) introduces rather advanced
approach to perform NLP tasks
BERT (Bidirectional tranformer) is a transformer used to overcome the limitations of RNN and
other neural networks as Long term dependencies. It is a pre-trained model that is naturally
bidirectional. This pre-trained model can be tuned to easily to perform the NLP tasks as
specified.
Points to a glance:
1. BERT models are pre-trained on huge datasets thus no further training is required.
2. It uses a powerful flat architecture with inter sentence tranform layers so as to get the best
results in summarization.
Advantages
1. It is most efficient summarizer till date.
2. Faster than RNN.
• The Summary sentences are assumed to be representing the most important points of a
document.
Methodology
For a set of sentences {sent1,sent2,sent3,...,sentn,} we have two possibilities , that are , yi={0,1}
which denotes whether a particular sentence will be picked or not.
Being trained as a masked model the output vectors are tokened instead of sentences. Unlike other
extractive summarizers it makes use of embeddings for indicating different sentences and it has
only two labels namely sentence A and sentence B rather than multiple sentences. These
embeddings are modified accordingly to generate required summaries.
The complete process can be divided into several phases, as follows:
Encoding Multiple Sentences
In this step sentences from the input document are encoded so as to be preprocessed. Each
sentence is preceded by a CLS tag and succeeded by a SEP tag. The CLS tag is used to aggregate
the features of one or more sentences.
Interval Segment Embeddings
This step is dedicated to distinguish sentences in a document. Sentences are assigned either of
the labels discussed above. For example,
{senti}= EA or EB depending upon i. The criterion is basically as EA for even i and EB for odd i.
Embeddings
It basically refers to the representation of words in their vector forms. It helps to make their
usage flexible. Even the Google utilizes this feature of BERT for better understanding of queries.
It helps in unlocking various functionality towards the semantics from understanding the intent
of the document to developing a similarity model between the words.
There are three types of embeddings applied to our text prior to feeding it to the BERT layer,
namely:
a. Token Embeddings - Words are converted into a fixed dimension vector. [CLS] and [SEP] is
added at the beginning and end of sentences respectively.
b. Segment Embeddings - It is used to distinguish or we can say classify the different inputs
using binary coding. For example, input1- "I love books" and input2- "I love sports". Then after
the processing through token embedding we would have
```[CLS],I,love,books,[SEP],I,love,sports ```
segment embedding would result into
```[0,0,0,0,0,1,1,1] ```
``` Input1=0 , Input2=1```
c. Position Embeddings - BERT can support input sequences of 512. Thus the resulting vector
dimensions will be (512,768). Positional embedding is used because the position of a word in a
sentence may alter the contextual meaning of the sentence and thus should not have same
representation as vectors. For example, "We did not play,however we were spectating."
In the sentence above "we" have must not have same vector representations.
NOTE - Every word is stored as a 768 dimensional representation. Overall sum of these
embeddings is the input to the BERT.
BERT uses a very different approach to handle the different contextual meanings of a word, for
instand "playing" and "played" are represented as play+##ing and play+ ##ed . ## here refers to
the subwords.
BERT Architecture
There are following two bert models introduced:
1. BERT base
In the BERT base model we have 12 transformer layers along with 12 attention layers
and 110 million parameters.
2. BERT Large
In BERT large model we have 24 transformer layers along with 16 attention layers and
340 million parameters.
Transformer layer- Tranformer layer is actually a combination of complete set of encoder and
decoder layers and the intermediate connections. Each encoder includes Attention layers along
with a RNN. Decoder also has the same architecture but it includes another attention layer in
between them as does the seq2seq model. It helps to concentrate on important words.
Summarization layers
The one major noticable difference between RNN and BERT is the Self attention layer. The
model tries to identify the strongest links between the words and thus helps in representation.
We can have different types of layers within the BERT model each having its own
specifications:
1. Simple Classifier - In a simple classifier method , a linear layer is added to the BERT
along with a sigmoid function to predict the score Yˆ i.
Yˆi = σ(WoTi + bo)
2. Inter Sentence Transformer - In the inter sentence transformer ,simple classifier is not
used. Rather various transformer layers are added into the model only on the sentence
representation thus making it more efficient. This helps in recognizing the important
points of the document.
h˜l = LN(hl-1 + MHAtt(hl-1)
hl = LN(h˜l + FFN(h˜l))
where h0 = PosEmb(T) and T are the sentence vectors output by BERT, PosEmb is the function
of adding positional embeddings (indicating the position of each sentence) to T, LN is the layer
normalization operation, MHAtt is the multi-head attention operation and the superscript l
indicates the depth of the stacked layer.
These are followed by the sigmoid output layer
Yˆi = σ(WohLi + bo)
hL is the vector for senti from the top layer (the L-th layer ) of the Transformer.
3. Recurrent Neural network - An LSTM layer is added with the BERT model output in
order to learn the summarization specific features. Where each LSTM cell is normalized.
. At time step i, the input to the LSTM layer is the BERT output Ti.
Ci = σ(Fi).Ci-1+ σ(Ii).tanh(Gi-1)
hi =σ(Ot).tanh(LNc(Ct))
where Fi, Ii, Oi are forget gates, input gates,
output gates; Gi is the hidden vector and Ci is the memory vector, hi is the output vector and
LNh, LNx, LNc are there difference layer normalization operations. The output layer is again the
sigmoid layer.
Pseudocode
BERT summarizer library can be directly installed in python using the following commands
pyhton pip install bert-extractive-summarizer for the easies of the implementation.
Import the required module from the library and create its object.
from summarizer import Summarizer
model=summarizer()
Python
Copy
Text to be summarized is to be stored in a variable
text='''
OpenGenus Foundation is an open-source non-profit organization with the aim to enable people
to work offline for a longer stretch, reduce the time spent on searching by exploiting the fact that
almost 90% of the searches are same for every generation and to make programming more
accessible.OpenGenus is all about positivity and innovation.Over 1000 people have contributed
to our missions and joined our family. We have been sponsored by three great companies namely
Discourse, GitHub and DigitalOcean. We run one of the most popular Internship program and
open-source projects and have made a positive impact over people's life.
'''
Python
Copy
Finally we call the model to pass our text for summarization
summary=model(text)
print(summary)
Python
Copy
OUTPUT-
OpenGenus Foundation is an open-source non-profit organization with the aim to enable people
to work offline for a longer stretch , reduce the time spent on searching by exploiting the fact that
almost 90 % of the searches are same for every generation and to make programming more
accessible. We run one of the most popular Internship program and open-source projects and have
made a positive impact over people 's life