KEMBAR78
Word Embedding & Language Modelling | PDF | Artificial Neural Network | Learning
0% found this document useful (0 votes)
64 views111 pages

Word Embedding & Language Modelling

The document discusses different techniques for word embeddings including frequency-based methods like one-hot encoding and TF-IDF as well as prediction-based methods. It provides examples and explanations of each technique.

Uploaded by

priyankap1624153
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views111 pages

Word Embedding & Language Modelling

The document discusses different techniques for word embeddings including frequency-based methods like one-hot encoding and TF-IDF as well as prediction-based methods. It provides examples and explanations of each technique.

Uploaded by

priyankap1624153
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 111

MODULE 3: SEQUENCE TO SEQUENCE &

LANGUAGE MODELLING

Word embedding: skip-gram model, CBOW,


GloVe, BERT; Sequence to sequence theory and
applications, Attention theory and teacher forcing;
Language Modelling: Basic ideas, smoothing
techniques, Language modeling with RNN and
LSTM;

1
WORD EMBEDDING IN NLP
Used for representing words for text analysis in
the form of real-valued vectors.
It is defined as a numeric vector input that
allows words with similar meanings to have the
same representation.
It’s approximate meaning and represent a word
in a lower dimensional space.
These can be trained much faster than the
hand-built models that use graph embeddings
like WordNet.
2
FAMILIAR WITH TERMINOLOGIES

Document
A document is a single text data point. For
Example, a review of a particular product by the
user.

Corpus
It a collection of all the documents present in our
dataset.

Feature
Every unique word in the corpus is considered as a
feature.
3
FOR EXAMPLE

Let’s consider the 2 documents shown below:


Sentences:
DOC1 : Dog hates a cat.
DOC2 : It loves to go out and play.
DOC1 * DOC2 : Cat loves to play with a ball. We can
build a corpus from the above 2 documents just by
combining them.

Corpus = “Dog hates a cat. It loves to go out and


play. Cat loves to play with a ball.” And features
will be all unique words:

Features: ‘and’, ‘ball’, ‘cat’, ‘dog’, ‘go’, ‘hates’, ‘it’,


‘loves’, ‘out’, ‘play’, ‘to’, ‘with’. 4
THE PROBLEM
Given a supervised learning task to predict which tweets
are about real disasters and which ones are not
(classification).

Here the independent variable would be the tweets (text)


and the target variable would be the binary values
(1: Real Disaster, 0: Not real Disaster).

Now, Machine Learning and Deep Learning algorithms


only take numeric input.

So, how do we convert tweets to their numeric values?


We will dive deep into the techniques to solve such
problems, but first let’s look at the solution provided by 5
word embedding.
WHY DO WE NEED WORD EMBEDDINGS?
As we know that many Machine Learning algorithms
and almost all Deep Learning Architectures are not
capable of processing strings or plain text in their
raw form.

In a broad sense, they require numerical numbers as


inputs to perform any sort of task, such as
classification, regression, clustering, etc.

Also, from the huge amount of data that is present in


the text format, it is imperative to extract some
knowledge out of it and build any useful applications.
6
DIFFERENT TYPES OF WORD EMBEDDINGS

Broadly, we can classified word embeddings into


the following two categories:

1. Frequency-based or Statistical based Word


Embedding.

2. Prediction based Word Embedding

7
1. FREQUENCY-BASED OR STATISTICAL BASED
WORD EMBEDDING.

Methods :
Label Encoding
One-Hot Encoding (OHE)
Counter Vector
TF-IDF Vectorization

8
WHAT IS LABEL ENCODING?

Label Encoding:
Is a popular encoding technique for handling
categorical variables. A unique integer or
alphabetical ordering represents each label.

9
1. Label Encoding:
Is a popular encoding technique for handling categorical
variables. A unique integer or alphabetical ordering represents
each label.

In the above scenario, the Country names do not have an order


or rank.
But, when label encoding is performed, the country names are
ranked based on the alphabets.
Due to this, there is a very high probability that the model
captures the relationship between countries such as India < 10

Japan < the US.


2. ONE-HOT ENCODING (OHE)

In this technique, we represent each unique word


in vocabulary by setting a unique token with
value 1 and rest 0 at other positions in the vector.

In simple words, a vector representation of a


one-hot encoded vector represents in the form of
1 and 0.

where 1 stands for the position where the word


exists and 0 everywhere else.
11
12
LET US CONSIDER THE TWO SENTENCES
---EXAMPLE
Sentence 1: “You can scale your business”
Sentence 2: “You can grow your business”.

These two sentences have the same meaning. If we consider a


vocabulary considering these two sentences, it will constitute
of these words: {You, can, scale, grow, your, business}.

A one-hot encoding of these words would create a vector of


length 6. The encodings for each of the words would look
like this:
You: [1,0,0,0,0,0], Can: [0,1,0,0,0,0], Scale: [0,0,1,0,0,0],
Grow: [0,0,0,1,0,0],Your: [0,0,0,0,1,0], Business:
[0,0,0,0,0,1]
In a 6-dimensional space, each word would occupy one of
the dimensions, meaning that none of these words has any 13
similarity with each other – irrespective of their literal
meanings.
14
15
DISADVANTAGES

1) Sparsity –a single sentence creates a vector of


n*m size where n is the length of sentence m is a
number of unique words in a document and 80
percent of values in a vector is zero.
2) No fixed Size – Each document is of a
different length which creates vectors of different
sizes and cannot feed to the model.
3) Does not capture semantics – The core idea
is we have to convert text into numbers by
keeping in mind that the actual meaning of a
sentence should be observed in numbers that are
not seen in one-hot encoding.
16
3. COUNTER VECTOR
Count vectorizer tokenization(tokenization
means breaking down a sentence into words) by
performing preprocessing tasks like converting
all words to lowercase, removing special
characters, etc.
An encoding vector is thus returned with the
length of the entire vocabulary(all words) and
integer count for the number of times each word
occurs in the sentence.

17
Let us understand this using a simple example.
D1: He is a lazy boy. She is also lazy.
D2: Neeraj is a lazy person.
The dictionary created may be a list of unique tokens(words)
in the corpus =[‘He’, ’She’,’ lazy’, ’boy’, ’Neeraj’, ’person’]
Here, D=2, N=6 [DXN]
The count matrix M of size 2 X 6 will be represented as –

He She lazy boy Neeraj person


D1 1 1 2 1 0 0
D2 0 0 1 0 1 1

Now, a column can also be understood as word vector for the corresponding
word in the matrix M. For example, the word vector for ‘lazy’ in the above
matrix is [2,1] and so on.

Here, the rows correspond to the documents in the corpus and the columns
correspond to the tokens in the dictionary.

The second row in the above matrix may be read as – D2 contains ‘lazy’: 18
once, ‘Neeraj’: once and ‘person’ once.
so, Here we explain the sentence.
My name is XYZ. firstly, I completed my B.E. in
2019 from Gujarat Technology University. I like
playing cricket and reading books. also, I am
from Amreli which is located in Gujrat.
So, here will be represented as follows:

19
Problem:

The way dictionary is prepared.

Why? Because in real world applications we might have


a corpus which contains millions of documents. And
with millions of document, we can extract hundreds of
millions of unique words. So basically, the matrix that
will be prepared like above will be a very sparse one
and inefficient for any computation. So an alternative
to using every unique word as a dictionary element
would be to pick say top 10,000 words based on
frequency and then prepare a dictionary.

20
TF-IDF VECTORIZATION
This is another method which is based on the
frequency method but it is different to the count
vectorization in the sense that it takes into account
not just the occurrence of a word in a single document
but in the entire corpus. So, what is the rationale
behind this? Let us try to understand.
Common words like ‘is’, ‘the’, ‘a’ etc. tend to appear
quite frequently in comparison to the words which
are important to a document.
For example, a document A on Lionel Messi is going
to contain more occurrences of the word “Messi” in
comparison to other documents. But common words
like “the” etc. are also going to be present in higher
frequency in almost every document.
21
Ideally, what we would want is to down weight
the common words occurring in almost all
documents and give more importance to words
that appear in a subset of documents.
TF-IDF works by penalizing these common words
by assigning them lower weights while giving
importance to words like Messi in a particular
document.
So, how exactly does TF-IDF work?
Consider the below sample table which gives the
count of terms(tokens/words) in two documents.

22
Now, let us define a few terms related to TF-IDF.
TF = (Number of times term t appears in a
document)/(Number of terms in the document)
So, TF(This,Document1) = 1/8 = (1+1+2+4)
TF(This, Document2)=1/5 = (1+2+1+1)
EX: A document about Messi should contain the word
‘Messi’ in large number.

23
IDF = log(N/n), where, N is the number of
documents and n is the number of documents a
term t has appeared in.
So, IDF(This) = log(2/2) = 0.
So, how do we explain the reasoning behind IDF?
Ideally, if a word has appeared in all the document,
then probably that word is not relevant to a
particular document. But if it has appeared in a
subset of documents then probably the word is of
some relevance to the documents it is present in.
Let us compute IDF for the word ‘Messi’.
IDF(Messi) = log(2/1) = 0.301.
24
Now, let us compare the TF-IDF for a common
word ‘This’ and a word ‘Messi’ which seems to be
of relevance to Document 1.
TF-IDF(This,Document1) = (1/8) * (0) = 0
TF-IDF(This, Document2) = (1/5) * (0) = 0
TF-IDF(Messi, Document1) = (4/8)*0.301 = 0.15

As, you can see for Document1 , TF-IDF method


heavily penalizes the word ‘This’ but assigns
greater weight to ‘Messi’. So, this may be
understood as ‘Messi’ is an important word for
Document1 from the context of the entire corpus.
25
CO-OCCURRENCE MATRIX WITH A FIXED CONTEXT
WINDOW

Similar words tend to occur together and will


have similar contexts
Symmetric (irrelevant whether left or right
context)
Let’s consider the following examples for a better
understanding:
Example: I like deep learning. I like NLP. I
enjoy flying. From the above corpus, the list of
unique words present are as follows:
Dictionary: [ 'I', 'like', 'enjoy', 'deep',
'learning', 'NLP', 'flying', '.' ] The co-occurrence
matrix for the above corpus becomes: 26
Example: I like deep learning. I like NLP. I enjoy flying.

27
2. PREDICTION BASED WORD EMBEDDING

Word2Vec
Skip Gram
CBOW

28
Word2Vec:
• Word2Vec is a popular natural language processing (NLP) technique for
generating word embeddings, which are vector representations of words in
a high-dimensional space.
• The fundamental idea behind Word2Vec is to learn dense vector
representations of words based on their context in a large corpus of text.
• There are two main approaches to Word2Vec: Continuous Bag of Words
(CBOW) and Skip-gram.

1.Continuous Bag of Words (CBOW):


1. CBOW predicts the current word based on its context, which consists
of the words before and after it.
2. The model tries to minimize the difference between the predicted word
and the actual word, given its context.

2.Skip-gram:
1. Skip-gram predicts the context words based on the current word.
2. The model tries to minimize the difference between the predicted
29
context words and the actual context words for a given word.
30
31
SKIP GRAM MODEL
The skip-gram model is a method for learning
word embeddings, which are continuous,
dense, and low-dimensional representations of
words in a vocabulary.

It is trained using large amounts of


unstructured text data.

Capture the context and semantic similarity


between words.
32
33
34
35
The skip-gram model is a way of teaching a
computer to understand the meaning of words
based on the context they are used in.
An example would be training a computer to
understand the word "dog" by looking at
sentences where "dog" appears and seeing the
words that come before and after it.
By doing this, the computer will be able to
understand how the word "dog" is commonly used
and will be able to use that understanding in
other ways.

36
The dog fetched the ball.

If you are trying to train a skip-gram model for


the word "dog", the goal of the model is to predict
the context words "the" and "fetched" given the
input word "dog". So, the training data for the
model would be pairs of the form (input word =
"dog", context word = "the"), (input word = "dog",
context word = "fetched or The").

37
ARCHITECTURE OF SKIP-GRAM MODEL
The architecture of the skip-gram model consists
of an input layer, an output layer, and a
hidden layer.
The input layer is the word to be predicted, and
the output layer is the context words.
The hidden layer represents the embedding of the
input word learned during training.
The skip-gram model uses a feed forward neural
network with a single hidden layer, as shown in
the diagram below:
38
39
Input Layer --> Hidden Layer --> Output Layer

The input and output layers are connected to the


hidden layer through weights, adjusted during
training to minimize the prediction error. The
skip-gram model uses a negative sampling
objective function to optimize the weights and
learn the embeddings.

The skip-gram model is a method for learning word


embeddings that captures the context and
semantic similarity between words in a vocabulary.
It is trained using a feedforward neural network with
a single hidden layer and is widely used in NLP
tasks.
40
To implement a skip-gram model for word embeddings, you will
need a large corpus of text in a single language and tools
for preprocessing and tokenizing the text. You can download
and access the text using the Natural Language Toolkit
(NLTK) library.

We can use TensorFlow with the Keras API to build and train
the model. A skip-gram generator can create training data for
the model in pairs of words (the target word and the context
word) and labels indicating whether the context word appears
within a fixed window size of the target word in the input text.

The model architecture should include embedding layers for


the target and context words and a dense layer with a sigmoid
activation function to predict the probability of the context
word appearing within a fixed window of the target word.

The trained word embeddings can be extracted from the model


once trained. The model can then be compiled with a loss
function, an optimizer, and fit on the skip grams. 41
Skip-gram is used to predict the context word for
a given target word. It’s reverse of CBOW
algorithm. Here, target word is input while
context words are output. As there is more than
one context word to be predicted which makes
this problem difficult.

42
The word sat will be given and we’ll try to predict
words cat, mat at position -1 and 3 respectively
given sat is at position 0 . We do not predict common
or stop words such as the

43
As we can see w(t) is the target word or input
given. There is one hidden layer which performs
the dot product between the weight matrix and the
input vector w(t).

No activation function is used in the hidden layer.


Now the result of the dot product at the hidden
layer is passed to the output layer.

Output layer computes the dot product between


the output vector of the hidden layer and the
weight matrix of the output layer.

Then we apply the softmax activation function to


compute the probability of words appearing to be 44

in the context of w(t) at given context location.


45
46
2. The word w(t) is passed to the hidden layer
from |v| neurons.

3. Hidden layer performs the dot product between


weight vector W[|v|, N] and the input vector
w(t). In this, we can conclude that the (t)th row of
W[|v|, N] will be the output(H[1, N]).

4. Remember there is no activation function used


at the hidden layer so the H[1,k]will be passed
directly to the output layer.

5. Output layer will apply dot product between


H[1, N] and W’[N, |v|] and will give us the
vector U. 47
6. Now, to find the probability of each vector we’ll use
the softmax function. As each iteration gives output
vector U which is of one hot encoding type.

7. The word with the highest probability is the result


and if the predicted word for a given context position
is wrong then we’ll use backpropagation to modify
our weight vectors W and W’.

This steps will be executed for each word w(t) present


in vocabulary. And each word w(t) will be passed k
times. So, we can see that forward propagation will
be processed |v|*k times in each epoch.

48
PROBABILITY FUNCTION

w(c,j) is the jth word predicted on the cth context


position;
`w(O,c) is the actual word present on the cth
context position;
w(I) is the only input word; and u(c,j) is the j th
value in the U vector when predicting the word
for cth context position.

49
CBOW(CONTINUOUS BAG-OF-WORDS )
The continuous bag-of-words (CBOW) model is a
neural network for natural languages processing tasks
such as language translation and text classification.
It predicts a target word based on the context of
the surrounding words and is trained on a large
dataset of text using an optimization algorithm such as
stochastic gradient descent.
Once trained, the CBOW model generates
numerical vectors, known as word embeddings,
which capture the semantics of words in a continuous
vector space and can be used in various NLP tasks.
It is often combined with other techniques and models,
such as the skip-gram model, and can be implemented 50
using libraries like gensim in python.
The way CBOW work is that it tends to predict
the probability of a word given a context. A
context may be a single word or a group of words.
But for simplicity, I will take a single context
word and try to predict a single target word.

Suppose, we have a corpus C = “Hey, this is


sample corpus using only one context word.” and
we have defined a context window of 1. This
corpus may be converted into a training set for a
CBOW model as follow. The input is shown
below. The matrix on the right in the below image
contains the one-hot encoded from of the input on
the left. 51
The target for a single datapoint say Datapoint 4 is shown as below

sampl corpu contex


Hey this is using only one word
e s t

0 0 0 1 0 0 0 0 0 0
52
This matrix shown in the above image is sent
into a shallow neural network with three layers:
an input layer, a hidden layer and an output
layer.

The output layer is a softmax layer which is used


to sum the probabilities obtained in the output
layer to 1.

Now let us see how the forward propagation will


work to calculate the hidden layer activation.

53
54
The input layer and the target, both are one- hot
encoded of size [1 X V]. Here V=10 in the above
example.

There are two sets of weights. one is between the input


and the hidden layer and second between hidden and
output layer.

Input-Hidden layer matrix size =[V X N] ,


hidden-Output layer matrix size =[N X V] : Where N is
the number of dimensions we choose to represent our
word in. It is arbitary and a hyper-parameter for a
Neural Network. Also, N is the number of neurons in
the hidden layer. Here, N=4.

There is a no activation function between any layers.


55
( More specifically, I am referring to linear activation).
The input is multiplied by the input-hidden
weights and called hidden activation. It is simply
the corresponding row in the input-hidden matrix
copied.

The hidden input gets multiplied by hidden-


output weights and output is calculated.

Error between output and target is calculated


and propagated back to re-adjust the weights.

The weight between the hidden layer and the


output layer is taken as the word vector
56
representation of the word.
We saw the above steps for a single context word.
Now, what about if we have multiple context
words? The image below describes the
architecture for multiple context words.

57
The image above takes 3 context words and
predicts the probability of a target word. The
input can be assumed as taking three one-hot
encoded vectors in the input layer as shown
above in red, blue and green.

So, the input layer will have 3 [1 X V] Vectors in


the input as shown above and 1 [1 X V] in the
output layer. Rest of the architecture is same as
for a 1-context CBOW.

58
59
The steps remain the same, only the calculation of
hidden activation changes. Instead of just copying the
corresponding rows of the input-hidden weight
matrix to the hidden layer, an average is taken over
all the corresponding rows of the matrix.

We can understand this with the above figure. The


average vector calculated becomes the hidden
activation.

So, if we have three context words for a single target


word, we will have three initial hidden activations
which are then averaged element-wise to obtain the
final activation.

In both a single context word and multiple context


word, I have shown the images till the calculation of
the hidden activations since this is the part where 60
CBOW differs from a simple MLP network.
Disadvantages of CBOW:

CBOW takes the average of the context of a word (as


seen above in calculation of hidden activation).

For example, Apple can be both a fruit and a


company but CBOW takes an average of both the
contexts and places it in between a cluster for fruits
and companies.

Training a CBOW from scratch can take forever if


not properly optimized.
61
Neural Networks

62
convolution neural
Neural Networks networks (CNNs)

In deep learning, all problems are generally classified into two types:

▪ Fixed topological structure: For images having static data, with use
cases such as image classification

▪ Sequential data: For text/audio with dynamic data, in tasks related to


text generation and voice recognition

recurrent neural
networks (RNNs)

63
Neural Networks

RNNs and LSTM networks have applications in diverse fields, including


:

Chatbots
Sequential pattern identification
Image/handwriting detection
Video and audio classification
Sentiment analysis
Time series modeling in finance
64
Recurrent Neural Networks
RNNs have varied sets of use cases and can implement a set of multiple
smaller programs,
with each painting a separate picture on its own and
all learning in parallel,
to finally reveal the intricate effect of the collaboration of all such
small programs.

RNNs are capable of performing such operations for two principal


reasons:

Hidden states being distributive by nature, store a lot of past


information and pass it on efficiently.
65

Hidden states are updated by nonlinear methods.


What Is Recurrence?

Recurrence is a recursive process in which a recurring function is called


at each step to model the sets of temporal data.

What is a temporal data?


🡺 Any unit of data that is dependent on the previous units of the data,
particularly sequential data.

66
Recurrent Neural Networks-Applications

67
Recurrent Neural Networks-Applications

Any time series problem, like predicting the prices of stocks in a


particular month, can be solved using an RNN.
68
Recurrent Neural Networks-Applications

Text mining and Sentiment analysis can be carried out using an RNN
for Natural Language Processing (NLP).
69
Recurrent Neural Networks-Applications

Given an input in one language, RNNs can be used to translate the


input into different languages as output
70
Differences Between Feedforward and
Recurrent Neural Networks

71
Differences Between Feedforward and
Recurrent Neural Networks
Following are the main limitations of feedforward neural networks:

• Unsuitable for sequences, time series data, video streaming, stock


data, etc.

• Do not bring memory factor in modeling.

the feedforward neural network takes decisions based only on the


current input, and

an RNN takes decisions based on the current and previous inputs and
makes sure that the connections are built across the hidden layers as
well. 72
Differences Between Feedforward and
Recurrent Neural Networks

73
RNN
Recurrent Neural Network(RNN) is a type of Neural
Network where the output from the previous step is fed as input
to the current step.

The main and most important feature of RNN is Hidden state,


which remembers some information about a sequence.

RNN have a “memory” which remembers all information


about what has been calculated.
It uses the same parameters for each input as it performs the
same task on all the inputs or hidden layers to produce the
output. 74
How RNN works

❖ Suppose there is a deeper network with one


input layer, three hidden layers and one
output layer.
❖ Then like other neural networks, each
hidden layer will have its own set of weights
and biases,
▪ let’s say, for hidden layer 1 the weights and
biases are (w1, b1), (w2, b2) for second hidden
layer and (w3, b3) for third hidden layer.
❖ This means that each of these layers are
independent of the other, i.e. they do not
memorize the previous outputs. 75
How RNN works

❖ RNN converts the independent activations


into dependent activations
▪ by providing the same weights and
biases to all the layers,
▪ thus reducing the complexity of
increasing parameters and memorizing
each previous outputs by giving each
output as input to the next hidden
layer.
❖ Hence these three layers can be joined
together such that the weights and bias of
all the hidden layers is the same, into a 76

single recurrent layer.


How RNN works

❖ Formula for calculating current state:

Where:
ht -> current state
ht-1 -> previous state
xt -> input state

❖ Formula for applying Activation function(tanh):


Where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron

77
How RNN works

❖ Formula for calculating output:

Where:
Yt -> output
Why -> weight at output layer

78
How RNN works

79
How RNN works

The nodes in different layers of the neural network are compressed to


form a single layer of recurrent neural networks. A, B, and C are the
80

parameters of the network.


How RNN works

81
How RNN works

82
Training through RNN
❖ A single-time step of the input is provided to the network.
❖ Then calculate its current state using set of current input and the
previous state.
❖ The current ht becomes ht-1 for the next time step.
❖ One can go as many time steps according to the problem and join
the information from all the previous states.
❖ Once all the time steps are completed the final current state is
used to calculate the output.
❖ The output is then compared to the actual output i.e the target
output and the error is generated.
❖ The error is then back-propagated to the network to update the
weights and hence the network (RNN) is trained. 83
Basics of RNN- activation functions

Linear Activation functions: A linear neuron takes a linear combination


of the weighted inputs; and the output can take any value between
-infinity to infinity.

Nonlinear Activation function: These are the most used ones, and they
make the output restricted between some range:

❖ Sigmoid or Logit Activation Function

❖ Softmax

❖ Tanh

❖ ReLU: ReLU (Rectified Linear Unit)


84
Various non-linear activations in use

Sigmoid:

• expensive,
• causes vanishing
gradient problem
• not zero-centered

NOTE: Sigmoid used for binary classification problems.


Various non-linear activations in use
Softmax:

more generalized form of the sigmoid

• used as the final layer in classification models.


• Softmax is almost similar to sigmoid, but it calculates the probabilities of
the event over ‘n’ different classes, which will be useful to determine the
target in multiclass classification problems.
Various non-linear activations in use
Various non-linear activations in use
Tanh:

• compare to sigmoid, it solves just one problem of being zero-centred.


• The range of the tanh function is from (-1 to 1), and the rest remains the
same as sigmoid.
Various non-linear activations in use
ReLU: ReLU (Rectified Linear Unit) :

widely used activation function,


especially with Convolutional Neural
networks.

• easy to compute and


• does not saturate and
• does not cause the Vanishing
Gradient Problem.

NOTE: Since the output is zero for all negative inputs.


It causes some nodes to completely die and not learn
anything.
Various non-linear activations in use

By Dr. Meenakshi Malhotra


Basics of Recurrent Neural Networks
RNN requires a 3-D tensor as input, and it can be broken perfectly
into the components shown

91
Basics of Recurrent Neural Networks

92
RNN model architecture to compute the number of 1s
in a 20-length sequence of binary digits
Types of Recurrent Neural Networks

1. One to One
2. One to Many
3. Many to One
4. Many to Many

93
Types of Recurrent Neural Networks

1. One to One

This type of neural network is known as


the Vanilla Neural Network.
It's used for general machine learning
problems, which has a single input and a
single output.

94
2. One to Many

This type of neural network


has a single input and
multiple outputs.
An example of this is the
image caption.

95
3. Many to One

This RNN takes a sequence of


inputs and generates a single
output.
Sentiment analysis is a good
example of this kind of
network where a given
sentence can be classified as
expressing positive or
negative sentiments
96
4. Many to Many

This RNN takes a sequence of


inputs and generates a
sequence of outputs.
Machine translation is one of
the examples.

97
Two Issues of Standard RNNs

❖ Vanishing Gradient Problem

❖ Exploding Gradient Problem

98
Two Issues of Standard RNNs
❖ Vanishing Gradient Problem

99
❖ Exploding Gradient Problem

This problem arises when large error gradients accumulate, resulting in very large 100
updates to the neural network model weights during the training process
Solution to Gradient Problem

101
Solution to Gradient Problem

102
Solution to Gradient Problem

103
Solution to Gradient Problem

104
RNN

All RNNs are in the form of a chain of repeating modules of a neural


network.
In standard RNNs, this repeating module will have a very simple structure,
such as a single tanh layer. 105
LSTM

LSTMs also have a chain-like structure, but the repeating module is a bit
different structure.
Instead of having a single neural network layer, four interacting layers are
106
communicating extraordinarily
Workings of LSTMs in RNN

107
Workings of LSTMs in RNN
Step 1: Decide How Much Past Data It Should Remember

• The sigmoid function determines this.


• It looks at the previous state (ht-1)
along with the current input xt and
computes the function
108
Workings of LSTMs in RNN
Step 2: Decide How Much This Unit Adds to the Current State

• In the sigmoid function, it decides


which values to let through (0 or 1).
• tanh function gives weightage to the
values which are passed, deciding their
level of importance (-1 to 1). 109
Conceptualized

Pre-trained word embeddings, such as Word2vec and GloVe,


compute a single static representation for each word.
110
Conceptualized

Approaches to Learning representations from unlabeled text, i.e


word embeddings included non- neural approaches and neural
approaches.
111
Using pre-trained word embeddings instead of training it from
scratch have proved to significant improvements in performance

You might also like