KEMBAR78
Deep Learning | PDF | Neuron | Deep Learning
0% found this document useful (0 votes)
39 views37 pages

Deep Learning

The document provides a comprehensive overview of deep learning, covering its historical development, key concepts such as biological neurons, perceptrons, and various neural network architectures including convolutional and recurrent networks. It discusses advancements in optimization methods, the significance of transformers in NLP, and the challenges of deep learning such as overfitting and interpretability. Additionally, it highlights applications in fields like protein folding and astronomy, as well as the importance of efficient deep learning for resource-constrained devices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views37 pages

Deep Learning

The document provides a comprehensive overview of deep learning, covering its historical development, key concepts such as biological neurons, perceptrons, and various neural network architectures including convolutional and recurrent networks. It discusses advancements in optimization methods, the significance of transformers in NLP, and the challenges of deep learning such as overfitting and interpretability. Additionally, it highlights applications in fields like protein folding and astronomy, as well as the importance of efficient deep learning for resource-constrained devices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Rishabh Indoria

21F3001823
Deep Learning December 6, 2024

1 History of Deep Learning


1.1 Biological Neurons
• Reticular Theory: Nervous system is a single continuous network as opposed to a network of many
discrete cells.
• Staining Technique: Chemical reaction that allows to examine nervous tissue in much greater detail.
• Neuron Doctrine: Nervous system is actually made up of discrete individual cells forming a network.

1.2 From Spring to Winter of AI


• McCulloch Pitts Neuron: A highly simplified model of the neuron.
• Perceptron: ”The perceptron may eventually be able to learn, make decisions, and translate languages.”
• ”The embryo of an electric computer that the Navy expects will be able to walk, talk, see, write, reproduce
itself and be conscious of its existence.”
• Book ”Perceptrons” by Minsky and Papert outlined the limits of what perceptrons could do.
• Backpropagation: First used in the context of artificial neural networks.
• Universal Approximation Theorem: A multilayered network of neurons with a single hidden layer can
be used to approximate any continuous function to any desired precision.

(a) Simple Model of the Neuron (b) Model of the Perceptron

Figure 1: First models of neurons

1.3 The Deep Revival


• Unsupervised Pre-Training: Described an effective way of initializing the weights that allows deep
autoencoders networks to learn a low-dimensional representation of data.

1.4 From Cats to Convolutional Neural Networks


• Hubel and Wiesel Experiment: Experimentally showed that each neuron has a fixed receptive field -i.e.
a neuron will fire only in response to visual stimuli in a specific region in the visual space(Motivation for
CNNs).

1.5 Faster, higher, stronger


• Better Optimization Methods: Faster convergence, better accuracies. Examples: AdaGrad, RMSProp,
Adam, Nadam, etc.
• Better Activation Functions: Many new functions have been proposed, leading to better convergence
and performance. Example: tanh, ReLu, Leaky Relu, etc.

1
1.6 The Curios Case of Sequences
• Sequences: Each unit in the sequence interacts with other units. Need models to capture this interaction.
Example: Speech, Videos, etc.
• Hopfield Network: Content-addressable memory systems for storing and retrieving patterns.
• Jordan Network: The output state of each time step is fed to the next time step, thereby allowing
interactions between time steps in the sequence.
• Elman Network: The hidden state of each time step is fed to the next time step, thereby allowing
interactions between time steps in the sequence.
• Very hard to train RNNs.
• Long Short Term Memory: Solve complex long time lag tasks that could never be solved before(solved
the problem of vanishing gradient).
• Sequence to Sequence Models: Introduction to Attention!!, However, they were unable to capture the
contextual information of a sentence.
• Transformers: Introduced a paradigm shift in the field of NLP. GPT and BERT are the most commonly
used transformer based architectures.

1.7 Beating humans at their own game


• Human-level control through deep reinforcement learning for playing Atari Game
• OpenAI Gym: Toolkit for developing and comprising reinforcement learning algorithms. It supports
teaching agents everything from walking to playing games like Pong or Pinball.
• OpenAI Gym Retro: A platform for reinforcement learning research on games, which contains 1,000
games across a variety of backing emulators.
• MuZero: Masters Go, chess, shogi and Atari without needing to be told the rules, thanks to its ability to
plan winning strategies in unknown environments.
• Player of Games(PoG): A general purpose algorithm that unifies all previous approached. Learn to play
under both perfect and imperfect information games.

1.8 The rise of Transformers


• Rule Based Systems: Initial Machine Translation Systems used handcrafted rules and dictionaries to
translate sentences between few politically important language pairs.
• Statistical MT: The IBM Models for Machine Translation gave a boost to the idea of data driven statistical
NLP, probability based models.
• Neural MT: The introduction of seq2seq models and attention. Bigger, hungrier, better models.
• From Language to Vision: A vision model based as closely as possible on the Transformer architecture,
originally designed for text-based tasks.
• From Discrimination to Generation: Sample(Generate) data from the learned probability distri-
bution.Variable Auto Encoders(VAE), Generational Adversarial Networks(GAN), Flow-based models to
achieve it.
• GANs don’t scale, are unstable and capture less diversity. Diffusion models are one of the alternatives to
GAN. Inspired by an idea from non-equilibrium thermodynamics.

1.9 Call for Sanity


• Paradox of Deep Learning: Deep learning works so well despite high capacity(susceptible to overfitting),
numerical instability(vanishing/exploding gradients), sharp minima(leads to overfitting), non-robustness.
• Interpretable Machine Learning: A Guide for Making Black Box Models Explainable by
Christoph Molnar.
• AI Audit challenge: AI systems must be evaluated for legal compliance, in particular laws protecting
people from illegal discrimination. This challenge seeks to broaden the tools available to people who want
to analyze and regulate them.

2
• Analog AI: Programmable resistors are the key building blocks in analog deep learning, just like transistors
are the core element of digital processors(faster computation).

1.10 The AI revolution in basic Science Research


• Protein-Folding Problem: Model proposed to predict protein structure, which would lead to better drug
development.

• Astronomy: Galaxy Evolution: Predict how galaxies would look like as it gets older.

1.11 Efficient Deep Learning


• Build models on small devices, aka phones. Deploying models in resource-constrained devices.

2 Multi Layered Network of Perceptrons


2.1 Biological Neurons
• The most fundamental unit of a deep neural network is called an artificial neuron.
• The inspiration comes from biology(more specifically, from the brain).
• biological neurons = neural cells = neural processing units

• Dendrite: Receives signals from other neurons


Synapse: Point of connection to other neurons
Soma: Processes the information
Axon: Transmits the output of this neuron.
• This massively parallel network also ensures that there is division of work. Each neuron may perform a
certain role or respond to a certain stimulus. The neurons in the brain arranged in a hierarchy.

(a) Biological Neuron (b) Artificial Neuron (c) A McCulloch Pitts Unit

Figure 2: Basic Neuron Structure

• McCulloch Pitts Neuron: Proposed a highly simplified computational model of the neuron. g aggregates
the inputs, and the function f takes a decision based on this aggregation. The inputs can be excitatory or
inhibitory. Pn
Example: y = 0 if any xi is inhibitory, else g(x1 , ..., xn ) = g(x) = i=1 xi
y = f (g(x)) = 1 if g(x) ≥ θ else 0
θ is called the thresholding parameter.
• circle at the end indicates inhibitory input if any inhibitory input is 1 then the output will be 0.
• Linear separability(for boolean functions): There exists a line(plane) such that all inputs which produce a 1
lie on one side of the line(plane) and all inputs which produce a 0 lie on the other side of the line(plane).

• A single McCulloch Pitts Neuron can be used to represent boolean functions which are linearly separable.

3
• It can be trivially seen that θ = 0 for the Tautology(always ON) function.

(a) AND function (b) OR function (c) NOR function (d) NOT function

Figure 3: Basic Boolean functions

2.2 Perceptrons
• Classical Perceptron: Frank Rosenblatt proposed this model, which was refined and carefully analyzed
by Minsky and Papert.
• Main difference is the introduction of numerical weights for inputs and a mechanism for learning these
weights.
Pn
• y = 1 if i=1 wi × xi ≥ θ else 0.
If we takePθ on one side and represent w0 = −θ and x0 = 1, then we have
n
y = 1 if i=0 wi × xi ≥ 0 else 0.

• w0 is called the bias as it represents the prior(prejudice).


• From the equation, it can be clearly seen that a perceptron separates the input space into two halves. A
single perceptron can only be used to implement linearly separable functions.

• The difference between this and the MP neuron is that now we have weights that can be learned, and the
inputs can be real values.

(a) Perceptron (b) Perceptron with w0

Figure 4: Classical Perceptrons

2.3 Perceptron Learning Algorithm


• Perceptron learning algorithm: Consider two vectors w = [w0 , ..., wn ] and x = [1, x1 , ..., xn ], then
w × x = wT x or the dot product. We can rewrite the perceptron rule as y = 1 if wT x ≥ 0 else 0.

• We are interested in finding the line wT x = 0 which divides the input space into two halves. The angle α
wT x
between w and any point x which lies on the line must be 90◦ because cos α = ||w||||x|| = 0. So, the vector
w is perpendicular to every point on the line, then it is perpendicular to the line itself.

4
• For any point such that wT x > 0 we will have α < 90◦ and for any point such that wT x < 0 we will have
α > 90◦ .

Algorithm 1 Perceptron Learning Algorithm


1: P ← inputs with label 1;
2: N ← inputs with label 0;
3: Initialize w randomly
4: while !convergence do
5: Pick random xP∈ P ∪ N ;
n
6: if x ∈ P and i=0 wi × xi < 0 then
7: w = w + x;
8: end if Pn
9: if x ∈ N and i=0 wi × xi ≥ 0 then
10: w = w - x;
11: end if
12: end while
13: ▷ the algorithm converges when all the inputs are classified correctly

• For x ∈ P if wT x < 0 then it means that the angle α between this x and the current w is greater than 90◦ ,
but we want it to be < 90◦ .
When we do wnew = w + x, alpha changes as follows
T
cos αnew ∝ (wnew x) = (w + x)T x = W T x + xT x = cos α + xT x

cos αnew > cos α =⇒ αnew < α


αnew becomes less than, α which is exactly what we wanted, we may not get < 90◦ in one shot so we keep
doing it. Similarly, it can be shown for x ∈ N .
• Definition: Two sets P and N of points in an n−dimensional space are called absolutely Pn linearly separable
such that every point (x1 , ..., xn ) ∈ P satisfies i=1 wi × xi ≥ w0 and
if n + 1 real numbers w0 , ..., wn exist P
n
every point (x1 , ..., xn ) ∈ N satisfies i=1 wi × xi < w0 .
• Proposition: If the sets P and N are finite and linearly separable, the perceptron learning algorithm
updates the weight vector w a finite number of times. In other words: if the vectors in P and N are tested
cyclically one after the other, a weight vector w is found after a finite number of steps t which can separate
the two sets.
• Proof of Convergence: If x ∈ N , then −x ∈ P (∴ wT x < 0 =⇒ wT (−x) ≥ 0).
We can now consider only a single set P ′ = P ∪ N − and for every element p ∈ P ′ ensure that wT p ≥ 0.

Algorithm 2 Modified Perceptron Learning Algorithm


1: P ← inputs with label 1;
2: N ← inputs with label 0;
3: N − ← negations of all points in N ;
4: P′ ← P ∪ N−
5: Initialize w randomly
6: while !convergence do

7:
Pnrandom p ∈ P ;
Pick
8: if i=0 wi × pi < 0 then
9: w = w + p;
10: end if
11: end while
12: ▷ the algorithm converges when all the inputs are classified correctly

Further we will normalize all the p’s so that ||p|| = 1, notice this does not change anything in our step.
Let w∗ be the normalized solution vector(we know one exists whose value we don’t know).
Now suppose at some time step t we inspected the point pi and found that wT pi < 0, then we make
correction wt+1 = wt + pi .

Let β be the angle between w∗ abd wt+1 , then we have cos β = w wt+1
||wt+1 ||

N umerator = w∗ · wt+1 = x∗ · (wt + pi ) = w∗ · wt + w∗ · pi ≥ w∗ · wt + δ (δ = min(w∗ · pi |∀i))


≥ w∗ · (wt−1 + pj ) + δ ≥ w∗ · wt−1 + w∗ · pj + δ ≥ w∗ · wt−1 + 2δ ≥ w∗ w0 + (k)δ (By Induction)

5
Denominator2 = ||wt+1 ||2 = (wt + pi ) · (wt + pi ) = ||wt ||2 + 2wt · pi + ||pi ||2 ≤ ||wt ||2 + ||pi ||2
≤ ||wt ||2 + 1 ≤ (||wt−1 ||2 + 1) + 1 ≤ ||w0 ||2 + (k)
So, we have N umerator ≥ w∗ · w0 + (k)δ and Denominator2 ≤ ||w0 ||2 + (k)

w∗ · w0 + kδ
cos β ≥ p
||w0 ||2 + k

cos β grows proportional to k
As k(number of corrections) increase cos β can become arbitrarily large, but since cos β ≤ 1, k must be
bounded by a maximum order.
Thus, there can only be a finite number of corrections (k) to w and the algorithm will converge!

2.4 Linearly Separable Boolean Function


• One simple example that is not linearly separable is XOR

x1 x2 XOR
P2
0 0 0 w0 + i=1 wi xi <0
P2
1 0 1 w0 + i=1 wi xi ≥0
P2
0 1 1 w0 + i=1 wi xi ≥0
P2
1 1 0 w0 + i=1 wi xi <0

Table 1: XOR Truth Table

• Most real world data is not linearly separable and will always contain some outliers.
• How many boolean functions can we design from 2 inputs.

x1 x2 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16


0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
0 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Table 2: Functions of 2 inputs

Out of these, only XOR and !XOR are not linearly separable.
n
• In general, we can have 22 boolean functions in n inputs.

2.5 Representation Power of a Network of Perceptrons


• We will assume True = +1 and False = −1, we consider 2 inputs and 4 perceptrons with specific weights.
The bias of each perceptron is −2. Each of these perceptrons is connected to an output perceptron by
weights (which need to be learned). The output of this perceptron (y) is the output of this network.

Figure 5: Network of Perceptrons

6
red edge indicates w = −1, and blue edge indicates w = +1. x1 has weights −1, −1, +1, +1 and x2 has
weights −1, +1, −1, +1 from left to right respectively.

• The above network contains 3 layers.


The layer containing the inputs (x1 , x2 ) is called the input layer.
The middle layer containing the 4 perceptrons is called the hidden layer.
The final layer containing one output neuron is called the output layer.
The output of the 4 perceptrons in the hidden layer are denoted by h1 , h2 , h3 , h4 .
The red and blue edges are called layer 1 weights
w1 , w2 , w3 , w4 are called layer 2 weights.
• This network can be used to implement any boolean function linearly separable or not. Each perceptron in
the middle layer fires only for a specific input and no perceptrons fire for the same input.
P4
x1 x2 XOR h1 h2 h3 h4 i=1 wi hi
0 0 0 1 0 0 0 w1
1 0 1 0 1 0 0 w2
0 1 1 0 0 1 0 w3
1 1 0 0 0 0 1 w4

Table 3: Truth Table for the Network

This results in four independent conditions.


• In case of 3 inputs we would now have 8 perceptrons in the hidden layer.

• Theorem: Any boolean function of n inputs can be represented exactly by a network of perceptrons
containing one hidden layer with 2n perceptrons and one output layer containing 1 perceptron.
• Note: A network of 2n + 1 perceptrons is not necessary but sufficient.
• Catch: As n increases the number of perceptrons in the hidden layers increases exponentially.

• We care about boolean functions, because we can model real world examples into boolean functions or
classification problems.
• The network we saw is formally known as Multilayer Perceptron(MLP) or more appropriately ”Multilayered
Network of Perceptrons”.

3 Sigmoid Neurons
3.1 What are they?
• The thresholding logic used by a perceptron is very harsh! It is a characteristic of the perceptron function
itself which behaves like a step function.
• For most real world applications we would expect a smoother decision function which gradually changes
from 0 to 1.
• Instead we would use sigmoid function
1
y= Pn
1 + exp(−(ω0 + i=1 ωi xi ))

• We no longer see a sharp transition around the threshold ω0 . Also, y now takes any real value in [0, 1].

• This value can also be interpreted as probability.

3.2 Typical Supervised Machine Learning Setup


• Data: {xi , yi }ni=1 , we have n data point where xi is a vector of Rm . Assume y = fˆ(x; θ).

• Model: Our approximation of the relation between x and y.


1
For example, ŷ = or ŷ = wT x or ŷ = xT W x
1 + e−wT x

7
• Learning algorithm: An algorithm for learning the parameters w of the model.
Pn
• Objective/Loss/Error function: To guide the learning algorithm. One possibility is i=1 (ŷi − yi ) 2 .
• The learning algorithm should aim to minimize the loss function.

3.3 Learning Parameters


• For ease of explanation we will take f (x) = 1
1+e−(wx+b)
, only a single parameter.

• Assume input for training data is {xi , yi }N


i=1 → N pairs of (x, y).
PN
• Training Objective: Find w and b such that L(w, b) = N1 i=1 (yi − f (xi ))2 is minimized.
• Guess Work(Infeasible): Intuitively guess what should be the values of w and b. May never end and is
not feasible for writing algorithms.
• Update rule: θnew = θ + η∆θ, how to find ∆θ?
• Taylor Series: A way of approximating any continuously differentiable function L(w) using polynomials of
degree n. The higher the degree the better the approximation!
L′ (w0 ) L′′ (w0 )
L(w) = L(w0 ) + (w − w0 ) + (w − w0 )2 + ...
1! 2!
• Linear approximation is the first order approximation of the Taylor series. Quadratic approximation is the
second order approximation of the Taylor series.
• Key point: You can only do this for a very small ∆ = w − w0 .
• Let’s assume ∆θ = u, then from Taylor series we have
η2 T 2
L(θ + ηu) = L(θ) + ηuT ∇θ L(θ) + u ∇θ L(θ)u + ... = L(θ) + ηuT ∇θ L(θ)
2!
• η is typically small, so η 2 , η 3 , ... → 0
• Now, we want new loss less than the current loss
L(θ + ηu) − L(θ) < 0 =⇒ uT ∇θ L(θ) < 0

• Let β be the angle between uT and ∇θ L(θ) and k = ∥uT ∥∥∇θ L(θ)∥, then we can write
uT ∇θ L(θ)
−1 ≤ cos(β) = ≤ 1 =⇒ −k ≤ uT ∇θ L(θ) ≤ k
∥uT ∥∥∇θ L(θ)∥

• uT ∇θ L(θ) is most negative when cos(β) = −1 or β = 180◦


• Parameter Update Rule: From the above we can now say
wt+1 = wt − η∇wt
bt+1 = bt − η∇bt
• We can now write the gradient descent algorithm for our problem as follows

Algorithm 3 Gradient Descent


GradientDescent
1: t←0
2: max iterations ← 1000
3: w, b ← initialize randomly
4: while t < max iterations do
5: wt+1 ← wt − η∇wt
6: bt+1 ← bt − η∇bt
7: t← t+1
8: end while

• where ∇w and ∇b are defined for sigmoid neuron as


∂L(x) ∂ 1
∇w = = (f (x) − y) ( ) = (f (x) − y) × f (x) × (1 − f (x)) × x
∂w ∂w 1 + e−(wx+b)
∂L(x)
∇b = = (f (x) − y) × f (x) × (1 − f (x))
∂b

8
3.4 Representation Power of a multiplayer network of sigmoid neurons
• A multilayer network of neurons with a single hidden layer can be used to approximate any continuous
function to any desired precision.

4 Feed Forward Neural Networks


4.1 Structure of Feed Forward Neural Network
• The input to the network is an n−dimensional vector.
• The network contains L − 1 hidden layers having n neurons each. Value of n could be different in each layer.

• Finally, there is one output layer containing k neurons, corresponding to kclasses.


• Each neuron in the hidden layer and output layer can be split into two parts: pre-activation and activation(ai
and hi are vectors).
• The input layer can be called the 0−th layer and the output layer can be called the L−th layer.

• Wi ∈ Rn×n and bi ∈ Rn are the weight and bias between layers i − 1 and i (0 < i < L).
• WL ∈ Rk×n and bL ∈ Rk are the weight and bias between the last hidden layer and the output layer.
• The pre-activation at layer i is given by

ai (x) = bi + Wi hi−1 (x)

• The activation at layer i is given by


hi (x) = g(ai (x))

• g is an element wise function and is called the activation function.


• The activation at the output layer is given by

f (x) = hL (x) = O(aL (x))

where O is the output activation function


• For ease of notation ai (x) = ai and hi (x) = hi .

• We also have h0 = x and hL = ŷ = fˆ(x)


• We can write ŷ as
ŷi = fˆ(xi ) = O(W3 g(W2 g(W1 xi + b1 ) + b2 ) + b3 )
This becomes our model as specified in 3.2.
• Parameters become θ = W1 , ..., WL , b1 , ..., bL
• We can now write our gradient descent algorithm more concisely as

Algorithm 4 Gradient Descent Modified


Gradient-Descent( )
1: t ← 0;
2: maxi terations ← 1000;
3: Initialize θ0 = [W10 , ..., WL0 , b01 , ...., b0L ]
4: while t + + < maxi terations do
5: θt+1 ← θt − η∇θt
6: end while

where we have
∂L(θ) ∂L(θ) ∂L(θ) ∂L(θ) T
θ = [W1 , ..., WL , b1 , ...., bL ] and ∇θt = [ , ..., , , ..., ]
∂W1,t ∂WL,t ∂b1t ∂b1t

9
Figure 6: Feed Forward Network

4.2 Output functions and Loss functions


• Assume we are trying to predict values in the range of all real values, an appropriate loss function would be
mean squared error function.
N k
1 XX
L(θ) = (ŷij − yij )2
N i=1 j=1

• If we want to predict values in the range of real values, regression, then we can take the output function as
a linear function
f (x) = hL = O(aL ) = WO aL + bO

• Given that we are performing classification, then we can use cross entropy loss function. This is known as
negative log likelihood function.
N k
1 XX
− (yij log(ŷij ) + (1 − yij )log(1 − ŷij ))
N i=1 j=1

• Ensure that ŷ is a probability distribution, one appropriate output function can be the softmax function
eaL,j
ŷj = O(aL )j = Pk
aL,i
i=1 w

Real Values Probabilities


Output Activation Linear Softmax
Loss Function Squared Error Cross Entropy

Table 4: Choice of Output and Loss functions based on type of Output

10
4.3 Backpropagation
• Intuitively it can be written as

Figure 7: Backpropagation Intuition

• For further discussion, the output function is considered to be softmax and the loss function is considered
to be cross entropy.
• Gradient w.r.t output layer: We start with the part ”Talk to the output layer”.

L(θ) = − log ŷl (l = true class label)


(
∂ − ŷ1l , if i = l 1 1
L(θ) = = [0, ..., 0, − , 0, ..., 0]T = − el
∂ ŷi 0, otherwise ŷ l ŷl

el is a k−dimensional vector whose lth element is 1 and rest are 0.


∂L(θ) 1 ∂ yˆl
=−
∂aL,i ŷl ∂aL,i

Since, ŷl is calculated using cross entropy it is dependent on all aL , i


eaLj
ŷj = P aLi
ie
(
∂ ŷl ŷl (1 − ŷl ), if i = l
=
∂aL,i −ŷl ŷi , if i ̸= l
We can now write the entire derivative in one shot as
∂L(θ) 1
= − (ŷl )(1l=i − ŷi ) = −(el − ŷ)
∂aL,i ŷl

• Chain rule among multiple paths: If a function p(z) can be written as a function of intermediate
results qi (z) then we have
∂p(z) X ∂p(z) ∂qm
=
∂z m
∂qm ∂z

• Gradient w.r.t hidden units: Now we ”Talk to the hidden layers”, in this case p(z) is the loss function,
z = hij and qm (z) = aLm . We have
k
∂L(θ) X L(θ)
= Wi+1,m,j
∂hij m=1
∂ai+1,m

∂L(θ) ∂L(θ) ∂hij ∂L(θ) ′


= = g (aij )
∂aij ∂hij ∂aij ∂hij

• Gradient w.r.t Parameters: Finally, we ”talk to the weights”

∂L(θ) ∂L(θ) ∂aki ∂L(θ) T


= = h
∂Wkij ∂aki ∂Wkij ∂aki k−1,j

∂L(θ) ∂L(θ) ∂aki ∂L(θ)


= =
∂bki ∂aki ∂bki ∂aki

11
• We can now write the full pseudocode as

Algorithm 5 Forward Propagation


1: for k = 1 to L − 1 do
2: ak = bk + Wk hk−1
3: hk = g(ak )
4: end for
5: aL = bL + WL hL−1
6: ŷ = O(aL )

Algorithm 6 Backward Propagation


1: ∇aL L(θ) = −(e(y) − ŷ)
2: for k = L − 1 to 1 do
3: ∇Wk L(θ) = ∇ak L(θ)hTk−1
4: ∇bk L(θ) = ∇ak L(θ)
5: ∇hk−1 L(θ) = WkT ∇ak L(θ)
6: ∇ak−1 L(θ) = ∇hk−1 L(θ) ⊙ [..., g ′ (ak−1,j ), ...]
7: end for

• g ′ for logistic function σ(z) is g(z)(1 − g(z)) and for tanh function is 1 − (g(z))2 .
• On flat surfaces, gradient descent moves very slow. How do we solve this?

5 Gradient Descent Types


5.1 Momentum based Gradient Descent
• Intuition: If I am repeatedly being asked to move in the same direction then I should probably gain some
confidence and start taking bigger steps in that direction. Just as a ball gains momentum while rolling
down a slope.
• Update rule for momentum based gradient descent is
ut = βut−1 + ∇wt , u0 = ∇w0 , u−1 = 0
wt+1 = wt − ηut

• We are not only considering the current gradient, but we are also giving some importance to past history. β
is typically less than 1, so we give decreasing importance to previous histories. We have
t
X
2
ut = βut−1 + ∇wt = β ut−2 + β∇wt−1 + ∇wt = ... = β t−τ ∇wτ
τ =0

• In addition to current update, also look at the history of updates.


• Even in the regions having gentle slopes, momentum based gradient descent is able to take large steps
because the momentum carries it along.
• However, there is a possibility of overshooting our goal. This overshoot could lead to oscillations as well.
• Despite these oscillations, it will converge faster than gradient descent.

5.2 Nesterov Accelerated Gradient Descent


• Can we do something to reduce the oscillations observed in momentum based gradient descent?
• Intuition: Look before you leap, we ask the weights to move by two parts βut−1 and ∇wt . The idea
is to first move by βut−1 and then compute ∇wt . So instead of relying only on current gradient we are
essentially ”looking ahead” and computing new gradient.
• Update rule for NAG is
ut = βut−1 + ∇(wt − βut−1 )
wt+1 = wt − ηut
with u−1 = 0, and 0 ≤ β ≤ 1

12
5.3 Stochastic vs Batch Gradient Descent
• Regular gradient descent goes over the entire data once before updating the parameters. Because this is the
true gradient of the loss as derived earlier. Hence, theoretical guarantees hold.
• Imagine we have a million points in the training data, then it will take very long.
• In stochastic version, we update the parameters for every point in the data. Now if we have a million data
points then we will make a million updates in each epoch.

• However, this is an approximate gradient. Hence, we have no guarantee that each step will decrease the loss.
• So, going over a large data once is bad, going over a single point and updating is bad, but going over some
data points and updating is ok.
• This is the idea for mini-batch gradient descent.

• 1 epoch is one pass over the entire data, 1 step is one update of the parameters, N is the number of data
points, and B is the mini batch size.
• So, for regular gradient descent the number of epochs is the same as number of steps, for stochastic gradient
descent the number of steps is the same as N , and for mini-batch gradient descent the number of steps is
the same as N B.

• It is intuitive that a larger batch size is better, but we sacrifice on time.

Algorithm # of steps in 1 epoch


Vanilla (Batch) Gradient Descent 1
Stochastic Gradient Descent N
N
Mini Batch Gradient Descent B

Table 5: Stochastic vs Mini Batch Gradient Descent

5.4 Scheduling Learning Rate


• Instead of using momentum and NAG we could have simply increased the learning rate, but on regions
which have a steep slope, the already large gradient would blow up farther.

• What we need is for the learning rate to be small when gradient is high and vice versa.
• Tune learning rate, try different values on a log scale: 0.0001, 0.001, 0.001, 0.1, 1.0.
• Run a few epochs with each of these and figure out a learning rate which works best. Now do a finer search
around this value.

• These are just heuristics, no clear winning strategy.


• Annealing learning rate: Decrease the learning rate as we get close to the minima.
• Step Decay: Halve the learning rate after every 5 epochs or halve the learning rate after an epoch if the
validation error is more than what it was at the end of the previous epoch.

• Exponential Decay: η = η0−kt , where η0 and k are hyperparameters and t is a step number. However,
choosing k becomes complex.
• 1/t Decay: η = η0
1+kt , again choosing k is a bit tricky, ideally we won’t use these learning rates.
• For momentum the following schedule was suggested by Sutskever et al., 2013
t
βt = min(1 − 2−1−log2 (⌊ 250 ⌋+1) , βmax )

βmax is chosen from {0.999, 0.995, 0.99, 0.9, 0}


• In practice, often a line search is done to find a relatively better value of η. Update w using different values
of η then pick the best w based on loss function.

13
6 Adaptive Learning Rates
• Intuition: Decay the learning rate for parameters in proportion to their update history(more updates
means more decay).
• Update rule for AdaGrad

vt = vt−1 + (∇wt )2
η
wt+1 = wt − √ × ∇wt
vt + ϵ
and similar set of equations for bt .
• Intuition: AdaGrad decays the learning rate very aggressively(as the denominator grows). As a result,
after a while, the frequent parameters will start receiving very small updates. To avoid this we can decay
the denominator.
• Update rule for RMSprop

vt = βvt−1 + (1 − β)(∇wt )2
η
wt+1 = wt − √ × ∇wt
vt + ϵ
Give lower and lower weightage to previous gradients
• RMSprop converges more quickly than AdaGrad by being less aggressive on decay.
• In RMSprop, the deltas start oscillating after a while, learning rate can increase, decrease or remain constant.
Whereas in AdaGrad learning rate can only decrease as the denominator constantly grows.
• In case of oscillations, setting η0 properly could solve the problem.
• Both are sensitive to initial learning rate, initial conditions of parameters and corresponding gradients.
• AdaDelta avoids setting initial learning rate η0

vt = βvt−1 + (1 − β)(∇wt )2

ut−1 + ϵ
∆wt = − √ ∇wt
vt + ϵ
wt+1 = wt + ∆wt
ut = βut−1 + (1 − β)(∆wt )2

Now the numerator, in the effective learning rate, is a function of past gradients. Observe that the ut that
we compute at t will be used only in the next iteration. Essentially one history runs behind the other.
• After some i iterations, vt will start decreasing and the ratio of the numerator to the denominator starts
increasing. If the gradient remains low for a subsequent time steps, then the learning rate grows accordingly.
• Therefore, AdaDelta allows the numerator to increase or decrease based on the current and past gradients.
• Intuition: Do everything that RMSprop does to solve the decay problem of AdaGrad, plus used a
cumulative history of the gradients.
• Update rule for Adam(Adaptive Moments)
mt
mt = β1 mt−1 + (1 − β1 )∇w, m̂t = ← Incorporating classical momentum
1 − β1t
vt
vt = β2 vt−1 + (1 − β2 )(∇wt )2 , v̂t =
1 − β2t
η
wt+1 = wt − √ m̂t
v̂t + ϵ
Note that we are taking a running average of the gradients as mt .
• One way of looking at this is that we are interested in the expected value of the gradient and not on a
single point estimate computed at time t.
• m̂t and v̂t are bias correction terms, we do this to avoid very high learning rate update. It can also be
derived by taking expectation of mt .

14
• Bias correction is the problem of exponential averaging.
1
• General form of Lp norm is (|x1 |p + |x2 |p + ... + |xn |p ) p , as p → ∞, Lp boils down to taking the max value.

• In Adam, we called vt as L2 norm, why not replace it with L∞ norm. Therefore, we have

vt = max(β2t−1 |∇w1 |, β2t−2 |∇w2 |, ..., |∇wt |)


vt = max(β2 vt−1 , |∇wt |)
η0
wt+1 = wt − ∇wt
vt
Observe that we didn’t use bias corrected vt as max norm is not susceptible to initial zero bias.
• Suppose that we initialize w0 such that the gradient at the w0 is high. Suppose further that the gradients
for the next subsequent iterations are also zero because x is sparse. Ideally, we don’t want the learning rate
to change its value when ∇wt = 0.
• Update Rule for MaxProp is similar to RMSprop

vt = max(βvt−1 , |∇wt |)
η
wt+1 = wt − ∇wt
vt + ϵ

• We can extend the same idea to Adam and call it AdaMax


mt
mt = β1 mt−1 + (1 − β1 )∇et , m̂6 =
1 − β1t
vt = max(β2 vt−1 , |∇wt |)
η
wt+1 = wt − m̂t
vt + ϵ

• Intuition: We know NAG is better than momentum based GD, why not just incorporate it with Adam.
• We can rewrite NAG such that there is no t − 1 term, as

gt+1 = ∇wt
mt+1 = βmt + ηgt+1
wt+1 = wt − η(βmt+1 + gt+1 )

• Update rule for NAdam(Nesterov Adam)


mt
mt = β1 mt−1 + (1 − β1 )∇w, m̂t =
1 − β1t
vt
vt = β2 vt−1 + (1 − β2 )(∇wt )2 , v̂t =
1 − β2t
η (1 − β1 )∇wt
wt+1 = wt − p (m̂t+1 + )
v̂t+1 + ϵ 1 − β1t+1

• It is common for new learners to start out using off the shelf optimizers, which are later replaced by custom
designed ones.
• Cyclical Learning Rate: Suppose the loss surface has a saddle point. Suppose further that the parameters
are initialized, and the learning rate is decreased exponentially. After some iterations, the parameters will
reach near the saddle point. Since, the learning rate has decreased exponentially, the algorithm has no way
of coming out of the saddle point.
What if we allow the learning rate to increase after some iterations.
• Rationale: Often, difficulty in minimizing the loss arises from saddle points rather than poor local minima.
• A simple way to solve this is to vary the learning rate cyclically. One example is triangular learning rate.
• Cosine Annealing(Warm Re-Start): also based on cyclical learning rate

ηmax − ηmin t%(T + 1)


ηt = ηmin + (1 + cos(π )), T is restart interval
2 T

• Warm-start: Using a low initial learning rate helps the model to warm and converge better.

15
7 Regularization
7.1 Introduction to Bias and Variance
• Deep learning models typically have BILLIONS of parameters whereas the training data may have only
MILLIONS of samples. Therefore, they are called over-parameterized models.
• Over-parameterized models are prone to a phenomenon called over-fitting.

• Simple models trained on different samples of the data do not differ much from each other. However, they
are very far from the true sinusoidal curve (under fitting).
• Bias: Difference between the expected value of predictions subtracted by the true value.

Bias fˆ(x) = E[fˆ(x)] − f (x)

Simple models will have high bias and complex models will have low bias.
• Variance is defined as

Variance fˆ(x) = E[(fˆ(x) − E[fˆ(x)])2 ] = E[fˆ2 (x)] − E[fˆ(x)]2

Roughly speaking it tells us how much the different fˆ(x)′ s trained on different samples of the data differ
from each other. Simple models have a low variance whereas complex model have a high variance.

Bias Variance
Simple Model High Low
Complex Model Low High

• There is always a trade-off between the bias and variance. Both contribute to the MSE loss.
• Consider a new point (x, y) which was not seen during training. If we use the model fˆ(x) to predict the
value of y then the mean squared error is given by E[(y − fˆ(x))2 ] (MSE), we can show that

E[(y − fˆ(x))2 ] = Bias2 + Variance + σ 2 (irreducible error)

• The parameters of fˆ(x) are trained using a training set. However, at test time we are interested in evaluating
the model on a validation(unseen) set which was not used for training. This gives rise to two entities
trainerr and testerr . Typically, this gives rise to the following graph, parabola represents test error.

Figure 8: Train vs Test Error

• Complex model is more sensitive to minor changes in the data, as compared to simple models. Hence while
training, instead of minimizing the training error we should minimize train error +Ω(θ), which is high for
complex models and small for simple models.
• This is the basis for all the regularization methods.

16
7.2 Regularization Methods
• For L2 Regularization define the loss as
α
L̄(w) = L(w) + ∥w∥2
2
and the following gradient would look like
∇L̄(w) = ∇L(w) + αw
So it is very easy to implement this.
• It essentially makes it, so the weights are closer to 0, non-essential weights will be closer to 0 as compared
to essential weights.
• Data Augmentation: Consider images, rotate them, shift them, blur them, change some pixels, etc.
• We exploit the fact that certain transformations to the image do not change the label of the image.
• Typically, more data = better learning.
• Augmentation works well for image classification/object recognition tasks.
• It has been shown, that it works well for speech as well.
• Parameter Sharing is used in CNNs, essentially same filter is applied at different positions of the image,
or same weight matrix acts on different input neurons.
• This is typically used in autoencoders, encoder and decoders.
• Injecting Noise at input: We can show that for a simple input output neural network, adding Gaussian
noise to the input is equivalent to weight decay(L2 regularization).
• We essentially shift the input by a slight amount at every epoch, so the model sees a different input at
every epoch making it harder to over fit.
• This can also be viewed as data augmentation.
• We can similarly inject noise at output as well.
• Early stopping: Have a patience parameter p, after p steps if validation error does not decrease then we
stop and return the model p steps before.
• This is very effective and widely used, it can be used even with other regularizers.
• Ensemble Methods: Combine the output of different models to reduce generalization error. These models
can correspond to different classifiers, it could be different instances of the same classifier trained with
different hyperparameters, features, samples etc.
• Bagging: Form an ensemble using different instances of the same classifier. For a given dataset, construct
multiple training sets by sampling with replacement.
• Bagging would work well when the errors of the model are independent or uncorrelated, if they are correlated
then the mse of the ensemble is as bad as the individual model.
• On average, the ensemble will perform at least as well as its individual members.
• Training several neural networks for making an ensemble is prohibitively expensive.
• Dropout: Refers to dropping out units, temporarily remove a node and all its incoming and outgoing
connections resulting in a thinned network.
• Each node is retained with a fixed probability, p = 0.5, for hidden nodes and p = 0.8 for visible nodes.
• We initialize all the parameters of the network and start training. For the first training instance(or
mini-batch), we apply dropout resulting in the thinned network. Then update only those parameters that
are active.
• For the second training instance, we again apply dropout resulting in a different thinned network. We again
compute the loss and back propagate.
• Each thinned network gets rarely trained but the parameter sharing ensures that no model has untrained
or poorly trained parameters.
• Dropout essentially applies a masking noise to the hidden units. Prevents hidden units from coadapting.

17
8 Deep Learning Revival
8.1 Unsupervised Pre-Training
• We will first train the weights between the layers using an unsupervised objective, essentially we will try
to reconstruct x.
• If we are able to do this, it would mean that the hidden layers are capturing all the important information
of the image. We will be learning one layer at a time.
• h1 will try to reconstruct x, h2 will try to reconstruct h1 , and so on...
• After this layerwise pre-training, we add the output layer and train the whole network.
• In effect we have initialized the weights of the network using the greedy unsupervised objective and are now
fine-tuning these weights using the supervised objective.
• This helps in optimization and regularization.
• Unsupervised objective ensures that the learning is not greedy w.r.t the supervised objective.
• Some other experiments have also shown that pre-training is more robust to random initializations.

• This led to people thinking that deep networks are sensitive to initial weights, and maybe if we have better
initial weights it would lead to better network.

8.2 Better Activation Functions


• Sigmoid: σ(x) = 1+e1−x , compresses all its inputs to the range [0, 1]. ∂σ(x)
∂x = σ(x)(1 − σ(x)), once sigmoid
neuron is saturated(σ(x) = 0 or 1) then the training will halt as the derivative would become zero.
Another issue is that sigmoid are not zero centered, leading to all positive or all negative gradients, lack
of diversity. These are also computationally expensive because of the exponential term.

• tanh: tanh(x), compresses all its input to the range [−1, 1] so, its zero centered. ∂tanh(x)
∂x = (1 − tanh2 (x)),
but we still face the issue of vanishing gradient, and it is still computationally expensive.
• Rectified Linear Unit: ReLU (x) = max (0, x), does not saturate in the positive region, it is computa-
tionally efficient and in practice converges much faster than sigmoid and tanh.
Clearly the derivative is 1 for x > 0 and 0 otherwise, leading to no updates for negative neurons, which
makes the neuron dead.
In practice a large fraction of ReLU units can die if the learning rate is set too high.
It is advised to initialize the bias to a positive value.
• Leaky ReLU: f (x) = max (0.1x, x), no saturation, computationally efficient, close to zero centered outputs,
and will not die.

• Parametric ReLU: f (x) = max (αx, x), α can be learned.


(
x if x > 0
• Exponential Linear Unit: ELU = , close to zero centered outputs, expensive.
aex − 1 if x ≤ 0

(a) Sigmoid Function (b) Tanh Function (c) ReLU Function

Figure 9: Activation Functions - 1

18
(a) Leaky ReLU Function (b) Parametric ReLU Function (c) ELU Function

Figure 10: Activation Functions - 2

• Model Averaging: Train k models with independent weights(w1 , w2 , ..., wk ) on independently sampled
data points from the original dataset. Each model is expected to be good at predicting a subset of training
samples well.
During inference, the prediction is done by averaging the predictions from all the models, it is often
arithmetic or geometric mean for regression problems.
• In bagging, the subset of training samples seen by each of the models does not change across epochs.
Therefore, the model weights get optimized for those samples.
• Each sub-model sees only some parts of the training data, therefore we want updates to be larger.
• We don’t have this luxury in case of dropout.
• MaxOut: In a way is a variation of Dropout, but instead of having probability, we take say 3 neurons in a
layer and propagate the max of them. Can also be thought of as a generalization of ReLU and Leaky ReLU.
Two MaxOut neurons with sufficient number of affine transformations, act as a universal approximator.
• Gaussian Error Linear Unit: GELU = mx where m ∼ Bernoulli(Φ(x))
• Swish: xσ(βx), taking β = 1.702 we get GELU, and taking 1 we get SILU(Sigmoid-weighted Linear Unit).

9 Convolutional Neural Networks


• Convolution operation can be defined as

X ∞
X
x∗w = x(t − a)w(a) = x(a)w(t − a)
a=0 a=0

where w is called a filter and x is the input.


• Similarly 2D convolution operation can be defined as

X n−1
m−1 X
(I ∗ K)(i, j) = I(i − a, j − a)K(a, b)
a=0 b=0

We can use this for images.


• We can define correlation operator as
m−1
X n−1
X
g(i, j) = I(i + a, j + a)K(a, b)
a=0 b=0

Correlation and Convolution operator will be same when K is a symmetric matrix.

• For practical implementations most libraries implement correlation operator rather than convolution.

19
Figure 11: Convolution Example

• When we convolve a filter with an image the resulting image is called a feature map.
• If we use multiple such filters we will get multiple feature maps, and then we stack them, this number will
in a way act as the number of color channels.

• Let image be of size W1 × H1 × D1 , let the number of filters be K, let the output be of size W2 × H2 × D2 ,
and let stride be S and spatial extent F . The filters are of size F × F × D1 .
W1 − F H1 − F
W2 = 1 + H2 = 1 + D2 = K
S S

• If we want the output to be of the same size as input, we add padding.


W1 − F + 2P H1 − F + 2P
W2 = 1 + H2 = 1 + D2 = K
S S

• One important characteristic of CNNs is weight sharing.


• Pooling: Basically the same as a filter, except instead of multiplying with numbers, we take the max/min/av-
erage of the pixel intensities within the filter window.

• Pooling essentially helps lower the size of the feature maps, and extract only essential or important features.
• A simple CNN will have (Convolution→Pooling)×2 layers, then a normal ANN.
• Pooling layers will propagate gradient in backpropagation, as they do not have any weight attached to
them.

• For Max pooling, the gradient from the next layer is propagated only to the location of the maximum value
in the pooling window.
• For Average pooling, the gradient from the next layer is distributed equally to all elements in the pooling
region because each element contributes equally to the average.

• For convolution, the backpropagation update would be convolving the input(X) to the convolution layer
with the loss or gradient values given by the pooling layer.
• And when we want to propagate the gradient we will convolve weights(W ) instead of X.

20
9.1 Different Architectures
• AlexNet: Input size is of 227 × 227 × 3, the architecture can be seen below. Approx 27M parameters.

Figure 12: AlexNet Architecture

• ZFNet: Input size is the same, the architecture is basically the same as AlexNet except the three consecutive
convolution layers have filters 512, 1024, 512 respectively.
• VGGNet: Input is again the same, the key idea in this is kernel size is always 3 × 3.

• GoogleLeNet: Key idea is to have multiple sized filters to capture features at varying range.

Figure 13: Inception Model

• ResNet: Train a shallow network, then add layers and train deeper network. We can ensure this by adding
a skip connection.

21
10 Word Representations
10.1 Direct Representations
• Corpus: Group of documents.
• Consider the set V of all uniques words across all input streams, V is called the vocabulary of the corpus.
We need a representation for every word in V .

• We will be using the following corpus, to understand each representation


1. Human machine interface for computer applications
2. User opinion of computer system response time
3. User interface management system
4. System engineering for improved response time
• Vocabulary for this corpus is V = [human, machine, interface, for, computer, applications, user, opinion, of,
system, response, time, interface, management, engineering, improved]
• One very simple way of doing this is to use one-hot vectors of size |V |, essentially switch the bit on if the
word is present else it is off.
• machine would be represented as 0 1 0 ... 0 0
 

• V tends to be very large, 50K for PTB, 13M for Google IT corpus.
• These representations do not capture any notion of similarity.

• With one hot representations, the Euclidean distance between any two words is 2 and the cosine similarity
between any two words is 0.
• We want representations of cat and dog to be closer to each other than the representations of cat and truck.
• These are also called sparse representations.

• A co-ocurrence matrix is a terms × terms matrix which captures the number of times a term appears in
the context of another term
• The context is defined as a window of k words around the terms.
• With k = 2, the co-ocurrence matrix for the above corpus would be

human machine system for ... user


human 0 1 0 1 ... 0
machine 1 0 0 1 ... 0
system 0 0 0 1 ... 2
for 1 1 1 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 2 0 ... 0

Table 6: Co-Ocurrence Matrix

• Essentially for the word w1 , we check a window of k words around it and count how many times other
words appear.
• This is also known as a word × context matrix.
• We could choose the set of words and contexts to be same or different.

• Each row (column) of the co-ocurrence matrix gives a representation of the corresponding word (context).
• Stop words(a, the, for, etc.) are very frequent → these counts will be very high.
• We can solve this by ignoring very frequent words or use a threshold t, and if a word, context pair goes
above a threshold then we cap it.

22
• Instead of count we can also use PMI
p(c|w) count(w, c) · N
P M I(w, c) = log = log
p(c) count(c) · count(w)

Obviously, when count(w, c) = 0 then PMI becomes −∞, which is an issue.

• We can solve this by taking PMI value as 0 when count(w, c) ≤ 0 and using the formula when count(w, c) > 0,
this is called P M I0 (w, c).
• Another solution is instead of count(w, c) > 0, we replace with P M I(w, c) > 0. This is called Positive PMI
or P P M I.

human machine system for ... user


human 0 2.944 0 2.25 ... 0
machine 2.944 0 0 2.25 ... 0
system 0 0 0 1.15 ... 1.84
for 2.25 2.25 1.15 0 ... 0
. . . . . . .
. . . . . . .
. . . . . . .
user 0 0 1.84 0 ... 0

Table 7: PMI Co-Ocurrence Matrix

• This is however still very large and very sparse, and it grows with vocabulary.

• Consider X = XP P M Im×n , we can use singular value decomposition to get a rank k approximation.
T
X̂m×n = Um×k Σk×k Vk×n

• SVD gives the best rank k approximation of the original data, Discovers latent semantics in the corpus.
• SVD will essentially extract the most important features from the matrix.

• Conventionally, we take word representation as

Wword = U Σ ∈ Rm×k

This is representation of the m words in the vocabulary and Wcontext = V is taken as the representation of
the context words.

• This idea comes from the fact that X̂ X̂ T and Wword Wword
T
are the same matrices, i.e., they roughly capture
the same cosine similarity.
• These methods are called count based models because they use the co-ocurrence counts of words.

10.2 Bag of Words Model


• Task: Predict the nth word given previous n − 1 words.
• Training data: All n word windows in the corpus.
• Consider doing this for n = 2, predict the second word given the first.

• We can model this problem using a feedforward neural network with one hidden layer. This network will
take input a one-hot representation of the context word and output a probability vector of words.
• Parameters are Wcontext ∈ Rk×|V | and Wword ∈ R|V |×k , and we use softmax to get probabilities.
• The product Wcontext x is simply the ith column of Wcontext , give x is a one hot vector.

• When the ith word is present the ith element in the one hot vector is ON and the ith column of Wcontext
gets selected.
• There is a one to one correspondence between the words and columns of Wcontext .

• We can treat the ith column of Wcontext as the representation of the context i.

23
• P (word = i|v) depends on the ith column of Wword .
• We thus treat the ith column of Wword as the representation of the word i.

• Consider the context word with index c and the correct output with index w

L(θ) = − log ŷw = − log P (w|c)


h = Wcontext xc = uc
euc vw
ŷw = P uc vw′
w′ ∈V e

uc is the column of Wcontext corresponding to context word c and vw is the column of Wword corresponding
to the word w.
• Give an context uc if we are predicting vw then the update rule for vw derived from backpropagation is

vw = vw + ηuc (1 − ŷw )

This increases the cosine similarity between uc and vw .


• The training objective ensures that the cosine similarity between word (vw ) and context word (uc ) is
maximized
• In practice, instead of window size of 1 it is common to use a window size of d.

• We can simply concatenate all the one-hot representations, and the first layer weights changes to
[Wcontext , Wcontext , ...d times], this becomes very complex. Instead, we simply add the column where bit is
on.
d−1
X
h= uc , where uc is the column of Wcontext where bit was on
i=1

• Computation of softmax is expensive as denominator requires sum of all words in vocabulary.

10.3 Skip Gram Model


• Predicts context words given an input word.
• the role of context and word has changed now.
• It essentially becomes the same as bag of words, and we run into the same problems.

• Negative Sampling
• Let D be the set of all correct (w, c) pairs in the corpus
• Let D′ be the set of all incorrect (w, r) pairs in the corpus

• D′ can be constructed by randomly sampling a context word r which has never appeared with w and
creating a pair (w, r)
• We are interested in maximizing
Y Y X X
P (z = 1|w, c) P (z = 0|w, r) = log σ(vcT vw ) + log σ(−vrT vw )
(w,c)∈D (w,r)∈D ′ (w,c)∈D (w,r)∈D ′

• The size of D′ is k times the size of D.


• Contrast Estimation: Consider instead of outputting probability we output a score.
• Now, we want the score of correct word s to be higher than wrong word sr , so we try to maximize s − sr .

• But we would like the difference to be at least m, so we maximize s − (sr + m)


• If s > sr + m then we don’t do anything.
• Hierarchical Softmax: Construct a binary tree such that there are |V | leaf nodes each corresponding to
one word in the vocabulary.

• There exists a unique path from the root node to a leaf node.

24
• Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on the path from root to w
• Let π(w) be a binary vector such that
(
1, path branches left at node l(wk )
π(w)k =
0, otherwise

• Finally each internal node is associated with a vector ui


• So the parameters of the module are Wcontext and u1 , u2 , ..., uv
• The probability of predicting a word is the same as predicting the correct unique path from the root node
to that word.

• We model
1
P (π(on)i = 1) = T and P (π(on)i = 0) = 1 − P (π(on)i = 1)
1 + e−vc ui
• Note that p(w|vc ) can now be computed using |π(w)| computations instead of |V | required by softmax
• Turns out that even a random arrangement of the words on leaf nodes does well in practice.

10.4 GloVe Representations


• Count based methods (SVD) rely on global co-occurrence counts from the corpus for computing word
representations
• Predict based methods learn word representations using co-occurrence information
• We will now combine the two.

• Consider X as the SVD decomposed matrix of the previous corpus.


• Xij encodes important global information about the co-occurrence between i and j.
• Now we enforce
viT vj = log P (j|i) = log Xij − log Xi
vjT vi = log Xji − log Xj = log Xij − log Xj (Xij = Xji )

• Adding the two equations we get

2viT vj = 2 log Xij − log Xi − log Xj


1 1
viT vj = log Xij − log Xi − log Xj = log Xij − bi − bj
2 2
viT vj + bi + bj = log Xij
X
min (viT vj + bi + bj − log Xij )2
vi ,vj ,bi ,bj
i,j

where viT vj + bi + bj is the predicted value using model parameters and log Xij is the actual value calculated
from the given corpus.
• One drawback is that, this weighs all co-occurrences equally.

• A simple fix is to multiply a weighing function f (Xij )


(
x
( xmax )α , if x < xmax
f (x) =
1, otherwise

where α can be tuned for a given dataset.

25
10.5 Evaluating Word Representations
• Semantic Relatedness
1. Ask humans to judge the relatedness between a pair of words
2. Compute the cosine similarity between the corresponding word vectors learned by the model
3. Given many such word pairs, compute the correlation between Smodel & Shuman , and compare different
models
4. Model 1 is better than Model 2 if correlation(Smodel1 , Shuman ) > correlation(Smodel2 , Shuman )
• Synonym Detection
1. Given: a term and four candidate synonyms
2. Pick the candidate which has the largest cosine similarity with the term
3. Compute the accuracy of different models and compare

11 Sequential Learning
11.1 Introduction
• In many applications the input is not of a fixed size
• Further successive inputs may not be independent of each other
• We need to look at a sequence of (dependent) inputs and produce an output (or outputs)
• Each input corresponds to one time step
• Consider the task of predicting the part of speech tag (noun, adverb, adjective verb) of each word in a
sentence
• Once we see an adjective (social) we are almost sure that the next word should be a noun (man)
• Thus the current output (noun) depends on the current input as well as the previous input
• Further the size of the input is not fixed (sentences could have a arbitrary number of words)
• Notice that here we are interested in producing an output at each time step
• Each network is performing the same task (input : word, output : tag)
• Sometimes we may not be interested in producing an output at every stage
• Instead we would look at the full sequence and then produce an output
• consider the task of predicting the polarity of a movie review
• The prediction clearly does not depend only on the last word but also on some words which appear before
• Here again we could think that the network is performing the same task at each step (input : word, output
: +/−) but it’s just that we don’t care about intermediate outputs
• Sequences could be composed of anything (not just words)

11.2 Recurrent Neural Networks


• The function being executed at each time step is

si = σ(U xi + b)
yi = O(V si + c)

where i is the timestep


• Since we want the same function to be executed at each timestep we should share the same network
• This parameter sharing also ensures that the network becomes agnostic to the size of the input
• Since we are simply going to compute the same function at each timestep, the number of timesteps doesn’t
matter

26
• We just create multiple copies of the network and execute them at each timestep
• Consider that we feed all previous inputs to the current network, this will be infeasible as now the network
does not compute the same function at each time step and hence parameter sharing cannot take place.
• The solution is to add a recurrent connection in the network
si = σ(U xi + W si−1 + b)
yi = O(V si + c)

• This can be represented as follows

Figure 14: Recurrent Neural Network

• Consider xi ∈ Rn , si ∈ Rd and yi ∈ Rk , then the dimensions of the parameters are

U ∈ Rn×d V ∈ Rd×k and W ∈ Rd×d

• The total loss is simply the sum of loss over all time steps.
• The gradient of V and U is straightforward backpropagation.
• However, gradient of W will be a chain of gradients going through time, aka backpropagation through time.
• For W there is a high chance of vanishing/exploding gradients, as such we restrict backprop time steps to τ .
• The state (si ) of an RNN records information from all previous time steps
• At each new timestep the old information gets morphed by the current input
• One could imagine that after t steps the information stored at time step t − k gets completely morphed so
much that it would be impossible to extract the original information stored at time step t − k.
• A similar problem occurs when the information flows backwards
• It is very hard to assign the responsibility of the error caused at time step t to the events that occurred at
time step t − k

11.3 Long Short Term Memory


• Consider the task of predicting the sentiment of a review
• RNN reads the document from left to right and after every word updates the state
• By the time we reach the end of the document the information obtained from the first few words is
completely lost
• Ideally we want to
1. forget the information added by stop words
2. selectively read the information added by previous sentiment bearing words
3. selectively write new information from the current word to the state

27
• we have computed a state st−1 at timestep t − 1, and now we want to overload it with new information (xt )
and compute a new state (st ).
• We introduce a vector ot−1 which decides what fraction of each element of st−1 should be passed to the
next state
• Each element of ot−1 gets multiplied with the corresponding element of st−1
• Each element of ot−1 is restricted to be between 0 and 1

ot−1 = σ(Wo ht−2 + Uo xt−1 + bo )


ht−1 = ot−1 ⊙ σ(st−1 )

Wo , Uo , bo need to be learned.
• ot is called the output gate as it decides how much to pass (write) to the next time step
• Now we calculate state information as

s̃t = σ(W ht−1 + U xt + b)

• However, we may not want to use all this new information and only selectively read from it before
constructing the new cell state st
• To do this we introduce another gate called the input gate

it = σ(Wi ht−1 + Ui xt + bi )

and then take dot product to get new temporary state s̃t
• To selectively forget, we introduce another forget gate

ft = σ(Wf ht−1 + Uf xt + bf )

• Now we have the full set of equations for LSTM, the gates are

ot = σ(Wo ht−1 + Uo xt + bo )
it = σ(Wi ht−1 + Ui xt + bi )
ft = σ(Wf ht−1 + Uf xt + bf )

and the state equations are

s̃t = σ(W ht−1 + U xt + b)


st = ft ⊙ st−1 + it ⊙ s̃t
ht = ot ⊙ σ(st ) and rnnoutput = ht

Figure 15: Long Short Term Memory

• Another variant of LSTM is Gated Recurrent Unit, which has the following set of equations

ot = σ(Wo st−1 + Uo xt + bo )
it = σ(Wi st−1 + Ui xt + bi )
s̃t = σ(W (ot ⊙ st−1 ) + U xt + b)
st = (1 − it ) ⊙ st−1 + it ⊙ s̃t

28
• No explicit forget gate, the forget gate and input gates are tied.
• During forward propagation, the gates control the flow of information
• They prevent any irrelevant information from being written to the state
• Similarly, during backward propagation they control the flow of gradients
• It is easy to see that during backward pass the gradients will get multiplied by the gate
• If the state at time t − 1 did not contribute much to the state at time t then during backpropagation the
gradients flowing into st−1 will vanish
• But this kind of vanishing gradient is fine, since st−1 did not contribute to st we don’t want to hold it
responsible for the crimes of st

12 Encoder-Decoder Models
12.1 Introduction
• What if we want to generate a sentence given an image?
• We are now interested in P (yt |yt−1 , I) instead of P (yt |yt−1 ), where I is an image.
• We could now model P (yt = j|yt−1 , I) as P (yt = j|st , fc7 (I)) where fc7 (I) is the representation obtained
from the convolution layer of an image
• There are many ways of making P (yt = j) conditional on fc7 (I)
• Option 1: Set s0 = fc7 (I)
• Now s0 and hence all subsequent st ’s depend on fc7 (I)
• We can thus say that P (yt = j) depends on fc7 (I)
• Option 2: Another more explicit way of doing this is to compute

st = RN N (st−1 , [xt , fc7 (I)])

• We are explicitly using fc7 (I) to compute st and hence P (yt = j)


• A CNN is first used to encode the image
• A RNN is then used to decode (generate) a sentence from the encoding
• This is a typical encoder decoder architecture
• Both the encoder and decoder use a neural network
• Alternatively, the encoder’s output can be fed to every step of the decoder

Figure 16: Image Captioning

29
Figure 17: Textual Entailment

Figure 18: Image Question Answering

Figure 19: Video Captioning

30
12.2 Attention Mechanism
• Encoder decoder models can be made even more expressive by adding an “attention” mechanism

• At each time-step we should feed only this relevant information (i.e. encodings of relevant words) to the
decoder
• We could just take a weighted average of the corresponding word representations and feed it to the decoder.

• The machine will have to learn this from the data


• To enable this, we define a function ejt = fAT T (st−1 , hj )
• This quantity captures the importance of the j th input word for decoding the tth output word.
• We can normalize these weights by using the softmax function

exp ejt
αjt = PM
jt
j=1 exp e

• αjt denotes the probability of focusing on the j th word to produce the tth output word.
• One possible choice for fAT T is
T
ejt = Vatt tanh Uatt st−1 + Watt hj

• We are essentially asking the model to approach the problem in a better, more natural way

Figure 20: Architecture with Attention

• We can check whether the attention model learns something meaningful by plotting the attention weights
as a heatmap.

• In the case of text, we have a representation for every location of the input sequence.
• But for images, we typically use representation from one of the fully connected layers. This representation
does not contain any location information.
• Instead of the fc7 representation, we use the output of one of the convolution layers, which has spatial
information.
• For example, the output of the 5th convolutional layer of VGGNet is a 14 × 14 × 512 size feature map. We
could think of this as 196 locations, each having a 512 dimensional representation. The model will then
learn an attention over these locations.

31
12.3 Hierarchical Attention
• Consider a dialog between a user (u) and a bot (B)
• The dialog contains a sequence of utterances between the user and the bot
• Each utterance in turn is a sequence of words
• Thus what we have here is a “sequence of sequences” as input.
• We could think of a two level hierarchical RNN encoder
• The first level RNN operates on the sequence of words in each utterance and gives us a representation
• We now have a sequence of utterance representations (red vectors in the image)
• We can now have another RNN which encodes this sequence and gives a single representations for the
sequences of utterances
• The decoder can then produce an output sequence conditioned on this utterance

Figure 21: Two layer RNN

• Consider another example of classification of documents, each document is a sequence of sentences and
each sentence is in turn a sequence of words.

Figure 22: Document Classification

32
• We need attention at two levels
First, we need to attend to important (most informative) words in a sentence.
Then we need to attend to important (most informative) sentences in a document.

Figure 23: Encoder Architecture

Figure 24: Decoder Architecture

33
13 Transformer Models
• In encoder-decoder architecture, we have to wait for until st−1 is calculated to calculate st .
• The idea is to calculate all these in parallel, instead of waiting.

Figure 25: Transformer Architecture

13.1 Self-Attention
• Inputs are vectors, word embeddings, and the outputs are also vectors.
• Given a word in a sentence, we want to compute the relational score between the word and the rest of the
words in the sentence, such that the score is higher if they are related contextually.
• now both the vectors si and hj , for all i, j are available for all the time.
• The score function we used was
T
score(st−1 , hj ) = Vatt tanh Uatt st−1 Watt hj

• There are three vectors involved in computing the score at each time step.
• We get them by transforming hj
• qj = WQ hj , called the query vector
• kj = WK hj called the key vector
• vj = WV hj called the value vector

Figure 26: Self Attention

34
• We will first calculate score(q1 , kj ), e1j = [q1 · k1 , q1 · k2 , ..., q1 · k5 ]
P5
• Then we get α1j = sof tmax(e1j ), finally we get z1 = j=1 α1j vj
• Query and key vector participate in calculating α, and value vector will participate in getting final output.
• All query, key and value vectors can be computed in one go, as follows

Q = [q1 , q2 , ..., qT ] = WQ [h1 , h2 , ..., hT ]

• The entire output can also be computed in parallel as follows

QT K
Z = [z1 , z2 , ..., zT ] = sof tmax( √ )V T
dk
where dk is the dimension of key vector.

Figure 27: Self Attention Block

13.2 Multi-Headed Attention


• The idea of having multi-headed attention is pretty much the same idea for having multiple filters in CNN.
• Two-headed attention: Have two self attention block side-by-side.
• We simply concatenate the output from each block.

Figure 28: Multi-Headed Attention Block

35
• The outputs are then passed to a feed forward network, this is our usual ANN.

Figure 29: Feed Forward Network

• Identical network for each position, F F N (z) = max(0, W1 z + b1 )W2 + b2


• The encoder is composed of identical layers, and each layer is composed of 2 sub-layers.

• The computation is parallelized in the horizontal direction, i.e. within a training sample, of the encoder
stack, not along the vertical direction.

Figure 30: Encoder Stack

13.3 Decoder Architecture


• Takes input from the encoder, and also has self-inputs, they are the words that we have predicted/decoded
so far.
• Each layer is composed of three sub-layers.

36
Figure 31: Decoder Block

• We don’t know how long is the output going to be, hence we will consider a max length T1 .
• At the beginning only < GO > makes sense, all other inputs will be junk, hence we have ”masked” self
attention. Essentially, we make all α greater than current decoded word 0.

• Masking is done by inserting negative infinite at the respective positions. This works because we take max
of values(ReLU) in the feedforward network.
• We will also use masking in cross attention.
• We will take the values coming from encoder, and get K and V . We will use output from self-attention as
Q, then cross attention becomes the same as self-attention.

37

You might also like