Deep Learning
Deep Learning
21F3001823
Deep Learning December 6, 2024
1
1.6 The Curios Case of Sequences
• Sequences: Each unit in the sequence interacts with other units. Need models to capture this interaction.
Example: Speech, Videos, etc.
• Hopfield Network: Content-addressable memory systems for storing and retrieving patterns.
• Jordan Network: The output state of each time step is fed to the next time step, thereby allowing
interactions between time steps in the sequence.
• Elman Network: The hidden state of each time step is fed to the next time step, thereby allowing
interactions between time steps in the sequence.
• Very hard to train RNNs.
• Long Short Term Memory: Solve complex long time lag tasks that could never be solved before(solved
the problem of vanishing gradient).
• Sequence to Sequence Models: Introduction to Attention!!, However, they were unable to capture the
contextual information of a sentence.
• Transformers: Introduced a paradigm shift in the field of NLP. GPT and BERT are the most commonly
used transformer based architectures.
2
• Analog AI: Programmable resistors are the key building blocks in analog deep learning, just like transistors
are the core element of digital processors(faster computation).
• Astronomy: Galaxy Evolution: Predict how galaxies would look like as it gets older.
(a) Biological Neuron (b) Artificial Neuron (c) A McCulloch Pitts Unit
• McCulloch Pitts Neuron: Proposed a highly simplified computational model of the neuron. g aggregates
the inputs, and the function f takes a decision based on this aggregation. The inputs can be excitatory or
inhibitory. Pn
Example: y = 0 if any xi is inhibitory, else g(x1 , ..., xn ) = g(x) = i=1 xi
y = f (g(x)) = 1 if g(x) ≥ θ else 0
θ is called the thresholding parameter.
• circle at the end indicates inhibitory input if any inhibitory input is 1 then the output will be 0.
• Linear separability(for boolean functions): There exists a line(plane) such that all inputs which produce a 1
lie on one side of the line(plane) and all inputs which produce a 0 lie on the other side of the line(plane).
• A single McCulloch Pitts Neuron can be used to represent boolean functions which are linearly separable.
3
• It can be trivially seen that θ = 0 for the Tautology(always ON) function.
(a) AND function (b) OR function (c) NOR function (d) NOT function
2.2 Perceptrons
• Classical Perceptron: Frank Rosenblatt proposed this model, which was refined and carefully analyzed
by Minsky and Papert.
• Main difference is the introduction of numerical weights for inputs and a mechanism for learning these
weights.
Pn
• y = 1 if i=1 wi × xi ≥ θ else 0.
If we takePθ on one side and represent w0 = −θ and x0 = 1, then we have
n
y = 1 if i=0 wi × xi ≥ 0 else 0.
• The difference between this and the MP neuron is that now we have weights that can be learned, and the
inputs can be real values.
• We are interested in finding the line wT x = 0 which divides the input space into two halves. The angle α
wT x
between w and any point x which lies on the line must be 90◦ because cos α = ||w||||x|| = 0. So, the vector
w is perpendicular to every point on the line, then it is perpendicular to the line itself.
4
• For any point such that wT x > 0 we will have α < 90◦ and for any point such that wT x < 0 we will have
α > 90◦ .
• For x ∈ P if wT x < 0 then it means that the angle α between this x and the current w is greater than 90◦ ,
but we want it to be < 90◦ .
When we do wnew = w + x, alpha changes as follows
T
cos αnew ∝ (wnew x) = (w + x)T x = W T x + xT x = cos α + xT x
Further we will normalize all the p’s so that ||p|| = 1, notice this does not change anything in our step.
Let w∗ be the normalized solution vector(we know one exists whose value we don’t know).
Now suppose at some time step t we inspected the point pi and found that wT pi < 0, then we make
correction wt+1 = wt + pi .
∗
Let β be the angle between w∗ abd wt+1 , then we have cos β = w wt+1
||wt+1 ||
5
Denominator2 = ||wt+1 ||2 = (wt + pi ) · (wt + pi ) = ||wt ||2 + 2wt · pi + ||pi ||2 ≤ ||wt ||2 + ||pi ||2
≤ ||wt ||2 + 1 ≤ (||wt−1 ||2 + 1) + 1 ≤ ||w0 ||2 + (k)
So, we have N umerator ≥ w∗ · w0 + (k)δ and Denominator2 ≤ ||w0 ||2 + (k)
w∗ · w0 + kδ
cos β ≥ p
||w0 ||2 + k
√
cos β grows proportional to k
As k(number of corrections) increase cos β can become arbitrarily large, but since cos β ≤ 1, k must be
bounded by a maximum order.
Thus, there can only be a finite number of corrections (k) to w and the algorithm will converge!
x1 x2 XOR
P2
0 0 0 w0 + i=1 wi xi <0
P2
1 0 1 w0 + i=1 wi xi ≥0
P2
0 1 1 w0 + i=1 wi xi ≥0
P2
1 1 0 w0 + i=1 wi xi <0
• Most real world data is not linearly separable and will always contain some outliers.
• How many boolean functions can we design from 2 inputs.
Out of these, only XOR and !XOR are not linearly separable.
n
• In general, we can have 22 boolean functions in n inputs.
6
red edge indicates w = −1, and blue edge indicates w = +1. x1 has weights −1, −1, +1, +1 and x2 has
weights −1, +1, −1, +1 from left to right respectively.
• Theorem: Any boolean function of n inputs can be represented exactly by a network of perceptrons
containing one hidden layer with 2n perceptrons and one output layer containing 1 perceptron.
• Note: A network of 2n + 1 perceptrons is not necessary but sufficient.
• Catch: As n increases the number of perceptrons in the hidden layers increases exponentially.
• We care about boolean functions, because we can model real world examples into boolean functions or
classification problems.
• The network we saw is formally known as Multilayer Perceptron(MLP) or more appropriately ”Multilayered
Network of Perceptrons”.
3 Sigmoid Neurons
3.1 What are they?
• The thresholding logic used by a perceptron is very harsh! It is a characteristic of the perceptron function
itself which behaves like a step function.
• For most real world applications we would expect a smoother decision function which gradually changes
from 0 to 1.
• Instead we would use sigmoid function
1
y= Pn
1 + exp(−(ω0 + i=1 ωi xi ))
• We no longer see a sharp transition around the threshold ω0 . Also, y now takes any real value in [0, 1].
7
• Learning algorithm: An algorithm for learning the parameters w of the model.
Pn
• Objective/Loss/Error function: To guide the learning algorithm. One possibility is i=1 (ŷi − yi ) 2 .
• The learning algorithm should aim to minimize the loss function.
• Let β be the angle between uT and ∇θ L(θ) and k = ∥uT ∥∥∇θ L(θ)∥, then we can write
uT ∇θ L(θ)
−1 ≤ cos(β) = ≤ 1 =⇒ −k ≤ uT ∇θ L(θ) ≤ k
∥uT ∥∥∇θ L(θ)∥
8
3.4 Representation Power of a multiplayer network of sigmoid neurons
• A multilayer network of neurons with a single hidden layer can be used to approximate any continuous
function to any desired precision.
• Wi ∈ Rn×n and bi ∈ Rn are the weight and bias between layers i − 1 and i (0 < i < L).
• WL ∈ Rk×n and bL ∈ Rk are the weight and bias between the last hidden layer and the output layer.
• The pre-activation at layer i is given by
where we have
∂L(θ) ∂L(θ) ∂L(θ) ∂L(θ) T
θ = [W1 , ..., WL , b1 , ...., bL ] and ∇θt = [ , ..., , , ..., ]
∂W1,t ∂WL,t ∂b1t ∂b1t
9
Figure 6: Feed Forward Network
• If we want to predict values in the range of real values, regression, then we can take the output function as
a linear function
f (x) = hL = O(aL ) = WO aL + bO
• Given that we are performing classification, then we can use cross entropy loss function. This is known as
negative log likelihood function.
N k
1 XX
− (yij log(ŷij ) + (1 − yij )log(1 − ŷij ))
N i=1 j=1
• Ensure that ŷ is a probability distribution, one appropriate output function can be the softmax function
eaL,j
ŷj = O(aL )j = Pk
aL,i
i=1 w
10
4.3 Backpropagation
• Intuitively it can be written as
• For further discussion, the output function is considered to be softmax and the loss function is considered
to be cross entropy.
• Gradient w.r.t output layer: We start with the part ”Talk to the output layer”.
• Chain rule among multiple paths: If a function p(z) can be written as a function of intermediate
results qi (z) then we have
∂p(z) X ∂p(z) ∂qm
=
∂z m
∂qm ∂z
• Gradient w.r.t hidden units: Now we ”Talk to the hidden layers”, in this case p(z) is the loss function,
z = hij and qm (z) = aLm . We have
k
∂L(θ) X L(θ)
= Wi+1,m,j
∂hij m=1
∂ai+1,m
11
• We can now write the full pseudocode as
• g ′ for logistic function σ(z) is g(z)(1 − g(z)) and for tanh function is 1 − (g(z))2 .
• On flat surfaces, gradient descent moves very slow. How do we solve this?
• We are not only considering the current gradient, but we are also giving some importance to past history. β
is typically less than 1, so we give decreasing importance to previous histories. We have
t
X
2
ut = βut−1 + ∇wt = β ut−2 + β∇wt−1 + ∇wt = ... = β t−τ ∇wτ
τ =0
12
5.3 Stochastic vs Batch Gradient Descent
• Regular gradient descent goes over the entire data once before updating the parameters. Because this is the
true gradient of the loss as derived earlier. Hence, theoretical guarantees hold.
• Imagine we have a million points in the training data, then it will take very long.
• In stochastic version, we update the parameters for every point in the data. Now if we have a million data
points then we will make a million updates in each epoch.
• However, this is an approximate gradient. Hence, we have no guarantee that each step will decrease the loss.
• So, going over a large data once is bad, going over a single point and updating is bad, but going over some
data points and updating is ok.
• This is the idea for mini-batch gradient descent.
• 1 epoch is one pass over the entire data, 1 step is one update of the parameters, N is the number of data
points, and B is the mini batch size.
• So, for regular gradient descent the number of epochs is the same as number of steps, for stochastic gradient
descent the number of steps is the same as N , and for mini-batch gradient descent the number of steps is
the same as N B.
• What we need is for the learning rate to be small when gradient is high and vice versa.
• Tune learning rate, try different values on a log scale: 0.0001, 0.001, 0.001, 0.1, 1.0.
• Run a few epochs with each of these and figure out a learning rate which works best. Now do a finer search
around this value.
• Exponential Decay: η = η0−kt , where η0 and k are hyperparameters and t is a step number. However,
choosing k becomes complex.
• 1/t Decay: η = η0
1+kt , again choosing k is a bit tricky, ideally we won’t use these learning rates.
• For momentum the following schedule was suggested by Sutskever et al., 2013
t
βt = min(1 − 2−1−log2 (⌊ 250 ⌋+1) , βmax )
13
6 Adaptive Learning Rates
• Intuition: Decay the learning rate for parameters in proportion to their update history(more updates
means more decay).
• Update rule for AdaGrad
vt = vt−1 + (∇wt )2
η
wt+1 = wt − √ × ∇wt
vt + ϵ
and similar set of equations for bt .
• Intuition: AdaGrad decays the learning rate very aggressively(as the denominator grows). As a result,
after a while, the frequent parameters will start receiving very small updates. To avoid this we can decay
the denominator.
• Update rule for RMSprop
vt = βvt−1 + (1 − β)(∇wt )2
η
wt+1 = wt − √ × ∇wt
vt + ϵ
Give lower and lower weightage to previous gradients
• RMSprop converges more quickly than AdaGrad by being less aggressive on decay.
• In RMSprop, the deltas start oscillating after a while, learning rate can increase, decrease or remain constant.
Whereas in AdaGrad learning rate can only decrease as the denominator constantly grows.
• In case of oscillations, setting η0 properly could solve the problem.
• Both are sensitive to initial learning rate, initial conditions of parameters and corresponding gradients.
• AdaDelta avoids setting initial learning rate η0
vt = βvt−1 + (1 − β)(∇wt )2
√
ut−1 + ϵ
∆wt = − √ ∇wt
vt + ϵ
wt+1 = wt + ∆wt
ut = βut−1 + (1 − β)(∆wt )2
Now the numerator, in the effective learning rate, is a function of past gradients. Observe that the ut that
we compute at t will be used only in the next iteration. Essentially one history runs behind the other.
• After some i iterations, vt will start decreasing and the ratio of the numerator to the denominator starts
increasing. If the gradient remains low for a subsequent time steps, then the learning rate grows accordingly.
• Therefore, AdaDelta allows the numerator to increase or decrease based on the current and past gradients.
• Intuition: Do everything that RMSprop does to solve the decay problem of AdaGrad, plus used a
cumulative history of the gradients.
• Update rule for Adam(Adaptive Moments)
mt
mt = β1 mt−1 + (1 − β1 )∇w, m̂t = ← Incorporating classical momentum
1 − β1t
vt
vt = β2 vt−1 + (1 − β2 )(∇wt )2 , v̂t =
1 − β2t
η
wt+1 = wt − √ m̂t
v̂t + ϵ
Note that we are taking a running average of the gradients as mt .
• One way of looking at this is that we are interested in the expected value of the gradient and not on a
single point estimate computed at time t.
• m̂t and v̂t are bias correction terms, we do this to avoid very high learning rate update. It can also be
derived by taking expectation of mt .
14
• Bias correction is the problem of exponential averaging.
1
• General form of Lp norm is (|x1 |p + |x2 |p + ... + |xn |p ) p , as p → ∞, Lp boils down to taking the max value.
√
• In Adam, we called vt as L2 norm, why not replace it with L∞ norm. Therefore, we have
vt = max(βvt−1 , |∇wt |)
η
wt+1 = wt − ∇wt
vt + ϵ
• Intuition: We know NAG is better than momentum based GD, why not just incorporate it with Adam.
• We can rewrite NAG such that there is no t − 1 term, as
gt+1 = ∇wt
mt+1 = βmt + ηgt+1
wt+1 = wt − η(βmt+1 + gt+1 )
• It is common for new learners to start out using off the shelf optimizers, which are later replaced by custom
designed ones.
• Cyclical Learning Rate: Suppose the loss surface has a saddle point. Suppose further that the parameters
are initialized, and the learning rate is decreased exponentially. After some iterations, the parameters will
reach near the saddle point. Since, the learning rate has decreased exponentially, the algorithm has no way
of coming out of the saddle point.
What if we allow the learning rate to increase after some iterations.
• Rationale: Often, difficulty in minimizing the loss arises from saddle points rather than poor local minima.
• A simple way to solve this is to vary the learning rate cyclically. One example is triangular learning rate.
• Cosine Annealing(Warm Re-Start): also based on cyclical learning rate
• Warm-start: Using a low initial learning rate helps the model to warm and converge better.
15
7 Regularization
7.1 Introduction to Bias and Variance
• Deep learning models typically have BILLIONS of parameters whereas the training data may have only
MILLIONS of samples. Therefore, they are called over-parameterized models.
• Over-parameterized models are prone to a phenomenon called over-fitting.
• Simple models trained on different samples of the data do not differ much from each other. However, they
are very far from the true sinusoidal curve (under fitting).
• Bias: Difference between the expected value of predictions subtracted by the true value.
Simple models will have high bias and complex models will have low bias.
• Variance is defined as
Roughly speaking it tells us how much the different fˆ(x)′ s trained on different samples of the data differ
from each other. Simple models have a low variance whereas complex model have a high variance.
Bias Variance
Simple Model High Low
Complex Model Low High
• There is always a trade-off between the bias and variance. Both contribute to the MSE loss.
• Consider a new point (x, y) which was not seen during training. If we use the model fˆ(x) to predict the
value of y then the mean squared error is given by E[(y − fˆ(x))2 ] (MSE), we can show that
• The parameters of fˆ(x) are trained using a training set. However, at test time we are interested in evaluating
the model on a validation(unseen) set which was not used for training. This gives rise to two entities
trainerr and testerr . Typically, this gives rise to the following graph, parabola represents test error.
• Complex model is more sensitive to minor changes in the data, as compared to simple models. Hence while
training, instead of minimizing the training error we should minimize train error +Ω(θ), which is high for
complex models and small for simple models.
• This is the basis for all the regularization methods.
16
7.2 Regularization Methods
• For L2 Regularization define the loss as
α
L̄(w) = L(w) + ∥w∥2
2
and the following gradient would look like
∇L̄(w) = ∇L(w) + αw
So it is very easy to implement this.
• It essentially makes it, so the weights are closer to 0, non-essential weights will be closer to 0 as compared
to essential weights.
• Data Augmentation: Consider images, rotate them, shift them, blur them, change some pixels, etc.
• We exploit the fact that certain transformations to the image do not change the label of the image.
• Typically, more data = better learning.
• Augmentation works well for image classification/object recognition tasks.
• It has been shown, that it works well for speech as well.
• Parameter Sharing is used in CNNs, essentially same filter is applied at different positions of the image,
or same weight matrix acts on different input neurons.
• This is typically used in autoencoders, encoder and decoders.
• Injecting Noise at input: We can show that for a simple input output neural network, adding Gaussian
noise to the input is equivalent to weight decay(L2 regularization).
• We essentially shift the input by a slight amount at every epoch, so the model sees a different input at
every epoch making it harder to over fit.
• This can also be viewed as data augmentation.
• We can similarly inject noise at output as well.
• Early stopping: Have a patience parameter p, after p steps if validation error does not decrease then we
stop and return the model p steps before.
• This is very effective and widely used, it can be used even with other regularizers.
• Ensemble Methods: Combine the output of different models to reduce generalization error. These models
can correspond to different classifiers, it could be different instances of the same classifier trained with
different hyperparameters, features, samples etc.
• Bagging: Form an ensemble using different instances of the same classifier. For a given dataset, construct
multiple training sets by sampling with replacement.
• Bagging would work well when the errors of the model are independent or uncorrelated, if they are correlated
then the mse of the ensemble is as bad as the individual model.
• On average, the ensemble will perform at least as well as its individual members.
• Training several neural networks for making an ensemble is prohibitively expensive.
• Dropout: Refers to dropping out units, temporarily remove a node and all its incoming and outgoing
connections resulting in a thinned network.
• Each node is retained with a fixed probability, p = 0.5, for hidden nodes and p = 0.8 for visible nodes.
• We initialize all the parameters of the network and start training. For the first training instance(or
mini-batch), we apply dropout resulting in the thinned network. Then update only those parameters that
are active.
• For the second training instance, we again apply dropout resulting in a different thinned network. We again
compute the loss and back propagate.
• Each thinned network gets rarely trained but the parameter sharing ensures that no model has untrained
or poorly trained parameters.
• Dropout essentially applies a masking noise to the hidden units. Prevents hidden units from coadapting.
17
8 Deep Learning Revival
8.1 Unsupervised Pre-Training
• We will first train the weights between the layers using an unsupervised objective, essentially we will try
to reconstruct x.
• If we are able to do this, it would mean that the hidden layers are capturing all the important information
of the image. We will be learning one layer at a time.
• h1 will try to reconstruct x, h2 will try to reconstruct h1 , and so on...
• After this layerwise pre-training, we add the output layer and train the whole network.
• In effect we have initialized the weights of the network using the greedy unsupervised objective and are now
fine-tuning these weights using the supervised objective.
• This helps in optimization and regularization.
• Unsupervised objective ensures that the learning is not greedy w.r.t the supervised objective.
• Some other experiments have also shown that pre-training is more robust to random initializations.
• This led to people thinking that deep networks are sensitive to initial weights, and maybe if we have better
initial weights it would lead to better network.
• tanh: tanh(x), compresses all its input to the range [−1, 1] so, its zero centered. ∂tanh(x)
∂x = (1 − tanh2 (x)),
but we still face the issue of vanishing gradient, and it is still computationally expensive.
• Rectified Linear Unit: ReLU (x) = max (0, x), does not saturate in the positive region, it is computa-
tionally efficient and in practice converges much faster than sigmoid and tanh.
Clearly the derivative is 1 for x > 0 and 0 otherwise, leading to no updates for negative neurons, which
makes the neuron dead.
In practice a large fraction of ReLU units can die if the learning rate is set too high.
It is advised to initialize the bias to a positive value.
• Leaky ReLU: f (x) = max (0.1x, x), no saturation, computationally efficient, close to zero centered outputs,
and will not die.
18
(a) Leaky ReLU Function (b) Parametric ReLU Function (c) ELU Function
• Model Averaging: Train k models with independent weights(w1 , w2 , ..., wk ) on independently sampled
data points from the original dataset. Each model is expected to be good at predicting a subset of training
samples well.
During inference, the prediction is done by averaging the predictions from all the models, it is often
arithmetic or geometric mean for regression problems.
• In bagging, the subset of training samples seen by each of the models does not change across epochs.
Therefore, the model weights get optimized for those samples.
• Each sub-model sees only some parts of the training data, therefore we want updates to be larger.
• We don’t have this luxury in case of dropout.
• MaxOut: In a way is a variation of Dropout, but instead of having probability, we take say 3 neurons in a
layer and propagate the max of them. Can also be thought of as a generalization of ReLU and Leaky ReLU.
Two MaxOut neurons with sufficient number of affine transformations, act as a universal approximator.
• Gaussian Error Linear Unit: GELU = mx where m ∼ Bernoulli(Φ(x))
• Swish: xσ(βx), taking β = 1.702 we get GELU, and taking 1 we get SILU(Sigmoid-weighted Linear Unit).
X n−1
m−1 X
(I ∗ K)(i, j) = I(i − a, j − a)K(a, b)
a=0 b=0
• For practical implementations most libraries implement correlation operator rather than convolution.
19
Figure 11: Convolution Example
• When we convolve a filter with an image the resulting image is called a feature map.
• If we use multiple such filters we will get multiple feature maps, and then we stack them, this number will
in a way act as the number of color channels.
• Let image be of size W1 × H1 × D1 , let the number of filters be K, let the output be of size W2 × H2 × D2 ,
and let stride be S and spatial extent F . The filters are of size F × F × D1 .
W1 − F H1 − F
W2 = 1 + H2 = 1 + D2 = K
S S
• Pooling essentially helps lower the size of the feature maps, and extract only essential or important features.
• A simple CNN will have (Convolution→Pooling)×2 layers, then a normal ANN.
• Pooling layers will propagate gradient in backpropagation, as they do not have any weight attached to
them.
• For Max pooling, the gradient from the next layer is propagated only to the location of the maximum value
in the pooling window.
• For Average pooling, the gradient from the next layer is distributed equally to all elements in the pooling
region because each element contributes equally to the average.
• For convolution, the backpropagation update would be convolving the input(X) to the convolution layer
with the loss or gradient values given by the pooling layer.
• And when we want to propagate the gradient we will convolve weights(W ) instead of X.
20
9.1 Different Architectures
• AlexNet: Input size is of 227 × 227 × 3, the architecture can be seen below. Approx 27M parameters.
• ZFNet: Input size is the same, the architecture is basically the same as AlexNet except the three consecutive
convolution layers have filters 512, 1024, 512 respectively.
• VGGNet: Input is again the same, the key idea in this is kernel size is always 3 × 3.
• GoogleLeNet: Key idea is to have multiple sized filters to capture features at varying range.
• ResNet: Train a shallow network, then add layers and train deeper network. We can ensure this by adding
a skip connection.
21
10 Word Representations
10.1 Direct Representations
• Corpus: Group of documents.
• Consider the set V of all uniques words across all input streams, V is called the vocabulary of the corpus.
We need a representation for every word in V .
• V tends to be very large, 50K for PTB, 13M for Google IT corpus.
• These representations do not capture any notion of similarity.
√
• With one hot representations, the Euclidean distance between any two words is 2 and the cosine similarity
between any two words is 0.
• We want representations of cat and dog to be closer to each other than the representations of cat and truck.
• These are also called sparse representations.
• A co-ocurrence matrix is a terms × terms matrix which captures the number of times a term appears in
the context of another term
• The context is defined as a window of k words around the terms.
• With k = 2, the co-ocurrence matrix for the above corpus would be
• Essentially for the word w1 , we check a window of k words around it and count how many times other
words appear.
• This is also known as a word × context matrix.
• We could choose the set of words and contexts to be same or different.
• Each row (column) of the co-ocurrence matrix gives a representation of the corresponding word (context).
• Stop words(a, the, for, etc.) are very frequent → these counts will be very high.
• We can solve this by ignoring very frequent words or use a threshold t, and if a word, context pair goes
above a threshold then we cap it.
22
• Instead of count we can also use PMI
p(c|w) count(w, c) · N
P M I(w, c) = log = log
p(c) count(c) · count(w)
• We can solve this by taking PMI value as 0 when count(w, c) ≤ 0 and using the formula when count(w, c) > 0,
this is called P M I0 (w, c).
• Another solution is instead of count(w, c) > 0, we replace with P M I(w, c) > 0. This is called Positive PMI
or P P M I.
• This is however still very large and very sparse, and it grows with vocabulary.
• Consider X = XP P M Im×n , we can use singular value decomposition to get a rank k approximation.
T
X̂m×n = Um×k Σk×k Vk×n
• SVD gives the best rank k approximation of the original data, Discovers latent semantics in the corpus.
• SVD will essentially extract the most important features from the matrix.
Wword = U Σ ∈ Rm×k
This is representation of the m words in the vocabulary and Wcontext = V is taken as the representation of
the context words.
• This idea comes from the fact that X̂ X̂ T and Wword Wword
T
are the same matrices, i.e., they roughly capture
the same cosine similarity.
• These methods are called count based models because they use the co-ocurrence counts of words.
• We can model this problem using a feedforward neural network with one hidden layer. This network will
take input a one-hot representation of the context word and output a probability vector of words.
• Parameters are Wcontext ∈ Rk×|V | and Wword ∈ R|V |×k , and we use softmax to get probabilities.
• The product Wcontext x is simply the ith column of Wcontext , give x is a one hot vector.
• When the ith word is present the ith element in the one hot vector is ON and the ith column of Wcontext
gets selected.
• There is a one to one correspondence between the words and columns of Wcontext .
• We can treat the ith column of Wcontext as the representation of the context i.
23
• P (word = i|v) depends on the ith column of Wword .
• We thus treat the ith column of Wword as the representation of the word i.
• Consider the context word with index c and the correct output with index w
uc is the column of Wcontext corresponding to context word c and vw is the column of Wword corresponding
to the word w.
• Give an context uc if we are predicting vw then the update rule for vw derived from backpropagation is
vw = vw + ηuc (1 − ŷw )
• We can simply concatenate all the one-hot representations, and the first layer weights changes to
[Wcontext , Wcontext , ...d times], this becomes very complex. Instead, we simply add the column where bit is
on.
d−1
X
h= uc , where uc is the column of Wcontext where bit was on
i=1
• Negative Sampling
• Let D be the set of all correct (w, c) pairs in the corpus
• Let D′ be the set of all incorrect (w, r) pairs in the corpus
• D′ can be constructed by randomly sampling a context word r which has never appeared with w and
creating a pair (w, r)
• We are interested in maximizing
Y Y X X
P (z = 1|w, c) P (z = 0|w, r) = log σ(vcT vw ) + log σ(−vrT vw )
(w,c)∈D (w,r)∈D ′ (w,c)∈D (w,r)∈D ′
• There exists a unique path from the root node to a leaf node.
24
• Let l(w1 ), l(w2 ), ..., l(wp ) be the nodes on the path from root to w
• Let π(w) be a binary vector such that
(
1, path branches left at node l(wk )
π(w)k =
0, otherwise
• We model
1
P (π(on)i = 1) = T and P (π(on)i = 0) = 1 − P (π(on)i = 1)
1 + e−vc ui
• Note that p(w|vc ) can now be computed using |π(w)| computations instead of |V | required by softmax
• Turns out that even a random arrangement of the words on leaf nodes does well in practice.
where viT vj + bi + bj is the predicted value using model parameters and log Xij is the actual value calculated
from the given corpus.
• One drawback is that, this weighs all co-occurrences equally.
25
10.5 Evaluating Word Representations
• Semantic Relatedness
1. Ask humans to judge the relatedness between a pair of words
2. Compute the cosine similarity between the corresponding word vectors learned by the model
3. Given many such word pairs, compute the correlation between Smodel & Shuman , and compare different
models
4. Model 1 is better than Model 2 if correlation(Smodel1 , Shuman ) > correlation(Smodel2 , Shuman )
• Synonym Detection
1. Given: a term and four candidate synonyms
2. Pick the candidate which has the largest cosine similarity with the term
3. Compute the accuracy of different models and compare
11 Sequential Learning
11.1 Introduction
• In many applications the input is not of a fixed size
• Further successive inputs may not be independent of each other
• We need to look at a sequence of (dependent) inputs and produce an output (or outputs)
• Each input corresponds to one time step
• Consider the task of predicting the part of speech tag (noun, adverb, adjective verb) of each word in a
sentence
• Once we see an adjective (social) we are almost sure that the next word should be a noun (man)
• Thus the current output (noun) depends on the current input as well as the previous input
• Further the size of the input is not fixed (sentences could have a arbitrary number of words)
• Notice that here we are interested in producing an output at each time step
• Each network is performing the same task (input : word, output : tag)
• Sometimes we may not be interested in producing an output at every stage
• Instead we would look at the full sequence and then produce an output
• consider the task of predicting the polarity of a movie review
• The prediction clearly does not depend only on the last word but also on some words which appear before
• Here again we could think that the network is performing the same task at each step (input : word, output
: +/−) but it’s just that we don’t care about intermediate outputs
• Sequences could be composed of anything (not just words)
si = σ(U xi + b)
yi = O(V si + c)
26
• We just create multiple copies of the network and execute them at each timestep
• Consider that we feed all previous inputs to the current network, this will be infeasible as now the network
does not compute the same function at each time step and hence parameter sharing cannot take place.
• The solution is to add a recurrent connection in the network
si = σ(U xi + W si−1 + b)
yi = O(V si + c)
• The total loss is simply the sum of loss over all time steps.
• The gradient of V and U is straightforward backpropagation.
• However, gradient of W will be a chain of gradients going through time, aka backpropagation through time.
• For W there is a high chance of vanishing/exploding gradients, as such we restrict backprop time steps to τ .
• The state (si ) of an RNN records information from all previous time steps
• At each new timestep the old information gets morphed by the current input
• One could imagine that after t steps the information stored at time step t − k gets completely morphed so
much that it would be impossible to extract the original information stored at time step t − k.
• A similar problem occurs when the information flows backwards
• It is very hard to assign the responsibility of the error caused at time step t to the events that occurred at
time step t − k
27
• we have computed a state st−1 at timestep t − 1, and now we want to overload it with new information (xt )
and compute a new state (st ).
• We introduce a vector ot−1 which decides what fraction of each element of st−1 should be passed to the
next state
• Each element of ot−1 gets multiplied with the corresponding element of st−1
• Each element of ot−1 is restricted to be between 0 and 1
Wo , Uo , bo need to be learned.
• ot is called the output gate as it decides how much to pass (write) to the next time step
• Now we calculate state information as
• However, we may not want to use all this new information and only selectively read from it before
constructing the new cell state st
• To do this we introduce another gate called the input gate
it = σ(Wi ht−1 + Ui xt + bi )
and then take dot product to get new temporary state s̃t
• To selectively forget, we introduce another forget gate
ft = σ(Wf ht−1 + Uf xt + bf )
• Now we have the full set of equations for LSTM, the gates are
ot = σ(Wo ht−1 + Uo xt + bo )
it = σ(Wi ht−1 + Ui xt + bi )
ft = σ(Wf ht−1 + Uf xt + bf )
• Another variant of LSTM is Gated Recurrent Unit, which has the following set of equations
ot = σ(Wo st−1 + Uo xt + bo )
it = σ(Wi st−1 + Ui xt + bi )
s̃t = σ(W (ot ⊙ st−1 ) + U xt + b)
st = (1 − it ) ⊙ st−1 + it ⊙ s̃t
28
• No explicit forget gate, the forget gate and input gates are tied.
• During forward propagation, the gates control the flow of information
• They prevent any irrelevant information from being written to the state
• Similarly, during backward propagation they control the flow of gradients
• It is easy to see that during backward pass the gradients will get multiplied by the gate
• If the state at time t − 1 did not contribute much to the state at time t then during backpropagation the
gradients flowing into st−1 will vanish
• But this kind of vanishing gradient is fine, since st−1 did not contribute to st we don’t want to hold it
responsible for the crimes of st
12 Encoder-Decoder Models
12.1 Introduction
• What if we want to generate a sentence given an image?
• We are now interested in P (yt |yt−1 , I) instead of P (yt |yt−1 ), where I is an image.
• We could now model P (yt = j|yt−1 , I) as P (yt = j|st , fc7 (I)) where fc7 (I) is the representation obtained
from the convolution layer of an image
• There are many ways of making P (yt = j) conditional on fc7 (I)
• Option 1: Set s0 = fc7 (I)
• Now s0 and hence all subsequent st ’s depend on fc7 (I)
• We can thus say that P (yt = j) depends on fc7 (I)
• Option 2: Another more explicit way of doing this is to compute
29
Figure 17: Textual Entailment
30
12.2 Attention Mechanism
• Encoder decoder models can be made even more expressive by adding an “attention” mechanism
• At each time-step we should feed only this relevant information (i.e. encodings of relevant words) to the
decoder
• We could just take a weighted average of the corresponding word representations and feed it to the decoder.
exp ejt
αjt = PM
jt
j=1 exp e
• αjt denotes the probability of focusing on the j th word to produce the tth output word.
• One possible choice for fAT T is
T
ejt = Vatt tanh Uatt st−1 + Watt hj
• We are essentially asking the model to approach the problem in a better, more natural way
• We can check whether the attention model learns something meaningful by plotting the attention weights
as a heatmap.
• In the case of text, we have a representation for every location of the input sequence.
• But for images, we typically use representation from one of the fully connected layers. This representation
does not contain any location information.
• Instead of the fc7 representation, we use the output of one of the convolution layers, which has spatial
information.
• For example, the output of the 5th convolutional layer of VGGNet is a 14 × 14 × 512 size feature map. We
could think of this as 196 locations, each having a 512 dimensional representation. The model will then
learn an attention over these locations.
31
12.3 Hierarchical Attention
• Consider a dialog between a user (u) and a bot (B)
• The dialog contains a sequence of utterances between the user and the bot
• Each utterance in turn is a sequence of words
• Thus what we have here is a “sequence of sequences” as input.
• We could think of a two level hierarchical RNN encoder
• The first level RNN operates on the sequence of words in each utterance and gives us a representation
• We now have a sequence of utterance representations (red vectors in the image)
• We can now have another RNN which encodes this sequence and gives a single representations for the
sequences of utterances
• The decoder can then produce an output sequence conditioned on this utterance
• Consider another example of classification of documents, each document is a sequence of sentences and
each sentence is in turn a sequence of words.
32
• We need attention at two levels
First, we need to attend to important (most informative) words in a sentence.
Then we need to attend to important (most informative) sentences in a document.
33
13 Transformer Models
• In encoder-decoder architecture, we have to wait for until st−1 is calculated to calculate st .
• The idea is to calculate all these in parallel, instead of waiting.
13.1 Self-Attention
• Inputs are vectors, word embeddings, and the outputs are also vectors.
• Given a word in a sentence, we want to compute the relational score between the word and the rest of the
words in the sentence, such that the score is higher if they are related contextually.
• now both the vectors si and hj , for all i, j are available for all the time.
• The score function we used was
T
score(st−1 , hj ) = Vatt tanh Uatt st−1 Watt hj
• There are three vectors involved in computing the score at each time step.
• We get them by transforming hj
• qj = WQ hj , called the query vector
• kj = WK hj called the key vector
• vj = WV hj called the value vector
34
• We will first calculate score(q1 , kj ), e1j = [q1 · k1 , q1 · k2 , ..., q1 · k5 ]
P5
• Then we get α1j = sof tmax(e1j ), finally we get z1 = j=1 α1j vj
• Query and key vector participate in calculating α, and value vector will participate in getting final output.
• All query, key and value vectors can be computed in one go, as follows
QT K
Z = [z1 , z2 , ..., zT ] = sof tmax( √ )V T
dk
where dk is the dimension of key vector.
35
• The outputs are then passed to a feed forward network, this is our usual ANN.
• The computation is parallelized in the horizontal direction, i.e. within a training sample, of the encoder
stack, not along the vertical direction.
36
Figure 31: Decoder Block
• We don’t know how long is the output going to be, hence we will consider a max length T1 .
• At the beginning only < GO > makes sense, all other inputs will be junk, hence we have ”masked” self
attention. Essentially, we make all α greater than current decoded word 0.
• Masking is done by inserting negative infinite at the respective positions. This works because we take max
of values(ReLU) in the feedforward network.
• We will also use masking in cross attention.
• We will take the values coming from encoder, and get K and V . We will use output from self-attention as
Q, then cross attention becomes the same as self-attention.
37