KEMBAR78
DL Module 2 | PDF | Applied Mathematics | Statistical Theory
0% found this document useful (0 votes)
14 views148 pages

DL Module 2

This document covers the training, optimization, and regularization of Deep Neural Networks (DNN), focusing on various activation functions such as Sigmoid, Tanh, ReLU, and Softmax, along with their advantages and disadvantages. It also discusses loss functions, including Mean Squared Error and Cross Entropy, and the importance of choosing the right activation and loss functions based on the type of problem being solved. Additionally, the document explains the backpropagation algorithm and gradient descent optimization techniques used to update weights in neural networks.

Uploaded by

Madhura Kanse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views148 pages

DL Module 2

This document covers the training, optimization, and regularization of Deep Neural Networks (DNN), focusing on various activation functions such as Sigmoid, Tanh, ReLU, and Softmax, along with their advantages and disadvantages. It also discusses loss functions, including Mean Squared Error and Cross Entropy, and the importance of choosing the right activation and loss functions based on the type of problem being solved. Additionally, the document explains the backpropagation algorithm and gradient descent optimization techniques used to update weights in neural networks.

Uploaded by

Madhura Kanse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

MODULE-2

Training, Optimization & Regularization of DNN


Training of DNN
Training of DNN
Training of DNN
Learning factors
▪ Network Architecture (Number of hidden
layers)
▪ Initial weight
▪ Choice of Activation Function
▪ Learning rate/Learning constant
▪ Loss Function /Cost Function
▪ Epochs and Batch size
▪ Momentum
Activation Functions
▪ Linear function
▪ Sigmoid / Logistic Activation Function – Smooth Function
▪ Tanh Function (Hyperbolic Tangent) -- Smooth Function
▪ Rectified Linear Unit (ReLU) Function – Piecewise Linear Function,
Rectifier
▪ Leaky ReLU Function – Piecewise Linear Function, Rectifier
▪ Softmax Function – Not activation Function but operation at o/p neurons
Linear function

Mathematically it can be represented as: f(x)


=x

"no activation," or "identity


function"
Cons of Linear function
❑It’s not possible to use backpropagation as the derivative of the function is
a constant and has no relation to the input x.

❑All layers of the neural network will collapse into one if a linear activation
function is used. No matter the number of layers in the neural network, the
last layer will still be a linear function of the first layer. So, essentially, a
linear activation function turns the neural network into just one layer.
Non-Linear Activation Functions
❑They allow backpropagation because now the derivative function would
be related to the input, and it’s possible to go back and understand which
weights in the input neurons can provide a better prediction.

❑They allow the stacking of multiple layers of neurons as the output would
now be a non-linear combination of input passed through multiple layers.
Any output can be represented as a functional computation in a neural
network.
Logistic / Sigmoid function

Mathematically it can be represented


as:
Pros of Sigmoid function
It is commonly used for models where we have to predict the
probability as an output. Since probability of anything exists only
between the range of 0 and 1, sigmoid is the right choice because of
its range.

The function is differentiable and provides a smooth gradient, i.e.,


preventing jumps in output values. This is represented by an
S-shape of the sigmoid activation function.
Cons of Sigmoid function
The derivative of the function is
As the gradient value approaches
f'(x) = sigmoid(x)*(1-sigmoid(x)). zero, the network ceases to learn and
suffers from the Vanishing
gradient problem.

The output of the logistic function is


not symmetric around zero. So the
output of all the neurons will be of
the same sign.
This makes the training of the neural
network more difficult and
unstable.
Tanh / Hyperbolic tangent function

Mathematically it can be represented


as:
Advantages of Tanh function
▪Theoutput of the tanh activation function is Zero centered; hence
we can easily map the output values as strongly negative, neutral,
or strongly positive.

▪Usually used in hidden layers of a neural network as its values lie


between -1 to 1; therefore, the mean for the hidden layer comes
out to be 0 or very close to it. It helps in centering the data and
makes learning for the next layer much easier.
Have a look at the gradient of the tanh
activation function
From the graph it is clear that it
also faces the problem of
vanishing gradients similar to the
sigmoid activation function.

Also the gradient of the tanh


function is much steeper as
compared to the sigmoid function.
Rectified Linear Unit (ReLU)
Mathematically it can be represented as:
Advantages of ReLU function
❑Since only a certain number of neurons are activated, the ReLU
function is far more computationally efficient when compared
to the sigmoid and tanh functions.

❑ReLU accelerates the convergence of gradient descent towards


the global minimum of the loss function due to its linear, non-
saturating property.
The dying ReLU problem
▪The negative side of the graph makes
the gradient value zero. Due to this
reason, during the backpropagation
process, the weights and biases for
some neurons are not updated. This
can create dead neurons which
never get activated.
❑All the negative input values become
zero immediately, which decreases
the model’s ability to fit or train from
the data properly.
Leaky ReLU function
an improved version of ReLU function
to solve the Dying ReLU problem
as it has a small positive slope in the negative
area.
Advantages of Leaky ReLU function
❑ it does enable backpropagation, even for negative input values.

❑By making this minor modification for negative input values, the
gradient of the left side of the graph comes out to be a non-zero
value. Therefore, we would no longer encounter dead neurons in
that region.
Limitation of Leaky ReLU function

❑The predictions may not be consistent for


negative input values.

❑The gradient for negative values is a small


value that makes the learning of model
parameters time-consuming.
Softmax function
oThere’s another kind of operation that we typically apply only at
the output neurons of a classifier neural network, and even then
only if there are 2 or more such neurons.

oIt’s not an activation function in the sense we’ve been using the
term, because it applies simultaneously to all the output
neurons, not just one.
Softmax function
▪Softmax function is often described as a combination of multiple sigmoids.

The softmax function takes all the network’s


outputs and
modifies them simultaneously.

The result is that the scores are turned into


probabilities.

It is most commonly used as an activation


function for the last layer of the neural
network in the case of multi-class
classification.
Softmax function
Softmax function – An example
Softmax function – An example
Choosing the right Activation Function
▪Sigmoid functions and their combinations generally work better in the case of
classifiers
▪Sigmoid/Logistic and Tanh functions should not be used in hidden layers
avoided due to the vanishing gradient problem
▪If we encounter a case of dead neurons in our networks the leaky ReLU function
is the best choice
▪Always keep in mind that ReLU function should only be used in the hidden
layers
▪As a rule of thumb, you can begin with using ReLU function and then move
over to other activation functions in case ReLU doesn’t provide with
Choosing the right Activation Function
•rules for choosing the activation function for output layer based on the type
of prediction problem that you are solving:
◦ Regression - Linear Activation Function
◦ Binary Classification—Sigmoid/Logistic Activation Function
◦ Multiclass Classification—Softmax
◦ Multilabel Classification—Sigmoid

▪The activation function used in hidden layers is typically chosen based on the
type of neural network architecture.
◦ Convolutional Neural Network (CNN): ReLU activation function.
◦ Recurrent Neural Network: Tanh and/or Sigmoid activation function.
Loss function
Loss function
❑ A loss function is a function that compares the target and predicted output
values
❑ measures how well the neural network models the training data.
❑ This loss essentially tells something about the performance of the network:
the higher it is, the worse network performs overall.

❑When training, we aim to minimize this loss between the predicted and target
outputs.
Types of Loss function
REGRESSION LOSS FUNCTION CLASSIFICATION LOSS
FUNCTION
1) Mean Squared Error,
2) Mean Absolute Error 1) Binary Cross-Entropy,
3) Huber Loss 2) Categorical Cross-Entropy
3) Sparse Categorical Cross-Entropy
3) Hinge Loss
Mean Squared Error (MSE)
❑ Mean Squared Error (MSE) also called L2 Loss is also a loss function used
for
regression.
❑It represents the difference between the original and predicted values
extracted by squared the average difference over the data set.
Mean Squared Error (MSE)
Use MSE when doing regression, believing that your target,
conditioned on the input, is normally distributed, and want large
errors to be significantly (quadratically) more penalized than
small ones.

MSE is sensitive towards outliers and given several examples with


the same input feature values, the optimal prediction will be their
mean target value.
Advantages:
1) Easy to interpret
2) Differentiable
3) One local minima

Disadvantages: 1) Squared Error Unit (Big Difference)


2) Not Robust to Outliers.
Mean Absolute Error

Advantages:
1) Easy to interpret
2) Unit is same as output (target)
3) Robust to Outliers

Disadvantages: 1) Not Differentiable


Huber Loss
Entropy – Information Theory Basics
❑ Information theory deals with the problem of encoding, decoding, transmitting, and
manipulating
information.
❑ The central idea in information theory is to quantify the amount of information contained in data.
❑ If it is always easy for us to predict the next token, then this data is easy to compress.
❑ However if we cannot perfectly predict every event, then we might sometimes be surprised.
❑ Our surprise is greater when we assigned an event lower probability.

❑ Claude Shannon settled on log (1/P (j)) = − log P(j)


to quantify one’s surprisal at observing an event j having assigned it a (subjective) probability P(j).

❑ The entropy is then the expected surprisal when one assigned the correct probabilities that truly
match the data-generating process.
Cross Entropy
❑ we can think of the cross-entropy classification objective in two ways:
❑ (i) as maximizing the likelihood of the observed data; and
❑ (ii) as minimizing our surprisal (and thus the number of bits) required
to
communicate the labels.
Cross Entropy
❑ Cross-Entropy loss is also called logarithmic loss, log loss, or logistic loss.
❑Each predicted class probability is compared to the actual class desired output
0 or 1
❑and a score/loss is calculated that penalizes the probability based on how far it
is from the actual expected value.
❑ The penalty is logarithmic in nature
❑ yielding a large score for large differences close to 1 and small score for small
differences tending to 0.
Cross Entropy

p(x) is the probability distribution of “true” label and


q(x) depicts the estimation
Cross Entropy – An example
❑ Consider the classification problem with the probabilities (S) and the labels (T).
❑ The objective is to calculate for cross-entropy loss given these information.

S- Logits
T- one-hot encoded truth label
LCE - Cross-Entropy loss function
Activation function & Loss function
Type of Problem Activation Function Loss Function

Regression Linear Mean squared error

Binary Classification Sigmoid Binary cross entropy

Multi Class Classification Softmax Categorical cross entropy


Example
Optimization – Learning with
backpropagation
Training of NN
Training of NN
Training of NN
The error is the difference between the target and the actual output:

We will later use a squared error function, because it has better characteristics for
the
algorithm.

In this case the error is


Updating weights & bias
▪Depending on this error, we have to change the weights from the
incoming values accordingly.
▪Using some basic calculus, it is possible to derive a rule to update the weights
on the connections coming into a neuron given a measure of the neuron’s
output error so as to reduce the neuron’s error.
▪The difficulty in training a neural network is that the weight-update rule
requires an estimate of the error at a neuron, and although it is
straightforward to calculate the error for each neuron in the output layer of
the network, it is difficult to calculate the error for the neurons in the earlier
layers.
▪The standard way to train a neural network is to use an algorithm called the
backpropagation algorithm to calculate the error for each neuron in the
network and then use the weight-update rule to modify the weights in the
network.
Backpropagation Algorithm
The main steps in the algorithm are as follows:
1.Calculate the error for the neurons in the output layer and use the weight-update rule to
update the weights coming into these neurons.
2.Share the error calculated at a neuron with each of the neurons in the preceding layer that is
connected to that neuron in proportion to the weight of the connection between the two
neurons.
3.For each neuron in the preceding layer, calculate the overall error of the network that the
neuron is responsible for by summing the errors that have been backpropagated to it and
use the result of this error summation to update the weights on the connections coming into
this neuron.
4.Work back through the rest of the layers in the network by repeating steps 2 and 3 until the
weights between the input neurons and the first layer of hidden neurons have been updated.
How to update the weights on the
connections??

1. Naïve Solution – expensive &


inefficient
2. Gradient Based
Gradient
▪All operations used in network are “ differentiable “

▪ Then compute gradient of loss with respect to network coefficients.

▪The process to tie together the loss function and model parameters
to update the network based on output of loss function is known as
optimization.
Gradient
❑We have to calculate what a small change in each individual weight
would do to the loss function.
❑ So we adjust each parameter based on its gradient.
❑ It is a small step in the determined direction.
❑ A gradient is a degree of inclination.
❑It facilitates the rate of change of one variable with respect to
another variable.
Gradient

A curve in two dimensions. Values of X increase as we move right,


and values of Y increase as we move up.
Gradient

Marking the tangent lines by whether they have a positive


slope (+), negative slope (–), or are flat (0).
To indicate the range of slopes shown,
the more positive or negative each slope, the more + or – signs
we’ve drawn.
Gradient

Marking the tangent lines by whether they have a positive


slope (+), negative slope (–), or are flat (0).
To indicate the range of slopes shown,
the more positive or negative each slope, the more + or – signs
we’ve drawn.
Derivative

The derivative at a point tells us which way to move to find


larger or smaller values of the curve. At the marked point,
the derivative is positive, so if we take a step to the right
(along positive X), we’ll find a larger value of the curve.
Gradient

Out of all the ways to move, the water will always follow the
steepest route downhill. The direction followed by the water
is called the direction of maximum descent.
Gradient

The direction of maximum ascent (in orange) always points in


exactly the opposite direction as maximum descent (in
black).
Gradien
t
❑We can then move the coefficient in opposite direction from
gradient thus decreasing the loss.
▪A gradient of a function is a vector of partial derivatives w.r.t. to all
independent variables.
▪It always points toward steepest increase in function.
Gradien
Tthe list of popular gradients is enlisted as--
1. Gradient Descent
2. Stochastic Gradient Descent
3. Mini Batch Gradient Descent
4. AdaGrad
5. RMSProp
6. Adam
7. Momentum Based Gradient Descent
8. Nesterov Accelerated Gradient
Descent
Example:
Gradient Descent –
The grand daddy of Optimization
▪It is iterative algorithm that starts off at a random point on loss function and travel down its
slope in steps until it reaches the lowest point (minimum) of the function.
Gradient Descent –
The grand daddy of Optimization
It is Gradient descent works as follows:

1.It starts with some coefficients, sees their cost, and searches for
cost value lesser than what it is now.
2.It moves towards the lower weight and updates the value of the
coefficients.
3.The process repeats until the local minimum is reached. A local
minimum is a point beyond which it cannot proceed.
Gradient Descent –
The grand daddy of Optimization
Gradient descent works best for most
purposes.

However, it has some downsides too.

It is expensive to calculate the gradients if the size of the data is


huge.
Gradient descent works well for convex functions but it doesn’t
know how far to travel along the gradient for nonconvex
functions.
Stochastic Gradient Descent - SGD
❑ The term stochastic means randomness on which the algorithm is based
upon.
❑In stochastic gradient descent, instead of taking the whole dataset for each
iteration, we randomly select the batches of data.
❑ That means we only take few samples from the dataset.

❑The procedure is first to select the initial parameters w and learning rate n.
Then randomly shuffle the data at each iteration to reach an approximate
minimum.
Stochastic Gradient Descent - SGD
❑We call SGD an online algorithm, because it doesn’t require the samples to be
stored or even consistent from one epoch to the next.
❑ It just handles each sample as it arrives and updates the network immediately.
❑Since we are not using the whole dataset but the batches of it for each iteration, the
path took by the algorithm is full of noise as compared to the gradient descent
algorithm.
❑ Thus, SGD uses a higher number of iterations to reach the local minima.
❑Due to an increase in the number of iterations, the overall computation time
increases.
❑ But even after increasing the number of iterations, the computation cost is still less
than that of the gradient descent optimizer.
Mini Batch Gradient Descent
❑We can find a nice middle ground between the extremes of batch gradient
descent, which updates once per epoch, and stochastic gradient descent,
which updates after every sample.
❑ We update the weights after some fixed number of samples have been
evaluated.
❑This number is almost always considerably smaller than the batch size (that is,
the number of samples in the training set).
❑ a set of that many samples drawn from the training set is a mini-batch.
❑The mini-batch size is frequently a power of 2 between about 32 and 256, and
often chosen to fully use the parallel capabilities of our GPU, if we have one.
Momentum based GD
Suppose we’re near the left side fig.
As we roll down the hill, we’ll reach the plateau starting
at around −0.5.

With regular gradient descent,


we’d stop on the plateau since the gradient is
zero
Momentum based GD
Momentum based GD
Momentum based
GD
❑If we could get our learning process to use this kind of momentum, that would
be great, because it would help us cross over plateaus like the one in the figure.

❑The technique of momentum gradient descent [Qian99] is based on


this idea.
❑For each step, once we calculate how much we want each weight to change,
we then add in a small amount of its change from the previous step.
❑ So if the change on a given step is 0, or nearly 0, but we had some larger
change on the last step, we’ll use some of that prior motion now, pushing
us along over the plateau.
Finding the step for gradient descent with momentum.
(a) When we’re at point B, we look back to the previous point A and find
the “momentum,” or the change we applied to point A. We call this
“m.” This
momentum is scaled by γ, a number from 0 to 1, giving us the shorter arrow
labeled γm.
(b) As with basic gradient descent, we find the gradient at B and scale it by η.
(c) Starting at point B, we add the scaled gradient ηg and the scaled
Momentum based GD –
Accelerating speed of
learning
Momentum based
GD
❑In order to understand this point, consider a setting in which one is performing
gradient descent with respect to the parameter vector W.
❑The normal updates for gradient-descent with respect to loss function L are as
follows:

❑ Here, α is the learning rate. In momentum-based descent, the vector V is


modified with exponential smoothing, where β ∈ (0, 1) is a smoothing
parameter:
Momentum based
❑ Larger values of β help the approach pick up a consistent velocity V in the
GD
direction.
correct
❑ Setting β = 0 specializes to straightforward mini-batch gradient-descent.
❑The parameter β is also referred to as the momentum parameter or the friction
parameter. The word “friction” is derived from the fact that small values of β act
as “brakes,” much like friction.
❑ The basic idea is to give greater preference to consistent directions over multiple
steps, which have greater importance in the descent.
❑This allows the use of larger steps in the correct direction without causing overflows
or “explosions” in the sideways direction.
❑ As a result, learning is accelerated.
Proper learning
rate
Learning Rate Decay
❑ A constant learning rate is not desirable because it poses a dilemma to the analyst.

❑A lower learning rate used early on will cause the algorithm to take too long to come even
close to an optimal solution.
❑On the other hand, a large initial learning rate will allow the algorithm to come reasonably
close to a good solution at first; however, the algorithm will then oscillate around the point for
a very long time, or diverge in an unstable way, if the high rate of learning is maintained.
❑ In either case, maintaining a constant learning rate is not ideal.

❑ Allowing the learning rate to decay over time can


naturally achieve the desired learning-rate adjustment
to avoid these challenges.
The ball rolls for a little while over the plateau, losing
and slowing.
energy
But it reaches the next valley before it comes to a halt,

Momentum based giving it the chance to drop into the valley and get down to the
minimum error.

GD
Nesterov Accelerated Gradient (NAG)
“look before you leap”

NAG = Momentum
+
GD with decay
Fact Analysis
▪ The frequent parameters will start receiving very small update
▪Therefore decay learning rate for parameters in proportion to their update history.
Intuition

Decay the learning rate for parameters in proportion to their


update history
(more updates means more decay).
AdaGrad — Adaptive Gradient
Algorithm

It is clear from the update rule that history of the gradient


is accumulated in v.
The smaller the gradient accumulated, the smaller the v value will be,
leading to a bigger learning rate (because v divides η).
AdaGrad — Adaptive Gradient
Algorithm
By using a parameter specific learning rate AdaGrad ensures that despite sparsity w gets
a higher learning rate and hence larger updates.
Furthermore, it also ensures that if b undergoes a lot of updates, its effective learning
rate decreases because of the growing denominator.

BUT ………….
In practice, Over time the effective learning rate for b will decay to an extent that there will be
no further updates to b.
Can we avoid this? RMSProp can!
RMSProp — Root Mean Square
Propagation
Intuition
AdaGrad decays the learning rate very aggressively (as the denominator grows). As a result,
after a while, the frequent parameters will start receiving very small updates because of the
decayed learning rate. To avoid this why not decay the denominator and prevent its rapid
growth.
Adam — Adaptive Moment
Estimation
In addition to that, use a cumulative history of gradients.
Training FF DNN- Process
❑The first input is fed to the network, which is represented as matrix x1, x2, and one where one
is the bias value.

❑Each input is multiplied by weight with respect to the first and second model to obtain their
probability of being in the positive region in each model.
So, we will multiply our inputs by a matrix of weight using matrix multiplication.
Training FF
❑ After that, we will take the sigmoid of our scores and gives us the probability of the
DNN
point
being in the positive region in both models.

❑ We multiply the probability which we have obtained from the previous step with the
second
set of weights.

❑And as we know to obtain the probability of the point being in the positive region of this
model, we take the sigmoid and thus producing our final output in a feed-forward
process.
Training FF DNN – Ex1
Training FF DNN – Ex1
Now, we have to multiply our probabilities from the
first layer with the second set of weights as

Now, we will take the sigmoid of our final


score
Training FF DNN – Ex2
REGULARIZATION

DR. PRAVIN RAHATE FCRIT, VASHI 12


8
MODEL EVALUATION
◾ We often split our data into nonoverlapping training and test sets in order to fairly evaluate
our model
INTRODUCTION

◾ Whether we’re a person or a computer, learning general rules about a


subject from a
finite set of examples is a tough challenge.
◾ If we don’t pay enough attention to the details of the examples, our
rules will be too general, and when we get to working on new data
we’re likely to come to wrong conclusions.
◾ On the other hand, if we pay too much attention to the details in the
examples, our rules will be too specific, and again we’ll be likely to
come to wrong conclusions.
◾ These phenomena are respectively called underfitting and overfitting.
UNDERFITTING
◾ Underfitting happens when a model unable to capture the underlying pattern
of the data.
◾ These models usually have high bias and low variance.
◾ It happens when we have very less amount of data to build an accurate model
or when we try to build a linear model with a nonlinear data.
◾ Also, these kind of models are very simple to capture the complex patterns in
data like Linear and logistic regression.
OVERFITTING
◾ overfitting happens when our model captures the noise along with the
underlying pattern in data.
◾ It happens when we train our model a lot over noisy dataset.
◾ These models have low bias and high variance.
◾ These models are very complex like Decision trees which are prone to
overfitting.

13
2
WHY IS BIAS VARIANCE TRADEOFF?

◾ Bias is the difference between the average prediction of our model and the correct value which we
are trying to predict.
◾ Variance is the variability of model prediction for a given data point or a value which tells us spread of
our data.

◾ If our model is too simple and has very few


parameters then it may have high bias and low
variance.
◾ On the other hand if our model has large number of
parameters then it’s going to have high variance and
low bias.
REGULARIZATION
◾ We always want to squeeze as much information as we can out of our training data, stopping
just short of overfitting.
◾ The techniques for controlling overfitting are collectively known as regularization.
◾ It’s quite likely that while we’re training our data, there’s a point in time where instead of learning
useful features, we start overfitting to the training set.
◾ To avoid that, we want to be able to stop the training process as soon as we start overfitting, to
prevent poor
generalization.
◾ To do this, we divide our training process into epochs.
◾ An epoch is a single iteration over the entire training set.
◾ In other words, if we have a training set of size d and we are doing mini-batch gradient descent with
batch size b,
ARAtHAtThEe end of each epoch, we wFaCnRItT, tVoASHmI easure how well our
V 92
◾ then
model an epoch would be equivalent to d / b model updates.
is generalizing.
◾ To do this, we use an additional validation set
REGULARIZATION METHODS

1. L1 regularization
2. L2 regularization
3. Parameter sharing
4. Drop out
5. Weight Decay
6. Batch Normalization
7. Early Stopping
8. Data Augmentation
9. Adding Noise to input &
output
REGULARIZATION
REGULARIZATION

13
7
REGULARIZATION
PARAMETER NORM PENALTIES
◾ Many regularization approaches are based on limiting the capacity
of models,
◾ by adding a parameter norm penalty Ω( θ ) to the objective function
J
◾ We denote the regularized objective function by

13
9
REGULARIZATION
DROP OUT

◾ During training, some layer outputs are ignored or dropped at random.


◾ This makes the layer appear and is regarded as having a different number of nodes and
connectedness to the
preceding layer.
◾ In practice, each layer update during training is carried out with a different perspective of the
specified layer.
◾ Dropout makes the training process noisy, requiring nodes within a layer to take on more or less
responsible for
the inputs on a probabilistic basis.
◾ According to this conception, dropout may break apart circumstances in which network tiers
co-adapt to fix mistakes committed by prior layers, making the model more robust.
DROP OUT

❑ It can’t rel y o n one input as it might be randomly dropped


F CRIT , VA SHI

out.
EARLY STOPPING

◾ Early stopping is a form of regularization technique that applies when we training a model with an
iterative
method, such as Gradient Descent.
◾ As we know that too much training happening in the neural networks results in network
overfitting on the training data.
◾ Up to a certain point, the model performance on the test set improves. Before that point, however,
improving the model’s fit to the training data leads to increased generalization error.This technique
of regularization provides us a guide on how many iterations can be run before the model begins to
overfit.
PARAMETER SAHRING

◾ Instead of penalizing model parameters, it forces a group of parameters


to be equal
◾ This can be seen as a way to apply previous domain knowledge to the
training process
◾ CNN takes advantage of the spatial structure of images by sharing
parameters across different locations in the input
BATCH NORMALIZATION

◾ It fixes means and variances of the input by bringing the feature in the
same range
CHALLENGE OF SMALL TRAINING DATASETS

◾ The first problem is that the network may effectively memorize the
training dataset.

◾ The second problem is that a small dataset provides less opportunity to


describe the structure of the input space and its relationship to the output
DATA AUGMENTATION
◾ It refers to the process of generating new training examples to our
dataset.
◾ More training data means lower model variance
◾ It can also seen as a form of noise injection in the training dataset
HOW AND WHERE TO ADD NOISE

◾ The most common type of noise used during training is the addition of Gaussian noise to
input
variables.
◾ Gaussian noise, or white noise, has a mean of zero and a standard deviation of one
and can be generated as needed using a pseudorandom number generator.
◾ The amount of noise added (eg. the spread or standard deviation) is a configurable
hyperparameter.
◾ Too little noise has no effect, whereas too much noise makes the mapping function too
challenging to learn.
◾ Although additional noise to the inputs is the most common and widely studied approach,
random
◾ A d d n oise to the gradien t s , i . e . the direction
DR.noise
PR AVI Ncan
R AHAbe
TE added to other parts
F C of the
RI T, V A network
SH I during training. Some examples include:
to update weights.
◾ Add noise to activations, i.e. the outputs of each layer.

You might also like