0% found this document useful (0 votes)

9 views13 pages

Optimization Gradient Descent

The document discusses various optimization algorithms used in neural networks, focusing on Gradient Descent and its variants, including Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. It highlights the objectives, advantages, and drawbacks of these methods, as well as advanced optimizers like Adagrad, Adadelta, RMSProp, and Adam. Additionally, it addresses issues related to exploding and vanishing gradients, providing potential solutions for these problems.

Uploaded by

tirthankar.sardar.cseds.2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views13 pages

Optimization Gradient Descent

Uploaded by

tirthankar.sardar.cseds.2022

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Optimization

Optimization algorithms are responsible for reducing losses and provide

most accurate results possible. The weight is initialized using some

initialization strategies and is updated with each epoch according to the

equation. The best results are achieved using some optimization strategies

or algorithms called Optimizer.

Gradient Descent

The goal of the gradient descent is to minimize a given function which, in our case,
is the loss function of the neural network. To achieve this goal, it performs two steps
iteratively.1. Forward pass 2. Back propagation

What is the objective of Gradient Descent?

Gradient, in plain terms means slope or slant of a surface. So
gradient descent literally means descending a slope to reach the
lowest point on that surface. For the function to reach minimum
value, the weights should be altered. With the help of back
propagation, loss is transferred from one layer to another and
“weights” parameter are also modified depending on loss so that
loss can be minimized.

New weight=old weight-n*derror/d old weight

 The objective of gradient descent algorithm is to find the value of “x”

such that “y” is minimum. Y is cost function and x is weights.

 So, the idea is to pass the training set through the hidden layers of the

neural network and then update the parameters of the layers by

computing the gradients using the training samples from the training

dataset.

 Think of it like this. Suppose a man is at top of the valley and he wants

to get to the bottom of the valley. So he goes down the slope. He

decides his next position based on his current position and stops when

he gets to the bottom of the valley which was his goal.

 There are different ways in which that man (weights) can go down the

slope.
 Batch Gradient Descent: Batch Gradient Descent involves calculations over the
full training set at each step as a result of which it is very slow on very large training

data. Thus, it becomes very computationally expensive to do Batch GD. However, this is

great for convex or relatively smooth error manifolds. Also, Batch GD scales well with

the number of features.

Drawbacks of Batch Gradient descent:

 Gradient Descent algorithm, the entire data set is loaded at a time. This makes it
computationally intensive.
 Another drawback is there are chances the iteration values may get stuck at local
minima or saddle point and never converge to minima.

2. Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is an extension of Gradient Descent
 Suppose our dataset has 5 million examples, then just to take one step the model
will have to calculate the gradients of all the 5 million examples. This does not seem
an efficient way. To tackle this problem we have Stochastic Gradient Descent.
 SGD tries to solve the main problem in Batch Gradient descent which is the usage of
whole training data to calculate gradients as each step.
 SGD is stochastic in nature i.e it picks up a “random” instance of training data at
each step and then computes the gradient making it much faster as there is much
fewer data to manipulate at a single time, unlike Batch GD.
 In Stochastic Gradient Descent (SGD), we consider just one example at a time to take
a single step. We do the following steps in one epoch for SGD:
1. Take an example
2. Feed it to Neural Network
3. Calculate it’s gradient
4. Use the gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for all the examples in training dataset

Drawback:
 SGD takes more number of iterations compared to GD to reach minimum and
also contains some noise when compared to Gradient Descent.
 As SGD computes derivatives of only 1 point at a time, the time taken to
complete one epoch is large compared to Gradient Descent algorithm.

Difference between Batch Gradient Descent and Stochastic Gradient

Descent
S.NO
Batch Gradient Descent Stochastic Gradient Descent
.

Computes gradient using the

1. whole Training sample Computes gradient using a single Training sample

Slow and computationally Faster and less computationally expensive than Batch
2. expensive algorithm GD

Not suggested for huge training

3. samples. Can be used for large training samples.
S.NO
Batch Gradient Descent Stochastic Gradient Descent
.

4. Deterministic in nature. Stochastic in nature.

Gives optimal solution given

5. sufficient time to converge. Gives good solution but not optimal.

The data sample should be in a random order, and this

No random shuffling of points is why we want to shuffle the training set for every
6. are required. epoch.

Can’t escape shallow local

7. minima easily. SGD can escape shallow local minima more easily.

8. Convergence is slow. Reaches the convergence much faster.

3. Mini Batch — Stochastic Gradient Descent

 MB-SGD is an extension of SGD algorithm.
 It is also common to sample a small number of data points instead of just one point
at each step and that is called “mini-batch” gradient descent. Mini-batch tries to
strike a balance between the goodness of gradient descent and speed of SGD.
 It overcomes the time-consuming complexity of SGD by taking a batch of points /
subset of points from dataset to compute derivative.
 after creating the mini-batches of fixed size, we do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created
 Drawback is the update of weights is much noisier because the derivative is not
always towards minima.

Drawbacks of base optimizer:(GD, SGD, mini-batch GD)

 Gradient Descent uses the whole training data to update weight and bias.
Suppose if we have millions of records then training becomes slow and
computationally very expensive.

 SGD solved the Gradient Descent problem by using only single records to
updates parameters. But, still, SGD is slow to converge because it needs forward
and backward propagation for every record. And the path to reach global minima
becomes very noisy.

 Mini-batch GD overcomes the SGD drawbacks by using a batch of records to

update the parameter. Since it doesn't use entire records to update parameter,
the path to reach global minima is not as smooth as Gradient Descent.

The above figure is the plot between the number of epoch on the x-axis and the loss on
the y-axis. We can clearly see that in Gradient Descent the loss is reduced smoothly
whereas in SGD there is a high oscillation in loss value.

SGD with momentum:

Instead of depending only on the current gradient to update the weight,

gradient descent with momentum replaces the current gradient with m

(“momentum”), which is an aggregate of gradients. This aggregate is the

exponential moving average of current and past gradients (i.e. up to time t).

 An equation to update weights and bias in SGD

 An equation to update weights and bias in SGD with momentum

 Where

mt = aggregate of gradients at time t [current] (initially, m t =

0)
mt-1 = aggregate of gradients at time t-1 [previous]
Wt = weights at time t
Wt+1 = weights at time t+1
αt = learning rate at time t
∂L = derivative of Loss Function
∂Wt = derivative of weights at time t
β = Moving average parameter (const, 0.9)

And m initialised to 0.

Common default value:

 β = 0.9

In SGD with momentum, we have added momentum in a gradient function.

Adagrad (Adaptive Gradient Algorithm):

The idea behind Adagrad is to use different learning rates for each parameter based on
iteration.

It adapts the learning rate to the parameters, performing smaller updates (i.e. low
learning rates) for parameters associated with frequently occurring features (dense),
and larger updates (i.e. high learning rates) for parameters associated with infrequent
features (sparse). For this reason, it is well-suited for dealing with sparse data.

Advantages of Using AdaGrad

 it eliminates the need to manually tune the learning rate

 convergence is faster and more reliable – than simple SGD when the scaling of
the weights is unequal

Default values (from Keras):

 α = 0.01

 ε = 10⁻⁷

 In the above Adagrad optimizer equation, the learning rate has been modified in
such a way that it will automatically decrease because the summation of the
previous gradient square will always keep on increasing after every time step.
 Drawback: due to monotonically decreasing learning rates, at some point in time
step, the model will stop learning as the learning rate is almost close to 0.

Adadelta and RMSProp

Adadelta is an extension of Adagrad that attempts to solve its radically diminishing

learning rates.

 The above equation shows that as the time steps “t” increase the summation of

squared gradients “α” increases which led to a decrease in learning rate “η”.
 In order to resolve the exponential increase in the summation of squared

gradients “α”, we replaced the “α” with exponentially weighted averages of

squared gradients.

 So, here unlike the alpha “α” in Adagrad, where it increases exponentially after
every time step. In Adadelta, using the exponentially weighted averages over the
past Gradient, an increase in “Sdw” is under control.
 The typical “β” value is 0.9 or 0.95.
Adam Optimizer

Adaptive Moment Estimation is an algorithm for optimization technique for gradient

descent. The method is really efficient when working with large problem involving a lot
of data or parameters. It requires less memory and is efficient. Intuitively, it is a
combination of the ‘gradient descent with momentum’ algorithm and the ‘RMSP’
algorithm.

Problems with Gradients

1. Exploding Gradients

2. Vanishing Gradients

 When training a deep neural network with gradient based learning

and backpropagation, we find the partial derivatives by traversing the

network from the the final layer (y_hat) to the initial layer.

 Using the chain rule, layers that are deeper into the network go

through continuous matrix multiplications in order to compute their

derivatives.
Exploding Gradients:
 In a network of n hidden layers, n derivatives will be multiplied

together. If the derivatives are large then the gradient will increase

exponentially as we propagate down the model until they eventually

explode, and this is what we call the problem of exploding gradient.

 In the case of exploding gradients, the accumulation of large

derivatives results in the model being very unstable and incapable of

effective learning,

 The large changes in the models weights creates a very unstable

network, which at extreme values the weights become so large that is

causes overflow resulting in NaN weight values of which can no longer

be updated.

Vanishing Gradients:
 Alternatively, if the derivatives are small then the gradient will

decrease exponentially as we propagate through the model until it

eventually vanishes, and this is the vanishing gradient problem.

 On the other hand, the accumulation of small gradients results in a

model that is incapable of learning meaningful insights since the

weights and biases of the initial layers, which tends to learn the core

features from the input data (X), will not be updated effectively. In the

worst case scenario the gradient will be 0 which in turn will stop the

network will stop further training.

How to know?

Exploding Gradients

There are few subtle methods that you may use to determine whether your

model is suffering from the problem of exploding gradients;

1. The model is not learning much on the training data therefore

resulting in a poor loss.

2. The model will have large changes in loss on each update due to

the models instability.

3. The models loss will be NaN during training.

When faced with these problems, to confirm whether the problem is due to

exploding gradients, there are some much more transparent signs, for

instance:
 Model weights grow exponentially and become very large when

training the model.

 The model weights become NaN in the training phase.

Vanishing Gradient

There are also ways to detect whether your deep network is suffering from

the vanishing gradient problem

 The model will improve very slowly during the training phase
and it is also possible that training stops very early, meaning that

any further training does not improve the model.

 The weights closer to the output layer of the model would

witness more of a change whereas the layers that occur closer to

the input layer would not change much (if at all).

 Model weights shrink exponentially and become very small when

training the model.

 The model weights become 0 in the training phase.

Solutions

There are many approaches to addressing exploding and vanishing

gradients; this section lists 3 approaches that you can use.

1. Reducing the amount of Layers

This is the solution could be used in both, scenarios (exploding and

vanishing gradient). However, by reducing the amount of layers in our

network, we give up some of our models complexity, since having more

layers makes the networks more capable of representing complex mappings.

2. Gradient Clipping (Exploding Gradients)

Checking for and limiting the size of the gradients whilst our model trains is

another solution.
3. Weight Initialization

A more careful initialization choice of the random initialization for your

network tends to be a partial solution, since it does not solve the problem

completely.

Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Op Tim Ization
No ratings yet
Op Tim Ization
37 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
SGD 1
No ratings yet
SGD 1
86 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Deep Learning Optimizers Explained
No ratings yet
Deep Learning Optimizers Explained
20 pages
Machine Learning: Gradient Descent Methods
No ratings yet
Machine Learning: Gradient Descent Methods
11 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
UNIT3
No ratings yet
UNIT3
37 pages
Stochastic Gradient Descent Tuning
No ratings yet
Stochastic Gradient Descent Tuning
8 pages
Gradient Descent for Deep Learning
No ratings yet
Gradient Descent for Deep Learning
21 pages
2.stochastic Gradient Descent (SGD)
No ratings yet
2.stochastic Gradient Descent (SGD)
11 pages
Unit 2.a Optimzer
No ratings yet
Unit 2.a Optimzer
10 pages
2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Neural Network Optimization Tactics
No ratings yet
Neural Network Optimization Tactics
20 pages
Gradient Descent in Machine Learning
No ratings yet
Gradient Descent in Machine Learning
98 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
SGD
No ratings yet
SGD
3 pages
Module 2
No ratings yet
Module 2
67 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Gradient Descent Optimization Guide
No ratings yet
Gradient Descent Optimization Guide
9 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Why Stochastic Gradient Descent Works
No ratings yet
Why Stochastic Gradient Descent Works
6 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Optim
No ratings yet
Optim
33 pages
Cours 5
No ratings yet
Cours 5
23 pages
Gradient Descent Types Explained
No ratings yet
Gradient Descent Types Explained
5 pages
MCQ1
No ratings yet
MCQ1
22 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Linear Models-Gradient Descent, Regularization (Introduction)
No ratings yet
Linear Models-Gradient Descent, Regularization (Introduction)
26 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
ANN Explanation Request Updated
No ratings yet
ANN Explanation Request Updated
44 pages
DL Exp2
No ratings yet
DL Exp2
6 pages
Lecture 21
No ratings yet
Lecture 21
49 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Different Types of Gradient Descent
No ratings yet
Different Types of Gradient Descent
4 pages
Gradient Descent & Stochastic Optimization
No ratings yet
Gradient Descent & Stochastic Optimization
4 pages
05.stochastic Gradient Descent
No ratings yet
05.stochastic Gradient Descent
2 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Deep Learning Optimization Guide
100% (1)
Deep Learning Optimization Guide
105 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Deep Learning Optimization Guide
No ratings yet
Deep Learning Optimization Guide
30 pages
Opito Approved Standard Authorised Gas Tester Levels 1, 2 & 3 Training
100% (1)
Opito Approved Standard Authorised Gas Tester Levels 1, 2 & 3 Training
37 pages
Q.P. Code: 526055
No ratings yet
Q.P. Code: 526055
3 pages
Panasonic Batteries Vrla For Professionals - Interactive PDF
No ratings yet
Panasonic Batteries Vrla For Professionals - Interactive PDF
148 pages
Eo 25-13
No ratings yet
Eo 25-13
13 pages
Influence of Matrix Compounds On Oxidative Hair Dyes by HPLC
No ratings yet
Influence of Matrix Compounds On Oxidative Hair Dyes by HPLC
18 pages
Especificação Técnica RM7800L1087 PDF
No ratings yet
Especificação Técnica RM7800L1087 PDF
36 pages
Samsung Galaxy Alpha
No ratings yet
Samsung Galaxy Alpha
5 pages
Pimentel Speech On Organic Farming
No ratings yet
Pimentel Speech On Organic Farming
19 pages
Sat Data Summary
No ratings yet
Sat Data Summary
1 page
Internship Report
No ratings yet
Internship Report
25 pages
Miniaturization of Rectangular Microstrip Patches Using Genetic Algorithms
No ratings yet
Miniaturization of Rectangular Microstrip Patches Using Genetic Algorithms
4 pages
American Artillery in The Mexican War
100% (3)
American Artillery in The Mexican War
118 pages
Repair Parts Manual: Model No. 2754gthi (96043019200)
No ratings yet
Repair Parts Manual: Model No. 2754gthi (96043019200)
40 pages
19 Standar Gorong Gorong Persegi Beton Bertulang Box Culvert Double
No ratings yet
19 Standar Gorong Gorong Persegi Beton Bertulang Box Culvert Double
12 pages
Manual English Volume 2 of 3 (Rev.01)
No ratings yet
Manual English Volume 2 of 3 (Rev.01)
168 pages
Ashford (2024) Impaired Oral Health - A Required Companion of Bacterial Aspiration Pneumonia
No ratings yet
Ashford (2024) Impaired Oral Health - A Required Companion of Bacterial Aspiration Pneumonia
19 pages
Intro to Carpentry & Woodwork
No ratings yet
Intro to Carpentry & Woodwork
3 pages
Foundation Design - Hybrid Riser GN E-Jan17
No ratings yet
Foundation Design - Hybrid Riser GN E-Jan17
92 pages
Audio Engineer & Sound Designer Expertise
No ratings yet
Audio Engineer & Sound Designer Expertise
3 pages
2061 High Performance HMI White Paper
No ratings yet
2061 High Performance HMI White Paper
13 pages
Corbevax Certificate
No ratings yet
Corbevax Certificate
1 page
Causes of Hydrophobic Interactions
No ratings yet
Causes of Hydrophobic Interactions
3 pages
6562a589ed1f42b18f8e01b8 Muwitusasevigawifu
No ratings yet
6562a589ed1f42b18f8e01b8 Muwitusasevigawifu
2 pages
Earthquake Preparedness Measures
No ratings yet
Earthquake Preparedness Measures
4 pages
Honor 6 Plus - Pe-Tl10 QSG - (01, All, Neu, Si, L)
No ratings yet
Honor 6 Plus - Pe-Tl10 QSG - (01, All, Neu, Si, L)
144 pages
BL Alaska Marine Lines
No ratings yet
BL Alaska Marine Lines
2 pages
1 Horned Rhino
No ratings yet
1 Horned Rhino
12 pages
INSTRUMENTATION
No ratings yet
INSTRUMENTATION
43 pages
Imm 065
No ratings yet
Imm 065
3 pages
Threshold 14
No ratings yet
Threshold 14
194 pages