Training Deep Neural Networks

The document discusses various techniques for training deep neural networks, focusing on initialization strategies like He and Glorot initialization, and the implementation of Batch Normalization to address gradient issues. It also covers several optimization algorithms, including Momentum, Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam, and Nadam, highlighting their advantages and typical parameter settings. These methods aim to improve the efficiency and effectiveness of training neural networks by addressing issues like vanishing gradients and optimizing learning rates.

Uploaded by

quebuennombredecorreo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views14 pages

Training Deep Neural Networks

Uploaded by

quebuennombredecorreo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Training deep neural

networks
Glorot and He Initialization
• The initialization strategy for the ReLU activation function(and
its variants) is sometimes called He initialization
• Some papers have provided similar strategies for different
activation functions. These strategies differ only by scale of
the variances and whether they use or

• By default, Keras used Glorot initialization with a uniform

distribution. You can change this to He initialization by setting
kernel_initializer
Batch Normalization
• Batch Normalization (BN)
• To address the vanishing/exploding gradients problems, and more
generally the problem that the distribution of each layer’s inputs
changes during training, as the parameters of the previous layers
change
• Just before the activation function of each layer, simply zero-center
and normalize the inputs, then scale and shift the result using two
new parameters per layer (one for scaling, the other for shifting)
• This operation lets the model learn the optimal scale and mean of
the inputs for each layer
• In many cases, if you add a BN layer as the very first layer of your
network, you do not need to standardize your training set
Batch Normalization
• is the empirical mean, evaluated over the whole
mini-batch B.
• is the empirical standard deviation, also
evaluated over the whole minibatch.
• is the number of instances in the mini-batch.
• is the zero-centered and normalized input.
• γ is the scaling parameter vector for the layer.
• β is the shifting parameter (offset) vector for the
layer.
• ϵ is a tiny number to avoid division by zero
(typically 10-5 ). This is called a smoothing term.
• z is the output of the BN operation: it is a scaled
and shifted version of the inputs.
Implementing Batch Normalization with
Keras
• Just add a BatchNormalization layer before or after each
hidden layer’s activation function
Implementing Batch Normalization with
Keras
• The authors of the BN paper argued in favor of adding the BN
layers before the activation function, rather than after.
• There is some debate about this, as which is preferable seems to depend
on the task
• To add the BN layers before the activation functions, you must remove
the activation function from the hidden layers and add them as separate
layers after the BN layers
• Moreover, since a BN layer includes one offset parameter per input, you
can remove the bias term from the previous layer
Implementing Batch Normalization with
Keras
• axis parameter
• It determines which axis should be normalized
• It defaults to -1, meaning the last axis
• When the input batch is 2D ([batch size, features]), this means that
each input feature will be normalized based on the mean and
standard deviation across all the instances in the batch
• The BN layer will independently normalize each of the 784 input features
• When the input batch is 3D ([batch size, height, width]), the BN layer
will compute 28 means and 28 sd’s
• If instead you still want to treat each of the 784 pixels independently, then you
should set axis=[1,2]
Faster optimizers
• Momentum optimization
• Imagine a bowling ball rolling down a gentle slope on a smooth surface
• It will start out slowly, but it will quickly pick up momentum until it eventually
reaches terminal velocity (if there is some friction or air resistance)
• Momentum optimization cares a great deal about what previous gradients
were
←
← θ
• To simulate some sort of friction mechanism and prevent the momentum
from growing too large, the algorithm introduces a new hyperparameter ,
called the momentum, 0(high friction) and 1(no friction)
• A typical momentum value is 0.9
Faster optimizers
• Momentum optimization
• If the gradient remains constant, the terminal velocity is equal to
that gradient multiplied by multiplied by 1/(1- )
• If =0.9, then the terminal velocity is equal to 10 times the gradient, so
momentum optimization ends up going 10 times faster than Gradient Descent
• This allows momentum optimization to escape from plateaus much faster than
Gradient Descent
• The momentum value of 0.9 usually works well in practice and
almost always goes faster than regular Gradient Descent
Faster optimizers
• Nesterov Accelerated Gradient
• It uses the gradient measured a bit farther in the
direction of the momentum vector rather than using
the gradient at the original position
←
← θ
• In general the momentum vector will be pointing in
the right direction(i.e., toward the optimum), so it will
slightly more accurate to use the gradient measured
a bit farther in that direction rather than the gradient
at the original position
• NAG will almost always faster than regular
momentum optimization
Faster optimizers
• AdaGrad
← s ⊗
← θ ⊘
• If the cost function is steep along a
dimension, then Gradient Descent starts
by quickly going down the steepest slope,
which does not straight toward the global
optimum
• The gradient vector is scaled down by a
factor of
• This algorithm decays the learning rate. It
often stops too early when training
neural networks
Faster optimizers
• RMSProp
• RMSProp algorithm modifies Adam by accumulating only the
gradients from the most recent iterations
• The decay rate β is typically set to 0.9. this default value often works
well
← s 1 ⊗
← ⊘
Faster optimizers
• Adam Optimization
• Adam, which stands for adaptive moment estimation, combines the ideas
of Momentum optimization and RMSProp
← 1
← 1 ⊗
←
1
←
1
← θ ⊘
• Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp),
it requires less tuning of the learning rate hyperparameter η. You can often
use the default value η = 0.001
• T represents the iteration number starting at 1
Faster optimizers
• Nadam
• Nadam optimization is Adam optimization plus the Nesterov trick
• It will often converge slightly faster than Adam
• In his report introducing the technique, the researcher Timothy
Dozat compares many different optimizers on various tasks and
finds that Nadam generally outperforms Adam but is sometimes
outperformed by RMSProp
• When you are disappointed by your model’s performance, try
using plain Nesterov Accelerated Gradient instead
• Your dataset may just be allergic to adaptive gradients

Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Cours 5
No ratings yet
Cours 5
23 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
Twentyone 20466 PDF
No ratings yet
Twentyone 20466 PDF
15 pages
(Fall 2024) Deep Learning 2
No ratings yet
(Fall 2024) Deep Learning 2
46 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
08 Training
No ratings yet
08 Training
18 pages
Deep Learning Optimization Guide
100% (1)
Deep Learning Optimization Guide
105 pages
PDF Hyperparameter Tuning Batch Normalization
No ratings yet
PDF Hyperparameter Tuning Batch Normalization
11 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Module 2
No ratings yet
Module 2
67 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
8.5 Batch Normalization: 8.4.5 Exercises
No ratings yet
8.5 Batch Normalization: 8.4.5 Exercises
11 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Deep Neural Network Training Guide
No ratings yet
Deep Neural Network Training Guide
55 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Gen Aiml Notes by Piyush
No ratings yet
Gen Aiml Notes by Piyush
39 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
2 Deep Neural Network - 241120 - 095158
No ratings yet
2 Deep Neural Network - 241120 - 095158
47 pages
INTRO TO Deep Learning Focusing On ToolS - Knowlexon - Biswa
No ratings yet
INTRO TO Deep Learning Focusing On ToolS - Knowlexon - Biswa
37 pages
Gradient Descent for ML Practitioners
No ratings yet
Gradient Descent for ML Practitioners
27 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
DL-Lecture-10 Deep Learning Experiments
No ratings yet
DL-Lecture-10 Deep Learning Experiments
15 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Master Thesis Template Polito
No ratings yet
Master Thesis Template Polito
16 pages
CNN 02 Batch Normalization
No ratings yet
CNN 02 Batch Normalization
19 pages
DL Lecture 11 Optimizers
No ratings yet
DL Lecture 11 Optimizers
41 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
Optimization Algorithms
No ratings yet
Optimization Algorithms
26 pages
Cours 6
No ratings yet
Cours 6
26 pages
Unit 3
No ratings yet
Unit 3
47 pages
Stochastic Gradient Descent Guide
No ratings yet
Stochastic Gradient Descent Guide
61 pages
6 Batchnorm
No ratings yet
6 Batchnorm
30 pages
3 DL
No ratings yet
3 DL
15 pages
Optim
No ratings yet
Optim
33 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
How To Use Batch Normalization With TensorFlow and TF - Keras To Train Deep Neural Networks Faster
No ratings yet
How To Use Batch Normalization With TensorFlow and TF - Keras To Train Deep Neural Networks Faster
11 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Batch Norm
No ratings yet
Batch Norm
7 pages
Training NNs
No ratings yet
Training NNs
34 pages
Dis4 Sol
No ratings yet
Dis4 Sol
10 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
en WCX Multi Skill Efficiency Algorithm WP
No ratings yet
en WCX Multi Skill Efficiency Algorithm WP
6 pages
Calculus TI 86
No ratings yet
Calculus TI 86
1 page
Lecture 12 - Neural Networks (DONE!!) PDF
No ratings yet
Lecture 12 - Neural Networks (DONE!!) PDF
27 pages
Marketing Experts: Segmentation Insights
No ratings yet
Marketing Experts: Segmentation Insights
4 pages
Neural Networks For Animal Science Applications: Two Case Studies
No ratings yet
Neural Networks For Animal Science Applications: Two Case Studies
7 pages
Wallace Tree Multiplier
No ratings yet
Wallace Tree Multiplier
7 pages
Numerical Analysis Pseudo Codes
No ratings yet
Numerical Analysis Pseudo Codes
32 pages
Differential Geometry
No ratings yet
Differential Geometry
9 pages
Ai Assignment Term-1 24-25
No ratings yet
Ai Assignment Term-1 24-25
9 pages
System Identification Toolbox - DC Motor
No ratings yet
System Identification Toolbox - DC Motor
19 pages
Aerove Subsytems Freshie Induction
No ratings yet
Aerove Subsytems Freshie Induction
48 pages
AEN 206 18192 Quiz 1 - Key
No ratings yet
AEN 206 18192 Quiz 1 - Key
1 page
Math
No ratings yet
Math
25 pages
Pyraminx Algorithm
No ratings yet
Pyraminx Algorithm
1 page
Fusion of Neural Networks With Fuzzy Logic and Genetic Algorithm
No ratings yet
Fusion of Neural Networks With Fuzzy Logic and Genetic Algorithm
10 pages
Load-Frequency Regulation Under0-06
100% (1)
Load-Frequency Regulation Under0-06
9 pages
DPW 0522 0526 338 Eman
No ratings yet
DPW 0522 0526 338 Eman
12 pages
Notes oPERATION Research
No ratings yet
Notes oPERATION Research
17 pages
hw8 (5555)
No ratings yet
hw8 (5555)
3 pages
Simulation Modeling and Analysis: Averill M. Law
50% (2)
Simulation Modeling and Analysis: Averill M. Law
9 pages
Data Structures Glossary
No ratings yet
Data Structures Glossary
5 pages
Daa - LP 2023-24
No ratings yet
Daa - LP 2023-24
3 pages
SEM 6 - CSE IOT - Cryptography and Network Security - 2024 May To 2023 May - Aeraxia - in
No ratings yet
SEM 6 - CSE IOT - Cryptography and Network Security - 2024 May To 2023 May - Aeraxia - in
4 pages
Confusion Matrix Examples
No ratings yet
Confusion Matrix Examples
2 pages
Group-5 (Image Segmentation) - PPT
100% (1)
Group-5 (Image Segmentation) - PPT
108 pages
11me306 Advanced Fluid Mechanics
No ratings yet
11me306 Advanced Fluid Mechanics
2 pages
Selected Solutions To Linear Algebra Done Wrong
No ratings yet
Selected Solutions To Linear Algebra Done Wrong
21 pages
Cost Estimation Using High-Low Method
No ratings yet
Cost Estimation Using High-Low Method
2 pages
Linear Algebra II Syllabus
No ratings yet
Linear Algebra II Syllabus
9 pages
Acme Public Schoo1 Xii
No ratings yet
Acme Public Schoo1 Xii
9 pages

Training Deep Neural Networks

Uploaded by

Training Deep Neural Networks

Uploaded by

Training deep neural

• By default, Keras used Glorot initialization with a uniform

You might also like