KEMBAR78
Training Deep Neural Networks | PDF | Applied Mathematics | Computational Science
0% found this document useful (0 votes)
6 views14 pages

Training Deep Neural Networks

The document discusses various techniques for training deep neural networks, focusing on initialization strategies like He and Glorot initialization, and the implementation of Batch Normalization to address gradient issues. It also covers several optimization algorithms, including Momentum, Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam, and Nadam, highlighting their advantages and typical parameter settings. These methods aim to improve the efficiency and effectiveness of training neural networks by addressing issues like vanishing gradients and optimizing learning rates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views14 pages

Training Deep Neural Networks

The document discusses various techniques for training deep neural networks, focusing on initialization strategies like He and Glorot initialization, and the implementation of Batch Normalization to address gradient issues. It also covers several optimization algorithms, including Momentum, Nesterov Accelerated Gradient, AdaGrad, RMSProp, Adam, and Nadam, highlighting their advantages and typical parameter settings. These methods aim to improve the efficiency and effectiveness of training neural networks by addressing issues like vanishing gradients and optimizing learning rates.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Training deep neural

networks
Glorot and He Initialization
• The initialization strategy for the ReLU activation function(and
its variants) is sometimes called He initialization
• Some papers have provided similar strategies for different
activation functions. These strategies differ only by scale of
the variances and whether they use or

• By default, Keras used Glorot initialization with a uniform


distribution. You can change this to He initialization by setting
kernel_initializer
Batch Normalization
• Batch Normalization (BN)
• To address the vanishing/exploding gradients problems, and more
generally the problem that the distribution of each layer’s inputs
changes during training, as the parameters of the previous layers
change
• Just before the activation function of each layer, simply zero-center
and normalize the inputs, then scale and shift the result using two
new parameters per layer (one for scaling, the other for shifting)
• This operation lets the model learn the optimal scale and mean of
the inputs for each layer
• In many cases, if you add a BN layer as the very first layer of your
network, you do not need to standardize your training set
Batch Normalization
• is the empirical mean, evaluated over the whole
mini-batch B.
• is the empirical standard deviation, also
evaluated over the whole minibatch.
• is the number of instances in the mini-batch.
• is the zero-centered and normalized input.
• γ is the scaling parameter vector for the layer.
• β is the shifting parameter (offset) vector for the
layer.
• ϵ is a tiny number to avoid division by zero
(typically 10-5 ). This is called a smoothing term.
• z is the output of the BN operation: it is a scaled
and shifted version of the inputs.
Implementing Batch Normalization with
Keras
• Just add a BatchNormalization layer before or after each
hidden layer’s activation function
Implementing Batch Normalization with
Keras
• The authors of the BN paper argued in favor of adding the BN
layers before the activation function, rather than after.
• There is some debate about this, as which is preferable seems to depend
on the task
• To add the BN layers before the activation functions, you must remove
the activation function from the hidden layers and add them as separate
layers after the BN layers
• Moreover, since a BN layer includes one offset parameter per input, you
can remove the bias term from the previous layer
Implementing Batch Normalization with
Keras
• axis parameter
• It determines which axis should be normalized
• It defaults to -1, meaning the last axis
• When the input batch is 2D ([batch size, features]), this means that
each input feature will be normalized based on the mean and
standard deviation across all the instances in the batch
• The BN layer will independently normalize each of the 784 input features
• When the input batch is 3D ([batch size, height, width]), the BN layer
will compute 28 means and 28 sd’s
• If instead you still want to treat each of the 784 pixels independently, then you
should set axis=[1,2]
Faster optimizers
• Momentum optimization
• Imagine a bowling ball rolling down a gentle slope on a smooth surface
• It will start out slowly, but it will quickly pick up momentum until it eventually
reaches terminal velocity (if there is some friction or air resistance)
• Momentum optimization cares a great deal about what previous gradients
were

← θ
• To simulate some sort of friction mechanism and prevent the momentum
from growing too large, the algorithm introduces a new hyperparameter ,
called the momentum, 0(high friction) and 1(no friction)
• A typical momentum value is 0.9
Faster optimizers
• Momentum optimization
• If the gradient remains constant, the terminal velocity is equal to
that gradient multiplied by multiplied by 1/(1- )
• If =0.9, then the terminal velocity is equal to 10 times the gradient, so
momentum optimization ends up going 10 times faster than Gradient Descent
• This allows momentum optimization to escape from plateaus much faster than
Gradient Descent
• The momentum value of 0.9 usually works well in practice and
almost always goes faster than regular Gradient Descent
Faster optimizers
• Nesterov Accelerated Gradient
• It uses the gradient measured a bit farther in the
direction of the momentum vector rather than using
the gradient at the original position

← θ
• In general the momentum vector will be pointing in
the right direction(i.e., toward the optimum), so it will
slightly more accurate to use the gradient measured
a bit farther in that direction rather than the gradient
at the original position
• NAG will almost always faster than regular
momentum optimization
Faster optimizers
• AdaGrad
← s ⊗
← θ ⊘
• If the cost function is steep along a
dimension, then Gradient Descent starts
by quickly going down the steepest slope,
which does not straight toward the global
optimum
• The gradient vector is scaled down by a
factor of
• This algorithm decays the learning rate. It
often stops too early when training
neural networks
Faster optimizers
• RMSProp
• RMSProp algorithm modifies Adam by accumulating only the
gradients from the most recent iterations
• The decay rate β is typically set to 0.9. this default value often works
well
← s 1 ⊗
← ⊘
Faster optimizers
• Adam Optimization
• Adam, which stands for adaptive moment estimation, combines the ideas
of Momentum optimization and RMSProp
← 1
← 1 ⊗

1

1
← θ ⊘
• Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp),
it requires less tuning of the learning rate hyperparameter η. You can often
use the default value η = 0.001
• T represents the iteration number starting at 1
Faster optimizers
• Nadam
• Nadam optimization is Adam optimization plus the Nesterov trick
• It will often converge slightly faster than Adam
• In his report introducing the technique, the researcher Timothy
Dozat compares many different optimizers on various tasks and
finds that Nadam generally outperforms Adam but is sometimes
outperformed by RMSProp
• When you are disappointed by your model’s performance, try
using plain Nesterov Accelerated Gradient instead
• Your dataset may just be allergic to adaptive gradients

You might also like