Training deep neural
networks
Glorot and He Initialization
• The initialization strategy for the ReLU activation function(and
its variants) is sometimes called He initialization
• Some papers have provided similar strategies for different
activation functions. These strategies differ only by scale of
the variances and whether they use or
• By default, Keras used Glorot initialization with a uniform
distribution. You can change this to He initialization by setting
kernel_initializer
Batch Normalization
• Batch Normalization (BN)
• To address the vanishing/exploding gradients problems, and more
generally the problem that the distribution of each layer’s inputs
changes during training, as the parameters of the previous layers
change
• Just before the activation function of each layer, simply zero-center
and normalize the inputs, then scale and shift the result using two
new parameters per layer (one for scaling, the other for shifting)
• This operation lets the model learn the optimal scale and mean of
the inputs for each layer
• In many cases, if you add a BN layer as the very first layer of your
network, you do not need to standardize your training set
Batch Normalization
• is the empirical mean, evaluated over the whole
mini-batch B.
• is the empirical standard deviation, also
evaluated over the whole minibatch.
• is the number of instances in the mini-batch.
• is the zero-centered and normalized input.
• γ is the scaling parameter vector for the layer.
• β is the shifting parameter (offset) vector for the
layer.
• ϵ is a tiny number to avoid division by zero
(typically 10-5 ). This is called a smoothing term.
• z is the output of the BN operation: it is a scaled
and shifted version of the inputs.
Implementing Batch Normalization with
Keras
• Just add a BatchNormalization layer before or after each
hidden layer’s activation function
Implementing Batch Normalization with
Keras
• The authors of the BN paper argued in favor of adding the BN
layers before the activation function, rather than after.
• There is some debate about this, as which is preferable seems to depend
on the task
• To add the BN layers before the activation functions, you must remove
the activation function from the hidden layers and add them as separate
layers after the BN layers
• Moreover, since a BN layer includes one offset parameter per input, you
can remove the bias term from the previous layer
Implementing Batch Normalization with
Keras
• axis parameter
• It determines which axis should be normalized
• It defaults to -1, meaning the last axis
• When the input batch is 2D ([batch size, features]), this means that
each input feature will be normalized based on the mean and
standard deviation across all the instances in the batch
• The BN layer will independently normalize each of the 784 input features
• When the input batch is 3D ([batch size, height, width]), the BN layer
will compute 28 means and 28 sd’s
• If instead you still want to treat each of the 784 pixels independently, then you
should set axis=[1,2]
Faster optimizers
• Momentum optimization
• Imagine a bowling ball rolling down a gentle slope on a smooth surface
• It will start out slowly, but it will quickly pick up momentum until it eventually
reaches terminal velocity (if there is some friction or air resistance)
• Momentum optimization cares a great deal about what previous gradients
were
←
← θ
• To simulate some sort of friction mechanism and prevent the momentum
from growing too large, the algorithm introduces a new hyperparameter ,
called the momentum, 0(high friction) and 1(no friction)
• A typical momentum value is 0.9
Faster optimizers
• Momentum optimization
• If the gradient remains constant, the terminal velocity is equal to
that gradient multiplied by multiplied by 1/(1- )
• If =0.9, then the terminal velocity is equal to 10 times the gradient, so
momentum optimization ends up going 10 times faster than Gradient Descent
• This allows momentum optimization to escape from plateaus much faster than
Gradient Descent
• The momentum value of 0.9 usually works well in practice and
almost always goes faster than regular Gradient Descent
Faster optimizers
• Nesterov Accelerated Gradient
• It uses the gradient measured a bit farther in the
direction of the momentum vector rather than using
the gradient at the original position
←
← θ
• In general the momentum vector will be pointing in
the right direction(i.e., toward the optimum), so it will
slightly more accurate to use the gradient measured
a bit farther in that direction rather than the gradient
at the original position
• NAG will almost always faster than regular
momentum optimization
Faster optimizers
• AdaGrad
← s ⊗
← θ ⊘
• If the cost function is steep along a
dimension, then Gradient Descent starts
by quickly going down the steepest slope,
which does not straight toward the global
optimum
• The gradient vector is scaled down by a
factor of
• This algorithm decays the learning rate. It
often stops too early when training
neural networks
Faster optimizers
• RMSProp
• RMSProp algorithm modifies Adam by accumulating only the
gradients from the most recent iterations
• The decay rate β is typically set to 0.9. this default value often works
well
← s 1 ⊗
← ⊘
Faster optimizers
• Adam Optimization
• Adam, which stands for adaptive moment estimation, combines the ideas
of Momentum optimization and RMSProp
← 1
← 1 ⊗
←
1
←
1
← θ ⊘
• Adam is an adaptive learning rate algorithm (like AdaGrad and RMSProp),
it requires less tuning of the learning rate hyperparameter η. You can often
use the default value η = 0.001
• T represents the iteration number starting at 1
Faster optimizers
• Nadam
• Nadam optimization is Adam optimization plus the Nesterov trick
• It will often converge slightly faster than Adam
• In his report introducing the technique, the researcher Timothy
Dozat compares many different optimizers on various tasks and
finds that Nadam generally outperforms Adam but is sometimes
outperformed by RMSProp
• When you are disappointed by your model’s performance, try
using plain Nesterov Accelerated Gradient instead
• Your dataset may just be allergic to adaptive gradients