Deep Learning Book
by Ian Goodfellow, Yoshua Bengio and Aaron Courville
Chapter 8: Optimization for Training Deep
https://poweredtemplate.com/05826/0/index.html Learning Models
2
Introduction
• Purpose of Learning
find the parameters θ of a neural
network model that
significantly reduce a cost function
J(θ)
where J(θ) includes a performance
measure evaluated on the training
set as well as additional
regularization terms
3
How Learning Differs from
Pure Optimization?
• Machine Learning
– usually acts indirectly, whereas optimization focuses on
the optimization of the cost function itself
– goal of machine learning: to reduce the expected
generalization error (risk)
– however, learning algorithms reduce cost functions
(empirical risk)
• by minimizing the expected loss on the training
dataset
• in the hope that the indirect optimization will
improve the overall performance on untrained data
4
How Learning Differs from
Pure Optimization?
• Empirical Risk Minimization
– the goal of a machine learning algorithm is to reduce the
expected generalization error
– a.k.a. the risk
– the expectation is taken over the true underlying data
distribution → if know then we have an optimization
problem
– if the data distribution is unknown, but a set of training
samples is known, then we have a machine learning problem
5
How Learning Differs from
Pure Optimization?
• Empirical Risk Minimization
– a machine learning problem can be converted back into
an optimization problem
• by minimizing the expected loss on the training set
(empirical risk)
• this means replacing the true data distribution with the
empirical distribution defined by the training set
6
How Learning Differs from
Pure Optimization?
• Surrogate Loss Functions
– a relevant loss function (say, classification error) sometimes
cannot be optimized efficiently
• e.g., exactly minimizing expected 0-1 loss is typically
intractable (exponential in the input dimension)
– a solution is to optimize a surrogate loss function
• acts as a proxy but has advantages
• e.g. the negative log-likelihood (NLL) of the correct class
is used as a surrogate for the 0-1 loss
• it allows the model to estimate the conditional probability
of the classes, given the input
– if the model can do that well, then it can pick the
classes that yield the least classification error
7
How Learning Differs from
Pure Optimization?
• Surrogate Loss Functions
– NLL = ∑ -log(yi) for all
correctly classified training
samples
– we assume yi is the, 0-1
normalized, model output for
training input xi
By minimizing NNL, we
maximize the likelihood of the
correct classes
NNL is differentiable, whereas
0-1 loss is not
8
Minibatch Algorithms
• Optimization is iterative in machine learning
– e.g., gradient descent
• Using a large training set (batch algorithm) per
every iteration is computationally infeasible
– e.g., ImageNet dataset has millions of images
• Minibatch algorithm uses only 𝒃 examples in each
iteration
– 𝑚: the size of entire training set
– 𝑏: minibatch size (1≤𝑏<𝑚)
– It is crucial that the minibatches are randomly selected
•unbiased gradient estimates require independent samples
– we can compute entire separate parameter updates over
different sets of examples in parallel
9
Minibatch Algorithms
• Some thoughts on taking a random sample of the
dataset for training
– the standard error of the mean, estimated from n samples is
given by: σ/√n
• σ is true standard deviation of the value of the samples
– √n in the denominator shows that there are less than linear
returns to using more examples
– Let’s compare n=100 and n=10,000
• n=10,000 requires 100 times more computation than n=100
– but reduces the standard error only by a factor of 10
– Most optimization algorithms converge much faster (total
computation) if they are allowed to
• rapidly compute approximate estimates of the gradient
rather than slowly computing the exact gradient
10
Minibatch Algorithms
• Optimization algorithms that use only a single
example at a time are sometimes called stochastic
(and online)
• However, the term “online” is mostly used for when
the examples are drawn from a stream of
continually created examples
• Stochastic method has become the general term for
minibatch or minibatch stochastic
11
Minibatch Algorithms
• Minibatch size
– Larger batches → more accurate estimate of the gradient, but
with less than linear returns
– Parallel processing of the batch → memory scales with the
batch size. For many hardware (hw) this is the limiting factor
in batch size
– Multicore hw: get underutilized with too small batches
– Some hw achieve better runtime with specific array sizes. For
GPU, power of 2 batch sizes often offer better runtime
– Small batches can offer a regularizing effect, possibly due to
the noise they add to the learning process
12
Batch vs. Minibatch
• Each iteration in minibatch may have poor
optimization performance than in full batch
• However, after many iterations, the minibatch
algorithm generally converges to an optimal state
13
Saddle Point
• For many high dimensional non-convex functions,
local minima are rare cases
– local minima should satisfy:
n is the # of model parameters
• Most of zero-gradient
points in deep neural
networks are saddle points
14
Challenges
• Limitations of Gradient Descent
– Local minima
• Convex optimization → problem can be reduced to
finding a local minimum
– any local minimum is guaranteed to be a global
minimum
• With nonconvex functions, such as neural nets,
it is possible to have many local minima
– A deep model will have many local minima
» local minima can be problematic only if they have
high cost in comparison to the global minimum
– It is believed that, for very large neural networks,
most local minima present a low cost function value
» It is not crucial to find a true global minimum→
a set of parameters that exhibit a sufficiently low
cost will be a good solution to the problem
15
Challenges
• Limitation of Gradient Descent
– Local minima
16
Challenges
• Limitation of Gradient Descent
– Saddle Points
• For many high-dimensional nonconvex functions, local minima
(and maxima) are rare compared to saddle points (also a
point with zero gradient)
• Some points around a saddle point have greater cost than the
saddle point,while others have a lower cost
• Optimization methods that are designed to solve for a point of
zero gradient (e.g. Newton’s method) cannot escape these points
17
Challenges
• Limitation of Gradient Descent
– Cliffs
• highly nonlinear deep neural networks often contain these sharp
nonlinearities in parameter space resulting from the
multiplication of several parameters
• nonlinearities give rise to very high derivatives in some places
• Cliffs are very steep region in the objective function space
• on an extremely steep cliff structure, the gradient update step
can move the parameters extremely far, usually jumping off the
cliff structure altogether
This problem can be avoided
using gradient clipping:
● on the imminence of a
very large step, gradient
clipping heuristic
intervenes to reduce the
step size
18
Challenges
• Limitation of Gradient Descent
– Gradient Clipping for Handling Cliffs
19
Basic Learning Algorithms
• Forward Propagation (review)
20
Basic Learning Algorithms
• Back Propagation (review)
21
Basic Learning Algorithms
• Parameter Update (review)
22
Basic Learning Algorithms
• Next iteration (review)
23
Basic Learning Algorithms
• Stochastic Gradient Descent (SGD)
– Minibatch of the training set:
data {𝒙(1),…,𝒙(𝑚)}with targets {𝑦(1),…,𝑦(𝑚)}
– Gradient:
– Apply update:
– learning rate 𝜖 is a critical hyperparameter for SGD
24
Basic Learning Algorithms
• Momentum
– Designed to accelerate learning
• accumulates an exponentially decaying moving-average
of past gradients
• continues to move in the gradient descent direction
– Comparison of GD and GD with momentum
25
Basic Learning Algorithms
• SGD with Momentum
– Minibatch of the training set:
data {𝒙(1),…,𝒙(𝑚)}with targets {𝑦(1),…,𝑦(𝑚)}
– Gradient:
– Accumulate velocity:
– Apply update:
26
Algorithms with Adaptive
Learning Rates
• Adaptive gradient (AdaGrad)
– modified SGD with per-parameter learning rate (lr)
• parameters having small gradient: lr is increased
• parameters having large gradient: lr is decreased
– Gradient:
– Accumulate squared gradients:
– Apply update:
27
Dúvidas?