KEMBAR78
Chapter 8-Deep Learning Book | PDF | Loss Function | Mathematical Optimization
0% found this document useful (0 votes)
45 views27 pages

Chapter 8-Deep Learning Book

Chapter 8 of the Deep Learning Book discusses optimization techniques for training deep learning models, emphasizing the distinction between machine learning and pure optimization. It covers concepts such as empirical risk minimization, surrogate loss functions, and the use of minibatch algorithms to improve computational efficiency. The chapter also addresses challenges faced in optimization, including local minima, saddle points, and cliffs, along with methods like gradient clipping and momentum to enhance learning algorithms.

Uploaded by

brandao.mat.ufcg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views27 pages

Chapter 8-Deep Learning Book

Chapter 8 of the Deep Learning Book discusses optimization techniques for training deep learning models, emphasizing the distinction between machine learning and pure optimization. It covers concepts such as empirical risk minimization, surrogate loss functions, and the use of minibatch algorithms to improve computational efficiency. The chapter also addresses challenges faced in optimization, including local minima, saddle points, and cliffs, along with methods like gradient clipping and momentum to enhance learning algorithms.

Uploaded by

brandao.mat.ufcg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Deep Learning Book

by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Chapter 8: Optimization for Training Deep


https://poweredtemplate.com/05826/0/index.html Learning Models
2

Introduction

• Purpose of Learning
find the parameters θ of a neural
network model that

significantly reduce a cost function


J(θ)

where J(θ) includes a performance


measure evaluated on the training
set as well as additional
regularization terms
3
How Learning Differs from
Pure Optimization?
• Machine Learning

– usually acts indirectly, whereas optimization focuses on


the optimization of the cost function itself

– goal of machine learning: to reduce the expected


generalization error (risk)

– however, learning algorithms reduce cost functions


(empirical risk)
• by minimizing the expected loss on the training
dataset
• in the hope that the indirect optimization will
improve the overall performance on untrained data
4
How Learning Differs from
Pure Optimization?
• Empirical Risk Minimization

– the goal of a machine learning algorithm is to reduce the


expected generalization error

– a.k.a. the risk

– the expectation is taken over the true underlying data


distribution → if know then we have an optimization
problem

– if the data distribution is unknown, but a set of training


samples is known, then we have a machine learning problem
5
How Learning Differs from
Pure Optimization?
• Empirical Risk Minimization

– a machine learning problem can be converted back into


an optimization problem

• by minimizing the expected loss on the training set


(empirical risk)

• this means replacing the true data distribution with the


empirical distribution defined by the training set
6
How Learning Differs from
Pure Optimization?
• Surrogate Loss Functions
– a relevant loss function (say, classification error) sometimes
cannot be optimized efficiently
• e.g., exactly minimizing expected 0-1 loss is typically
intractable (exponential in the input dimension)

– a solution is to optimize a surrogate loss function


• acts as a proxy but has advantages
• e.g. the negative log-likelihood (NLL) of the correct class
is used as a surrogate for the 0-1 loss
• it allows the model to estimate the conditional probability
of the classes, given the input
– if the model can do that well, then it can pick the
classes that yield the least classification error
7
How Learning Differs from
Pure Optimization?
• Surrogate Loss Functions
– NLL = ∑ -log(yi) for all
correctly classified training
samples
– we assume yi is the, 0-1
normalized, model output for
training input xi
By minimizing NNL, we
maximize the likelihood of the
correct classes

NNL is differentiable, whereas


0-1 loss is not
8

Minibatch Algorithms

• Optimization is iterative in machine learning


– e.g., gradient descent

• Using a large training set (batch algorithm) per


every iteration is computationally infeasible
– e.g., ImageNet dataset has millions of images

• Minibatch algorithm uses only 𝒃 examples in each


iteration
– 𝑚: the size of entire training set
– 𝑏: minibatch size (1≤𝑏<𝑚)
– It is crucial that the minibatches are randomly selected
•unbiased gradient estimates require independent samples
– we can compute entire separate parameter updates over
different sets of examples in parallel
9

Minibatch Algorithms

• Some thoughts on taking a random sample of the


dataset for training
– the standard error of the mean, estimated from n samples is
given by: σ/√n
• σ is true standard deviation of the value of the samples

– √n in the denominator shows that there are less than linear


returns to using more examples
– Let’s compare n=100 and n=10,000
• n=10,000 requires 100 times more computation than n=100
– but reduces the standard error only by a factor of 10
– Most optimization algorithms converge much faster (total
computation) if they are allowed to
• rapidly compute approximate estimates of the gradient
rather than slowly computing the exact gradient
10

Minibatch Algorithms

• Optimization algorithms that use only a single


example at a time are sometimes called stochastic
(and online)

• However, the term “online” is mostly used for when


the examples are drawn from a stream of
continually created examples

• Stochastic method has become the general term for


minibatch or minibatch stochastic
11

Minibatch Algorithms

• Minibatch size
– Larger batches → more accurate estimate of the gradient, but
with less than linear returns

– Parallel processing of the batch → memory scales with the


batch size. For many hardware (hw) this is the limiting factor
in batch size

– Multicore hw: get underutilized with too small batches

– Some hw achieve better runtime with specific array sizes. For


GPU, power of 2 batch sizes often offer better runtime

– Small batches can offer a regularizing effect, possibly due to


the noise they add to the learning process
12

Batch vs. Minibatch

• Each iteration in minibatch may have poor


optimization performance than in full batch

• However, after many iterations, the minibatch


algorithm generally converges to an optimal state
13

Saddle Point

• For many high dimensional non-convex functions,


local minima are rare cases

– local minima should satisfy:


n is the # of model parameters

• Most of zero-gradient
points in deep neural
networks are saddle points
14

Challenges

• Limitations of Gradient Descent


– Local minima
• Convex optimization → problem can be reduced to
finding a local minimum
– any local minimum is guaranteed to be a global
minimum
• With nonconvex functions, such as neural nets,
it is possible to have many local minima
– A deep model will have many local minima
» local minima can be problematic only if they have
high cost in comparison to the global minimum
– It is believed that, for very large neural networks,
most local minima present a low cost function value
» It is not crucial to find a true global minimum→
a set of parameters that exhibit a sufficiently low
cost will be a good solution to the problem
15

Challenges

• Limitation of Gradient Descent


– Local minima
16

Challenges

• Limitation of Gradient Descent


– Saddle Points
• For many high-dimensional nonconvex functions, local minima
(and maxima) are rare compared to saddle points (also a
point with zero gradient)
• Some points around a saddle point have greater cost than the
saddle point,while others have a lower cost
• Optimization methods that are designed to solve for a point of
zero gradient (e.g. Newton’s method) cannot escape these points
17

Challenges

• Limitation of Gradient Descent


– Cliffs
• highly nonlinear deep neural networks often contain these sharp
nonlinearities in parameter space resulting from the
multiplication of several parameters
• nonlinearities give rise to very high derivatives in some places
• Cliffs are very steep region in the objective function space
• on an extremely steep cliff structure, the gradient update step
can move the parameters extremely far, usually jumping off the
cliff structure altogether
This problem can be avoided
using gradient clipping:
● on the imminence of a
very large step, gradient
clipping heuristic
intervenes to reduce the
step size
18

Challenges

• Limitation of Gradient Descent

– Gradient Clipping for Handling Cliffs


19

Basic Learning Algorithms

• Forward Propagation (review)


20

Basic Learning Algorithms

• Back Propagation (review)


21

Basic Learning Algorithms

• Parameter Update (review)


22

Basic Learning Algorithms

• Next iteration (review)


23

Basic Learning Algorithms

• Stochastic Gradient Descent (SGD)

– Minibatch of the training set:


data {𝒙(1),…,𝒙(𝑚)}with targets {𝑦(1),…,𝑦(𝑚)}

– Gradient:

– Apply update:

– learning rate 𝜖 is a critical hyperparameter for SGD


24

Basic Learning Algorithms

• Momentum

– Designed to accelerate learning


• accumulates an exponentially decaying moving-average
of past gradients
• continues to move in the gradient descent direction

– Comparison of GD and GD with momentum


25

Basic Learning Algorithms

• SGD with Momentum

– Minibatch of the training set:


data {𝒙(1),…,𝒙(𝑚)}with targets {𝑦(1),…,𝑦(𝑚)}

– Gradient:

– Accumulate velocity:

– Apply update:
26
Algorithms with Adaptive
Learning Rates
• Adaptive gradient (AdaGrad)
– modified SGD with per-parameter learning rate (lr)
• parameters having small gradient: lr is increased
• parameters having large gradient: lr is decreased

– Gradient:

– Accumulate squared gradients:

– Apply update:
27

Dúvidas?

You might also like