0% found this document useful (0 votes)

45 views27 pages

Chapter 8-Deep Learning Book

Chapter 8 of the Deep Learning Book discusses optimization techniques for training deep learning models, emphasizing the distinction between machine learning and pure optimization. It covers concepts such as empirical risk minimization, surrogate loss functions, and the use of minibatch algorithms to improve computational efficiency. The chapter also addresses challenges faced in optimization, including local minima, saddle points, and cliffs, along with methods like gradient clipping and momentum to enhance learning algorithms.

Uploaded by

brandao.mat.ufcg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views27 pages

Chapter 8-Deep Learning Book

Uploaded by

brandao.mat.ufcg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Deep Learning Book

by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Chapter 8: Optimization for Training Deep

https://poweredtemplate.com/05826/0/index.html Learning Models
2

Introduction

• Purpose of Learning
ﬁnd the parameters θ of a neural
network model that

signiﬁcantly reduce a cost function

J(θ)

where J(θ) includes a performance

measure evaluated on the training
set as well as additional
regularization terms
3
How Learning Diﬀers from
Pure Optimization?
• Machine Learning

– usually acts indirectly, whereas optimization focuses on

the optimization of the cost function itself

– goal of machine learning: to reduce the expected

generalization error (risk)

– however, learning algorithms reduce cost functions

(empirical risk)
• by minimizing the expected loss on the training
dataset
• in the hope that the indirect optimization will
improve the overall performance on untrained data
4
How Learning Diﬀers from
Pure Optimization?
• Empirical Risk Minimization

– the goal of a machine learning algorithm is to reduce the

expected generalization error

– a.k.a. the risk

– the expectation is taken over the true underlying data

distribution → if know then we have an optimization
problem

– if the data distribution is unknown, but a set of training

samples is known, then we have a machine learning problem
5
How Learning Diﬀers from
Pure Optimization?
• Empirical Risk Minimization

– a machine learning problem can be converted back into

an optimization problem

• by minimizing the expected loss on the training set

(empirical risk)

• this means replacing the true data distribution with the

empirical distribution defined by the training set
6
How Learning Differs from
Pure Optimization?
• Surrogate Loss Functions
– a relevant loss function (say, classification error) sometimes
cannot be optimized efficiently
• e.g., exactly minimizing expected 0-1 loss is typically
intractable (exponential in the input dimension)

– a solution is to optimize a surrogate loss function

• acts as a proxy but has advantages
• e.g. the negative log-likelihood (NLL) of the correct class
is used as a surrogate for the 0-1 loss
• it allows the model to estimate the conditional probability
of the classes, given the input
– if the model can do that well, then it can pick the
classes that yield the least classiﬁcation error
7
How Learning Diﬀers from
Pure Optimization?
• Surrogate Loss Functions
– NLL = ∑ -log(yi) for all
correctly classified training
samples
– we assume yi is the, 0-1
normalized, model output for
training input xi
By minimizing NNL, we
maximize the likelihood of the
correct classes

NNL is differentiable, whereas

0-1 loss is not
8

Minibatch Algorithms

• Optimization is iterative in machine learning

– e.g., gradient descent

• Using a large training set (batch algorithm) per

every iteration is computationally infeasible
– e.g., ImageNet dataset has millions of images

• Minibatch algorithm uses only 𝒃 examples in each

iteration
– 𝑚: the size of entire training set
– 𝑏: minibatch size (1≤𝑏<𝑚)
– It is crucial that the minibatches are randomly selected
•unbiased gradient estimates require independent samples
– we can compute entire separate parameter updates over
diﬀerent sets of examples in parallel
9

Minibatch Algorithms

• Some thoughts on taking a random sample of the

dataset for training
– the standard error of the mean, estimated from n samples is
given by: σ/√n
• σ is true standard deviation of the value of the samples

– √n in the denominator shows that there are less than linear

returns to using more examples
– Let’s compare n=100 and n=10,000
• n=10,000 requires 100 times more computation than n=100
– but reduces the standard error only by a factor of 10
– Most optimization algorithms converge much faster (total
computation) if they are allowed to
• rapidly compute approximate estimates of the gradient
rather than slowly computing the exact gradient
10

Minibatch Algorithms

• Optimization algorithms that use only a single

example at a time are sometimes called stochastic
(and online)

• However, the term “online” is mostly used for when

the examples are drawn from a stream of
continually created examples

• Stochastic method has become the general term for

minibatch or minibatch stochastic
11

Minibatch Algorithms

• Minibatch size
– Larger batches → more accurate estimate of the gradient, but
with less than linear returns

– Parallel processing of the batch → memory scales with the

batch size. For many hardware (hw) this is the limiting factor
in batch size

– Multicore hw: get underutilized with too small batches

– Some hw achieve better runtime with speciﬁc array sizes. For

GPU, power of 2 batch sizes often oﬀer better runtime

– Small batches can oﬀer a regularizing eﬀect, possibly due to

the noise they add to the learning process
12

Batch vs. Minibatch

• Each iteration in minibatch may have poor

optimization performance than in full batch

• However, after many iterations, the minibatch

algorithm generally converges to an optimal state
13

Saddle Point

• For many high dimensional non-convex functions,

local minima are rare cases

– local minima should satisfy:

n is the # of model parameters

• Most of zero-gradient
points in deep neural
networks are saddle points
14

Challenges

• Limitations of Gradient Descent

– Local minima
• Convex optimization → problem can be reduced to
ﬁnding a local minimum
– any local minimum is guaranteed to be a global
minimum
• With nonconvex functions, such as neural nets,
it is possible to have many local minima
– A deep model will have many local minima
» local minima can be problematic only if they have
high cost in comparison to the global minimum
– It is believed that, for very large neural networks,
most local minima present a low cost function value
» It is not crucial to ﬁnd a true global minimum→
a set of parameters that exhibit a sufficiently low
cost will be a good solution to the problem
15

Challenges

• Limitation of Gradient Descent

– Local minima
16

Challenges

• Limitation of Gradient Descent

– Saddle Points
• For many high-dimensional nonconvex functions, local minima
(and maxima) are rare compared to saddle points (also a
point with zero gradient)
• Some points around a saddle point have greater cost than the
saddle point,while others have a lower cost
• Optimization methods that are designed to solve for a point of
zero gradient (e.g. Newton’s method) cannot escape these points
17

Challenges

• Limitation of Gradient Descent

– Cliffs
• highly nonlinear deep neural networks often contain these sharp
nonlinearities in parameter space resulting from the
multiplication of several parameters
• nonlinearities give rise to very high derivatives in some places
• Cliffs are very steep region in the objective function space
• on an extremely steep cliff structure, the gradient update step
can move the parameters extremely far, usually jumping off the
cliff structure altogether
This problem can be avoided
using gradient clipping:
● on the imminence of a
very large step, gradient
clipping heuristic
intervenes to reduce the
step size
18

Challenges

• Limitation of Gradient Descent

– Gradient Clipping for Handling Cliffs

Basic Learning Algorithms

• Forward Propagation (review)

Basic Learning Algorithms

• Back Propagation (review)

Basic Learning Algorithms

• Parameter Update (review)

Basic Learning Algorithms

• Next iteration (review)

Basic Learning Algorithms

• Stochastic Gradient Descent (SGD)

– Minibatch of the training set:

data {𝒙(1),…,𝒙(𝑚)}with targets {𝑦(1),…,𝑦(𝑚)}

– Gradient:

– Apply update:

– learning rate 𝜖 is a critical hyperparameter for SGD

Basic Learning Algorithms

• Momentum

– Designed to accelerate learning

• accumulates an exponentially decaying moving-average
of past gradients
• continues to move in the gradient descent direction

– Comparison of GD and GD with momentum

Basic Learning Algorithms

• SGD with Momentum

– Minibatch of the training set:

data {𝒙(1),…,𝒙(𝑚)}with targets {𝑦(1),…,𝑦(𝑚)}

– Gradient:

– Accumulate velocity:

– Apply update:
26
Algorithms with Adaptive
Learning Rates
• Adaptive gradient (AdaGrad)
– modified SGD with per-parameter learning rate (lr)
• parameters having small gradient: lr is increased
• parameters having large gradient: lr is decreased

– Gradient:

– Accumulate squared gradients:

– Apply update:
27

Dúvidas?

DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
DL 12
No ratings yet
DL 12
55 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
DL Unit 4&5
No ratings yet
DL Unit 4&5
27 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
DL Module 2 1 (Sami)
No ratings yet
DL Module 2 1 (Sami)
17 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Chapter 4
No ratings yet
Chapter 4
33 pages
Gradient-Based Learning & Neural Networks
No ratings yet
Gradient-Based Learning & Neural Networks
72 pages
Lec 8
No ratings yet
Lec 8
43 pages
Module 1
No ratings yet
Module 1
7 pages
Advanced Topics in Machine Learning: Supervised Learning, Deep Learning, and Optimization Techniques
No ratings yet
Advanced Topics in Machine Learning: Supervised Learning, Deep Learning, and Optimization Techniques
5 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Deep Learning Optimization Guide
No ratings yet
Deep Learning Optimization Guide
12 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Cours 5
No ratings yet
Cours 5
23 pages
Lecture 4 1
No ratings yet
Lecture 4 1
60 pages
DL 4
No ratings yet
DL 4
15 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
AI and Deep Learning
No ratings yet
AI and Deep Learning
40 pages
Unit 2
No ratings yet
Unit 2
76 pages
Module 2
No ratings yet
Module 2
67 pages
Fundamentals of Deep Learning
No ratings yet
Fundamentals of Deep Learning
26 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Pattern Classification 11. Backpropagation & Time-Series Forecasting
No ratings yet
Pattern Classification 11. Backpropagation & Time-Series Forecasting
78 pages
DL Unit1
No ratings yet
DL Unit1
61 pages
M3 Session 1-1
No ratings yet
M3 Session 1-1
27 pages
Unit 3
No ratings yet
Unit 3
47 pages
Mod 2.4,2.5,2.6 Architecture Design
No ratings yet
Mod 2.4,2.5,2.6 Architecture Design
20 pages
Module3 Notes
No ratings yet
Module3 Notes
18 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Training NNs
No ratings yet
Training NNs
34 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
DL Exp2
No ratings yet
DL Exp2
6 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Dla-Cat 1
No ratings yet
Dla-Cat 1
37 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
1 Intro
No ratings yet
1 Intro
91 pages
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
No ratings yet
Unit 4 Short Notes Deep Feedforward Networks Gradient Learning
27 pages
Lec2 Intro To ML
No ratings yet
Lec2 Intro To ML
35 pages
Lecture 4
No ratings yet
Lecture 4
45 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Gradient Descent and Optimization in Machine Learning
No ratings yet
Gradient Descent and Optimization in Machine Learning
9 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
Midterm
No ratings yet
Midterm
33 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Geophysical Prospecting - 2024 - Li - One Dimensional Deep Learning Inversion of Marine Controlled Source Electromagnetic
No ratings yet
Geophysical Prospecting - 2024 - Li - One Dimensional Deep Learning Inversion of Marine Controlled Source Electromagnetic
21 pages
Ass 2
No ratings yet
Ass 2
7 pages
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
No ratings yet
Deep Learning Meets Sparse Regularization: A Signal Processing Perspective
23 pages
Machine Learning and Artificial Intelligence: PG Diploma in
No ratings yet
Machine Learning and Artificial Intelligence: PG Diploma in
23 pages
NN Examples
No ratings yet
NN Examples
91 pages
Panion PDF
No ratings yet
Panion PDF
154 pages
AI UG Course Book: Sem V-VI
No ratings yet
AI UG Course Book: Sem V-VI
39 pages
Action Recognition Using High-Level Action Units
No ratings yet
Action Recognition Using High-Level Action Units
5 pages
Documentation of Our Project
No ratings yet
Documentation of Our Project
21 pages
An Introduction To Statistical Learning PDF
No ratings yet
An Introduction To Statistical Learning PDF
35 pages
Introduction To Inverse Problems in Imaging, 2nd Edition Bertero Download
No ratings yet
Introduction To Inverse Problems in Imaging, 2nd Edition Bertero Download
46 pages
An Effective Document Image Deblurring Algorithm
No ratings yet
An Effective Document Image Deblurring Algorithm
8 pages
Yong Shi - Advances in Big Data Analytics - Theory, Algorithms and Practices (2022, Springer) - Libgen - Li
100% (1)
Yong Shi - Advances in Big Data Analytics - Theory, Algorithms and Practices (2022, Springer) - Libgen - Li
723 pages
Indian Sign Language Recognition Using Mediapipe Holistic: Kaushal Goyal
No ratings yet
Indian Sign Language Recognition Using Mediapipe Holistic: Kaushal Goyal
16 pages
Deep Learning Theory Advances
No ratings yet
Deep Learning Theory Advances
50 pages
Btech III Year I Semester (Ar20)
No ratings yet
Btech III Year I Semester (Ar20)
7 pages
Sound Deposit Insurance Pricing Using A Machine Le
No ratings yet
Sound Deposit Insurance Pricing Using A Machine Le
18 pages
Understanding Machine Learning From Theory To Algorithms 1nbsped 9781107057135 Compress
No ratings yet
Understanding Machine Learning From Theory To Algorithms 1nbsped 9781107057135 Compress
310 pages
2009 Lotos Bssa
No ratings yet
2009 Lotos Bssa
21 pages
XGBoost for Data Scientists
100% (3)
XGBoost for Data Scientists
54 pages
Supervised Contrastive Learning
No ratings yet
Supervised Contrastive Learning
23 pages
DADS303 - MBA 3 - Machine - Learning
No ratings yet
DADS303 - MBA 3 - Machine - Learning
11 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Machine Learning Midterm Exam
No ratings yet
Machine Learning Midterm Exam
106 pages
Machine Learning
100% (2)
Machine Learning
76 pages
Bayesian Neural Networks for Cable-Stayed Bridge Monitoring
No ratings yet
Bayesian Neural Networks for Cable-Stayed Bridge Monitoring
14 pages
Isye6501 Office Hour Fa22 Week08 Mon
No ratings yet
Isye6501 Office Hour Fa22 Week08 Mon
11 pages
Tyagi Et Al. 2021
No ratings yet
Tyagi Et Al. 2021
19 pages

Chapter 8-Deep Learning Book

Uploaded by

Chapter 8-Deep Learning Book

Uploaded by

Deep Learning Book

by Ian Goodfellow, Yoshua Bengio and Aaron Courville

Chapter 8: Optimization for Training Deep

signiﬁcantly reduce a cost function

where J(θ) includes a performance

– usually acts indirectly, whereas optimization focuses on

– goal of machine learning: to reduce the expected

– however, learning algorithms reduce cost functions

– the goal of a machine learning algorithm is to reduce the

– a.k.a. the risk

– the expectation is taken over the true underlying data

– if the data distribution is unknown, but a set of training

– a machine learning problem can be converted back into

• by minimizing the expected loss on the training set

• this means replacing the true data distribution with the

– a solution is to optimize a surrogate loss function

NNL is differentiable, whereas

• Optimization is iterative in machine learning

• Using a large training set (batch algorithm) per

• Minibatch algorithm uses only 𝒃 examples in each

• Some thoughts on taking a random sample of the

– √n in the denominator shows that there are less than linear

• Optimization algorithms that use only a single

• However, the term “online” is mostly used for when

• Stochastic method has become the general term for

– Parallel processing of the batch → memory scales with the

– Multicore hw: get underutilized with too small batches

– Some hw achieve better runtime with speciﬁc array sizes. For

– Small batches can oﬀer a regularizing eﬀect, possibly due to

Batch vs. Minibatch

• Each iteration in minibatch may have poor

• However, after many iterations, the minibatch

• For many high dimensional non-convex functions,

– local minima should satisfy:

• Limitations of Gradient Descent

• Limitation of Gradient Descent

• Limitation of Gradient Descent

• Limitation of Gradient Descent

• Limitation of Gradient Descent

– Gradient Clipping for Handling Cliffs

Basic Learning Algorithms

• Forward Propagation (review)

Basic Learning Algorithms

• Back Propagation (review)

Basic Learning Algorithms

• Parameter Update (review)

Basic Learning Algorithms

• Next iteration (review)

Basic Learning Algorithms

• Stochastic Gradient Descent (SGD)

– Minibatch of the training set:

– learning rate 𝜖 is a critical hyperparameter for SGD

Basic Learning Algorithms

– Designed to accelerate learning

– Comparison of GD and GD with momentum

Basic Learning Algorithms

• SGD with Momentum

– Minibatch of the training set:

– Accumulate squared gradients:

You might also like