CSC411: Optimization for Machine Learning
University of Toronto
September 20–26, 2018
1
based on slides by Eleni Triantafillou, Ladislav Rampasek, Jake Snell, Kevin
Swersky, Shenlong Wang and other
!
!
!
!
θ∗ = θ (θ) θ
! θ∈R
! :R →R
(θ) − (θ)
! θ
! θ
!
θ
! ( , )
( | , θ)
! − log ( | , θ)
!
∂ (θ ∗ )
∂θ =
! θ
!
! ∂ ∂ ∂
∇θ = ( ∂θ , ∂θ , ..., ∂θ )
η
! θ
! = :
! δ ← −η∇θ −
! θ ←θ− +δ
η
! θ
! = :
! η (θ − η ∇θ − ) < (θ )
! δ ← −η ∇θ −
! θ ←θ− +δ
α∈[ , )
! θ
! δ
! = :
! δ ← −η∇θ − +αδ −
! θ ←θ− +δ
α
η
! θ
!
! δ ← −η∇θ −
! θ ←θ− +δ
!
!
| (θ + ) − (θ )| < ϵ
! ∥∇θ ∥ < ϵ
!
Learning Rate (Step Size)
In gradient descent, the learning rate ↵ is a hyperparameter we
need to tune. Here are some things that can go wrong:
↵ too small: ↵ too large: ↵ much too large:
slow progress oscillations instability
Good values are typically between 0.001 and 0.1. You should do a
grid search if you want good performance (i.e. try
0.1, 0.03, 0.01, . . .).
Intro ML (UofT) CSC311-Lec3 37 / 42
!
∇
!
∂ ((θ , . . . , θ + ϵ, . . . , θ )) − ((θ , . . . , θ − ϵ, . . . , θ ))
≈
∂θ ϵ
!
!
!
!
!
!
!
Stochastic Gradient Descent
Batch gradient descent moves directly downhill. SGD takes steps
in a noisy direction, but moves downhill on average.
batch gradient descent stochastic gradient descent
Intro ML (UofT) CSC311-Lec4 66 / 70
SGD Learning Rate
In stochastic training, the learning rate also influences the
fluctuations due to the stochasticity of the gradients.
Typical strategy:
I Use a large learning rate early in training so you can get close to
the optimum
I Gradually decay the learning rate to reduce the fluctuations
Intro ML (UofT) CSC311-Lec4 67 / 70
SGD Learning Rate
Warning: by reducing the learning rate, you reduce the
fluctuations, which can appear to make the loss drop suddenly.
But this can come at the expense of long-run performance.
Intro ML (UofT) CSC311-Lec4 68 / 70
SGD and Non-convex optimization
Stochastic Gradient descent
updates
3
<latexit sha1_base64="(null)">(null)</latexit>
4
Local minimum <latexit sha1_base64="(null)">(null)</latexit>
Global minimum
Stochastic methods have a chance of escaping from bad minima.
Gradient descent with small step-size converges to first minimum
it finds.
Intro ML (UofT) CSC311-Lec4 69 / 70