KEMBAR78
15 Optimization Script | PDF | Mathematical Optimization | Mathematical Logic
0% found this document useful (0 votes)
149 views62 pages

15 Optimization Script

This document provides an introduction to optimization techniques. It outlines several types of optimization problems including blackbox, gradient-based, and problems where second derivatives can be evaluated. Examples of optimization in machine learning like support vector machines and robotics are provided. The document also lists various optimization methods that will be covered like gradient descent, Newton's method, constrained optimization, convex optimization, and blackbox search methods.

Uploaded by

ayeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views62 pages

15 Optimization Script

This document provides an introduction to optimization techniques. It outlines several types of optimization problems including blackbox, gradient-based, and problems where second derivatives can be evaluated. Examples of optimization in machine learning like support vector machines and robotics are provided. The document also lists various optimization methods that will be covered like gradient descent, Newton's method, constrained optimization, convex optimization, and blackbox search methods.

Uploaded by

ayeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Introduction to Optimization

Marc Toussaint

July 23, 2015

This is a direct concatenation and reformatting of all lecture slides


and exercises from the Optimization course (summer term 2015, U
Stuttgart), including indexing to help prepare for exams. Printing on
A4 paper: 3 columns in landscape.

Contents
1 Introduction 3
Types of optimization problems (1:3)

2 Unconstraint Optimization Basics 6


Plain gradient descent (2:1) Stepsize and step direction as core issues (2:2) Stepsize adap-
tation (2:4) Backtracking (2:5) Line search (2:5) Wolfe conditions (2:7) Gradient descent con-
vergence (2:8) Steepest descent direction (2:11) Covariant gradient descent (2:13) Newton
direction (2:14) Newton method (2:15) Gauss-Newton method (2:20) Quasi-Newton methods
(2:23) Broyden-Fletcher-Goldfarb-Shanno (BFGS) (2:25) Conjugate gradient (2:28) Rprop
(2:35)

3 Constrained Optimization 16
Constrained optimization (3:1) Log barrier method (3:6) Central path (3:9) Squared penalty
method (3:12) Augmented Lagrangian method (3:14) Lagrangian: definition (3:21) La-
grangian: relation to KKT (3:24) Karush-Kuhn-Tucker (KKT) conditions (3:25) Lagrangian:
saddle point view (3:27) Lagrange dual problem (3:29) Log barrier as approximate KKT
(3:33) Primal-dual interior-point Newton method (3:36) Phase I optimization (3:40) Trust re-
gion (3:41)

4 Convex Optimization 26
Function types: covex, quasi-convex, uni-modal (4:1) Linear program (LP) (4:6) Quadratic
program (QP) (4:6) LP in standard form (4:7) Simplex method (4:11) LP-relaxations of integer
programs (4:15) Sequential quadratic programming (4:23)

5 Global & Bayesian Optimization 32


Bandits (5:4) Exploration, Exploitation (5:6) Belief planning (5:8) Upper Confidence Bound
(UCB) (5:12) Global Optimization as infinite bandits (5:17) Gaussian Processes as belief
(5:19) Expected Improvement (5:24) Maximal Probability of Improvement (5:24) GP-UCB
(5:24)

6 Blackbox Optimization: Local, Stochastic & Model-based Search 39


Blackbox optimization: definition (6:1) Blackbox optimization: overview (6:3) Greedy local
search (6:5) Stochastic local search (6:6) Simulated annealing (6:7) Random restarts (6:10)
Iterated local search (6:11) Variable neighborhood search (6:13) Coordinate search (6:14)
Pattern search (6:15) Nelder-Mead simplex method (6:16) General stochastic search (6:20)
Evolutionary algorithms (6:23) Covariance Matrix Adaptation (CMA) (6:24) Estimation of Dis-
tribution Algorithms (EDAs) (6:28) Model-based optimization (6:31) Implicit filtering (6:34)

7 Exercises 49
7.1 Exercise 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Exercise 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.3 Exercise 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.4 Exercise 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.5 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.6 Exercise 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
7.7 Exercise 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.8 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.9 Exercise 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
7.10 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.11 Exercise 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.12 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.13 Exercise 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.14 Exercise 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8 Bullet points to help learning 58


8.1 Optimization Problems in General . . . . . . . . . . . . . . . . 58
8.2 Basic Unconstrained Optimization . . . . . . . . . . . . . . . . 58
8.3 Constrained Optimization . . . . . . . . . . . . . . . . . . . . . 59
8.4 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . 60
8.5 Search methods for Blackbox optimization . . . . . . . . . . . . 60
8.6 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . 61

Index 62
1 Introduction

Why Optimization is interesting!


• In an otherwise unfortunate interview I’ve been asked why “we guys” (AI, ML, optimal control
people) always talk about optimality. “People are by no means optimal”, the interviewer
said. I think that statement pinpoints the whole misunderstanding of the role and concept of
optimality principles.
– Optimality principles are a means of scientific (or engineering) description.
– It is often easier to describe a thing (natural or artifical) via an optimality
priciple than directly

• Which science does not use optimality principles to describe nature & artifacts?
– Physics, Chemistry, Biology, Mechanics, ...
– Operations research, scheduling, ...
– Computer Vision, Speach Recognition, Machine Learning, Robotics, ...

• Endless applications
1:1

Teaching optimization
• Standard: Convex Optimization, Numerical Optimization
• Discrete Optimization (Stefan Funke)
• Exotics: Evolutionary Algorithms, Swarm optimization, etc

• In this lecture I try to cover the standard topics, but include as well work on
stochastic search & global optimization
1:2

Rough Types of Optimization Problems


• Generic optimization problem:
Let x ∈ Rn , f : Rn → R, g : Rn → Rm , h : Rn → Rl . Find

min f (x)
x

s.t. g(x) ≤ 0 , h(x) = 0

• Blackbox: only f (x) can be evaluated


• Gradient: ∇f (x) can be evaluated
• Gauss-Newton type: f (x) = φ(x)>φ(x) and ∇φ(x) can be evaluated

• 2nd order: ∇2f (x) can be evaluated

• “Approximate upgrade”:
– Use samples of f (x) to approximate ∇f (x) locally
– Use samples of ∇f (x) to approximate ∇2f (x) locally
1:3

Optimization in Machine Learning: SVMs


• optimization problem
maxβ,||β||=1 M subject to yi (φ(xi )>β) ≥ M, i = 1, . . . , n
• can be rephrased as
minβ ||β|| subject to yi (φ(xi )>β) ≥ 1, i = 1, . . . , n
Ridge regularization like ridge regression, but different loss
y B

x
1:4

Optimization in Robotics
• Trajectories:
Let xt ∈ Rn be a joint configuration and x = x1:T = (x1 , . . . , xT ) a trajectory of
length T . Find

T
X
min ft (xt−k:t )>ft (xt−k:t )
x
t=0 (1)
s.t. ∀t : gt (xt ) ≤ 0 , ht (xt ) = 0

• Control:

min ||u − a||2H (2)


u,q̈,λ

s.t. u = M q̈ + h + J>
gλ (3)
Jφ q̈ = c (4)
λ = λ∗ (5)
Jg q̈ = b (6)

1:5

Optimization in Computer Vision

• Andres Bruhn’s lectures


• Flow estimation, (relaxed) min-cut problems, segmentation, ...
1:6

Planned Outline
• Unconstrained Optimization: Gradient- and 2nd order methods
– stepsize & direction, plain gradient descent, steepest descent, line search &
trust region methods, conjugate gradient
– Newton, Gauss-Newton, Quasi-Newton, (L)BFGS
• Constrained Optimization
– log barrier, squared penalties, augmented Lagrangian
– Lagrangian, KKT conditions, Lagrange dual, log barrier ↔ approx. KKT
• Special convex cases
– Linear Programming, (sequential) Quadratic Programming
– Simplex algorithm
– Relaxation of integer linear programs
• Global Optimization
– infinite bandits, probabilistic modelling, exploration vs. exploitation, GP-UCB
• Stochastic search
– Blackbox optimization (0th order methods), MCMC, downhill simplex
1:7
Books

Boyd and Vandenberghe: Convex Opti-


mization.
http://www.stanford.edu/˜boyd/
cvxbook/

(this course will not go to the full depth in math of Boyd et al.)
1:8

Books

Nocedal & Wright: Numerical Optimization


www.bioinfo.org.cn/˜wangchao/
maa/Numerical_Optimization.pdf

1:9

Organisation
• Webpage:
http://ipvs.informatik.uni-stuttgart.de/mlr/marc/teaching/15-Optimization/
– Slides, Exercises & Software (C++)
– Links to books and other resources
• Admin things, please first ask:
Carola Stahl, Carola.Stahl@ipvs.uni-stuttgart.de, Raum 2.217

• Rules for the tutorials:


– Doing the exercises is crucial!
– At the beginning of each tutorial:
– sign into a list
– mark which exercises you have (successfully) worked on
– Students are randomly selected to present their solutions
– You need 50% of completed exercises to be allowed to the exam
– Please check 2 weeks before the end of the term, if you can take the exam
1:10
2 Unconstraint Optimization Basics
Descent direction & stepsize, plain gradient descent, stepsize adaptation & monotonic-
ity, line search, trust region, steepest descent, Newton, Gauss-Newton, Quasi-Newton,
BFGS, conjugate gradient, exotic: Rprop

Gradient descent

• Objective function: f : Rn → R
h i>
Gradient vector: ∇f (x) = ∂∂x f (x) ∈ Rn

• Problem:

min f (x)
x

where we can evaluate f (x) and ∇f (x) for any x ∈ Rn

• Plain gradient descent: iterative steps in the direction −∇f (x).

Input: initial x ∈ Rn , function ∇f (x), stepsize α, tolerance θ


Output: x
1: repeat
2: x ← x − α∇f (x)
3: until |∆x| < θ [perhaps for 10 iterations in sequence]

2:1

• Plain gradient descent is really not efficient


• Two core issues of unconstrainted optimization:

A. Stepsize
B. Descent direction
2:2

Stepsize
• Making steps proportional to ∇f (x)?

small gradient
small step?

large gradient
large step?

• We need methods that


– robustly adapt stepsize
– exploit convexity, if known
– perhaps be independent of |∇f (x)| (e.g. if non-convex as above)
2:3
Stepsize Adaptation

Input: initial x ∈ Rn , functions f (x) and ∇f (x), tolerance θ, parameters (de-



faults: %+α = 1.2, %α = 0.5, %ls = 0.01)
Output: x
1: initialize stepsize α = 1
2: repeat
∇f (x)
3: d ← − |∇f (x)| // (alternative: d = −∇f (x))
4: while f (x + αd) > f (x)+%ls ∇f (x)>(αd) do // line search
5: α ← %− αα // decrease stepsize
6: end while
7: x ← x + αd
8: α ← %+ αα // increase stepsize (alternative: α = 1)
9: until |αd| < θ [perhaps for 10 iterations in sequence]

• α determines the absolute stepsize


• Guaranteed monotonicity (by construction)
(“Typically” ensures convergence to locally convex minima; see later)
2:4

Backtracking line search


• Line search in general denotes the problem

min f (x + αd)
α≥0

for some step direction d.


• The most common line search is backtracking, which decreases α as long as

f (x + αd) > f (x) + %ls ∇f (x)>(αd)

%−
α describes the stepsize decrement in case of a rejected step

%ls describes a minimum desired decrease in f (x)


• Boyd at al: typically %ls ∈ [0.01, 0.3] and %−
α ∈ [0.1, 0.8]
2:5

Backtracking line search

2:6

Wolfe Conditions

• The 1st Wolfe condition (“sufficient decrease condition”)

f (x + αd) ≤ f (x) − %ls ∇f (x)>(αd)

requires a decrease of f at least %ls -times “as expected”


• The 2nd (stronger) Wolfe condition (“curvature condition”)

|∇f (x + αd)>d| ≤ %ls2 |∇f (x)>d|

implies a requires an decrease of the slope by a factor %ls2 .


%ls2 ∈ (%ls , 21 ) (for conjugate gradient)
• See Nocedal et al., Section 3.1 & 3.2 for more general proofs of convergence of
any method that ensures the Wolfe conditions after each line search
2:7
Convergence for (locally) convex functions
following Boyd et al. Sec 9.3.1
• Assume that ∀x the Hessian is m ≤ eig(∇2f (x)) ≤ M . If follows
m
f (x) + ∇f (x)>(y − x) + (y − x)2 ≤ f (y)
2
M
≤ f (x) + ∇f (x)>(y − x) + (y − x)2
2
1 1
f (x) − |∇f (x)|2 ≤ fmin ≤ f (x) − |∇f (x)|2
2m 2M
|∇f (x)|2 ≥ 2m(f (x) − fmin )

• Consider a perfect line search with y = x − α∗ ∇f (x), α∗ = argminα f (y(α)). The


following eqn. holds as M also upper-bounds ∇2f (x) along −∇f (x):

1
f (y) ≤ f (x) − |∇f (x)|2
2M
1
f (y) − fmin ≤ f (x) − fmin − |∇f (x)|2
2M
2m
≤ f (x) − fmin − (f (x) − fmin )
2M
h mi
≤ 1− (f (x) − fmin )
M
m
→ each step is contracting at least by 1 − M
<1
2:8

Convergence for (locally) convex functions


following Boyd et al. Sec 9.3.1
1
• In the case of backtracking line search, backtracking will terminate latest when α ≤ M
,
1
because for y = x − α∇f (x) and α ≤ M we have

M α2
f (y) ≤ f (x) − α|∇f (x)|2 + |∇f (x)|2
2
α
≤ f (x) − |∇f (x)|2
2
≤ f (x) − %ls α|∇f (x)|2

1 %−
As backtracking terminates for any α ≤ M
, a step α ≥ M
α
is chosen, such that

%ls %−
α
f (y) ≤ f (x) − |∇f (x)|2
M
% %−α
f (y) − fmin ≤ f (x) − fmin − ls |∇f (x)|2
M
2m%ls %−α
≤ f (x) − fmin − (f (x) − fmin )
M
−i
h 2m%ls %α
≤ 1− (f (x) − fmin )
M
2m%ls %−
→ each step is contracting at least by 1 − M
α
<1
2:9

B. Descent Direction
2:10

Steepest Descent Direction


• The gradient ∇f (x) is sometimes called steepest descent direction

Is it really?

• Here is a possible definition:

The steepest descent direction is the one where, when I make a step of length 1,
I get the largest decrease of f in its linear approximation.

argmin ∇f (x)>δ s.t. ||δ|| = 1


δ

2:11
Steepest Descent Direction
• But the norm ||δ||2 = δ>Aδ depends on the metric A!

Let A = B>B (Cholesky decomposition) and z = Bδ

δ ∗ = argmin ∇f>δ s.t. δ>Aδ = 1


δ

= B -1 argmin(B -1 z)>∇f s.t. z>z = 1


z

= B -1 argmin z>B -> ∇f s.t. z>z = 1


z

= B -1 [−B -> ∇f ] = −A-1 ∇f

The steepest descent direction is δ = −A-1 ∇f


2:12

Behavior under linear coordinate transformations


• Let B be a matrix that describes a linear transformation in coordinates

• A coordinate vector x transforms as z = Bx


• The gradient vector ∇x f (x) transforms as ∇z f (z) = B -> ∇x f (x)
• The metric A transforms as Az = B -> Ax B -1
• The steepest descent transforms as A-1 -1
z ∇z f (z) = BAx ∇x f (x)

The steepest descent transforms like a normal coordinate vector (covariant)


2:13

Newton Direction
• Assume we have access to the symmetric Hessian
∂2 ∂2 ∂2
 
 ∂x1 ∂x1
f (x) ∂x1 ∂x2
f (x) ··· ∂x1 ∂xn
f (x) 
 
..
 
 
∂2
 

f (x) . 
∇2f (x) = ∂x1 ∂x2 ∈ Rn×n
 
 
 


 .. .. 



 . . 

∂2 ∂2
 
f (x) ··· ··· f (x)
 
∂xn ∂x1 ∂xn ∂xn

• which defines the Taylor expansion:


1 > 2
f (x + δ) ≈ f (x) + ∇f (x)>δ + δ ∇ f (x) δ
2

Note: ∇2f (x) acts like a metric for δ


2:14

Newton method
• For finding roots (zero points) of f (x)

f (x)
x←x−
f 0 (x)

• For finding optima of f (x) in 1D:

f 0 (x)
x←x−
f 00 (x)

For x ∈ Rn :
x ← x − ∇2f (x)-1 ∇f (x)
2:15
Why 2nd order information is better

• Better direction:

2nd Order

Plain Gradient
Conjugate Gradient

• Better stepsize:
– a full step jumps directly to the minimum of the local squared approx.
– often this is already a good heuristic
– additional stepsize reduction and dampening are straight-forward
2:16

Newton method with adaptive stepsize

Input: initial x ∈ Rn , functions f (x), ∇f (x), ∇2f (x), tolerance θ, parameters


− −
(defaults: %+ +
α = 1.2, %α = 0.5, %λ = 1, %λ = 0.5, %ls = 0.01)
Output: x
1: initialize stepsize α = 1 and damping λ = λ0
2: repeat
3: compute d to solve (∇2f (x) + λI) d = −∇f (x)
4: while f (x + αd) > f (x) + %ls ∇f (x)>(αd) do // line search
5: α ← %− αα // decrease stepsize
6: optionally: λ ← %+λ λ and recompute d // increase damping
7: end while
8: x ← x + αd // step is accepted
9: α ← min{%+ α α, 1} // increase stepsize
10: optionally: λ ← %− λλ // decrease damping
11: until ||αd||∞ < θ

• Notes:
– Line 3 computes the Newton step d = −∇2f (x)-1 ∇f (x),
use special Lapack routine dposv to solve Ax = b (using Cholesky)
– λ is called damping, related to trust region methods, makes the parabola
more steep around current x
for λ → ∞: d becomes colinear with −∇f (x) but |d| = 0
2:17

Demo
2:18

• In the remainder: Extensions of the Newton approach:


– Gauss-Newton
– Quasi-Newton
– BFGS, (L)BFGS
– Conjugate Gradient

• And a crazy method: Rprop

• Postponed: trust region methods properly


2:19

Gauss-Newton method
• Consider a sum-of-squares problem:
X
min f (x) where f (x) = φ(x)>φ(x) = φi (x)2
x
i

n
and we can evaluate φ(x), ∇φ(x) for any x ∈ R
• φ(x) ∈ Rd is a vector; each entry contributes a squared cost term to f (x)
• ∇φ(x) is the Jacobian (d × n-matrix)

∂ ∂ ∂
φ1 (x) φ1 (x) ··· φ1 (x) 
 
 ∂x1 ∂x2 ∂xn
 

 .. 


φ2 (x) .
 
 
∇φ(x) = ∂x1 ∈ Rd×n
 
 
.. ..
 
 
 
. .
 
 
 
∂ ∂
φd (x) ··· ··· φd (x)
 
∂x1 ∂xn

with 1st-order Taylor expansion φ(x + δ) = φ(x) + ∇φ(x)δ


2:20

Gauss-Newton method
• The gradient and Hessian of f (x) become

f (x) = φ(x)>φ(x)
∇f (x) = 2∇φ(x)>φ(x)
∇2f (x) = 2∇φ(x)>∇φ(x) + 2φ(x)>∇2φ(x)

• The Gauss-Newton method is the Newton method for f (x) = φ(x)>φ(x) with ap-
proximating ∇2φ(x) ≈ 0

In the Newton algorithm, replace line 3 by 3: compute d to solve (2∇φ(x)>∇φ

• The approximate Hessian 2∇φ(x)>∇φ(x) is always semi-pos-def!


2:21

Quasi-Newton methods
2:22

Quasi-Newton methods

• Assume we cannot evaluate ∇2f (x).


Can we still use 2nd order methods?

• Yes: We can approximate ∇2f (x) from the data {(xi , ∇f (xi ))}ki=1 of previous iter-
ations
2:23

Basic example
• We’ve seen already two data points (x1 , ∇f (x1 )) and (x2 , ∇f (x2 ))
How can we estimate ∇2f (x)?

• In 1D:

∇f (x2 ) − ∇f (x1 )
∇2f (x) ≈
x2 − x1

• In Rn : let y = ∇f (x2 ) − ∇f (x1 ), δ = x2 − x1

! !
∇2f (x) δ = y δ = ∇2f (x)−1 y
y y> δδ>
∇2f (x) = ∇2f (x)−1 =
y>δ δ>y

Convince yourself that the last line solves the desired relations
[Left: how to update ∇2f (x). Right: how to update directly ∇2f (x)-1 .]
2:24
BFGS
• Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:

Input: initial x ∈ Rn , functions f (x), ∇f (x), tolerance θ


Output: x
1: initialize H -1 = In
2: repeat
3: compute d = −H -1 ∇f (x)
4: perform a line search minα f (x + αd)
5: δ ← αd
6: y ← ∇f (x + δ) − ∇f (x)
7: x←x+δ
>
yδ> yδ> δδ>
   
8: update H -1 ← I − δ>y
H -1 I − δ>y
+ δ>y
9: until ||δ||∞ < θ

• Notes:
– The blue term is the H -1 -update as on the previous slide
– The red term “deletes” previous H -1 -components
2:25

Quasi-Newton methods
• BFGS is the most popular of all Quasi-Newton methods
Others exist, which differ in the exact H -1 -update

• L-BFGS (limited memory BFGS) is a version which does not require to explicitly
store H -1 but instead stores the previous data {(xi , ∇f (xi ))}ki=1 and manages to
compute d = −H -1 ∇f (x) directly from this data

• Some thought:
In principle, there are alternative ways to estimate H -1 from the data {(xi , f (xi ), ∇f (xi )
e.g. using Gaussian Process regression with derivative observations
– Not only the derivatives but also the value f (xi ) should give information on
H(x) for non-quadratic functions
– Should one weight ‘local’ data stronger than ‘far away’?
(GP covariance function)
2:26

(Nonlinear) Conjugate Gradient


2:27

Conjugate Gradient
• The “Conjugate Gradient Method” is a method for solving (large, or sparse) linear
eqn. systems Ax + b = 0, without inverting or decomposing A. The steps will be
“A-orthogonal” (=conjugate).
We mention its extension for optimizing nonlinear functions f (x)

• A key insight:
– at xk we computed g 0 = ∇f (xk )
– assume we made a exact line-search step to xk+1
– at xk+1 we computed g = ∇f (xk+1 )

What conclusions can we draw about the “local quadratic shape” of f ?


2:28

Conjugate Gradient
Input: initial x ∈ Rn , functions f (x), ∇f (x), tolerance θ
Output: x
1: initialize descent direction d = g = −∇f (x)
2: repeat
3: α ← argminα f (x + αd) // line search
4: x ← x + αd
5: g 0 ← g, g 
= −∇f (x)  // store and compute grad
g>(g−g 0 )
6: β ← max g 0>g 0
,0
7: d ← g + βd // conjugate descent direction
8: until |∆x| < θ

• Notes:
– β > 0: The new descent direction always adds a bit of the old direction!
– This essentially provides 2nd order information
– The equation for β is by Polak-Ribière: On a quadratic function f (x) = x>Ax + b>x this
leads to conjugate search directions, d0>Ad = 0.
– Line search can be replaced by 1st and 2nd Wolfe condition with %ls2 < 21
2:29

Conjugate Gradient

• For quadratic functions CG converges in n iterations. But each iteration does line
search
2:30

Convergence Rates Notes


2:31

Convergence Rates Notes

• Linear, quadratic convergence (for q = 1, 2):

|xk+1 − x∗ |
lim =r
k |xk − x∗ |p

with rate r. E.g. xk = rk (linear) or xk+1 = rx2k (quadratic)


2:32

Convergence Rates Notes


• Theorem 3.3 in Nocedal et al.:
Plain gradient descent with exact line search applied to f (x) = x>Ax, A with
eigenvalues 0 < λ1 ≤ .. ≤ λn , satisfies
 λ − λ 2
n 1
||xk+1 − x∗ ||2A ≤ ||xk − x∗ ||2A
λn + λ1

• same on a smooth, locally pos-def function f (x): For sufficiently large k

f (xk+1 ) − f (x∗ ) ≤ r2 [f (xk ) − f (x∗ )]

• Newton steps (with α = 1) on smooth locally pos-def function f (x):


– xk converges quadratically to x∗
– |∇f (xk )| converges quadratically to zero
• Quasi-Newton methods also converge superlinearly if the Hessian approximation
is sufficiently precise (Thm. 3.7)
2:33

Rprop
2:34

Rprop
“Resilient Back Propagation” (outdated name from NN times...)

Input: initial x ∈ Rn , function f (x), ∇f (x), initial stepsize α, tolerance θ


Output: x
1: initialize x = x0 , all αi = α, all gi = 0
2: repeat
3: g ← ∇f (x)
4: x0 ← x
5: for i = 1 : n do
6: if gi gi0 > 0 then // same direction as last time
7: αi ← 1.2αi
8: xi ← xi − αi sign(gi )
9: gi0 ← gi
10: else if gi gi0 < 0 then // change of direction
11: αi ← 0.5αi
12: xi ← xi − αi sign(gi )
13: gi0 ← 0 // force last case next time
14: else
15: xi ← xi − αi sign(gi )
16: gi0 ← gi
17: end if
18: optionally: cap αi ∈ [αmin xi , αmax xi ]
19: end for
0
20: until |x − x| < θ for 10 iterations in sequence

2:35

Rprop
• Rprop is a bit crazy:
– stepsize adaptation in each dimension separately
– it not only ignores |∇f | but also its exact direction
step directions may differ up to < 90◦ from ∇f
– Often works very robustly
– Guarantees? See work by Ch. Igel

• If you like, have a look at:


Christian Igel, Marc Toussaint, W. Weishui (2005): Rprop using the natural gradient com-
pared to Levenberg-Marquardt optimization. In Trends and Applications in Constructive Ap-
proximation. International Series of Numerical Mathematics, volume 151, 259-272.
2:36

Appendix
2:37

Stopping Criteria
• Standard references (Boyd) define stopping criteria based on the “change” in f (x),
e.g. |∆f (x)| < θ or |∇f (x)| < θ.

• Throughout I will define stopping criteria based on the change in x, e.g. |∆x| < θ!
In my experience with certain applications this is more meaningful, and invariant
of the scaling of f . But this is application dependent.
2:38
Evaluating optimization costs
• Standard references (Boyd) assume line search is cheap and measure optimiza-
tion costs as the number of iterations (counting 1 per line search).

• Throughout I will assume that every evaluation of f (x) or (f (x), ∇f (x)) or (f (x), ∇f (x)
is approx. equally expensive—as is the case in certain applications.
2:39
3 Constrained Optimization
General definition, log barriers, central path, squared penalties, augmented Lagrangian
(equalities & inequalities), the Lagrangian, force balance view & KKT conditions, saddle
point view, dual problem, min-max max-min duality, modified KKT & log barriers, Phase
I

Constrained Optimization
• General constrained optimization problem:
Let x ∈ Rn , f : Rn → R, g : Rn → Rm , h : Rn → Rl find

min f (x) s.t. g(x) ≤ 0, h(x) = 0


x

In this lecture I’ll mostly focus on inequality constraints g, equality constraints are
analogous/easier

• Applications
– Find an optimal, non-colliding trajectory in robotics
– Optimize the shape of a turbine blade, s.t. it must not break
– Optimize the train schedule, s.t. consistency/possibility
3:1

General approaches
• Try to somehow transform the constraint problem to

a series of unconstraint problems

a single but larger unconstraint problem

another constraint problem, hopefully simpler (dual, convex)


3:2

General approaches
• Penalty & Barriers
– Associate a (adaptive) penalty cost with violation of the constraint
– Associate an additional “force compensating the gradient into the constraint” (augmented
Lagrangian)
– Associate a log barrier with a constraint, becoming ∞ for violation (interior point method)

• Gradient projection methods (mostly for linear contraints)


– For ‘active’ constraints, project the step direction to become tangantial
– When checking a step, always pull it back to the feasible region

• Lagrangian & dual methods


– Rewrite the constrained problem into an unconstrained one
– Or rewrite it as a (convex) dual problem

• Simplex methods (linear constraints)


– Walk along the constraint boundaries
3:3

Barriers & Penalties


• Convention:

A barrier is really ∞ for g(x) > 0

A penalty is zero for g(x) ≤ 0 and increases with g(x) > 0


3:4
Log barrier method or Interior Point method
3:5

Log barrier method


• Instead of
min f (x) s.t. g(x) ≤ 0
x

we address
X
min f (x) − µ log(−gi (x))
x
i

3:6

Log barrier

• For µ → 0, −µ log(−g) converges to ∞[g > 0]


Notation: [boolean expression] ∈ {0, 1}
∇g
• The barrier gradient ∇− log(−g) = g
pushes away from the constraint

• Eventually we want to have a very small µ—but choosing small µ makes the barrier
very non-smooth, which might be bad for gradient and 2nd order methods
3:7

Log barrier method

Input: initial x ∈ Rn , functions f (x), g(x), ∇f (x), ∇g(x), tolerance θ, param-


eters (defaults: %−
µ = 0.5, µ0 = 1)
Output: x
1: initialize µ = µ0
2: repeat
P
3: find x ← argminx f (x) − µ i log(−gi (x)) with tolerance ∼ 10θ

4: decrease µ ← %µ µ
5: until |∆x| < θ

Note: See Boyd & Vandenberghe for alternative stopping criteria based on f precision (du-
ality gap) and better choice of initial µ (which is called t there).
3:8

Central Path
• Every µ defines a different optimal x∗ (µ)

X
x∗ (µ) = argmin f (x) − µ log(−gi (x))
x
i
• Each point on the path can be understood as the optimal compromise of mini-
mizing f (x) and a repelling force of the constraints. (Which corresponds to dual
variables λ∗ (µ).)
3:9

We will revisit the log barrier method later, once we introduced the Langrangian...
3:10

Squared Penalty Method


3:11

Squared Penalty Method


• This is perhaps the simplest approach
• Instead of
min f (x) s.t. g(x) ≤ 0
x

we address
m
X
min f (x) + µ [gi (x) > 0] gi (x)2
x
i=1

Input: initial x ∈ Rn , functions f (x), g(x), ∇f (x), ∇g(x), tol. θ, , parameters


(defaults: %+µ = 10, µ0 = 1)
Output: x
1: initialize µ = µ0
2: repeat
find x ← argminx f (x) + µ i [gi (x) > 0] gi (x)2 with tolerance ∼ 10θ
P
3:
+
4: µ ← %µ µ
5: until |∆x| < θ and ∀i : gi (x) < 

3:12

Squared Penalty Method


• The method is ok, but will always lead to some violation of constraints

• A better idea would be to add an out-pushing gradient/force −∇gi (x) for every
constraint gi (x) > 0 that is violated

Ideally, the out-pushing gradient mixes with −∇f (x) exactly such that the result
becomes tangential to the constraint!

This idea leads to the augmented Lagrangian approach


3:13

Augmented Lagrangian
(We can introduce this is a self-contained manner, without yet defining the “Lagrangian”)
3:14
Augmented Lagrangian (equality constraint)
• We first consider an equality constraint before addressing inequalities
• Instead of
min f (x) s.t. h(x) = 0
x

we address

m
X X
min f (x) + µ hi (x)2 + λi hi (x) (7)
x
i=1 i=1

• Note:
– The gradient ∇hi (x) is always orthogonal to the constraint
– By tuning λi we can induce a “virtual gradient” λi ∇hi (x)
– The term µ m 2
P
i=1 hi (x) penalizes as before

• Here is the trick:


– First minimize (14) for some µ and λi
Pm
– This will in general lead to a (slight) penalty µ i=1 hi (x)2
– For the next iteration, choose λi to generate exactly the gradient that was previ-
ously generated by the penalty
3:15

• Optimality condition after an iteration:

m
X m
X
x0 = argmin f (x) + µ hi (x)2 + λi hi (x)
x
i=1 i=1
m
X m
X
⇒ 0 = ∇f (x0 ) + µ 2hi (x0 )∇hi (x0 ) + λi ∇hi (x0 )
i=1 i=1

• Update λ’s for the next iteration:

X m
X X
0
λnew
i ∇hi (x ) = µ 2hi (x0 )∇hi (x0 ) + λold 0
i ∇hi (x )
i=1 i=1 i=1
0
λnew
i = λold
i + 2µhi (x )

Input: initial x ∈ Rn , functions f (x), h(x), ∇f (x), ∇h(x), tol. θ, , parameters


(defaults: %+µ = 1, µ0 = 1)
Output: x
1: initialize µ = µ0 , λi = 0
2: repeat
find x ← argminx f (x) + µ i hi (x)2 + i λi hi (x)
P P
3:
4: ∀i : λi ← λi + 2µhi (x )0

5: optionally, µ ← %+µµ
6: until |∆x| < θ and |hi (x)| < 

3:16

This adaptation of λi is really elegant:


– We do not have to take the penalty limit µ → ∞ but still can have exact
constraints
– If f and h were linear (∇f and ∇hi constant), the updated λi is exactly right:
In the next iteration we would exactly hit the constraint (by construction)
– The penalty term is like a measuring device for the necessary “virtual gradi-
ent”, which is generated by the agumentation term in the next iteration
– The λi are very meaningful: they give the force/gradient that a constraint
exerts on the solution
3:17
Augmented Lagrangian (inequality constraint)
• Instead of
min f (x) s.t. g(x) ≤ 0
x

we address
m
X m
X
min f (x) + µ [gi (x) ≥ 0 ∨ λi > 0] gi (x)2 + λi gi (x)
x
i=1 i=1

• A constraint is either active or inactive:


– When active (gi (x) ≥ 0 ∨ λi > 0) we aim for equality gi (x) = 0
– When inactive (gi (x) < 0 ∧ λi = 0) we don’t penalize/augment
– λi are zero or positive, but never negative

Input: initial x ∈ Rn , functions f (x), g(x), ∇f (x), ∇g(x), tol. θ, , parameters (de-
faults: %+µ = 1, µ0 = 1)
Output: x
1: initialize µ = µ0 , λi = 0
2: repeat
find x ← argminx f (x) + µ i [gi (x) ≥ 0 ∨ λi > 0] gi (x)2 + i λi gi (x)
P P
3:
4: ∀i : λi ← max(λi + 2µgi (x0 ), 0)
5: optionally, µ ← %+µµ
6: until |∆x| < θ and gi (x) < 

3:18

• See also:
M. Toussaint: A Novel Augmented Lagrangian Approach for Inequalities and Convergent
Any-Time Non-Central Updates. e-Print arXiv:1412.4329, 2014.
3:19

The Lagrangian
3:20

The Lagrangian
• Given a constraint problem

min f (x) s.t. g(x) ≤ 0


x

we define the Lagrangian as

m
X
L(x, λ) = f (x) + λi gi (x)
i=1

• The λi ≥ 0 are called dual variables or Lagrange multipliers


3:21

What’s the point of this definition?

• The Lagrangian is useful to compute optima analytically, on paper – that’s why


physicist learn it early on

• The Lagrangian implies the KKT conditions of optimality

• Optima are necessarily at saddle points of the Lagrangian

• The Lagrangian implies a dual problem, which is sometimes easier to solve than
the primal
3:22
Example: Some calculus using the Lagrangian
• For x ∈ R2 , what is
min x2 s.t. x1 + x2 = 1
x

• Solution:

L(x, λ) = x2 + λ(x1 + x2 − 1)
 

0 = ∇x L(x, λ) = 2x + λ 1 ⇒ x1 = x2 = −λ/2
1
 

0 = ∇λ L(x, λ) = x1 + x2 − 1 = −λ/2 − λ/2 − 1 ⇒ λ = −1


⇒x1 = x2 = 1/2

3:23

The “force” & KKT view on the Lagrangian

• At the optimum there must be a balance between the cost gradient −∇f (x) and
the gradient of the active constraints −∇gi (x)

3:24

The “force” & KKT view on the Lagrangian


• At the optimum there must be a balance between the cost gradient −∇f (x) and
the gradient of the active constraints −∇gi (x)
• Formally: for optimal x: ∇f (x) ∈ span{∇gi (x)}
hP i
• Or: for optimal x there must exist λi such that −∇f (x) = − i (−λi ∇gi (x))

• For optimal x it must hold (necessary condition): ∃λ s.t.

m
X
∇f (x) + λi ∇gi (x) = 0 (“stationarity”)
i=1

∀i : gi (x) ≤ 0 (primal feasibility)


∀i : λi ≥ 0 (dual feasibility)
∀i : λi gi (x) = 0 (complementary)

The last condition says that λi > 0 only for active constraints.
These are the Karush-Kuhn-Tucker conditions (KKT, neglecting equality con-
straints)
3:25

The “force” & KKT view on the Lagrangian


• The first condition (“stationarity”), ∃λ s.t.

m
X
∇f (x) + λi ∇gi (x) = 0
i=1

can be equivalently expressed as, ∃λ s.t.

∇x L(x, λ) = 0
• In that sense, the Lagrangian can be viewed as the “energy function” that gener-
ates (for good choice of λ) the right balance between cost and constraint gradients

• This is exactly as in the augmented Lagrangian approach, where however we have an addi-
tional (“augmented”) squared penalty that is used to tune the λi
3:26

Saddle point view on the Lagrangian


• Let’s briefly consider the equality case again:

min f (x) s.t. h(x) = 0


x

with the Lagrangian


m
X
L(x, λ) = f (x) + λi hi (x)
i=1

• Note:

min L(x, λ) ⇒ 0 = ∇x L(x, λ) ↔ stationarity


x

max L(x, λ) ⇒ 0 = ∇λ L(x, λ) = h(x) ↔ constraint


λ

• Optima (x∗ , λ∗ ) are saddle points where


∇x L = 0 ensures stationarity and
∇λ L = 0 ensures the primal feasibility
3:27

Saddle point view on the Lagrangian


• In the inequality case:
(
f (x) if g(x) ≤ 0
max L(x, λ) =
λ≥0 ∞ otherwise
(
λi = 0 if gi (x) < 0
λ = argmax L(x, λ) ⇒
λ≥0 0 = ∇λi L(x, λ) = gi (x) otherwise

This implies either (λi = 0 ∧ gi (x) < 0) or gi (x) = 0, which is exactly equivalent
to the complementarity and primal feasibility conditions
• Again, optima (x∗ , λ∗ ) are saddle points where
minx L enforces stationarity and
maxλ≥0 L enforces complementarity and primal feasibility

Together, minx L and maxλ≥0 L enforce the KKT conditions!


3:28

The Lagrange dual problem


• Finding the saddle point can be written in two ways:

min max L(x, λ) primal problem


x λ≥0

max min L(x, λ) dual problem


λ≥0 x

• Let’s define the Lagrange dual function as

l(λ) = min L(x, λ)


x

Then we have

min f (x) s.t. g(x) ≤ 0 primal problem


x

max l(λ) s.t. λ≥0 dual problem


λ

The dual problem is convex (objective=concave, constraints=convex), even if the


primal is non-convex!
3:29
The Lagrange dual problem
• The dual function is always a lower bound (for any λi ≥ 0)
h i
l(λ) = min L(x, λ) ≤ min f (x) s.t. g(x) ≤ 0
x x

And consequently

max min L(x, λ) ≤ min max L(x, λ) = min f (x)


λ≥0 x x λ≥0 x:g(x)≤0

• We say strong duality holds iff

max min L(x, λ) = min max L(x, λ)


λ≥0 x x λ≥0

• If the primal is convex, and there exist an interior point

∃x : ∀i : gi (x) < 0

(which is called Slater condition), then we have strong duality


3:30

And what about algorithms?


• So far we’ve only introduced a whole lot of formalism, and seen that the La-
grangian sort of represents the constrained problem

• What are the algorithms we can get out of this?


3:31

Log barrier method revisited


3:32

Log barrier method revisited


• Log barrier method: Instead of

min f (x) s.t. g(x) ≤ 0


x

we address X
min f (x) − µ log(−gi (x))
x
i

• For given µ the optimality condition is


X µ
∇f (x) − ∇gi (x) = 0
i
gi (x)

or equivalently
X
∇f (x) + λi ∇gi (x) = 0 , λi gi (x) = −µ
i

These are called modified (=approximate) KKT conditions.


3:33

Log barrier method revisited

Centering (the unconstrained minimization) in the log barrier method is


equivalent to solving the modified KKT conditions.

Note also: On the central path, the duality gap


P is mµ:
l(λ∗ (µ)) = f (x∗ (µ)) + i λi gi (x∗ (µ)) = f (x∗ (µ)) − mµ
3:34

Primal-Dual interior-point Newton Method


3:35
Primal-Dual interior-point Newton Method
• A core outcome of the Lagrangian theory was the shift in problem formulation:
find x to minx f (x) s.t. g(x) ≤ 0

→ find x to solve the KKT conditions

Optimization problem −→ Solve KKT conditions

• We think of the KKT conditions as an equation system r(x, λ) = 0, and can use
the Newton method for solving it:  

∇r ∆x  = −r
∆λ
 

This leads to primal-dual algorithms that adapt x and λ concurrently. Roughly, this
uses the curvature ∇2 f to estimate the right λ to push out of the constraint.
3:36

Primal-Dual interior-point Newton Method


• The first and last modified (=approximate) KKT conditions

Pm
∇f (x) + i=1 λi ∇gi (x) = 0 (“force balance”)
∀i : gi (x) ≤ 0 (primal feasibility)
∀i : λi ≥ 0 (dual feasibility)
∀i : λi gi (x) = −µ (complementary)

can be written as the n + m-dimensional equation system

∇f (x) + ∇g(x)>λ 
 

r(x, λ) = 0 , r(x, λ) := 
−diag(λ)g(x) − µ1m
 

• Newton method to find the root r(x, λ) = 0


   
x  ←  x  − ∇r(x, λ)-1 r(x, λ)

λ λ
   

∇2f (x) + i λi ∇2gi (x) ∇g(x)>  ∈ R(n+m)×(n+m)


 P 

∇r(x, λ) = 
−diag(λ)∇g(x) −diag(g(x))
 

3:37

Primal-Dual interior-point Newton Method


• The method requires the Hessians ∇2f (x) and ∇2gi (x)
– One can approximate the constraint Hessians ∇2gi (x) ≈ 0
– Gauss-Newton case: f (x) = φ(x)>φ(x) only requires ∇φ(x)

• This primal-dual method does a joint update of both


– the solution x
– the lagrange multipliers (constraint forces) λ
No need for nested iterations, as with penalty/barrier methods!

• The above formulation allows for a duality gap µ; choose µ = 0 or consult Boyd
how to update on the fly (sec 11.7.3)

• The feasibility constraints gi (x) ≤ 0 and λi ≥ 0 need to be handled explicitly by


the root finder (the line search needs to ensure these constraints)
3:38

Phase I: Finding a feasible initialization


3:39
Phase I: Finding a feasible initialization
• An elegant method for finding a feasible point x:

min s s.t. ∀i : gi (x) ≤ s, s ≥ 0


(x,s)∈Rn+1

or
m
X
min si s.t. ∀i : gi (x) ≤ si , si ≥ 0
(x,s)∈Rn+m
i=1

3:40

Trust Region
• Instead of adapting the stepsize along a fixed direction, an alternative is to adapt
the trust region
• Rougly, while f (x + δ) > f (x) + %ls ∇f (x)>δ:
– Reduce trust region radius β
– try δ = argminδ:|δ|<β f (x + δ) using a local quadratic model of f (x + δ)

• The constraint optimization minδ:|δ|<β f (x + δ) can be translated into an uncon-


strained minδ f (x + δ) + λδ 2 for suitable λ. The λ is equivalent to a regularization
of the Hessian; see damped Newton.
• We’ll not go into more details of trust region methods; see Nocedal Section 4.
3:41

General approaches
• Penalty & Barriers
– Associate a (adaptive) penalty cost with violation of the constraint
– Associate an additional “force compensating the gradient into the constraint” (augmented
Lagrangian)
– Associate a log barrier with a constraint, becoming ∞ for violation (interior point method)

• Gradient projection methods (mostly for linear contraints)


– For ‘active’ constraints, project the step direction to become tangantial
– When checking a step, always pull it back to the feasible region

• Lagrangian & dual methods


– Rewrite the constrained problem into an unconstrained one
– Or rewrite it as a (convex) dual problem

• Simplex methods (linear constraints)


– Walk along the constraint boundaries
3:42
4 Convex Optimization
Convex, quasiconvex, unimodal, convex optimization problem, linear program (LP),
standard form, simplex algorithm, LP-relaxation of integer linear programs, quadratic
programming (QP), sequential quadratic programming

Function types
• A function is defined convex iff

f (ax + (1−a)y) ≤ a f (x) + (1−a) f (y)

for all x, y ∈ Rn and a ∈ [0, 1].

• A function is quasiconvex iff

f (ax + (1−a)y) ≤ max{f (x), f (y)}

for any x, y ∈ Rm and a ∈ [0, 1].


..alternatively, iff every sublevel set {x|f (x) ≤ α} is convex.

• [Subjective!] I call a function unimodal iff it has only 1 local minimum, which is the
global minimum
Note: in dimensions n > 1 quasiconvexity is stronger than unimodality

• A general non-linear function is unconstrained and can have multiple local minima
4:1

convex ⊂ quasiconvex ⊂ unimodal ⊂ general


4:2

Local optimization
• So far I avoided making explicit assumptions about problem convexity: To empha-
size that all methods we considered – except for Newton – are applicable also on
non-convex problems.

• The methods we considered are local optimization methods, which can be defined
as
– a method that adapts the solution locally
– a method that is guaranteed to converge to a local minimum only

• Local methods are efficient


– if the problem is (strictly) unimodal (strictly: no plateaux)
– if time is critical and a local optimum is a sufficiently good solution
– if the algorithm is restarted very often to hit multiple local optima
4:3

Convex problems
• Convexity is a strong assumption

• But solving convex problems is an important case


– theoretically (convergence proofs!)
– many real world applications are actually convex
– convexity around a local optimum → efficient local optimization

• Roughly:
“global optimization = finding local optima + multiple convex problems”
4:4
Convex problems
• A constrained optimization problem

min f (x) s.t. g(x) ≤ 0, h(x) = 0


x

is called convex iff


– f is convex
– each gi , i = 1, .., m is convex
– h is linear: h(x) = Ax − b, A ∈ Rl×n , b ∈ Rl

• Alternative definition:
f convex and feasible region is a convex set
4:5

Linear and Quadratic Programs


• Linear Program (LP)

min c>x s.t. Gx ≤ h, Ax = b


x

LP in standard form
min c>x s.t. x ≥ 0, Ax = b
x

• Quadratic Program (QP)

1 >
min x Qx + c>x s.t. Gx ≤ h, Ax = b
x 2
where Q is positive definite.

(One also defines Quadratically Constraint Quadratic Programs (QCQP))


4:6

Transforming an LP problem into standard form


• LP problem:
min c>x s.t. Gx ≤ h, Ax = b
x

• Define slack variables:

min c>x s.t. Gx + ξ = h, Ax = b, ξ ≥ 0


x,ξ

• Express x = x+ − x− with x+ , x− ≥ 0:

min c>(x+ − x− )
x+ ,x− ,ξ

s.t. G(x+ − x− ) + ξ = h, A(x+ − x− ) = b, ξ ≥ 0, x+ ≥ 0, x− ≥ 0

where (x+ , x− , ξ) ∈ R2n+m

• Now this is conform with the standard form (replacing (x+ , x− , ξ) ≡ x, etc)

min c>x s.t. x ≥ 0, Ax = b


x

4:7

Example LPs

Browse through the exercises 4.8-4.20 of Boyd & Vandenberghe!


4:8

Linear Programming
– Algorithms
– Application: LP relaxation of discret problems
4:9
Algorithms for Linear Programming

• All of which we know!


– augmented Lagrangian (LANCELOT software), penalty
– log barrier (“interior point method”, “[central] path following”)
– primal-dual Newton

• The simplex algorithm, walking on the constraints

(The emphasis in the notion of interior point methods is to distinguish from con-
straint walking methods.)

• Interior point and simplex methods are comparably efficient


Which is better depends on the problem
4:10

Simplex Algorithm
Georg Dantzig (1947)
Note: Not to confuse with the Nelder–Mead method (downhill simplex method)

• We consider an LP in standard form

min c>x s.t. x ≥ 0, Ax = b


x

• Note that in a linear program the optimum is always situated at a corner

4:11

Simplex Algorithm

• The Simplex Algorithm walks along the edges of the polytope, at every corner
choosing the edge that decreases c>x most
• This either terminates at a corner, or leads to an unconstrained edge (−∞ opti-
mum)

• In practise this procedure is done by “pivoting on the simplex tableaux”


4:12

Simplex Algorithm
• The simplex algorithm is often efficient, but in worst case exponential in n and m.

• Interior point methods (log barrier) and, more recently again, augmented La-
grangian methods have become somewhat more popular than the simplex algo-
rithm
4:13

LP-relaxations of discrete problems


4:14

Integer linear programming (ILP)


• An integer linear program (for simplicity binary) is

min c>x s.t. Ax = b, xi ∈ {0, 1}


x

• Examples:
P
– P
Travelling Salesman: minxij ij cij xP ij with xij ∈ {0, 1} and constraints ∀j :

i x ij = 1 (columns sum to 1), ∀ j : i xji = 1, ∀ij : tj − ti ≤ n − 1 + nxij


(where ti are additional integer variables).
– MaxSAT problem: In conjunctive normal form, each clause contributes an ad-
ditional variable and a term in the objective function; each clause contributes
a constraint
– Search the web for The Power of Semidefinite Programming Relaxations for
MAXSAT
4:15

LP relaxations of integer linear programs


• Instead of solving
min c>x s.t. Ax = b, xi ∈ {0, 1}
x

we solve
min c>x s.t. Ax = b, x ∈ [0, 1]
x

• Clearly, the relaxed solution will be a lower bound on the integer solution (some-
times also called “outer bound” because [0, 1] ⊃ {0, 1})

• Computing the relaxed solution is interesting


– as an “approximation” or initialization to the integer problem
– to be aware of a lower bound
– in cases where the optimal relaxed solution happens to be integer
4:16

Example: MAP inference in MRFs


• Given integer random variables xi , i = 1, .., n, a pairwise Markov Random Field
(MRF) is defined as
X X
f (x) = fij (xi , xj ) + fi (xi )
(ij)∈E i

where E denotes the set of edges. Problem: find maxx f (x).


(Note: any general (non-pairwise) MRF can be converted into a pair-wise one, blowing up the number of
variables)

• Reformulate with indicator variables

bi (x) = [xi = x] , bij (x, y) = [xi = x] [xj = y]

These are nm + |E|m2 binary variables


• The indicator variables need to fulfil the constraints

bi (x), bij (x, y) ∈ {0, 1}


X
bi (x) = 1 because xi takes eactly one value
x
X
bij (x, y) = bi (x) consistency between indicators
y

4:17
Example: MAP inference in MRFs
• Finding maxx f (x) of a MRF is then equivalent to
X X XX
max bij (x, y) fij (x, y) + bi (x) fi (x)
bi (x),bij (x,y)
(ij)∈E x,y i x

such that
X X
bi (x), bij (x, y) ∈ {0, 1} , bi (x) = 1 , bij (x, y) = bi (x)
x y

• The LP-relaxation replaces the constraint to be


X X
bi (x), bij (x, y) ∈ [0, 1] , bi (x) = 1 , bij (x, y) = bi (x)
x y

This set of feasible b’s is called marginal polytope (because it describes the a
space of “probability distributions” that are marginally consistent (but not neces-
sarily globally normalized!))
4:18

Example: MAP inference in MRFs


• Solving the original MAP problem is NP-hard
Solving the LP-relaxation is really efficient

• If the solution of the LP-relaxation turns out to be integer, we’ve solved the origi-
nally NP-hard problem!
If not, the relaxed problem can be discretized to be a good initialization for discrete
optimization

• For binary attractive MRFs (a common case) the solution will always be integer
4:19

Quadratic Programming
4:20

Quadratic Programming

1 >
min x Qx + c>x s.t. Gx ≤ h, Ax = b
x 2

• Efficient Algorithms:
– Interior point (log barrier)
– Augmented Lagrangian
– Penalty

• Highly relevant applications:


– Support Vector Machines
– Similar types of max-margin modelling methods
4:21

Example: Support Vector Machine


• Primal:

max M s.t. ∀i : yi (φ(xi )>β) ≥ M


β,||β||=1

• Dual:

min ||β||2 s.t. ∀i : yi (φ(xi )>β) ≥ 1


β
y B

x
4:22

Sequential Quadratic Programming


• We considered general non-linear problems

min f (x) s.t. g(x) ≤ 0


x

where we can evaluate f (x), ∇f (x), ∇2f (x) and g(x), ∇g(x), ∇2g(x) for any x ∈
Rn
→ Newton method

• In the unconstrained case, the standard step direction δ is (∇2f (x) + λI) δ =
−∇f (x)

• In the constrained case, a natural step direction δ can be found by solving the local
QP-approximation to the problem

min f (x) + ∇f (x)>δ + δ>∇2f (x)δ s.t. g(x) + ∇g(x)>δ ≤ 0


δ

This is an optimization problem over δ and only requires the evaluation of f (x), ∇f (x), ∇
once.
4:23
5 Global & Bayesian Optimization
Multi-armed bandits, exploration vs. exploitation, navigation through belief space, up-
per confidence bound (UCB), global optimization = infinite bandits, Gaussian Pro-
cesses, probability of improvement, expected improvement, UCB

Global Optimization
• Is there an optimal way to optimize (in the Blackbox case)?
• Is there a way to find the global optimum instead of only local?
5:1

Outline
• Play a game

• Multi-armed bandits
– Belief state & belief planning
– Upper Confidence Bound (UCB)

• Optimization as infinite bandits


– GPs as belief state

• Standard heuristics:
– Upper Confidence Bound (GP-UCB)
– Maximal Probability of Improvement (MPI)
– Expected Improvement (EI)
5:2

Bandits
5:3

Bandits

• There are n machines.


• Each machine i returns a reward y ∼ P (y; θi )
The machine’s parameter θi is unknown
5:4

Bandits
• Let at ∈ {1, .., n} be the choice of machine at time t
Let yt ∈ R be the outcome with mean hyat i

• A policy or strategy maps all the history to a new choice:

π : [(a1 , y1 ), (a2 , y2 ), ..., (at-1 , yt-1 )] 7→ at


• Problem: Find a policy π that
DP E
T
max t=1 yt

or
max hyT i


P∞
γ t yt

or other objectives like discounted infinite horizon max t=1
5:5

Exploration, Exploitation
• “Two effects” of choosing a machine:
– You collect more data about the machine → knowledge
– You collect reward

• For example
– Exploration: Choose the next action at to min hH(bt )i
– Exploitation: Choose the next action at to max hyt i
5:6

The Belief State


• “Knowledge” can be represented in two ways:
– as the full history

ht = [(a1 , y1 ), (a2 , y2 ), ..., (at-1 , yt-1 )]

– as the belief
bt (θ) = P (θ|ht )
where θ are the unknown parameters θ = (θ1 , .., θn ) of all machines

• In the bandit case:


Q
– The belief factorizes bt (θ) = P (θ|ht ) = i bt (θi |ht )
e.g. for Gaussian bandits with constant noise, θi = µi

bt (µi |ht ) = N(µi |ŷi , ŝi )

e.g. for binary bandits, θi = pi , with prior Beta(pi |α, β):

bt (pi |ht ) = Beta(pi |α + ai,t , β + bi,t )


ai,t = t−1
P Pt−1
s=1 [as = i][ys = 0] , bi,t = s=1 [as = i][ys = 1]

5:7

The Belief MDP


• The process can be modelled as
a1 y1 a2 y2 a3 y3

θ θ θ θ

or as Belief MDP
a1 y1 a2 y2 a3 y3

b0 b1 b2 b3

(
0 1 if b0 = b0[b,a,y] R
P (b |y, a, b) = , P (y|a, b) = θa b(θa ) P (y|θa )
0 otherwise

• The Belief MDP describes a different process: the interaction between the information avail-
able to the agent (bt or ht ) and its actions, where the agent uses his current belief to antici-
pate outcomes, P (y|a, b).
• The belief (or history ht ) is all the information the agent has avaiable; P (y|a, b) the “best”
possible anticipation of observations. If it acts optimally in the Belief MDP, it acts optimally in
the original problem.
Optimality in the Belief MDP ⇒ optimality in the original problem
5:8
Optimal policies via Belief Planning
• The Belief MDP:
a1 y1 a2 y2 a3 y3

b0 b1 b2 b3

(
1 if b0 = b0[b,a,y]
P (b0 |y, a, b) =
R
, P (y|a, b) = θa b(θa ) P (y|θa )
0 otherwise

• Belief Planning: Dynamic Programming on the value function


DP E
T
∀b : Vt-1 (b) = max ytt=t
π
h DP Ei
T
= max hyt i + t=t+1 yt
π
h i
= max yt P (yt |at , b) yt + Vt (b0[b,at ,yt ] )
R
at

5:9

Optimal policies
• The value function assigns a value (maximal achievable return) to a state of knowl-
edge
• The optimal policy is greedy w.r.t. the value function (in the sense of the maxat
above)
• Computationally heavy: bt is a probability distribution, Vt a function over probability
distributions

R h i
• The term yt
P (yt |at , bt-1 ) yt + Vt (bt-1 [at , yt ]) is related to the Gittins Index: it can be computed
for each bandit separately.
5:10

Example exercise
• Consider 3 binary bandits for T = 10.
– The belief is 3 Beta distributions Beta(pi |α + ai , β + bi ) → 6 integers
– T = 10 → each integer ≤ 10
– Vt (bt ) is a function over {0, .., 10}6

• Given a prior α = β = 1,
a) compute the optimal value function and policy for the final reward and the aver-
age reward problems,
b) compare with the UCB policy.
5:11

Greedy heuristic: Upper Confidence Bound (UCB)

1: Initializaiton: Play each machine once


2: repeat q
2 ln n
3: Play the machine i that maximizes ŷi + β ni
4: until

ŷi is the average reward of machine i so far


ni is how often machine i has been played so far
P
n= i ni is the number of rounds so far
β is often chosen as β = 1

See Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi & Fischer, Machine learn-
ing, 2002.
5:12
UCB algorithms
• UCB algorithms determine a confidence interval such that
ŷi − σi < hyi i < ŷi + σi
with high probability.
UCB chooses the upper bound of this confidence interval

• Optimism in the face of uncertainty

• Strong bounds on the regret (sub-optimality) of UCB (e.g. Auer et al.)


5:13

Conclusions
• The bandit problem is an archetype for
– Sequential decision making
– Decisions that influence knowledge as well as rewards/states
– Exploration/exploitation

• The same aspects are inherent also in global optimization, active learning & RL

• Belief Planning in principle gives the optimal solution

• Greedy Heuristics (UCB) are computationally much more efficient and guarantee
bounded regret
5:14

Further reading
• ICML 2011 Tutorial Introduction to Bandits: Algorithms and Theory, Jean-Yves
Audibert, Rémi Munos
• Finite-time analysis of the multiarmed bandit problem, Auer, Cesa-Bianchi & Fis-
cher, Machine learning, 2002.
• On the Gittins Index for Multiarmed Bandits, Richard Weber, Annals of Applied
Probability, 1992.
Optimal Value function is submodular.
5:15

Global Optimization
5:16

Global Optimization
• Let x ∈ Rn , f : Rn → R, find
min f (x)
x

(I neglect constraints g(x) ≤ 0 and h(x) = 0 here – but could be included.)

• Blackbox optimization: find optimium by sampling values yt = f (xt )


No access to ∇f or ∇2f
Observations may be noisy y ∼ N(y | f (xt ), σ)
5:17

Global Optimization = infinite bandits


• In global optimization f (x) defines a reward for every x ∈ Rn
– Instead of a finite number of actions at we now have xt

• Optimal Optimization could be defined as: find π : ht 7→ xt that


DP E
T
min t=1 f (xt )

or
min hf (xT )i
5:18
Gaussian Processes as belief
• The unknown “world property” is the function θ = f
• Given a Gaussian Process prior GP (f |µ, C) over f and a history

Dt = [(x1 , y1 ), (x2 , y2 ), ..., (xt-1 , yt-1 )]

the belief is

bt (f ) = P (f | Dt ) = GP(f |Dt , µ, C)
Mean(f (x)) = fˆ(x) = κ(x)(K + σ 2 I)-1 y response surface
Var(f (x)) = σ̂(x) = k(x, x) − κ(x)(K + σ 2 In )-1 κ(x) confidence interval

• Side notes:
– Don’t forget that Var(y ∗ |x∗ , D) = σ 2 + Var(f (x∗ )|D)
– We can also handle discrete-valued functions f using GP classification
5:19

5:20

Optimal optimization via belief planning


• As for bandits it holds
DP E
T
Vt-1 (bt-1 ) = max t=t yt
π
R h i
= max yt
P (y t |xt , b t-1 ) yt + Vt (b t-1 [x t , y t ])
xt

Vt-1 (bt-1 ) is a function over the GP-belief!


If we could compute Vt-1 (bt-1 ) we “optimally optimize”

• I don’t know of a minimalistic case where this might be feasible


5:21

Conclusions
• Optimization as a problem of
– Computation of the belief
– Belief planning

• Crucial in all of this: the prior P (f )


– GP prior: smoothness; but also limited: only local correlations!
No “discovery” of non-local/structural correlations through the space
– The latter would require different priors, e.g. over different function classes
5:22

Heuristics
5:23
1-step heuristics based on GPs

from Jones (2001)

• Maximize Probability of Improvement (MPI)

R y∗
xt = argmax −∞
N(y|fˆ(x), σ̂(x))
x

• Maximize Expected Improvement (EI)

R y∗
xt = argmax −∞
N(y|fˆ(x), σ̂(x)) (y ∗ − y)
x

• Maximize UCB
xt = argmax fˆ(x) + βt σ̂(x)
x

(Often, βt = 1 is chosen. UCB theory allows for better choices. See Srinivas et al. citation below.)
5:24

Each step requires solving an optimization problem


• Note: each argmax on the previous slide is an optimization problem
• As fˆ, σ̂ are given analytically, we have gradients and Hessians. BUT: multi-modal
problem.
• In practice:
– Many restarts of gradient/2nd-order optimization runs
– Restarts from a grid; from many random points

• We put a lot of effort into carefully selecting just the next query point
5:25

From: Information-theoretic regret bounds for gaussian process optimization in the bandit setting Srinivas,
Krause, Kakade & Seeger, Information Theory, 2012.

5:26
5:27

Pitfall of this approach


• A real issue, in my view, is the choice of kernel (i.e. prior P (f ))
– ’small’ kernel: almost exhaustive search
– ’wide’ kernel: miss local optima
– adapting/choosing kernel online (with CV): might fail
– real f might be non-stationary
– non RBF kernels? Too strong prior, strange extrapolation

• Assuming that we have the right prior P (f ) is really a strong assumption


5:28

Further reading
• Classically, such methods are known as Kriging

• Information-theoretic regret bounds for gaussian process optimization in the bandit


setting Srinivas, Krause, Kakade & Seeger, Information Theory, 2012.

• Efficient global optimization of expensive black-box functions. Jones, Schonlau, &


Welch, Journal of Global Optimization, 1998.
• A taxonomy of global optimization methods based on response surfaces Jones,
Journal of Global Optimization, 2001.
• Explicit local models: Towards optimal optimization algorithms, Poland, Technical
Report No. IDSIA-09-04, 2004.
5:29

Entropy Search
slides by Philipp Hennig
P. Hennig & C. Schuler: Entropy Search for Information-Efficient Global Optimiza-
tion, JMLR 13 (2012).
5:30

Predictive Entropy Search


• Hernández-Lobato, Hoffman & Ghahraman: Predictive Entropy Search for Effi-
cient Global Optimization of Black-box Functions, NIPS 2014.
• Also for constraints!
• Code: https://github.com/HIPS/Spearmint/
5:31
6 Blackbox Optimization: Local, Stochastic & Mode
based Search

“Blackbox Optimization”
• We use the term to denote the problem: Let x ∈ Rn , f : Rn → R, find

min f (x)
x

where we can only evaluate f (x) for any x ∈ Rn


∇f (x) or ∇2f (x) are not (directly) accessible

• A constrained version: Let x ∈ Rn , f : Rn → R, g : Rn → {0, 1}, find

min f (x) s.t. g(x) = 1


x

where we can only evaluate f (x) and g(x) for any x ∈ Rn


I haven’t seen much work on this. Would be interesting to consider this more rigorously.
6:1

“Blackbox Optimization” – terminology/subareas


• Stochastic Optimization (aka. Stochastic Search, Metaheuristics)
– Simulated Annealing, Stochastic Hill Climing, Tabu Search
– Evolutionary Algorithms, esp. Evolution Strategies, Covariance Matrix Adap-
tation, Estimation of Distribution Algorithms
– Some of them (implicitly or explicitly) locally approximating gradients or 2nd
order models

• Derivative-Free Optimization (see Nocedal et al.)


– Methods for (locally) convex/unimodal functions; extending gradient/2nd-order
methods
– Gradient estimation (finite differencing), model-based, Implicit Filtering

• Bayesian/Global Optimization
– Methods for arbitrary (smooth) blackbox functions that get not stuck in local
optima.
– Very interesting domain – close analogies to (active) Machine Learning, ban-
dits, POMDPs, optimal decision making/planning, optimal experimental de-
sign
6:2

Outline
• Basic downhill running
– Greedy local search, stochastic local search, simulated annealing
– Iterated local search, variable neighborhood search, Tabu search
– Coordinate & pattern search, Nelder-Mead downhill simplex

• Memorize or model something


– General stochastic search
– Evolutionary Algorithms, Evolution Strategies, CMA, EDAs
– Model-based optimization, implicit filtering

• Bayesian/Global optimization: Learn & approximate optimal optimization


– Belief planning view on optimal optimization
– GPs & Bayesian regression methods for belief tracking
– bandits, UBC, expected improvement, etc for decision making
6:3
Basic downhill running
– Greedy local search, stochastic local search, simulated annealing
– Iterated local search, variable neighborhood search, Tabu search
– Coordinate & pattern search, Nelder-Mead downhill simplex
6:4

Greedy local search (greedy downhill, hill climbing)


• Let x ∈ X be continuous or discrete
• We assume there is a finite neighborhood N(x) ⊂ X defined for every x

• Greedy local search (variant 1):

Input: initial x, function f (x)


1: repeat
2: x ← argminy∈N(x) f (y) // convention: we assume x ∈ N(x)
3: until x converges

• Variant 2: x ← the “first” y ∈ N(x) such that f (y) < f (x)


• Greedy downhill is a basic ingredient of discrete optimization
• In the continuous case: what is N(x)? Why should it be fixed or finite?
6:5

Stochastic local search


• Let x ∈ Rn
• We assume a “neighborhood” probability distribution q(y|x), typically a Gaussian
q(y|x) ∝ exp{− 21 (y − x)>Σ-1 (y − x)}

Input: initial x, function f (x), proposal distribution q(y|x)


1: repeat
2: Sample y ∼ q(y|x)
3: If f (y) < f (x) then x ← y
4: until x converges

• The choice of q(y|x) is crucial, e.g. of the covariance matrix Σ


• Simple heuristic: decrease variance if many steps “fail”; increase variance if suffi-
cient success steps
• Covariance Matrix Adaptation (discussed later) memorizes the recent successful
steps and adapts Σ based on this.
6:6

Simulated Annealing (run also uphill)


• An extension to avoid getting stuck in local optima is to also accept steps with
f (y) > f (x):

Input: initial x, function f (x), proposal distribution q(y|x)


1: initialilze the temperature T = 1
2: repeat
3: Sample y ∼ q(y|x)
f (x)−f (y)
 q(x|y)
4: Acceptance probability A = min 1, e T
q(y|x)
5: With probability A update x ← y
6: Decrease T , e.g. T ← (1 − )T for small 
7: until x converges

• Typically: q(y|x) ∝ exp{− 12 (y − x)2 /σ 2 }


6:7

Simulated Annealing
• Simulated Annealing is a Markov chain Monte Carlo (MCMC) method.
– Must read!: An Introduction to MCMC for Machine Learning
– These are iterative methods to sample from a distribution, in our case

−f (x)
p(x) ∝ e T

• For a fixed temperature T , one can prove that the set of accepted points is dis-
tributed as p(x) (but non-i.i.d.!) The acceptance probability

 f (x)−f (y) q(x|y)


A = min 1, e T
q(y|x)

compares the f (y) and f (x), but also the reversibility of q(y|x)
• When cooling the temperature, samples focus at the extrema. Guaranteed to
sample all extrema eventually
• Of high theoretical relevance, less of practical
6:8

Simulated Annealing

6:9

Random Restarts (run downhill multiple times)


• Greedy local search is typically only used as an ingredient of more robust methods
• We assume to have a start distribution q(x)

• Random restarts:

1: repeat
2: Sample x ∼ q(x)
3: x ← GreedySearch(x) or StochasticSearch(x)
4: If f (x) < f (x∗ ) then x∗ ← x
5: until run out of budget

• Greedy local search requires a neighborhood function N(x)


Stochastic local search requires a transition proposal q(y|x)
6:10

Iterated Local Search


• Random restarts may be rather expensive, sampling x ∼ q(x) is fully uninformed
• Iterated Local Search picks up the last visited local minimum x and restarts in a
meta-neighborhood N∗ (x)

• Iterated Local Search (variant 1):

Input: initial x, function f (x)


1: repeat
2: x ← argminy0 ∈{GreedySearch(y) : y∈N∗ (x)} f (y 0 )
3: until x converges
– This version evalutes a GreedySearch for all meta-neighbors y ∈ N∗ (x) of
the last local optimum x
– The inner GreedySearch uses another neighborhood function N(x)
• Variant 2: x ← the “first” y ∈ N∗ (x) such that f (GS(y)) < f (x)
• Stochastic variant: Neighborhoods N(x) and N∗ (x) are replaced by transition pro-
posals q(y|x) and q ∗ (y|x)
6:11

Iterated Local Search


• Application to Travelling Salesman Problem:
k-opt neighbourhood: solutions which differ by at most k edges

from Hoos & Stützle: Tutorial: Stochastic Search Algorithms

• GreedySearch uses 2-opt or 3-opt neighborhood


Iterated Local Search uses 4-opt meta-neighborhood (double bridges)
6:12

Very briefly...
• Variable Neighborhood Search:
– Switch the neighborhood function in different phases
– Similar to Iterated Local Search

• Tabu Search:
– Maintain a tabu list points (or points features) which may not be visited again
– The list has a fixed finite size: FILO
– Intensification and diversification heuristics make it more global
6:13

Coordinate Search

Input: Initial x ∈ Rn
1: repeat
2: for i = 1, .., n do
3: α∗ = argminα f (x + αei ) // Line Search
4: x ← x + α∗ ei
5: end for
6: until x converges

• The LineSearch must be approximated


– E.g. abort on any improvement, when f (x + αei ) < f (x)
– Remember the last successful stepsize αi for each coordinate
• Twiddle:

Input: Initial x ∈ Rn , initial stepsizes αi for all i = 1 : n


1: repeat
2: for i = 1, .., n do
3: x ← argminy∈{x−αi ei ,x,x+αi ei } f (y) // twiddle xi
4: Increase αi if x changed; decrease αi otherwise
5: end for
6: until x converges

6:14

Pattern Search

– In each iteration k, have a (new) set of search directions Dk = {dki } and test
steps of length αk in these directions
– In each iteration, adapt the search directions Dk and step length αk
Details: See Nocedal et al.
6:15

Nelder-Mead method – Downhill Simplex Method

6:16

Nelder-Mead method – Downhill Simplex Method


• Let x ∈ Rn
• Maintain n + 1 points x0 , .., xn , sorted by f (x0 ) < ... < f (xn )
• Compute center c of points
• Reflect: y = c + α(c − xn )
• If f (y) < f (x0 ): Expand: y = c + γ(c − xn )
• If f (y) > f (xn-1 ): Contract: y = c + %(c − xn )
• If still f (y) > f (xn ): Shrink ∀i=1,..,n xi ← x0 + σ(xi − x0 )

• Typical parameters: α = 1, γ = 2, % = − 12 , σ = 1
2
6:17

Summary: Basic downhill running


• These methods are highly relevant! Despite their simplicity
• Essential ingredient to iterative approaches that try to find as many local minima
as possible

• Methods essentially differ in the notion of


neighborhood, transition proposal, or pattern of next search points
to consider
• Iterated downhill can be very effective

• However: There should be ways to better exploit data!


– Learn from previous evaluations where to test new point
– Learn from previous local minima where to restart
6:18

Memorize or model something


– Stochastic search schemes
– Evolutionary Algorithms, Evolution Strategies, CMA, EDAs
– Model-based optimization, implicit filtering
6:19
A general stochastic search scheme
• The general scheme:
– The algorithm maintains a probability distribution pθ (x)
– In each iteration it takes n samples {xi }n
i=1 ∼ pθ (x)

– Each xi is evaluated → data {(xi , f (xi ))}n


i=1

– That data is used to update θ

Input: initial parameter θ, function f (x), distribution model pθ (x), update


heuristic h(θ, D)
Output: final θ and best point x
1: repeat
2: Sample {xi }n i=1 ∼ pθ (x)
3: Evaluate samples, D = {(xi , f (xi ))}n
i=1
4: Update θ ← h(θ, D)
5: until θ converges

6:20

Example: Gaussian search distribution “(µ, λ)-ES”


From 1960s/70s. Rechenberg/Schwefel

• The simplest distribution family

θ = (x̂) , pθ (x) = N(x | x̂, σ 2 )

a n-dimenstional isotropic Gaussian with fixed variance σ 2

• Update heuristic:
– Given D = {(xi , f (xi ))}λi=1 , select µ best: D0 = bestOfµ (D)
– Compute the new mean x̂ from D0

• This algorithm is called “Evolution Strategy (µ, λ)-ES”


– The Gaussian is meant to represent a “species”
– λ offspring are generated
– the best µ selected
6:21

θ is the “knowledge/information” gained

• The parameter θ is the only “knowledge/information” that is being propagated be-


tween iterations
θ encodes what has been learned from the history
θ defines where to search in the future

• The downhill methods of the previous section did not store any information other
than the current x. (Exception: Tabu search, Nelder-Mead)

• Evolutionary Algorithms are a special case of this stochastic search scheme


6:22

Evolutionary Algorithms (EAs)


• EAs can well be described as special kinds of parameterizing pθ (x) and updating
θ
– The θ typically is a set of good points found so far (parents)
– Mutation & Crossover define pθ (x)
– The samples D are called offspring
– The θ-update is often a selection of the best, or “fitness-proportional” or rank-
based

• Categories of EAs:
– Evolution Strategies: x ∈ Rn , often Gaussian pθ (x)
– Genetic Algorithms: x ∈ {0, 1}n , crossover & mutation define pθ (x)
– Genetic Programming: x are programs/trees, crossover & mutation
– Estimation of Distribution Algorithms: θ directly defines pθ (x)
6:23

Covariance Matrix Adaptation (CMA-ES)


• An obvious critique of the simple Evolution Strategies:
– The search distribution N(x | x̂, σ 2 ) is isotropic
(no going forward, no preferred direction)
– The variance σ is fixed!

• Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

6:24

Covariance Matrix Adaptation (CMA-ES)


• In Covariance Matrix Adaptation

θ = (x̂, σ, C, %σ , %C ) , pθ (x) = N(x | x̂, σ 2 C)

where C is the covariance matrix of the search distribution


• The θ maintains two more pieces of information: %σ and %C capture the “path”
(motion) of the mean x̂ in recent iterations
• Rough outline of the θ-update:
– Let D0 = bestOfµ (D) be the set of selected points
– Compute the new mean x̂ from D0
– Update %σ and %C proportional to x̂k+1 − x̂k
– Update σ depending on |%σ |
– Update C depending on %c %> 0
c (rank-1-update) and Var(D )
6:25

CMA references
Hansen, N. (2006), ”The CMA evolution strategy: a comparing review”
Hansen et al.: Evaluating the CMA Evolution Strategy on Multimodal Test Func-
tions, PPSN 2004.

• For “large enough” populations local minima are avoided


6:26
CMA conclusions
• It is a good starting point for an off-the-shelf blackbox algorithm
• It includes components like estimating the local gradient (%σ , %C ), the local “Hes-
sian” (Var(D0 )), smoothing out local minima (large populations)
6:27

Estimation of Distribution Algorithms (EDAs)


• Generally, EDAs fit the distribution pθ (x) to model the distribution of previously
good search points
For instance, if in all previous distributions, the 3rd bit equals the 7th bit, then the search distribution pθ (x)
should put higher probability on such candidates.
pθ (x) is meant to capture the structure in previously good points, i.e. the dependencies/correlation be-
tween variables.

• A rather successful class of EDAs on discrete spaces uses graphical models to


learn the dependencies between variables, e.g.
Bayesian Optimization Algorithm (BOA)

• In continuous domains, CMA is an example for an EDA


6:28

Stochastic search conclusions

Input: initial parameter θ, function f (x), distribution model pθ (x), update


heuristic h(θ, D)
Output: final θ and best point x
1: repeat
2: Sample {xi }n i=1 ∼ pθ (x)
3: Evaluate samples, D = {(xi , f (xi ))}n
i=1
4: Update θ ← h(θ, D)
5: until θ converges

• The framework is very general


• The crucial difference between algorithms is their choice of pθ (x)
6:29

Model-based optimization
following Nodecal et al. “Derivative-free optimization”
6:30

Model-based optimization
• The previous stochastic serach methods are heuristics to update θ
Why not store the previous data directly?

• Model-based optimization takes the approach


– Store a data set θ = D = {(xi , yi )}n
i=1 of previously explored points
(let x̂ be the current minimum in D)
– Compute a (quadratic) model D 7→ fˆ(x) = φ2 (x)>β
– Choose the next point as

x+ = argmin fˆ(x) s.t. |x − x̂| < α


x

– Update D and α depending on f (x+ )


• The argmin is solved with constrained optimization methods
6:31

Model-based optimization
1: Initialize D with at least 12 (n + 1)(n + 2) data points
2: repeat
3: Compute a regression fˆ(x) = φ2 (x)>β on D
4: Compute x+ = argminx fˆ(x) s.t. |x − x̂| < α
f (x̂)−f (x+ )
5: Compute the improvement ratio % =
fˆ(x̂)−fˆ(x+ )
6: if % >  then
7: Increase the stepsize α
8: Accept x̂ ← x+
9: Add to data, D ← D ∪ {(x+ , f (x+ ))}
10: else
11: if det(D) is too small then // Data improvement
12: Compute x+ = argmaxx det(D ∪ {x}) s.t. |x − x̂| < α
13: Add to data, D ← D ∪ {(x+ , f (x+ ))}
14: else
15: Decrease the stepsize α
16: end if
17: end if
18: Prune the data, e.g., remove argmaxx∈∆ det(D \ {x})
19: until x converges

1
• Variant: Initialize with only n + 1 data points and fit a linear model as long as |D| < 2 (n + 1)(n + 2) =
dim(φ2 (x))
6:32

Model-based optimization
• Optimal parameters (with data matrix X ∈ Rn×dim(β) )

β̂ ls = (X>X)-1 X>y

The determinant det(X>X) or det(X) (denoted det(D) on the previous slide)


is a measure for well the data supports the regression. The data improvement
explicitly selects a next evaluation point to increase det(D).
• Nocedal describes in more detail a geometry-improving procedure to update D.

• Model-based optimization is closely related to Bayesian approaches. But


– Should we really prune data to have only a minimal set D (of size dim(β)?)
– Is there another way to think about the “data improvement” selection of x+ ?
(→ maximizing uncertainty/information gain)
6:33

Implicit Filtering (briefly)


• Estimates the local gradient using finite differencing
h1 i
∇ f (x) ≈ (f (x + ei ) − f (x − ei ))
2 i=1,..,n

• Lines search along the gradient; if not succesful, decrease 


• Can be extended by using ∇ f (x) to update an approximation of the Hessian (as
in BFGS)
6:34

Conclusions
• We covered
– “downhill running”
– Two flavors of methods that exploit the recent data:
– stochastic search (& EAs), maintaining θ that defines pθ (x)
– model-based opt., maintaining local data D that defines fˆ(x)

• These methods can be very efficient, but somehow the problem formalization is
unsatisfactory:
– What would be optimal optimization?
– What exactly is the information that we can gain from data about the opti-
mum?
– If the optimization algorithm would be an “AI agent”, selecting points his ac-
tions, seeing f (x) his observations, what would be his optimal decision mak-
ing strategy?
– And what about global blackbox optimization?
6:35
7 Exercises

7.1 Exercise 1

7.1.1 Boyd & Vandenberghe

Read sections 1.1, 1.3 & 1.4 of Boyd & Vandenberghe “Convex Optimization”. This is
for you to get an impression of the book. Learn in particular about their categories of
convex and non-linear optimization problems.

7.1.2 Getting started

Consider the following functions over x ∈ Rn :

fsq (x) = x>Cx , (8)


>
fhole (x) = 1 − exp(−x Cx) . (9)

For C = I (identity matrix) these would be fairly simple to optimize. The C matrix
changes the conditioning (“skewedness of the Hessian”) of these functions to make
them a bit more interesting. We assume that C is a diagonal matrix with entries
i−1
C(i, i) = c n−1 . We choose a conditioning1 c = 10.

a) What are the gradients ∇fsq (x) and ∇fhole (x)?

b) What are the Hessians ∇2fsq (x) and ∇2fhole (x)?

c) Implement these functions and display them for c = 10 over x ∈ [−1, 1]2 . You can
use any language, but we recommend Python, Octave/Matlab, or C++ (iff you are ex-
perienced with numerics in C++). Plotting is oftem a quite laboring part of coding... For
plotting a function over the 2D input on evaluates the function on a grid of points, e.g.
in Python
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

space = np.linspace(-1, 1, 20)


X0, X1 = np.meshgrid(space, space)
Y = X0**2 + X1**2

fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.plot_wireframe(X0, X1, Y)
plt.show()
Or in Octave:
[X0,X1] = meshgrid(linspace(-1,1,20),linspace(-1,1,20));
Y = X0.**2 + X1.**2;
mesh(X0,X1,Y);
save ’datafile’ Y -ascii
Or you can store the grid data in a file and use gnuplot, e.g.:
splot [-1:1][-1:1] ’datafile’ matrix us ($1/10-1):($2/10-1):3 with lines

d) Implement a simple fixed stepsize gradient descent, iterating xk+1 = xk − α∇f (xk ),
with start point x0 = (1, 1), c = 10 and heuristically chosen α.

e) If you use Python or Octave, use an off-the-shelve optimization routine (ideally IP-
opt). In Python, scipy.optimize is a standard go-to solution for general optimization
problems.

1
The word “conditioning” generally denotes the ration of the largest and smallest
Eigenvalue of the Hessian.
7.2 Exercise 2

7.2.1 Quadratics

Take the quadratic function fsq = x>Cx with diagonal matrix C and entries C(i, i) = λi .

a) Which 3 fundamental shapes does a 2-dimensional quadratic take? Plot the surface
of fsq for various values of λ1 , λ2 (big/small, positive/negative/zero). Could you predict
these shapes before plotting them?

b) For which values of λ1 , λ2 does minx fsq (x) not have a solution? For which does it
have infinite solutions? For which does it have exactly 1 solution? Find out empirically
first, if you have to, then analytically.

c) Use the eigen-decomposition of a generic (non-diagonal) matrix C to prove that


the same 3 basic shapes appear and that the values of λ1 and λ2 have the same
implications on the existence of one or more solutions. (In this scenario, λ1 and λ2
don’t indicate the diagonal entries of C, but its eigenvalues).

7.2.2 Backtracking

Consider again the functions:

fsq (x) = x>Cx (10)


>
fhole (x) = 1 − exp(−x Cx) (11)

i−1
with diagonal matrix C and entries C(i, i) = c n−1 . We choose a conditioning2 c = 10.

a) Implement gradient descent with backtracking, as described on slide 02-05 (with


default parameters %). Test the algorithm on fsq (x) and fhole (x) with start point x0 =
(1, 1). To judge the performance, create the following plots:

• function value over the number of function evaluations.

• number of inner (line search) loops over the number of outer (gradient descent)
loops.

• function surface, this time including algorithm’s search trajectory.

b) Test also the alternative in step 3. Further, how does the performance change with
%ls (the backtracking stop criterion)?

7.3 Exercise 3

7.3.1 Misc

a) How do you have to choose the “damping” λ depending on ∇2f (x) in line 3 of the
Newton method (slide 02-18) to ensure that the d is always well defined (i.e., finite)?

b) The Gauss-Newton method uses the “approximate Hessian” 2∇φ(x)>∇φ(x). First


show that for any vector v ∈ Rn the matrix vv> is symmetric and semi-positive-definite.3
2
The word “conditioning” generally denotes the ration of the largest and smallest
Eigenvalue of the Hessian.
3
A matrix A ∈ Rn×n is semi-positive-definite simply when for any x ∈ Rn it holds
x>Ax ≥ 0. Intuitively: A might be a metric as it “measures” the norm of any x as
positive. Or: If A is a Hessian, the function is (locally) convex.
From this, how can you argue that ∇φ(x)>∇φ(x) is also symmetric and semi-positive-
definite?
>
c) In the context of BFGS, convince yourself that choosing H -1 = δδδ>y indeed fulfills the

desired relation δ = H -1 y, where δ and y are defined as on slide 02-23. Are there other
choices of H -1 that fulfill the relation? Which?

7.3.2 Gauss-Newton

In x ∈ R2 consider the function


sin(ax1 ) 
 

f (x) = φ(x)>φ(x) , φ(x) = sin(acx




 2 )


2x1 
 

 
 
2cx2
The function is plotted above for a = 4 (left) and a = 5 (right, having local minima), and
conditioning c = 1. The function is non-convex.

a) Extend your backtracking method implemented in the last week’s exercise to a


Gauss-Newton method (with constant λ) to solve the unconstrained minimization prob-
lem minx f (x) for a random start point in x ∈ [−1, 1]2 . Compare the algorithm for a = 4
and a = 5 and conditioning c = 3 with gradient descent.

b) Optimize the function using your optimization library of choice (If you can, use a
BFGS implementation.)

7.4 Exercise 4

7.4.1 Alternative Barriers & Penalties

Propose 3 alternative barrier functions, and 3 alternative penalty functions. To display


functions, gnuplot is useful, e.g., plot -log(-x).

7.4.2 Squared Penalties & Log Barriers

c
In a previous exercise we defined the “hole function” fhole (x), where we now assume a
conditioning c = 4.

Consider the optimization problem


c
min fhole (x) s.t. g(x) ≤ 0 (12)
x
>
 

g(x) =  x x−1  (13)


xn + 1/c
 

a) First, assume n = 2 (x ∈ R2 is 2-dimensional), c = 4, and draw on paper what the


problem looks like and where you expect the optimum.

b) Implement the Squared Penalty Method. (In the inner loop you may choose any
method, including simple gradient methods.) Choose as a start point x = ( 12 , 12 ). Plot
its optimization path and report on the number of total function/gradient evaluations
needed.

c) Test the scaling of the method for n = 10 dimensions.

d) Implement the Log Barrier Method and test as in b) and c). Compare the func-
tion/gradient evaluations needed.
7.5 Exercise 6

7.5.1 Min-max 6= max-min

Give a function f (x, y) such that

max min f (x, y) 6= min max f (x, y)


y x x y

7.5.2 Primal-dual Newton method

Slide 03:38 describes the primal-dual Newton method. Implement it to solve the same
constrained problem we considered in the last exercise.

a) d = −∇r(x, λ)-1 r(x, λ) defines the search direction. Ideally one can make a step
with factor α = 1 in this direction. However, line search needs to ensure (i) dual
feasibility λ > 0, (ii) primal feasibility g(x) ≤ 0, and (iii) sufficient decrease (the Wolfe
condition). Line search decreases the step factor α to ensure these conditions (in this
order), where the Wolfe condition here reads
 

|r(z + αd)| ≤ (1 − %ls α)|r(z)| , z= x


λ
 

b) Initialize µ = 1. In each iteration decrease it by some factor.

c) Optionally, regularize ∇r(x, λ) to robustify inversion.

7.5.3 Maybe skip: Phase I & Log Barriers

Consider the the same problem 14.

a) Use the method you implemented above to find a feasible initialization (Phase I). Do
this by solving the n + 1-dimensional problem

min s s.t. ∀i : gi (x) ≤ s, s ≥ −


(x,s)∈Rn+1

For some very small . Initialize this with the infeasible point (1, 1) ∈ R2 .

b) Once you’ve found a feasible point, use the standard log barrier method to find the
solution to the original problem (14). Start with µ = 1, and decrease it by µ ← µ/2 in
each iteration. In each iteration also report λi := giµ(x) for i = 1, 2.

7.6 Exercise 5

7.6.1 Equality Constraint Penalties and augmented Lagrangian

Take a squared penalty approach to solving a constrained optimization problem


m
X
min f (x) + µ hi (x)2 (14)
x
i=1

The Augmented Lagrangian method adds yet another penalty term


m
X m
X
min f (x) + µ hi (x)2 + λi hi (x) (15)
x
i=1 i=1

Assume that if we minimize (14) we end up at a solution x̄ for which each hi (x̄) is
reasonable small, but not exactly zero. Prove, in the context of the Augmented La-
grangian method, that setting λi = 2µhi (x̄) will, if we assume that the gradients ∇f (x)
and ∇h(x) are (locally) constant, ensure that the minimum of (15) fulfills the constraints
h(x) = 0.

Tip: Think intuitive. Think about how the gradient that arises from the penalty in (14) is
now generated via the λi .
7.6.2 Lagrangian and dual function

(Taken roughly from ‘Convex Optimization’, Ex. 5.1)

A simple example. Consider the optimization problem

min x2 + 1 s.t. (x − 2)(x − 4) ≤ 0

with variable x ∈ R.

a) Derive the optimal solution x∗ and the optimal value p∗ = f (x∗ ) by hand.

b) Write down the Lagrangian L(x, λ). Plot (using gnuplot or so) L(x, λ) over x for
various values of λ ≥ 0. Verify the lower bound property minx L(x, λ) ≤ p∗ , where p∗
is the optimum value of the primal problem.

c) Derive the dual function l(λ) = minx L(x, λ) and plot it (for λ ≥ 0). Derive the dual
optimal solution λ∗ = argmaxλ l(λ). Is maxλ l(λ) = p∗ (strong duality)?

7.6.3 Augmented Lagrangian Programming

Take last week’s programming exercise on Squared Penalty and “augment” it so that it
becomes the Augmented Lagrangian method. Compare the function/gradient evalua-
tions between the simple Squared Penalty method and the Augmented method.

7.7 Exercise 6

7.7.1 Min-max 6= max-min

Give a function f (x, y) such that

max min f (x, y) 6= min max f (x, y)


y x x y

7.7.2 Lagrangian Method of Multipliers

c
We have previously defined the “hole function” as fhole (x) = 1 − exp(−x>Cx), where
i−1
C is a n × n diagonal matrix with Cii = c n−1 . Assume conditioning c = 10 and
use the Lagrangian Method of Multipliers to solve on paper the following constrained
optimization problem in 2D.

c
min fhole (x) s.t. h(x) = 0 (16)
x

h(x) = v>x − 1 (17)

Near the very end, you won’t be able to proceed until you have special values for v. Go
as far as you can without the need for these values.

7.7.3 Primal-dual Newton method

Slide 03:38 describes the primal-dual Newton method. Implement it to solve the same
constrained problem we considered in the last exercise.

a) d = −∇r(x, λ)-1 r(x, λ) defines the search direction. Ideally one can make a step
with factor α = 1 in this direction. However, line search needs to ensure (i) dual
feasibility λ > 0, (ii) primal feasibility g(x) ≤ 0, and (iii) sufficient decrease (the Wolfe
condition). Line search decreases the step factor α to ensure these conditions (in this
order), where the Wolfe condition here reads
 

|r(z + αd)| ≤ (1 − %ls α)|r(z)| , z= x


λ
 
b) Initialize µ = 1. In each iteration decrease it by some factor.

c) Optionally, regularize ∇r(x, λ) to robustify inversion.

7.8 Exercise 7

Solving real-world problems involves 2 subproblems:

1) formulating the problem as an optimization problem (conform to a standard op-


timization problem category) (→ human)

2) the actual optimization problem (→ algorithm)

These exercises focus on the first type, which is just as important as the second, as it
enables the use of a wider range of solvers. Exercises from Boyd et al http://www.
stanford.edu/˜boyd/cvxbook/bv_cvxbook.pdf:

7.8.1 Network flow problem

Solve Exercise 4.12 (pdf page 207) from Boyd & Vandenberghe, Convex Optimization.

7.8.2 Minimum fuel optimal control

Solve Exercise 4.16 (pdf page 208) from Boyd & Vandenberghe, Convex Optimization.

7.8.3 Primal-Dual Newton for Quadratic Programming

Derive an explicit equation for the primal-dual Newton update of (x, λ) (slide 03:38) in
the case of Quadratic Programming. Use the special method for solving block matrix
linear equations using the Schur complements (Wikipedia “Schur complement”).

What is the update for a general Linear Program?

7.9 Exercise 7

7.9.1 CMA vs. twiddle search

At https://www.lri.fr/˜hansen/cmaes_inmatlab.html there is code for CMA


for all languages (I do not recommend the C++ versions).

a) Test CMA with a standard parameter setting a log-variant of the Rosenbrock function
(see Wikipedia). My implementation of this function in C++ is:
double LogRosenbrock(const arr& x) {
double f=0.;
for(uint i=1; i<x.N; i++)
f += sqr(x(i)-sqr(x(i-1)))
+ .01*sqr(1-x(i-1));
f = log(1.+f);
return f;
}
where sqr computes the square of a double.

Test CMA for the n = 2 and n = 10 dimensional Rosenbrock function. Initialize around
the start point (1, 10) and (1, 10, .., 10) ∈ R10 with standard deviation 0.1. You might
require up to 1000 iterations.

CMA should have no problem in optimizing this function – but as it always samples a
whole population of size λ, the number of evaluations is rather large. Plot f (xbest ) for
the best point found so far versus the total number of function evaluations.
b) Implement Twiddle Search (slide 05:15) and test it on the same function under same
conditions. Also plot f (xbest ) versus the total number of function evaluations and com-
pare to the CMA results.

7.10 Exercise 8

7.10.1 More on LP formulation

A few more exercises on standard techniques to convert problems into linear programs:

Solve Exercise 4.11 (pdf page 207) from Boyd & Vandenberghe, Convex Optimization.

7.10.2 Grocery Shopping

You’re at the market and you find n offers, each represented by a set of items Ai and
the respective price ci . Your goal is to buy at least one of each item for as little as
possible.

Formulate as an ILP and then define a relaxation. If possible, come up with an inter-
pretation for the relaxed problem.

7.10.3 Facility Location

There are n facilities with which to satisfy the needs of m clients. The cost for opening
facility j is fj , and the cost for servicing client i through facility j is cij . You have to find
an optimal way to open facilities and to associate clients to facilities.

Formulate as an ILP and then define a relaxation. If possible, come up with an inter-
pretation for the relaxed problem.

7.10.4 Taxicab Driver (optional)

You’re a taxicab driver in hyper-space (Rd ) and have to service n clients. Each client i
has an known initial position ci ∈ Rd and a destination di ∈ Rd . You start out at position
p0 ∈ Rd and have to service all the clients while minimizing fuel use, which is propor-
tional to covered distance. Hyper-space is funny, so the geometry is not Euclidean and
distances are Manhattan distances.

Formulate as an ILP and then define a relaxation. If possible, come up with an inter-
pretation for the relaxed problem.

7.10.5 Programming

Use the primal-dual interior point Newton method you programmed in the previous
exercises to solve the relaxed facility location for n facilities and m clients (n and m
small enough that you can find the solution by hand, so about n ∈ {3, ..., 5} and m ∈
{5, 10}.

Sample the positions of facilities and clients uniformly in [−10, 10]2 . Also sample the
cost of opening a facility randomly in [1, 10]. Set the cost for servicing client i with
facility j as the euclidean distance between the two. (important: keep track of what
seed you are using for your RNG).

Compare the solution you found by hand to the relaxed solution, and to the relaxed
solution after rounding it to the nearest integral solution. Try to find a seed for which
the rounded solution is relatively good, and one for which the rounded solution is pretty
bad.
7.11 Exercise 8

7.11.1 Global Optimization

Find an implementation of Gaussian Processes for your language of choice (e.g. python:
scikit-learn, or Sheffield/Gpy; octave/matlab: gpml). and implement UCB. Test your im-
plementation with different hyperparameters (Find the best combination of kernel and
its parameters in the GP) on the following 2D global optimization problems:

• the 2d Rosenbrock function.

• the Rastrigin function as defined in exercise e03 with a = 6.

7.11.2 Constrained Global Bayes Optimization?

On slide 5:18 it is speculated that one could consider a constrained blackbox optimiza-
tion problem as well. How could one approach this in the UCB manner?

7.12 Exercise 10

7.12.1 BlackBox Local Search Programming

1a) Implement Greedy Local Search and Stochastic Local Search

1b) Implement Random Restarts

1c) Implement Iterated Local Search

Use the above methods on the Rosenbrock and the Rastrigin functions.

7.12.2 Neighborhoods

2a) Define a deterministic and a stochastic notion of neighborhood which is appropriate


for the Grocery Shopping problem defined in e08.

2b) Define a deterministic and a stochastic notion of neighborhood which is appropriate


for the Facility Location problem defined in e08.

2c) Define a deterministic and a stochastic notion of neighborhood which is appropriate


for the Taxicab problem defined in e08.

7.13 Exercise 10

7.13.1 BlackBox Local Search Programming

1a) Implement Greedy Local Search and Stochastic Local Search

1b) Implement Random Restarts

1c) Implement Iterated Local Search

Use the above methods on the Rosenbrock and the Rastrigin functions.

7.13.2 Neighborhoods

2a) Define a deterministic and a stochastic notion of neighborhood which is appropriate


for the Grocery Shopping problem defined in e08.
2b) Define a deterministic and a stochastic notion of neighborhood which is appropriate
for the Facility Location problem defined in e08.

2c) Define a deterministic and a stochastic notion of neighborhood which is appropriate


for the Taxicab problem defined in e08.

7.14 Exercise 11

7.14.1 Model Based Optimization

1) Implement Model-based optimization as described in slide 06:33 and use it to solve


the usual Raistringen and Rosenbrock functions.

Visualize the current estimated model at each step of the optimization procedure.

Hints:

• Try out both a quadratic model φ2 (x) = [1, x1 , x2 , x21 , x22 , x1 x2 ] and a linear one
φ1 (x) = [1, x1 , x2 ].

• Initialize the model sampling .5(n + 1)(n + 2) (in the case of quadratic model)
or n + 1 (in the case of linear model) points around the starting position.

• For any given set of datapoints D, compute β = (X>X)−1 X>y, where X con-
tains (row-wise) the data points (either φ1 or φ2 ) in D, and y are the respective
function evaluations.

• Compute det(D) as det(X>X).

• Solve the local maximization problem in line 12 directly by inspecting a finite


number of points in a grid-like structure around the current point.

• The ∆ in line 18 should just be the current dataset D.

• Always maintain at least .5(n + 1)(n + 2) (quadratic) or n + 1 (linear) datapoints


in D. This almost-surely ensures that the regression parameter β is well defined.

7.14.2 No Free Lunch Theorems

Broadly speaking, the No Free Lunch Theorems state that an algorithm can be said
to outperform another one only if certain assumptions are made about the problem
which is being solved itself. In other words, algorithms perform in average exactly the
same, if no restriction or assumption is made on the type of problem itself. Algorithms
outperform each other only w.r.t specific classes of problems.

2a) Read the publication “No Free Lunch Theorems for Optimization” by Wolpert and
Macready and get a better feel for what the statements are about.

2b, 2c, 2d) You are given an optimization problem where the search space is a set X
with size 100, and the cost space Y is the set of integers {1, . . . , 100}. Come up with
three different algorithms, and three different assumptions about the problem-space
such that each algorithm outperforms the others in one of the assumptions.

Try to be creative, or you will all come up with the same “obvious” answers.
8 Bullet points to help learning

This is a summary list of core topics in the lecture and intended as a guide for prepara-
tion for the exam. Test yourself also on the bullet points in the table of contents. Going
through all exercises is equally important.

8.1 Optimization Problems in General

• Types of optimization problems


– General constrained optimization problem definition
– Blackbox, gradient-based, 2nd order
– Understand the differences

• Hardly coherent texts that cover all three


– constrained & convex optimization
– local & adaptive search
– global/Bayesian optimization

• In the lecture we usually only consider inequality constraints (for simplicity of


presentation)
– Understand in all cases how also equality constraints could be handled

8.2 Basic Unconstrained Optimization

• Plain gradient descent


– Understand the stepsize problem
– Stepsize adaptation
– Backtracking line search (2:21)

• Steepest descent
– Is the gradient the steepest direction?
– Covariance (= invariance under linear transformations) of the steepest descent
direction

• 2nd-order information
– 2nd order information can improve direction & stepsize
– Hessian needs to be pos-def (↔ f (x) is convex) or modified/approximated as
pos-def (Gauss-Newton, damping)

• Newton method
– Definition
– Adaptive stepsize & damping

• Gauss-Newton
– f (x) is a sum of squared cost terms
– The approx. Hessian 2∇φ(x)>∇φ(x) is always semi-pos-def!

• Quasi-Newton
– Accumulate gradient information to approximate a Hessian
δδ>
– BFGS, understand the term δ>y

• Conjugate gradient
– New direction d0 should be “orthogonal” to the previous d, but relative to the
local quadratic shape, d0>Ad = 0 (= d0 and d are conjugate)
– On quadratic functions CG converges in n iterations
• Rprop
– Seems awfully hacky
– Every coordinate is treated separately. No invariance under rotations/transformatio
– Change in gradient sign → reduce stepsize; else increase
– Works surprisingly well and robust in practice

• Convergence
– With perfect line search, the extrem (finite & positive!) eigenvalues of the
Hessian ensure convergence
– The Wolfe conditions (acceptance criterion for backtracking line search) en-
sure a “significant” decrease in f (x), which also leads to convergence

• Trust region
– Alternative to stepsize adaptation and backtracking

• Evaluating optimization costs


– Be aware in differences in convention. Sometimes “1 iteration”=many function
evaluations (line search)
– Best: always report on # function evaluations

8.3 Constrained Optimization


• Overview
– General problem definition
– Convert to series of unconstrained problems: penalty, log barrier, & Aug-
mented Lagrangian methods
– Convert to series of QPs and line search: Sequential Quadratic Programming
– Convert to larger unconstrained problem: primal-dual Newton method
– Convert to other constrained problem: dual problem

• Log barrier method


– Definition
– Understand how the barrier gets steeper with µ → 0 (not µ → ∞!)
– Iterativly decreasing µ generates the central path
– The gradient of the log barrier generates a Lagrange term with λi = − giµ(x) !
→ Each iteration solves the modified (approximate) KKT condition

• Squared penalty method


– Definition
– Motivates the Augmented Lagrangian

• Augmented Lagrangian
– Definition
– Role of the squared penalty: “measure” how strong f pushes into the con-
straint
– Role of the Lagrangian term: generate counter force
– Unstand that the λ update generates the “desired gradient”

• The Lagrangian
– Definition
– Using the Lagrangian to solve constrained problems on paper (set both, ∇x L(x, λ)
0 and ∇λ L(x, λ) = 0)
– “Balance of gradients” and the first KKT condition
– Understand in detail the full KKT conditions
– Optima are necessarily saddle points of the Lagrangian
– minx L ↔ first KKT ↔ balance of gradients
– maxλ L ↔ complementarity KKT ↔ constraints
• Lagrange dual problem
– primal problem: minx maxl≥0 L(x, λ)
– dual problem: maxλ≥0 minx L(x, λ)
– Definition of Lagrange dual
– Lower bound and strong duality

• Primal-dual Newton method to solve KKT conditions


– Definition & description

• Phase I optimization
– Nice trick to find feasible initialization

8.4 Convex Optimization

• Definitions
– Convex, quasi-convex, uni-modal functions
– Convex optimization problem

• Linear Programming
– General and standard form definition
– Converting into standard form
– LPs are efficiently solved using 2nd-order log barrier, augmented Lagrangian
or primal-dual methods
– Simplex Algorithm is classical alternative; walks on the constraint edges in-
stead of the interior

• Application of LP:
– Very important application of LPs: LP-relaxations of integer linear programs

• Quadratic Programming
– Definition
– QPs are efficiently solved using 2nd-order log barrier, augmented Lagrangian
or dual-primal methods
– Sequential QP solves general (non-quadratic) problems by defining a local QP
for the step direction followed by a line search in that direction

8.5 Search methods for Blackbox optimization

• Overview
– Basic downhill running: mostly ignore the collected data
– Use the data to shape search: stochastic search, EAs, model-based search
– Bayesian (global) optimization

• Basic downhill running


– Greedy local search: defined by neighborhood N
– Stochastic local search: defined by transition probability q(y|x)
– Simulated Annealing: also accepts “bad” steps depending on temperature;
theoretically highly relevant, practically less
– Random restarts of local search can be efficient
– Iterated local search: use meta-neighborhood N∗ to restart
– Coordinate & Pattern search, Twiddle: use heuristics to walk along coordi-
nates
– Nelder-Mead simplex method: reflect, expand, contract, shrink
• Stochastic Search
– General scheme: sample from pθ (x), update θ
– Understand the crucial role of θ: θ captures all that is maintained and updated
depending on the data; in EAs, θ is a population; in ESs, θ are parameters of a
Gaussian
– Categories of EAs: ES, GA, GP, EDA
– CMA: adapting C and σ based on the path of the mean

• Model-based Optimization
– Precursor of Bayesian Optimization
– Core: smart ways to keep data D healthy

8.6 Bayesian Optimization


• Multi-armed bandit framework
– Problem definition
– Understand the concepts of exploration, exploitation & belief
– Optimal Optimization would imply to plan (exactly) through belief space
– Upper Confidence Bound (UCB) and confidence interval
– UCB is optimistic

• Global optimization
– Global optimization = infinite bandits
– Locally correlated bandits → Gaussian Process beliefs
– Maximum Probability of Improvement
– Expected Improvement
– GP-UCB

• Potential pitfalls
– Choice of prior belief (e.g. kernel of the GP) is crucial
– Pure variance-based sampling for radially symmetric kernel ≈ grid sampling
Index
Augmented Lagrangian method (3:14), Phase I optimization (3:40),
Plain gradient descent (2:1),
Primal-dual interior-point Newton method
Backtracking (2:5), (3:36),
Bandits (5:4),
Belief planning (5:8), Quadratic program (QP) (4:6),
Blackbox optimization: definition (6:1), Quasi-Newton methods (2:23),

Blackbox optimization: overview (6:3), Random restarts (6:10),


Rprop (2:35),
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
(2:25), Sequential quadratic programming (4:23),

Central path (3:9), Simplex method (4:11),


Conjugate gradient (2:28), Simulated annealing (6:7),
Constrained optimization (3:1), Squared penalty method (3:12),
Coordinate search (6:14), Steepest descent direction (2:11),
Covariance Matrix Adaptation (CMA) (6:24), Stepsize adaptation (2:4),
Stepsize and step direction as core is-
Covariant gradient descent (2:13), sues (2:2),
Stochastic local search (6:6),
Estimation of Distribution Algorithms (EDAs)
(6:28), Trust region (3:41),
Evolutionary algorithms (6:23), Types of optimization problems (1:3),
Expected Improvement (5:24),
Exploration, Exploitation (5:6), Upper Confidence Bound (UCB) (5:12),

Function types: covex, quasi-convex, uni-


modal (4:1), Variable neighborhood search (6:13),

Wolfe conditions (2:7),


Gauss-Newton method (2:20),
Gaussian Processes as belief (5:19),
General stochastic search (6:20),
Global Optimization as infinite bandits
(5:17),
GP-UCB (5:24),
Gradient descent convergence (2:8),
Greedy local search (6:5),

Implicit filtering (6:34),


Iterated local search (6:11),

Karush-Kuhn-Tucker (KKT) conditions (3:25),

Lagrange dual problem (3:29),


Lagrangian: definition (3:21),
Lagrangian: relation to KKT (3:24),
Lagrangian: saddle point view (3:27),

Line search (2:5),


Linear program (LP) (4:6),
Log barrier as approximate KKT (3:33),

Log barrier method (3:6),


LP in standard form (4:7),
LP-relaxations of integer programs (4:15),

Maximal Probability of Improvement (5:24),

Model-based optimization (6:31),

Nelder-Mead simplex method (6:16),


Newton direction (2:14),
Newton method (2:15),

Pattern search (6:15),

You might also like