Advanced Algorithms in Nonlinear Optimization
Philippe Toint
Department of Mathematics, University of Namur, Belgium
( philippe.toint@fundp.ac.be )
Belgian Francqui Chair, Leuven, April 2009
Outline
1
Nonlinear optimization: motivation, past and perspectives
Trust region methods for unconstrained problems
Derivative free optimization, filters and other topics
Convex constraints and interior-point methods
The use of problem structure for large-scale applications
Regularization methods and nonlinear step control
Conclusions
Philippe Toint (Namur)
April 2009
2 / 323
Acknowledgements
This course would not have been possible without
the Francqui Foundation and the Katholieke Universiteit Leuven,
Moritz Diehl, Dirk Roose and Stefan Vandewalle (the gentle
organizers),
Fabian Bastin, Stefania Bellavia, Cinzia Cirillo, Coralia Cartis, Andy
Conn, Nick Gould, Serge Gratton, Sven Leyffer, Vincent Malmedy,
Benedetta Morini, Melodie Mouffe, Annick Sartenaer, Katya
Scheinberg, Dimitri Tomanos, Melissa Weber-Mendonca (my patient
co-authors).
Ke Chen, Patrick Laloyaux (who supplied pictures)
My grateful thanks to them all.
Philippe Toint (Namur)
April 2009
3 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
What is optimization?
The best choice subject to constraints
best
choice
constraints
Philippe Toint (Namur)
criterion, objective function
variables whose value may be chosen
restrictions on allowed values of the variables
April 2009
4 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
More formally
variables
objective function
constraints
x = (x1 , x2 , . . . , xn )
minimize/maximize f (x)
c(x) 0
Note: maximize f (x) equivalent to minimize f (x).
min f (x)
x
such that
c(x) 0
(the general nonlinear optimization problem)
(+ conditions on x, f and c)
Philippe Toint (Namur)
April 2009
5 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Nature optimizes
Philippe Toint (Namur)
April 2009
6 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
People optimize (daily)
Philippe Toint (Namur)
April 2009
7 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: PAL design (1)
Design of modern Progressive Adaptive Lenses:
vary optical power of lenses while minimizing astigmatism
a)
power
b)
astigmatism
Loos, Greiner, Seidel (1997)
Philippe Toint (Namur)
April 2009
8 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: PAL design (2)
Achievements: Loos, Greiner, Seidel (1997)
uncorrected
long distance
Philippe Toint (Namur)
short distance
PAL
April 2009
9 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: PAL design (3)
Is this nonlinear ( difficult)?
Assume the lens surface is z = z(x, y ). The optical power is
"
#
2 ! 2
2 ! 2
z
z
z
N3
z
z z 2 z
1+
p(x, y ) =
+ 1+
2
2
x
y 2
y
x 2
x y xy
where
1
N = N(x, y ) = r
h i2 h i2 .
1 + z + z
x
y
The surface astigmatism is then
v
u
u
a(x, y ) = 2tp(x, y ) N 4
Philippe Toint (Namur)
2 2 !
z z
z
x y
xy
April 2009
10 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: Food sterilization (1)
A common problem in the food processing industry:
keep a max of vitamins while killing a prescribed fraction of the bacteria
heating in steam/hot water autoclaves
Sachs (2003)
Philippe Toint (Namur)
April 2009
11 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: Food sterilization (2)
Model: coupled PDEs
Concentration of micro-organisms and other nutrients:
C
(x, t) = K [(x, t)]C (x, t),
t
where (x, t) is the temperature, and where
K [] = K1 e
K2
1
1
(Arrhenius equation)
Evolution of temperature:
c()
= [k()],
t
(with suitable boundary conditions: coolant, initial temperature,. . . )
Philippe Toint (Namur)
April 2009
12 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: biological parameter estimation (1)
K-channel in a the model of a neuron membrane:
Sansom (2001)
Doyle et al. (1998)
Philippe Toint (Namur)
April 2009
13 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: biological parameter estimation (2)
Where are these neurons?
in a Pacific spiny lobster!
Simmers, Meyrand and Moulin (1995)
Philippe Toint (Namur)
April 2009
14 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: biological parameter estimation (3)
After gathering experimental data (applying a current to the cell):
estimate the biological model parameters that best fit experiments
Model:
Activation: p independent gates
Deactivation: nh gates with different dynamics
nh + 2 coupled ODEs for the voltage, the activation level, the partial
inactivations levels
5-points BDF for 50000 time steps
very nonlinear!
Philippe Toint (Namur)
April 2009
15 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: data assimilation for weather forecasting (1)
(Attempt to) predict. . .
tomorrows weather
the oceans average temperature
next month
future gravity field
future currents in the ionosphere
...
Philippe Toint (Namur)
April 2009
16 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: data assimilation for weather forecasting (2)
Data: temperature, wind, pressure, . . . everywhere and at all times!
May involve up to 250000000 variables!
Philippe Toint (Namur)
April 2009
17 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: data assimilation for weather forecasting (3)
The principle:
0.8
Known situation 2.5 days ago
and background prediction
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days
Philippe Toint (Namur)
April 2009
18 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: data assimilation for weather forecasting (3)
The principle:
0.8
Known situation 2.5 days ago
and background prediction
Record temperature for the past 2.5 days
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days
Philippe Toint (Namur)
April 2009
18 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: data assimilation for weather forecasting (3)
The principle:
Minimize deviation between model and past observations
1
Known situation 2.5 days ago
and background prediction
Record temperature for the past 2.5 days
Run the model to minimize difference
I between model and observations
0.8
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days
N
1X
1
min kx0 xb k2B 1 +
kHM(ti , x0 ) bi k2R 1 .
x0 2
2
i
i=0
Philippe Toint (Namur)
April 2009
18 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: data assimilation for weather forecasting (3)
The principle:
Minimize deviation between model and past observations
Known situation 2.5 days ago
and background prediction
Record temperature for the past 2.5 days
Run the model to minimize difference
I between model and observations
Predict temperature for the next day
0.8
0.6
0.4
0.2
0.2
0.4
2.5
1.5
0.5
0.5
temp. vs. days
Philippe Toint (Namur)
April 2009
18 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: data assimilation for weather forecasting (4)
Analysis of the oceans heat content:
CERFACS (2009)
Much better fit!
Philippe Toint (Namur)
April 2009
19 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: aeronautical structure design
minimize weight while maintaining structural integrity
mass reduction during optimization
SAMTECH (2009)
Philippe Toint (Namur)
April 2009
20 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: asteroid trajectory matching
find todays asteroid whose orbital parameters
match best one observed 50 years ago
Milani, Sansaturio et al. (2005)
Philippe Toint (Namur)
April 2009
21 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: discrete choice modelling (1)
Context: simulation of individual choices in Transportation (or other)
(mode, route, time of departure,. . . )
Random utility theory
An individual i assigns to alternative j the utility
Uij = [ parameters explaining factors ] + [ random error ]
Illustration :
Ubus = distance 1.2 price of ticket 2.1 delay wrt to car travel +
Philippe Toint (Namur)
April 2009
22 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: discrete choice modelling (2)
Probability that individual i chooses alternative j rather than
alternative k given by
prob (Uij Uik for all k)
Data: mobility surveys (MOBEL)
find the parameters in the utility function to
maximize likelihood of observed behaviours
Philippe Toint (Namur)
April 2009
23 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: discrete choice modelling (3)
1
Normal
Lognormal
Spline
0.9
0.8
0.7
0.6
0.5
0.4
Using standard statistics
0.3
0.2
0.1
Using advanced optimization
0
0
10
15
20
25
30
35
40
45
Estimation of the value of time lost in congested trafic
(with and without advanced optimization)
Philippe Toint (Namur)
April 2009
24 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: Poisson image denoising (1)
Consider a two dimensional image with noise proportional to signal
zij = uij + nf (uij )
where n is a random Gaussian noise. How to recover the original uij ?
use the pixel values as much as possible
while minimizing sharp transitions (gradients)
This leads to the optimization problem
min
u
Philippe Toint (Namur)
Z
(uij zij log(uij )) +
ij
kuk
April 2009
25 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: Poisson image denoising (2)
Some spectacular results: a 512 512 picture with 95% noise
Philippe Toint (Namur)
April 2009
26 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: Poisson image denoising (2)
Some spectacular results: a 512 512 picture with 95% noise
Chan and Chen (2007)
Philippe Toint (Namur)
April 2009
26 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: shock simulation in video games
Optimize the realism of the motion of multiple rigid bodies in space
complementarity problem
q [q(t)]v (t) 0
(q(t)) 0
(q(t) = positions, v (t) =
dq
dt (t)
= velocities)
system of inequalities and equalities
used in realtime for video animation
Anitescu and Potra (1996)
Philippe Toint (Namur)
April 2009
27 / 323
Nonlinear optimization: motivation, past and perspectives
Definition and examples
Applications: finance
1
risk management
portofolio analysis
3
1
0.9
FX markets
Normal
Spline
0.8
0.7
0.6
0.5
0.4
Standard
0.3
Optimized
0.2
0.1
0
-1.5
-1
-0.5
0.5
Investment distribution
for the BoJ 1991-2004
4
...
Philippe Toint (Namur)
1.5
Everybody loves
an optimizer!
April 2009
28 / 323
Nonlinear optimization: motivation, past and perspectives
History
Where does optimization come from?
Nous sommes comme des nains juches sur des epaules de geants, de telle
sorte que nous puissions voir plus de choses et de plus eloignees que nen
voyaient ces derniers. Et cela, non point parce que notre vue serait
puissante ou notre taille avantageuse, mais parce que nous sommes portes
et exhausses par la haute stature des geants.
We are like dwarfs standing on the shoulders of giants, such that we can
see more things and further away than they could. And this, not because
our sight would be more powerful or our height more advantageous, but
because we are carried and heigthened by the high stature of the giants.
Bernard de Chartres (1130-1160)
Philippe Toint (Namur)
April 2009
29 / 323
Nonlinear optimization: motivation, past and perspectives
Euclid (300 BC)
Philippe Toint (Namur)
History
Al-Khwarizmi (783-850)
April 2009
30 / 323
Nonlinear optimization: motivation, past and perspectives
Isaac Newton (1642-1727)
Philippe Toint (Namur)
History
Leonhardt Euler (1707-1783)
April 2009
31 / 323
Nonlinear optimization: motivation, past and perspectives
J. de Lagrange (1735-1813)
Philippe Toint (Namur)
History
Friedrich Gauss (1777-1855)
April 2009
32 / 323
Nonlinear optimization: motivation, past and perspectives
History
Augustin Cauchy (1789-1857) George Dantzig (1914-2005)
Philippe Toint (Namur)
April 2009
33 / 323
Nonlinear optimization: motivation, past and perspectives
Michael Powell
Philippe Toint (Namur)
History
Roger Fletcher
April 2009
34 / 323
Nonlinear optimization: motivation, past and perspectives
Basic concepts
Return to the mathematical problem
min f (x)
x
such that
c(x) 0
Difficulties:
the objective function f (x) is typically complicated (nonlinear)
it is also often costly to compute
there may be many variables
the constraints c(x) may defined a complicated geometry
Philippe Toint (Namur)
April 2009
35 / 323
Nonlinear optimization: motivation, past and perspectives
Basic concepts
An example unconstrained problem
minimize : f (, ) = 102 + 10 2 + 4 sin() 2 + 4
3
Two local minima: (2.20, 0.32) and (2.30, 0.34)
How to find them?
Philippe Toint (Namur)
April 2009
36 / 323
Nonlinear optimization: motivation, past and perspectives
Basic concepts
Trust-region methods
iterative algorithms
find local solutions only
Algorithm 1.1: The trust-region framework
Until an (approximate) solution is found:
Step 1: use a model of the nonlinear function(s)
within region where it can be trusted
Step 2: notion of sufficient decrease
Step 3: measure achieved and predicted reductions
Step 4: decrease the region radius if unsuccessful
Philippe Toint (Namur)
April 2009
37 / 323
Nonlinear optimization: motivation, past and perspectives
Illustration
minimize : f (, ) = 102 + 10 2 + 4 sin() 2 + 4
Two local minima: (2.20, 0.32) and (2.30, 0.34)
Philippe Toint (Namur)
April 2009
38 / 323
Nonlinear optimization: motivation, past and perspectives
x0 = (0.71, 3.27)
Illustration
and
f (x0 ) = 97.630
Contours of f
Contours of m0 around x0
(quadratic model)
Philippe Toint (Namur)
April 2009
39 / 323
Nonlinear optimization: motivation, past and perspectives
k
0
k
1
sk
(0.05, 0.93)
Illustration
f (xk + sk )
43.742
f /mk
0.998
xk+1
x0 + s0
Philippe Toint (Namur)
April 2009
40 / 323
Nonlinear optimization: motivation, past and perspectives
k
0
1
k
1
2
sk
(0.05, 0.93)
(0.62, 1.78)
Illustration
f (xk + sk )
43.742
2.306
f /mk
0.998
1.354
xk+1
x0 + s0
x1 + s1
Philippe Toint (Namur)
April 2009
40 / 323
Nonlinear optimization: motivation, past and perspectives
k
0
1
2
k
1
2
4
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
Illustration
f (xk + sk )
43.742
2.306
6.295
f /mk
0.998
1.354
0.004
xk+1
x0 + s0
x1 + s1
x2
Philippe Toint (Namur)
April 2009
40 / 323
Nonlinear optimization: motivation, past and perspectives
k
0
1
2
3
k
1
2
4
2
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
f /mk
0.998
1.354
0.004
0.649
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
Philippe Toint (Namur)
April 2009
40 / 323
Nonlinear optimization: motivation, past and perspectives
k
0
1
2
3
4
k
1
2
4
2
2
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
31.131
f /mk
0.998
1.354
0.004
0.649
0.857
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
Philippe Toint (Namur)
April 2009
40 / 323
Nonlinear optimization: motivation, past and perspectives
k
0
1
2
3
4
5
k
1
2
4
2
2
4
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
(0.03, 0.02)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
31.131
31.176
f /mk
0.998
1.354
0.004
0.649
0.857
1.009
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
x4 + s4
Philippe Toint (Namur)
April 2009
40 / 323
Nonlinear optimization: motivation, past and perspectives
k
0
1
2
3
4
5
6
k
1
2
4
2
2
4
8
sk
(0.05, 0.93)
(0.62, 1.78)
(3.21, 0.00)
(1.90, 0.08)
(0.32, 0.15)
(0.03, 0.02)
(0.02, 0.00)
Illustration
f (xk + sk )
43.742
2.306
6.295
29.392
31.131
31.176
31.179
f /mk
0.998
1.354
0.004
0.649
0.857
1.009
1.013
xk+1
x0 + s0
x1 + s1
x2
x2 + s2
x3 + s3
x4 + s4
x5 + s5
Philippe Toint (Namur)
April 2009
40 / 323
Nonlinear optimization: motivation, past and perspectives
Illustration
Path of iterates:
From another x0 :
Philippe Toint (Namur)
April 2009
41 / 323
Nonlinear optimization: motivation, past and perspectives
Illustration
And then. . .
Does it (always) work?
The answer tomorrow!
(and subsequent days for a (biased) survey of new optimization methods)
Thank you to you for your attention
Philippe Toint (Namur)
April 2009
42 / 323
Trust region methods for unconstrained problems
Philippe Toint (Namur)
April 2009
43 / 323
Trust region methods for unconstrained problems
H(
Lesson 2:
Trust-region methods
for unconstrained problems
Philippe Toint (Namur)
April 2009
43 / 323
Trust region methods for unconstrained problems
The basic text for this course
A. R. Conn, N. I. M. Gould and Ph. L. Toint,
Trust-Region Methods,
Nr 01 in the MPS-SIAM Series on Optimization,
SIAM, Philadelphia, USA, 2000.
Philippe Toint (Namur)
April 2009
44 / 323
Trust region methods for unconstrained problems
Background material
2.1: Background material
Philippe Toint (Namur)
April 2009
45 / 323
Trust region methods for unconstrained problems
Background material
Scalar mean-value theorems
Let S be an open subset of IRn , and suppose f : S IR is
continuously differentiable throughout S. Then, if the segment
x + s S for all [0, 1],
f (x + s) = f (x) + hx f (x + s), si
for some [0, 1].
Let S be an open subset of IRn , and suppose f : S IR is twice
continuously differentiable throughout S. Then, if the segment
x + s S for all [0, 1],
f (x + s) = f (x) + hx f (x), si + 12 hs, xx f (x + s)si
for some [0, 1].
Philippe Toint (Namur)
April 2009
46 / 323
Trust region methods for unconstrained problems
Background material
Vector mean-value theorem
Let S be an open subset of IRn , and suppose F : S IRm is
continuously differentiable throughout S. Then, if the segment
x + s S for all [0, 1],
Z
F (x + s) = F (x) +
0
Philippe Toint (Namur)
x F (x + s)s d.
April 2009
47 / 323
Trust region methods for unconstrained problems
Background material
Taylors scalar approximation theorems (1)
Let S be an open subset of IRn , and suppose f : S IR is
continuously differentiable throughout S. Suppose further that
x f (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm. Then, if the segment
x + s S for all [0, 1],
|f (x + s) m(x + s)| 21 (x)ksk2 ,
where
m(x + s) = f (x) + hx f (x), si.
Philippe Toint (Namur)
April 2009
48 / 323
Trust region methods for unconstrained problems
Background material
Taylors scalar approximation theorems (2)
Let S be an open subset of IRn , and suppose f : S IR is twice
continuously differentiable throughout S. Suppose further that
xx f (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm and its induced matrix
norm. Then, if the segment x + s S for all [0, 1],
|f (x + s) m(x + s)| 61 (x)ksk3 ,
where
m(x + s) =
f (x) + hx f (x), si + 12 hs, xx f (x)si.
Philippe Toint (Namur)
April 2009
49 / 323
Trust region methods for unconstrained problems
Background material
Taylors vector approximation theorem
Let S be an open subset of IRn , and suppose F : S IRm is
continuously differentiable throughout S. Suppose further that
x F (x) is Lipschitz continuous at x, with Lipschitz constant
(x) in some appropriate vector norm and its induced matrix
norm. Then, if the segment x + s S for all [0, 1],
kF (x + s) M(x + s)k 21 (x)ksk2 ,
where
M(x + s) = F (x) + x F (x)s.
Philippe Toint (Namur)
April 2009
50 / 323
Trust region methods for unconstrained problems
Background material
Newtons method
Solve
F (x) = 0
Idea: solve linear approximation
F (x) + J(x)s = 0
quadratic local convergence
. . . but not globally convergent
Yet the basis of everything that follows
Philippe Toint (Namur)
April 2009
51 / 323
Trust region methods for unconstrained problems
Background material
Unconstrained optimality conditions
Suppose that f C 1 , and that x is a local minimizer of f (x).
Then
x f (x ) = 0.
Suppose that f C 2 , and that x is a local minimizer of f (x).
Then the above holds and the objective functions Hessian at
x is positive semi-definite, that is
hs, xx f (x )si 0 for all s IRn .
hs, xx f (x )si > 0 for all s 6= 0 IRn
strict local solution
Philippe Toint (Namur)
April 2009
52 / 323
Trust region methods for unconstrained problems
Background material
Constrained optimality conditions (1)
minimize
subject to
and
f (x)
ci (x) = 0, for i E,
ci (x) 0, for i I,
Active set:
F{2,3}
c2 (x) = 0
r
rx3
F{2}
j
C
F
F{1,2} *
Philippe Toint (Namur)
rx1
6
F{1}
rx2
F{3}
A(x1 ) = {1, 2}
A(x2 ) =
A(x3 ) = {3}
c1 (x) = 0
r
Y F{1,3}
H
c3 (x) = 0
April 2009
53 / 323
Trust region methods for unconstrained problems
Background material
Constrained optimality conditions (2): first order
Ignore constraint qualification!
Suppose that f , c C 1 , and that x is a local solution. Then
there exist a vector of Lagrange multipliers y such that
X
x f (x ) =
[y ]i x ci (x )
iEI
ci (x ) = 0 for all i E
ci (x ) 0 and [y ]i
and ci (x )[y ]i
0 for all i I
= 0 for all i I.
Lagrangian: `(x, y ) = f (x)
yi ci (x)
iEI
Philippe Toint (Namur)
April 2009
54 / 323
Trust region methods for unconstrained problems
Background material
Constrained optimality conditions (3): second order
Suppose that f , c C 2 , and that x is a local minimizer of f (x).
Then there exist a vector of Lagrange multipliers y such that firstorder conditions hold and
hs, xx `(x , y )si 0 for all s N+
where N+ is the set of vectors s such that
S
T
hs, x ci (x )i = 0 for all i E {j A(x ) I | [y ]j > 0}
and
hs, x ci (x )i 0 for all i {j A(x )
I | [y ]j = 0}
strict complementarity: hs, xx `(x , y )si > 0 for all s N+ (s 6= 0)
strict local solution
Philippe Toint (Namur)
April 2009
55 / 323
Trust region methods for unconstrained problems
Background material
Optimatity conditions (convex 1)
Assume now that C is convex
normal cone of C at x C,
def
N (x) = {y IRn | hy , u xi 0, u C}
tangent cone of C at x C
T (x) = N (x)0 = cl{(u x) | 0 and u C}
def
Philippe Toint (Namur)
April 2009
56 / 323
Trust region methods for unconstrained problems
Background material
Optimality conditions (convex 2)
11111111
00000000
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000000000000000000000000
11111111111111111111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000000000000000000
11111111111111111111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
11111111111111
000000000000
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
000000000000 11111111111111
111111111111
00000000000000
11111111111111
00000000000000
11111111111111
00000000000000
11111111111111
00000000000000
11111111111111
The Moreau decomposition
Philippe Toint (Namur)
April 2009
57 / 323
Trust region methods for unconstrained problems
Background material
Optimatity conditions (convex 2)
Suppose that C 6= is convex, closed, that f is continuously
differentiable in C, and that x is a first-order critical point for
the minimization of f over C. Then, provided that constraint
qualification holds,
x f (x ) N (x ).
Philippe Toint (Namur)
April 2009
58 / 323
Trust region methods for unconstrained problems
Background material
Conjugate gradients
Idea: minimize a convex quadratic on successive nested Krylov subspaces
Algorithm 2.1: Conjugate-gradients (CG)
Given x0 , set g0 = Hx0 + c and let p0 = g0 .
For k = 0, 1, . . . , until convergence, perform the iteration
k
= kgk k22 /hpk , Hpk i
xk+1 = xk + k pk
gk+1 = gk + k Hpk
k
= kgk+1 k22 /kgk k22
pk+1 = gk+1 + k pk
Philippe Toint (Namur)
April 2009
59 / 323
Trust region methods for unconstrained problems
Background material
Preconditioning
Idea: change the variables x = Rx and define M = R T R.
Algorithm 2.2: Preconditioned CG
Given x0 , set g0 = Hx0 + c, and let v0 = M 1 g0 and p0 = v0 .
For k = 0, 1, . . . , until convergence, perform the iteration
k
= hgk , vk i/hpk , Hpk i
xk+1 = xk + k pk
gk+1 = gk + k Hpk
vk+1 = M 1 gk+1
k
= hgk+1 , vk+1 i/hgk , vk i
pk+1 = vk+1 + k pk
Philippe Toint (Namur)
April 2009
60 / 323
Trust region methods for unconstrained problems
Background material
Lanczos method
Idea: compute an orthonormal basis of the successive nested Krylov
subspaces
makes QkT HQk tridiagonal
Algorithm 2.3: Lanczos
Given r0 , set y0 = r0 , q1 = 0.
For k = 0, 1, . . ., perform the iteration,
k
= kyk k2
qk
= yk /k
= hqk , Hqk i
yk+1 = Hqk k qk k qk1
Philippe Toint (Namur)
April 2009
61 / 323
Trust region methods for unconstrained problems
Background material
Another view on the Conjugate-Gradients method
Conjugate Gradients = Lanczos + LDLT (Cholesky)
Lanczos
Cholesky
{z
Conjugate gradients in one of the Krylov subspaces
Philippe Toint (Namur)
April 2009
62 / 323
Trust region methods for unconstrained problems
The Trust-region algorithm
2.2: The trust-region algorithm
Philippe Toint (Namur)
April 2009
63 / 323
Trust region methods for unconstrained problems
The Trust-region algorithm
The trust-region idea
use a model of the objective function
define a trust-region where it is thought adequate
Bk = {x IRn | kx xk kk k }
find a trial point by sufficiently decreasing the model in Bk
compute the objective function at the trial point
compare achived vs. predicted reductions
reduce k if unsatisfactory
Philippe Toint (Namur)
April 2009
64 / 323
Trust region methods for unconstrained problems
The Trust-region algorithm
The basic trust-region algorithm
Algorithm 2.4: Basic trust-region algorithm (BTR)
Step 0: Initialization. x0 and 0 given, compute f (x0 ) and set k = 0.
Step 1: Model definition. Choose k kk and define a model mk in Bk .
Step 2: Step calculation. Compute sk that sufficiently reduces the
model mk with xk + sk Bk .
Step 3: Acceptance of the trial point. Compute f (xk + sk ) and define
k =
f (xk ) f (xk + sk )
.
mk (xk ) mk (xk + sk )
If k 1 , then define xk+1 = xk + sk ; otherwise define xk+1 = xk .
Step 4: Trust-region radius update.
k+1
[k , )
[2 k , k ]
[1 k , 2 k ]
if k 2 ,
if k [1 , 2 ),
if k < 1 .
Increment k by one and go to Step 1.
Philippe Toint (Namur)
April 2009
65 / 323
Trust region methods for unconstrained problems
Basic convergence theory
2.3: Basic convergence theory
Philippe Toint (Namur)
April 2009
66 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Assumptions
f C2
f (x) lbf
kxx f (x)k ufh
mk C 2 (Bk )
mk (xk ) = f (xk )
def
gk = x mk (xk ) = x f (xk )
kxx mk (x)k umh 1 for all x Bk
1
une kxkk
kxk une kxkk
. . . but use k kk = k k2 in what follows!
Philippe Toint (Namur)
April 2009
67 / 323
Trust region methods for unconstrained problems
Basic convergence theory
The Cauchy step
Idea: minimize mk on the Cauchy arc
def
xkC (t) = {x | x = xk tgk , t 0 and x Bk }.
4
3.5
2
2.5
1
2
0
1.5
1
1
2
0.5
4
4
1.5
2.5
3.5
4.5
the Cauchy point
Philippe Toint (Namur)
April 2009
68 / 323
Trust region methods for unconstrained problems
Basic convergence theory
The Cauchy point for quadratic models
Three cases when minimizing the quadratic mk along the Cauchy arc:
1
1
0
4
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
15
0 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1
0.1
0.2
kgk k
mk (xk ) mk (xk ) kgk k min
, k
k
C
Philippe Toint (Namur)
1
2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
April 2009
69 / 323
Trust region methods for unconstrained problems
Basic convergence theory
The Cauchy point for general models
Three cases when minimizing the general mk along the Cauchy arc:
1
1
0
4
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
15
0 1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 1
0.1
0.2
0.3
kgk k
mk (xk ) mk (xk ) dcp kgk k min
, k
k
AC
Philippe Toint (Namur)
0.4
0.5
0.6
0.7
0.8
0.9
April 2009
70 / 323
Trust region methods for unconstrained problems
Basic convergence theory
The meaning of sufficient decrease
In both cases, we get:
Sufficient decrease condition:
kgk k
, k ,
mk (xk ) mk (xk + sk ) mdc kgk k min
k
Immediate consequence:
Suppose that x f (xk ) 6= 0. Then mk (xk + sk ) < mk (xk ) and
sk 6= 0.
k is well defined!
Philippe Toint (Namur)
April 2009
71 / 323
Trust region methods for unconstrained problems
Basic convergence theory
The exact minimizer is OK
Suppose that, for all k, sk ensures that
mk (xk ) mk (xk + sk )amm [mk (xk ) mk (xkM )],
Then sufficient decrease is obtained.
1.5
80
60
45
30
20
5
5.2
0.5
2
1
5.2
1
2
5.2
0.5
5
20
30
45
60
1.5
1.5
Philippe Toint (Namur)
0.5
0.5
1.5
April 2009
72 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Taylor and minimum radius
For all k,
|f (xk + sk ) mk (xk + sk )| ubh 2k ,
Suppose that gk 6= 0 and that
k
mdc kgk k(1 2 )
.
ubh
Then iteration k is very successful and
k+1 k .
Suppose that kgk k lbg > 0 for all k. Then is a constant
lbd > 0 such that, for all k
k lbd .
Philippe Toint (Namur)
April 2009
73 / 323
Trust region methods for unconstrained problems
Basic convergence theory
First-order convergence (1)
Suppose that there are only finitely many successful iterations.
Then xk = x for all sufficiently large k and x is first-order
critical.
Suppose that there are infinitely many successful iterations.
Then
lim inf kx f (xk )k = 0.
k
idea: infinite descent if not critical
Philippe Toint (Namur)
April 2009
74 / 323
Trust region methods for unconstrained problems
Basic convergence theory
First-order convergence (2)
Suppose that there are infinitely many successful iterations.
Then lim kx f (xk )k = 0.
k
kgk k s6
s
For 1 >0
s
s s
2
s
s
s
s
{ti }
{`i }
s
s
s
s
s
S
K
Philippe Toint (Namur)
s s
s
s
s
-k
April 2009
75 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Convex models (1)
Suppose that min [xx mk (x)] for all x [xk , xk + sk ] and
for some > 0. Then
2
ksk k kgk k.
idea: mk curves upwards!
Suppose that {xki } x and x is first-order critical, and that
there is a constant smh > 0 such that
min min [xx mk (x)] smh
xBk
whenever xk is sufficiently close to x Suppose finally that
xx f (x ) is nonsingular. Then the complete sequence of iterates {xk } converges to x .
idea: steps too short to escape local basin
Philippe Toint (Namur)
April 2009
76 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Convex models (2)
But. . .
4
4
4
Philippe Toint (Namur)
4
4
April 2009
77 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Asymptotically exact Hessians
Assume also that
lim kxx f (xk ) xx mk (xk )k = 0 whenever
lim kgk k = 0
Suppose that {xki } x and x is first-order critical, that
sk 6= 0 for all k sufficiently large, and that xx f (x ) is positive
definite. Then the complete sequence of iterates {xk } converges to x , all iterations are eventually very successful and
the trust-region radius k is bounded away from zero.
idea: sufficient decrease implies that
mk (xk ) mk (xk + sk ) mqd ksk k2 > 0.
Then k 1.
Philippe Toint (Namur)
April 2009
78 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Second order: the eigen point
Assume 0 > k (Hk ).
Then fine the eigen direction uk such that
huk , gk i 0, kuk kk = k huk , Hk uk i snc k 2k ,
Minimize the model along uk to compute the eigen point:
mk (xkE ) = mk (xk + tkE uk ) = min mk (xk + tuk )
t(0,1]
1.5
0.5
0.5
1.5
2
2
Philippe Toint (Namur)
1.5
0.5
0.5
1.5
April 2009
79 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Model decrease at the eigen point
Suppose: 0 > k (Hk ), uk is an eigen direction and
kxx mk (x) xx mk (y )k lch kx y k
for all x, y Bk . Then
mk (xk ) mk (xkE ) sod k min[k2 , 2k ].
(quadratic or general model)
1
9
10
0.1
0.2
0.3
Philippe Toint (Namur)
0.4
0.5
0.6
0.7
0.8
0.9
10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
April 2009
80 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Second order: convergence theorems
lim sup min [xx f (xk )] 0.
k
Suppose that x is an isolated limit point of the sequence of
iterates {xk }. Then x is a second-order critical point.
Assume also that, for 3 >1,
k 2 and k max k+1 [3 k , 4 k ]
Let x be any limit point of the sequence of iterates. Then x
is a second-order critical point.
Philippe Toint (Namur)
April 2009
81 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Different trust-region norms
1.5
1.5
1.5
0.5
0.5
0.5
0.5
0.5
0.5
1.5
1.5
1.5
2
2
1.5
0.5
0.5
Philippe Toint (Namur)
1.5
2
2
1.5
0.5
0.5
1.5
2
2
1.5
0.5
0.5
April 2009
1.5
82 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Using norms for scaling
Idea: change the variables
Sk w = s
Then
def
mkS (xk + w ) f (xk + Sk w ) = f S (w ),
BkS = {xk + w | kw k k }.
mkS (xk ) = f (xk ),
gkS = w f S (0) = SkT x f (xk )
HkS ww f S (0) = SkT xx f (xk )Sk .
Thus
mkS (xk + w ) =
=
=
=
=
Philippe Toint (Namur)
f (xk ) + hgkS , w i + 12 hw , HkS w i
f (xk ) + hSkT x f (xk ), w i + 12 hw , SkT Hk Sk w i
f (xk ) + hx f (xk ), Sk w i + 12 hSk w , Hk Sk w i
f (xk ) + hx f (xk ), si + 12 hs, Hk si
mk (xk + s)
April 2009
83 / 323
Trust region methods for unconstrained problems
Basic convergence theory
Scaling: the geometry
3.5
3.5
3.5
2.5
2.5
2.5
1.5
1.5
1.5
0.5
0.5
1.5
2.5
Philippe Toint (Namur)
3.5
0.5
0.5
1.5
2.5
3.5
0.5
0.5
1.5
2.5
April 2009
3.5
84 / 323
Trust region methods for unconstrained problems
Solving the subproblem
2.4: Solving the subproblem
Philippe Toint (Namur)
April 2009
85 / 323
Trust region methods for unconstrained problems
Solving the subproblem
The subproblem
Assume
Euclidean norm
quadratic model (possibly non-convex)
(drop the index k)
min
sIRn
q(s) hg , si + 21 hs, Hsi
subject to ksk2
Philippe Toint (Namur)
April 2009
86 / 323
Trust region methods for unconstrained problems
Solving the subproblem
Possible approaches
exact minimization
truncated conjugate-gradients
CG + Lanczos (GLTR)
doglegs
eigenvalue based methods
(projection methods)
Philippe Toint (Namur)
April 2009
87 / 323
Trust region methods for unconstrained problems
Solving the subproblem
The exact minimizer
Any global minimizer of q(s) subject to ksk2 = satisfies the
equation
H(M )s M = g ,
where
def
H(M ) = H + M I is positive semi-definite,
M 0 and
M (ks M k2 ) = 0.
If H(M ) is positive definite, s M is unique.
Note: M is the Lagrange multiplier
Philippe Toint (Namur)
April 2009
88 / 323
Trust region methods for unconstrained problems
Solving the subproblem
The exact minimizer: a geometrical view
4
4
4
4
4
4
4
Philippe Toint (Namur)
4
4
April 2009
89 / 323
Trust region methods for unconstrained problems
Solving the subproblem
Finding the exact minimizer
Eigenvalue decomposition of H:
H = U T U
where 1 2 n . Characterization implies that
M 1
Suppose that > 1 and define
s() = H()1 g = U T ( + I )1 Ug
New formulation (one dimensional):
ks()k2
ks()k22 = kU T ( + I )1 Ug k22 = k( + I )1 Ug k22 =
where i = [Ug ]i .
Philippe Toint (Namur)
n
X
i=1
i2
(i + )2
April 2009
90 / 323
Trust region methods for unconstrained problems
Solving the subproblem
The convex case
30
ks()k2
25
@
I
@
20
@
@
@ solution curve
15
10
0
?
6
Philippe Toint (Namur)
April 2009
91 / 323
Trust region methods for unconstrained problems
Solving the subproblem
A nonconvex case
30
ks()k2
25
20
15
10
0
4
Philippe Toint (Namur)
April 2009
92 / 323
Trust region methods for unconstrained problems
Solving the subproblem
The hard case: 1 = 0
30
ks()k2
25
20
15
10
no root larger than 2
5
root near 2.2
0
4
Philippe Toint (Namur)
April 2009
93 / 323
Trust region methods for unconstrained problems
Solving the subproblem
Near the hard case: 1 0
30
ks()k2
=1
= 0.1
= 0.01
= 0.00001
25
20
15
10
Philippe Toint (Namur)
April 2009
94 / 323
Trust region methods for unconstrained problems
Solving the subproblem
The secular equation
Idea: consider the secular equation
def
() =
Then
1/ks()k2
1
1
=0
ks()k2
2.5
6
2
=1
= 0.1
= 0.01
= 0.00001
1.5
0.5
2
apply Newtons method to () = 0 : + = ()/0 ()
Philippe Toint (Namur)
April 2009
95 / 323
Trust region methods for unconstrained problems
Solving the subproblem
The derivatives of ()
Suppose g 6= 0. Then
() is strictly increasing ( > 1 ), and concave.
0 () =
hs(), s()i
ks()k32
where
s() = H()1 s().
Note: if H() = LLT and Lw = s(), then
hs(), s()i = hs(), LT L1 s()i = kw k2
Philippe Toint (Namur)
April 2009
96 / 323
Trust region methods for unconstrained problems
Solving the subproblem
Newtons method on the secular equation
Algorithm 2.5: Exact trust-region solver
Let > 1 and > 0 be given.
1
Factorize H() = LLT .
Solve LLT s = g .
Solve Lw = s.
Replace by +
ksk2
ksk22
.
kw k22
But . . . more complications due to
bracketing the root (initial + update)
termination rule
may be preconditioned
More (1978), More-Sorensen (1983), Dollar-Gould-Robinson (2009)
Philippe Toint (Namur)
April 2009
97 / 323
Trust region methods for unconstrained problems
Solving the subproblem
Approximate solution by truncated CG
Fact: CG never reenters the `2 trust-region
3.5
3.5
2.5
2.5
1.5
1.5
1
1
0.5
0.5
1.5
2.5
3.5
4.5
1.5
2.5
3.5
4.5
4
4
May be preconditioned
Steihaug (1983), T. (1981)
Philippe Toint (Namur)
April 2009
98 / 323
Trust region methods for unconstrained problems
Solving the subproblem
Approximate solution by the GLTR
ST might hit the boundary for steepest descent step sometimes slow
Idea: solve the subproblem on the nested Krylov subspaces
Algorithm 2.6: Two-phase GLTR algorithm
as long as interior: conjugate-gradients
on the boundary: Lanczos method + subproblem solution in
Krylov space
(smooth transition)
Gould-Lucidi-Roma-T. (1999)
Philippe Toint (Namur)
April 2009
99 / 323
Trust region methods for unconstrained problems
Solving the subproblem
Doglegs
Idea: use steepest descent and the full Newtonstep (requires convexity?)
dogleg curve
sD
sN
@
R
@
sC
s DD
@
I
@
@
double-dogleg curve
trust-region boundary
Powell (1970), Dennis-Mei (1979)
Philippe Toint (Namur)
April 2009
100 / 323
Trust region methods for unconstrained problems
Solving the subproblem
An eigenvalue approach
Rewrite
(H + M)s = g
as
(H g )
s
1
= Ms
or (introducing the parameter )
H
gT
s
1
= ()
M 0
0 1
s
1
choose such that
0,
H + M positive semi-definite
(kskM ) = 0
Philippe Toint (Namur)
Rendl-Wolkowicz (1997), Rojas-Santos-Sorensen (1999)
April 2009
101 / 323
Trust region methods for unconstrained problems
Bibliography
Bibliography for lesson 2 (1)
J. E. Dennis and H. H. W. Mei,
Two New Unconstrained Optimization Algorithms Which Use Function and Gradient Values,
Journal of Optimization Theory and Applications, 28(4):453-482, 1979.
H. S. Dollar, N. I. M. Gould and D. P. Robinson,
On solving trust-region and other regularised subproblems in optimization,
Rutherford-Appleton Laboratory, Chilton, UK, Report RAL-TR-2009-003, 2009.
S. M. Goldfeldt, R. E. Quandt and H. F. Trotter,
Maximization by quadratic hill-climbing,
Econometrica, 34:541-551, 1966.
N. I. M. Gould, S. Lucidi, M. Roma and Ph. L. Toint,
Solving the trust-region subproblem using the Lanczos method,
SIAM Journal on Optimization, 9(2):504-525, 1999.
K. Levenberg,
A Method For The Solution Of Certain Problems In Least Squares,
Quarterly Journal on Applied Mathematics, 2:164168, 1944.
D. Marquardt,
An Algorithm For Least-Squares Estimation Of Nonlinear Parameters,
SIAM Journal on Applied Mathematics, 11:431-441, 1963.
J. J. Mor
e,
The Levenberg-Marquardt algorithm: implementation and theory,
Numerical Analysis, Dundee 1977 (A. Watson, ed.), Springer Verlag, Heidelberg, 1978.
J. J. Mor
e,
Recent developments in algorithms and software for trust region methods,
in Mathematical Programming: The State of the Art (A. Bachem, M. Gr
otschel and B. Korte, eds.), Springer Verlag,
Heidelberg, pp. 258-287, 1983.
J. J. Mor
e and D. C. Sorensen,
Computing A Trust Region Step,
SIAM Journal on Scientific and Statistical Computing, 4(3):553-572, 1983.
Philippe Toint (Namur)
April 2009
102 / 323
Trust region methods for unconstrained problems
Bibliography
Bibliography for lesson 2 (2)
D. D. Morrison,
Methods for nonlinear least squares problems and convergence proofs,
Proceedings of the Seminar on Tracking Programs and Orbit Determination, (J. Lorell and F. Yagi, eds.), Jet Propulsion
Laboratory, Pasadena, USA, pp. 1-9, 1960.
M. J. D. Powell,
A New Algorithm for Unconstrained Optimization,
in Nonlinear Programming (J. B. Rosen, O. L. Mangasarian and K. Ritter, eds.), Academic Press, London, pp. 31-65,
1970.
F. Rendl and H. Wolkowicz,
A Semidefinite Framework for Trust Region Subproblems with Applications to Large Scale Minimization,
Mathematical Programming, 77(2):273-299, 1997.
M. Rojas, S. A. Santos and D. C. Sorensen,
A new matrix-free algorithm for the large-scale trust-region subproblem,
CAAM, Rice University, TR99-19, 1999.
D. Winfield,
Function and functional optimization by interpolation in data tables,
Ph.D. Thesis, Harvard University, Cambridge, USA, 1969.
Philippe Toint (Namur)
April 2009
103 / 323
Derivative free optimization, filters and other topics
Lesson 3:
Derivative-free optimization,
infinite dimensions and filters
Philippe Toint (Namur)
April 2009
104 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
3.1: Derivative-free optimization
Philippe Toint (Namur)
April 2009
105 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
An application of trust-regions: unconstrained DFO
Consider the unconstrained problem
min f (x)
x
Gradient (and Hessian) of f (x) unavailable
physical measurement
object code
typically small-scale (but not always. . . )
Derivative free optimization (DFO)
f (x) typically very costly
Exploit each evaluation of f (x) to the utmost possible
considerable interest of practitioners
Philippe Toint (Namur)
April 2009
106 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation methods for DFO
Idea: Winfield (1973), Powell (1994)
Until convergence:
Use the available function values to build a polynomial
interpolation model mk :
mk (yi ) = f (yi )
yi Y ;
Minimize the model in a trust region, yielding a new
potentially good point;
Compute a new function value.
Y = interpolation set { points yi at which f (yi ) is known }
Philippe Toint (Namur)
April 2009
107 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
A naive trust-region method for DFO: illustration
50
45
40
35
30
25
20
15
10
1
5
2
0
3
2.5
Philippe Toint (Namur)
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
A naive trust-region method for DFO: illustration
50
45
40
35
30
25
20
15
10
1
5
2
0
3
2.5
Philippe Toint (Namur)
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
A naive trust-region method for DFO: illustration
50
45
40
35
30
25
20
15
10
1
5
2
0
3
Philippe Toint (Namur)
2.5
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
A naive trust-region method for DFO: illustration
50
45
40
35
30
25
20
15
10
1
5
2
0
3
Philippe Toint (Namur)
2.5
1.5
0.5
0.5
3
1
1.5
April 2009
108 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation methods for DFO (2)
To be considered:
poisedness of the interpolation set Y
choice of models (linear, quadratic, in between, beyond)
convergence theory
numerical performance
Philippe Toint (Namur)
April 2009
109 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Poisedness
Assume a quadratic model
mk (xk + s) = fk + hgk , si + 12 hs, Hk si
Thus
p = 1 + n + 12 n(n + 1) = 12 (n + 1)(n + 2)
parameters to determine need p function values (|Y | = p)
Not sufficient!
need geometric conditions for the points in Y . . .
Philippe Toint (Namur)
April 2009
110 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Poisedness: geometry with n = 2, p = 6
20
18
16
14
12
10
8
6
4
2
0
2
1
0
1.5
0.5
1
0.5
1.5
With these 6 data points in IR3 . . . ...
Philippe Toint (Namur)
April 2009
111 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Poisedness: geometry with n = 2, p = 6
. . . is this the correct interpolation?
Philippe Toint (Namur)
April 2009
111 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Poisedness: geometry with n = 2, p = 6
. . . or this?
Philippe Toint (Namur)
April 2009
111 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Poisedness: geometry with n = 2, p = 6
. . . or this?
Philippe Toint (Namur)
April 2009
111 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Poisedness: geometry with n = 2, p = 6
The difference ... is zero on a quadratic curve containing Y !
Philippe Toint (Namur)
April 2009
111 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Poisedness: geometry (2)
If {i ()}pi=1 = basis for quadratic polynomials
p
X
i i (yj ) = f (yj )
j = 1, . . . , p
i=1
Possible poisedness measure:
1 (y1 )
..
(Y ) = det
.
1 (yp )
p (y1 )
..
.
p (yp )
Y (well) poised |(Y )|
scale for the spread of the yi s
notion of geometry improvement
Philippe Toint (Namur)
April 2009
112 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Lagrange polynomials
Remarkable: replace y by y+ in Y :
(Y+ )
= L(y+ , y ) is independent of the basis {i ()}pi=1
(Y )
where
y Y
L(y , y ) =
1
0
if y = y
if y =
6 y
is the Lagrange fundamental polynomial
Note: for quadratic interpolation, L(, y ) is a quadratic polynomial!
Powell (1994)
Philippe Toint (Namur)
April 2009
113 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials
Idea: use the Lagrange polynomials to define the (quadratic) interpolant
by
mk (xk + s) =
f (y )Lk (xk + s, y )
y Yk
And then. . .
kf (xk + s) mk (xk + s)k
kxk + s y k2 |Lk (xk + s, y )|
y Yk
Philippe Toint (Namur)
April 2009
114 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The original function. . .
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
. . . and the interpolation set
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The first Lagrange polynomial
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The second Lagrange polynomial
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The third Lagrange polynomial
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The fourth Lagrange polynomial
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The fifth Lagrange polynomial
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The sixth Lagrange polynomial
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Interpolation using Lagrange polynomials: construction
The final interpolating quadratic
Philippe Toint (Namur)
April 2009
115 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Other algorithmic ingredients
include a new point in the interpolation set
need to drop an existing interpolation point?
select which one to drop: make Y as poised as possible
Note: model/function minimizer may produce bad geometry!!
geometry improvement procedure . . .
trust-region radius management
trust region = Bk = {xk + s | ksk k }
standard: reduce k when no progress
DFO: more complicated! (Could reduce to fast and prevent
convergence. . . )
verify that Y is poised before reducing k
Philippe Toint (Namur)
April 2009
116 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Improving the geometry in a ball
attempt to reuse past points that are close to xk
attempt to replace a distant point of Y
attempt to replace a close point of Y
good geometry for the current k improvement impossible
Philippe Toint (Namur)
April 2009
117 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Self-correction at unsuccessful iterations (1)
At iteration k, define the set of exchangeable far points:
Fk = {y Yk | ky xk k > k and Lk (xk + sk , y ) 6= 0}
and the set of exchangeable close points (for some > 1):
Ck = {y Yk \{xk } | ky xk k k and |Lk (xk +sk , y )| }
Philippe Toint (Namur)
April 2009
118 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Self-correction at unsuccessful iterations (2)
Remarkably,
Whenever
iteration k is unsuccessful,
Fk =
k is small w.r.t. kgk k,
then Ck 6= .
(an improvement of the geometry by a factor is always possible at
unsuccessful iterations when k is small and all exchangeable far points
have been considered)
no need to reduce k forever!
Philippe Toint (Namur)
April 2009
119 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Trust-region algorithm for DFO (1)
Algorithm 3.1: TR for DFO
Step 0: Initialization. Given: x0 , 0 , Y0 ( L0 (, y )). Set k = 0.
Step 1: Criticality test [complicated and not discussed here]
Step 2: Solve the subproblem. Compute sk that sufficiently reduces mk (xk + s)
within the trust region,
Step 3: Evaluation.
Compute f (xk + sk ) and
k =
f (xk ) f (xk + sk )
.
mk (xk ) mk (xk + sk )
Step 4: Define the next iterate and interpolation set.
the big question
Step 5: Update the Lagrange polynomials.
Philippe Toint (Namur)
April 2009
120 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Trust-region algorithm for DFO (2)
Algorithm 3.2: Step 4: Define xk+1 and Yk+1
Step 4a: Successful iteration. If k 1 , accept
xk + sk , increase k and exchange xk + sk with
y = arg max ky (xk + sk )k2 |Lk (xk + sk , y )|
y Yk
Step 4b: Replace far point. If k < 1 (+ other technical condition) and Fk 6= , reject
xk + sk , keep k and exchange xk + sk with
y = arg max ky (xk + sk )k2 |Lk (xk + sk , y )|
y Fk
Step 4c: Replace close point. If k < 1 (+ other technical condition) and Ck 6= , reject
xk + sk , keep k and exchange xk + sk with
y = arg max ky (xk + sk )k2 |Lk (xk + sk , y )|
y Ck
Step 4d: Decrease the radius. Otherwise, reject xk + sk , keep Yk , and reduce k .
Philippe Toint (Namur)
April 2009
121 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
Global convergence results
If the model is at least fully linear, then
lim inf kx f (xk )k = lim inf kgk k = 0
k
Scheinberg and T. (2009)
With more costly algorithm:
If the model is at least fully linear, then
lim kx f (xk )k = lim kgk k = 0
If the model at least fully quadratic, then iterates converge to
2nd-order critical points
Philippe Toint (Namur)
April 2009
122 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
For an efficient numerical method. . .
Many more issues:
which Hessian approximation?
(full/vs diagonal or structured)
details of criticality tests difficult
details for numerically handling interpolation polynomials
(Lagrange, Newton),
reference shifts,
...
good codes around: NEWUOA, DFO efficient solvers
Powell (2008 and previously), Conn, Scheinberg and T. (1998)
Conn, Scheinberg and Vicente (2008)
Philippe Toint (Namur)
April 2009
123 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Derivative free optimization
On the ever famous banana function. . .
25
20
15
10
5
0
5
10
1.5
1.2
1
0.8
0.6
0.5
0.4
0
Philippe Toint (Namur)
0.2
0
April 2009
124 / 323
Derivative free optimization, filters and other topics
Infinite dimensional problems
3.2: Infinite dimensional problems
Philippe Toint (Namur)
April 2009
125 / 323
Derivative free optimization, filters and other topics
Infinite dimensional problems
Why consider infinite dimensions?
Main motivation:
large-scale finite dimensional problems often result from discretized
continuous ones (surfaces, time-trajectories, optimal control, . . . )
behaviour on these problems dominated by infinite dimensional
properties
Need to investigate infinite dimensions to ensure consistency!
Two main cases: Hilbert and Banach spaces.
Philippe Toint (Namur)
April 2009
126 / 323
Derivative free optimization, filters and other topics
Infinite dimensional problems
Convergence in Hilbert spaces
The trust-region algorithm is well-defined and globally
convergent in Hilbert spaces.
Riescz representation theorem V 0 V
Cauchy point results from one dimensional minimization
(but xkM may not exist!)
def
k = 1 + sup kxx mk (x)kV,V 0 ,
xBk
def
min [H] =
Philippe Toint (Namur)
hd, Hdi
dV,d6=0 hd, di
inf
April 2009
127 / 323
Derivative free optimization, filters and other topics
Infinite dimensional problems
Wht happens in Banach spaces ?
Problem: dual space different from the primal!
Need further assumptions:
x f (x) V for all x V.
x f is uniformly continuous from V to V.
For every x {x V | f (x) f (x0 )},
hx f (x), x f (x)i (kx f (x)kV 0 )kx f (x)kV ,
for some continuous monotonically increasing real from
[0, ] to itself, independent of x and such that (0) = 0
and (t) > 0 for t > 0.
Philippe Toint (Namur)
April 2009
128 / 323
Derivative free optimization, filters and other topics
Infinite dimensional problems
Convergence in Banach spaces, nevertheless
The last assumption implies
hgk , gk i (kgk kV 0 )kgk kV
. . . and sufficient decrease follows!
Is this realistic?
The additional assumptions always hold for V = Lp () and
2 p < , when kg kLp () ubg .
Under these additional assumptions, the trust-region algorithm
is well-defined and globally convergent in Banach spaces.
Philippe Toint (Namur)
April 2009
129 / 323
Derivative free optimization, filters and other topics
Filter algorithms
3.3: Filter algorithms
Philippe Toint (Namur)
April 2009
130 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Monotonicity (1)
Global convergence theoretically ensured by
some global measure. . .
unconstrained : f (xk )
(constrained : some merit function at xk )
. . . with strong monotonic behaviour (Lyapunov function)
Also practically enforced by
algorithmic safeguards around Newton method
(linesearches, trust regions)
Philippe Toint (Namur)
April 2009
131 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Monotonicity (2)
But, unfortunately,
classical safeguards limit efficiency!
Of interest: design less obstructive safeguards while
ensuring better numerical performance
(the Newton Liberation Front!)
continuing to guarantee global convergence properties
Is this possible?
Typically:
abandon strict monotonicity of usual measures
but insist on average behaviour instead
Philippe Toint (Namur)
April 2009
132 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Non-monotone trust-regions
Idea: f (xk+1 ) < f (xk ) replaced by f (xk+1 ) < fr (k)
with
fr (k) < fr (k1)
Further issues:
suitably define the reference iteration r (k)
adapt the trust-region algorithm: also compare achieved and
predicted reductions since reference iteration
T. (1997)
Philippe Toint (Namur)
April 2009
133 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Non-monotone TR algorithm
Algorithm 3.3: Non monotone TR algorithm (NMTR)
Step 0: Initialization. Given: x0 , 0 , 1 , 2 , 1 , 2 . Compute f (x0 ), set k = 0.
Step 1: Model definition. Choose k kk and define mk in Bk .
Step 2: Step calculation. Compute sk that sufficiently reduces mk and xk + sk Bk .
Step 3: Acceptance of the trial point.
compute f (xk + sk ),
kh =
Define the reference iteration r (k) k and
k1
X
[mi (xi ) mi (xi + si )],
i=r (k)
Define
iS
f (xr (k) ) f (xk + sk )
f (xk ) f (xk + sk )
k = max h
,
. .
k + mk (xk ) mk (xk + sk ) mk (xk ) mk (xk + sk )
If k 1 , then define xk+1 = xk + sk ; otherwise define xk+1 = xk .
Step 4: Trust-region radius update. Set
8
< [k , )
[2 k , k )
k+1
:
[1 k , 2 k ]
Increment k by one and go to Step 1.
Philippe Toint (Namur)
if k 2 ,
if k [1 , 2 ),
if k < 1 .
April 2009
134 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Sufficient decrease for NMTR
k
X
f (xp(k) ) f (xk+1 ) 1 mdc
kgj k min
j=p(k),jS
kgj k
, j
j
with p(k) = r (k) when hk ck , or p(k) = k otherwise
r6
f (xk )
r
r
6 r
r r
r
r
r
r
r
6 r
r 6
r
r
r
r
r r
6
f (x0 ) f (xk+1 ) 1 mdc
k
X
t=0,tS
Philippe Toint (Namur)
kgt k min
r
r r
r
-k
kgt k
, t .
t
April 2009
135 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Choosing the reference iteration (1)
Algorithm 3.4: Choosing r (k)
Step 3: Acceptance of the trial point.
Step 3a: update the iterate. Compute f (xk + sk ) and set
fr f (xk + sk )
f (xk ) f (xk + sk )
.
k = max
,
r + mk (xk ) mk (xk + sk ) mk (xk ) mk (xk + sk )
If k < 1 , then xk+1 = xk and go to Step 4; otherwise xk+1 = xk + sk and
c = c + mk (xk ) mk (xk+1 ) and r = r + mk (xk ) mk (xk+1 )
Step 3b: update the best value. If f (xk+1 ) < fmin then set fc = fmin = f (xk+1 ),
c = 0 and ` = 0 and go to Step 4; otherwise, ` ` + 1.
Step 3c: update the reference candidate. If f (xk+1 ) > fc , set fc = f (xk+1 ) and
c = 0.
Step 3d: possibly reset the reference value. If ` = m, set fr = fc and r = c .
Philippe Toint (Namur)
April 2009
136 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Choosing the reference iteration (2): example with m = 2
6
r
f (xk )
dr
r
r
dr
r
dr
r
r
r
r
r
dr
r
r
r
r
rd
rd
r
r
r
r
r
r
rd
-k
0 0 0 0 1 2 0 1 0 1 1 1 2 1 2 0 1 1 1 1 2 0 1 1 2 0 0 1 2 0
: reference iteration
: new best value
? : reference iteration redefined (l = m)
Philippe Toint (Namur)
April 2009
137 / 323
Derivative free optimization, filters and other topics
Filter algorithms
An unconstrained example
2
-2
-4
-6
-8
-10
-12
0
10
20
30
40
50
60
70
Monotone and non-monotone TR (using LANCELOT B) on EXTROSNB
Philippe Toint (Namur)
April 2009
138 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Introducing the filter
A fruitful alternative: filter methods
Constrained optimization :
using the SQP step, at the same time:
reduce the objective function f (x)
reduce constraint violation (x)
CONFLICT
Philippe Toint (Namur)
April 2009
139 / 323
Derivative free optimization, filters and other topics
Filter algorithms
The filter point of view
Fletcher and Leyffer replace question:
What is a better point?
by:
What is a worse point?
Of course, y is worse than x when
f (x) f (y ) and (x) (y )
(y is dominated by x)
When is xk + sk acceptable?
Fletcher and Leyffer (2002), Fletcher, Gould, Leyffer, T. and W
achter (2002)
Philippe Toint (Namur)
April 2009
140 / 323
Derivative free optimization, filters and other topics
Filter algorithms
The standard filter
Idea: accept non-dominated points
no monotonicity of merit function implied
f (x)
6
r
r
r
0
Philippe Toint (Namur)
(x)
April 2009
141 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Filling up the standard filter
Note: filter area is bounded in the (f , ) space!
f (x)
f (xk )
f (xk ) k
(1 )k
0
(x)
filter area (non)-monotonically decreasing
Philippe Toint (Namur)
April 2009
142 / 323
Derivative free optimization, filters and other topics
Filter algorithms
The (unconstrained) feasibility problem
Feasibility
Find x such that
c(x) 0
e(x) = 0
for general smooth c and e.
Least-squares
Find x such that
min
Philippe Toint (Namur)
i2
April 2009
143 / 323
Derivative free optimization, filters and other topics
Filter algorithms
A multidimensional filter (1)
(Simple) idea: more dimensions in filter space
1 (x)
6
q
q
q
0
2 (x)
(full dimension vs. grouping)
Philippe Toint (Namur)
April 2009
144 / 323
Derivative free optimization, filters and other topics
Filter algorithms
A multidimensional filter (2)
Additionally
possibly consider unsigned filter entries
use a trust-region algorithm when
trial point unacceptable
convergence to non-zero solution
( internal restoration)
Sound convergence theory
Gould, Leyffer and T. (2005)
Philippe Toint (Namur)
April 2009
145 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience: FILTRANE
Fortran 95 package
large scale problems (CUTEr interface)
includes several variants of the method
signed/unsigned filters
Gauss-Newton, Newton or adaptive models
pure trust-region option
uses preconditioned conjugate-gradients
+ Lanczos for subproblem solution
part of the GALAHAD library
Gould, Orban and T. (2003), Gould and T. (2007)
Philippe Toint (Namur)
April 2009
146 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience (1)
1
fraction of problems for which solver in within of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Default
Pure trust region
0
10
Filter vs. trust-region (CPU time)
Philippe Toint (Namur)
April 2009
147 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience (2)
1
0.9
0.8
0.7
p()
0.6
0.5
0.4
0.3
0.2
0.1
Filter
LANCELOT
0
10
Filter vs. LANCELOT B (CPU time)
Philippe Toint (Namur)
April 2009
148 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience (3)
1
0.9
0.8
0.7
p()
0.6
0.5
0.4
0.3
0.2
0.1
Filter
Unfettered
0
10
Filter vs. free Newton (CPU time)
Philippe Toint (Namur)
April 2009
149 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Filter for unconstrained optimization
Again simple idea: use gi instead of i
g1 (x)
6
q
q
q
0
g2 (x)
(full dimension vs. grouping)
Philippe Toint (Namur)
April 2009
150 / 323
Derivative free optimization, filters and other topics
Filter algorithms
A few complications. . .
But . . .
g (x) = 0 not sufficient for nonconvex problems!
When negative curvature found:
reset filter
set upper bound on acceptable f (x)
(or. . . add a dimension for f in the filter)
reasonable convergence theory
Philippe Toint (Namur)
April 2009
151 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience (1)
1
0.9
0.8
0.7
p()
0.6
0.5
0.4
0.3
0.2
0.1
Default
Pure trust region
LANB
1
10
Filter vs. trust-region and LANCELOT B (iterations)
Philippe Toint (Namur)
April 2009
152 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience: HEART6
4
log of residual
10
12
14
10
20
30
40
iterations
50
60
70
80
Filter vs. trust-region and LANCELOT B
Philippe Toint (Namur)
April 2009
153 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience: EXTROSNB
6
log of residual
50
100
150
iterations
200
250
300
Filter vs. trust-region and LANCELOT B
Philippe Toint (Namur)
April 2009
154 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Numerical experience: LOBSTERZ
6
log of residual
10
50
100
150
iterations
200
250
300
Filter vs. trust-region
Philippe Toint (Namur)
April 2009
155 / 323
Derivative free optimization, filters and other topics
Filter algorithms
Conclusions
derivative-free optimization possible and efficient
non-monotonicity definitely helpful
filter methods very efficient
Newtons behaviour unexplained
. . . more research needed?
Philippe Toint (Namur)
April 2009
156 / 323
Derivative free optimization, filters and other topics
Bibliography
Bibliography for lesson 3 (1)
1
A. R. Conn, K. Scheinberg and Ph. L. Toint,
A Derivative Free Optimization Algorithm in Practice,
Proceedings of the 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St.
Louis, Missouri, September 1998.
A. R. Conn, K. Scheinberg and L. N. Vicente,
Introduction to Derivative-free Optimization,
SIAM-MPS Series on Optimization, 2008.
R. Fletcher and S. Leyffer,
Nonlinear Programming without a penalty function,
Mathematical Programming A, 91(2):239-269, 2002.
R. Fletcher, N. I. M. Gould, S. Leyffer, Ph. L. Toint and A. W
achter,
Global Convergence of Trust-Region SQP-Filter Algorithms for Nonlinear Programming,
SIAM Journal on Optimization, 13(3):635-659, 2002.
N. I. M. Gould and S. Leyffer and Ph. L. Toint,
A Multidimensional Filter Algorithm for Nonlinear Equations and Nonlinear Least-Squares,
SIAM Journal on Optimization, 15(1):17-38, 2005.
N. I. M. Gould, D. Orban and Ph. L. Toint,
GALAHADa library of thread-safe Fortran 90 packages for large-scale nonlinear optimization, ACM Transactions on
Mathematical Software, 29(4):353-372, 2003.
N. I. M. Gould,C. Sainvitu and Ph. L. Toint,
A Filter-Trust-Region Method for Unconstrained Optimization,
SIAM Journal on Optimization, 16(2):341-357, 2005.
N. I. M. Gould and Ph. L. Toint,
FILTRANE, a Fortran 95 filter-trust-region package for solving systems of nonlinear equalities, nonlinear inequalities
and nonlinear least-squares problems,
ACM Transactions on Mathematical Software, 33(1):3-25, 2007.
M. J. D. Powell,
A direct search optimization method that models the objective by quadratic interpolation,
Presentation at the 5th Stockholm Optimization Days, Stockholm, 1994.
Philippe Toint (Namur)
April 2009
157 / 323
Derivative free optimization, filters and other topics
Bibliography
Bibliography for lesson 3 (2)
10 M. J. D. Powell,
Developments of NEWUOA for minimization without derivatives,
IMA Journal of Numerical Analysis, 28(4):649-664, 2008.
11 K. Scheinberg and Ph. L. Toint,
Self-correcting geometry in model-based algorithms for derivative-free unconstrained optimization,
Report TR09/06, FUNDP, Namur, 2009.
12 Ph. L. Toint,
A non-monotone trust-region algorithm for nonlinear optimization subject to convex constraints,
Mathematical Programming, 77(1):69-94, 1997.
13 D. Winfield,
Function Minimization by Interpolation in a Data Table,
Journal of the IMA, 12:339-347, 1973.
Philippe Toint (Namur)
April 2009
158 / 323
Convex constraints and interior-point methods
Lesson 4:
Optimization with
convex constraints
Philippe Toint (Namur)
April 2009
159 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
4.1: Projection algorithms
Philippe Toint (Namur)
April 2009
160 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Projections on simple convex domains (1)
r
y = PC (y )
r
rP (y )
C
PC (y )
r
r
PC (y )
r
[x` ]i
def
[y ]i
[PC (y )]i =
[xu ]i
Philippe Toint (Namur)
ry
if [y ]i [x` ]i ,
if [x` ]i < [y ]i < [xu ]i ,
if [xu ]i [y ]i
April 2009
161 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Projections on simple convex domains (2)
r PC (y )
y = PC (y )
r
Philippe Toint (Namur)
April 2009
162 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Projections on simple convex domains (2)
. . . but also the ordered simplex . . .
0.8
0.6
x3
0.4
0.2
0
1
0.8
1
0.6
0.8
0.6
0.4
x2
0.4
0.2
0.2
0
x1
Idea: use those simple projections!
Philippe Toint (Namur)
April 2009
163 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
The projected gradient path
Define the projected gradient path = the Cauchy arc
p(t, x) = PC [x tx f (x)]
r x t f (x)
m x
r
p(t, x) = p(tm , x)
x f (x)
Philippe Toint (Namur)
April 2009
164 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Two projections
PT (x) [x f (x)]6C 0
Philippe Toint (Namur)
PC [x x f (x)] xC 0
April 2009
165 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Measuring criticality
Measure the gain in linearized objective function per step of length :
def
min
hx f (x), di
(x, ) =
x+dF ,kdk
(t) = kPF (x tg (x)) xk
Philippe Toint (Namur)
(x, ) =
(x)
April 2009
166 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
The criticality measure
min
hx f (x), di
(x) = (x, 1) =
x+dF ,kdk1
def
the feasible reduction in the linearized objective for unit steps
reduces to kx f (x)k2 in the unconstrained case
Philippe Toint (Namur)
April 2009
167 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
The projected gradient path and
k
xk r
r
Philippe Toint (Namur)
. projected path
xk + dk
April 2009
168 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
The generalized Cauchy point
Approximately minimize mk () on the PG path
Find
def
xkGC = PF [xk tkGC gk ] = xk + skGC
(tkGC > 0)
such that
mk (xkGC ) f (xk ) + ubs hgk , skGC i
(below linear approximation)
and either
mk (xkGC ) f (xk ) + lbs hgk , skGC i
(above linear approximation)
or
kPT (x GC ) [gk ]k epp |hgk , skGC i|
k
(close to paths end)
or
kskGC k frd k
Philippe Toint (Namur)
(close to TR boundary)
April 2009
169 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Searching for the GCP (1)
5
4
3
2
1
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 such that s 1.5 and 2.8
Philippe Toint (Namur)
April 2009
170 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Searching for the GCP (2)
5
4
3
2
1
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 such that s 1.5 and 1.8
Philippe Toint (Namur)
April 2009
171 / 323
Convex constraints and interior-point methods
Projections and the projected gradient path
Useful properties
Piecewise search for xkGC well-defined and finite
1
(, ), (, ) and (, ) are continuous
(x, ) is non-decreasing
(x, ) is non-decreasing
(x, ) is non-increasing
(xk ) (xk , kskGC k) + 2kPT (x GC ) [gk ]k
hgk , skGC i = (xk , kskGC k) 0
(xk , t) t kPT (x(t)) [x f (xk )]k
|(x) (y )| Lkx y k
if x f (x) is continuous on a bounded level set
Philippe Toint (Namur)
April 2009
172 / 323
Convex constraints and interior-point methods
Trust-region method for convex constraints
Cauchy decrease along the projected gradient path
The Cauchy condition: minimize mk long the projected gradient path
k
mk (xk ) mk (xk + sk ) CR k min
, k , 1
1 + kHk k
Idea: Linesearch conditions imply
mk (xk ) mk (xkGC ) ubs |hgk , skGC i| = ubs (xk , kskGC k)
but need
kPT (P[xk tj gk ]) [gk ]k epp
|hgk , sk (tj )i|
k
def
Now define k = min[1, k ] k . Then
k
mk (xk ) mk (xk ) dcp k min
, k
k
GC
Philippe Toint (Namur)
April 2009
173 / 323
Convex constraints and interior-point methods
Trust-region method for convex constraints
How far can we turn the handle?
As above. . .
All limit points are first-order critical, i.e.
lim k = 0
But . . .
does the active set settle ?
(needed for 2nd-order convergence or rate)
Philippe Toint (Namur)
April 2009
174 / 323
Convex constraints and interior-point methods
Trust-region method for convex constraints
Active constraints identification (1)
Require further assumptions: let L = { limit points of {xk } }
x L , {x ci (x )}iA(x ) are linearly independent
x L , x f (x ) ri{N (x )}
k, A(xkGC ) A(xk + sk )
For each connected component of limit points L(x ) L ,
there exists a set A {1, . . . , m} for which
A(x ) = A for all x L(x ).
Idea: connectivity + uniqueness of Lagrange multipliers
each L(x ) belongs to a single facet of C
Philippe Toint (Namur)
April 2009
175 / 323
Convex constraints and interior-point methods
Trust-region method for convex constraints
Active constraints identification (2)
There exists a (0, 1) such that
dist(x , L0 )
for every x L and each compact connected component of
limit points L0 such that A(L0 ) 6= A(x ).
Idea: continuity + compactness well separated
There exist (0, 14 ), (0, 1), and k1 0 such that, for
k k1 , there is a Lk such that
xk V(Lk , ) = {x IRn | dist(x, Lk ) }
and
A(x) A(Lk ) for all x V(Lk , ).
Idea: partition the complete sequence into convergent subsequences
each xk near a unique Lk
Philippe Toint (Namur)
April 2009
176 / 323
Convex constraints and interior-point methods
Trust-region method for convex constraints
Active constraints identification (3)
There exists k2 k1 such that, if for some k k2 ,
j A(Lk ) and j 6 A(xkGC ),
then, for some (0, 1) independent of k and j,
k .
Idea: complicated (uses criticality measures for incomplete constraint sets)
incomplete local A(xk ) implies not critical
(more technical arguments here)
There exists an active set A , such that
x L
A(x ) = A
and, for all k sufficiently large,
A(xk ) = A(xkGC ) = A
Philippe Toint (Namur)
April 2009
177 / 323
Convex constraints and interior-point methods
Trust-region method for convex constraints
Further convergence results
. . . and now it works in T (xk ) ( now continuous for large k ) with
xx mk remplaced by xx mk` xx `(xk , yk )
convergence to isolated critical points
(generalized) eigen-points for the Lagrangian
(needs consistent multiplier estimates!)
convergence to second-order points
fast asymptotic rate of convergence
Philippe Toint (Namur)
April 2009
178 / 323
Convex constraints and interior-point methods
Barriers and interior points
4.2: Barrier methods
Philippe Toint (Namur)
April 2009
179 / 323
Convex constraints and interior-point methods
Barriers and interior points
A simple case
Consider C = {x IRn | x 0} and build
log
def
(x, ) = f (x) he, log(x)i = f (x)
n
X
log(xi )
i=1
Under acceptable assumptions,
x () = arg min log (x, )
x
converge to the solution of the problem
min f (x)
xC
when & 0.
Philippe Toint (Namur)
April 2009
180 / 323
Convex constraints and interior-point methods
Barriers and interior points
How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function
Philippe Toint (Namur)
April 2009
181 / 323
Convex constraints and interior-point methods
Barriers and interior points
How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function + barrier ( = 50)
Philippe Toint (Namur)
April 2009
181 / 323
Convex constraints and interior-point methods
Barriers and interior points
How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function + barrier ( = 25)
Philippe Toint (Namur)
April 2009
181 / 323
Convex constraints and interior-point methods
Barriers and interior points
How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function + barrier ( = 10)
Philippe Toint (Namur)
April 2009
181 / 323
Convex constraints and interior-point methods
Barriers and interior points
How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function + barrier ( = 5)
Philippe Toint (Namur)
April 2009
181 / 323
Convex constraints and interior-point methods
Barriers and interior points
How it works. . .
h
i2
Example: minx1 ,x2 0 120 x12 (x1 1) x2 + 1 + 10(4 + x1 )2 150
2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
original objective function + barrier ( = 2)
Philippe Toint (Namur)
April 2009
181 / 323
Convex constraints and interior-point methods
Barriers and interior points
Other barriers: reciprocals
b R() (x, ) =
n
X
i=1
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
0.8
0.8
0.6
0.6
0.4
0.4
0.2
1
[x]i
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
0.2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
1
2
( = 2, log + R( ), R(1) and R(2))
Philippe Toint (Namur)
April 2009
182 / 323
Convex constraints and interior-point methods
Barriers and interior points
The barrier function
def
(x, ) = f (x) + b(x, ) = f (x) he, log(x)i
Assume:
b(x, ) is defined for all x ri{C} and all > 0, and is
C 2 (ri{C}) w.r.t. x.
> 0, > 0 bbh (, ) 1 such that
kxx b(x, )k bbh (, )
x C such that dist(x, C)
limp b(yp , ) = + > 0 and {yp }
p=0 such that
yp ri{C} and
Philippe Toint (Namur)
lim dist(yp , C) = 0.
April 2009
183 / 323
Convex constraints and interior-point methods
Barriers and interior points
An elementary barrier algorithm
Algorithm 4.1: A simple barrier algorithm
Step 0: Initialization. Given: x0 ri{C}, 0 > 0. Set k = 0.
Step 1: Inner minimization. (Approximately) solve the problem
min (x, k )
x
by applying an unconstrained (inner) algorithm, starting from
a suitable starting point xk,0 ri{C}.
Let xk+1 be the corresponding (approximate) solution.
Step 2: Update the barrier parameter. Choose k+1 > 0 such that
lim k = 0.
Increment k by one and return to Step 1.
Philippe Toint (Namur)
April 2009
184 / 323
Convex constraints and interior-point methods
Barriers and interior points
A first inner primal algorithm
Algorithm 4.2: Inner primal 1
Step 0: Initialization. Given: xk,0 ri{C}, k,0 , 1 , 2 , 1 , 2 , k
Compute (x0 , k ), set j = 0.
(0, 1).
Step 1: Model definition. Define mk,j of (xk,j + s, k ) in Bk,j of the form
f
b
mk,j (xk,j + s) = mk,j
(xk,j + s)+mk,j
(xk,j + s),
Step 2: Step calculation. Compute sk,j that sufficiently reduces mk,j and
such that xk,j + sk,j Bk,j .
Step 3: Acceptance of the trial point. If xk,j + sk,j 6 C or if dist(xk,j + sk,j , C) <
k dist(xk,j , C), set k,j = , xk,j+1 = xk,j and go to Step 4.
Otherwise compute (xk,j + sk,j , k ) and
(xk,j , k ) (xk,j + sk,j , k )
.
k,j =
mk,j (xk,j ) mk,j (xk,j + sk,j )
Then if k,j 1 , define xk,j+1 = xk,j + sk,j ; otherwise define xk,j+1 = xk,j .
Step 4: Trust-region radius update. Set
8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
Increment
Philippe Toint (Namur)
if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .
j by one and go to Step 1.
April 2009
185 / 323
Convex constraints and interior-point methods
Barriers and interior points
Models and assumptions
Use separate models for f and b!
f
b
mk,j (xk,j + s) = mk,j
(xk,j + s) + mk,j
(xk,j + s),
Assume:
k, > 0, bbmh (, k ) 1 k, j 0,
b
kxx mk,j
(x, k )k bbmh (, k )
x Bk,j C such that dist(x, C) .
k, j 0 x Bk,j ri{C},
f
kxx mk,j
(x)k umh
Philippe Toint (Namur)
April 2009
186 / 323
Convex constraints and interior-point methods
Barriers and interior points
(Inner) convergence properties
There exists mdb (k) (0, 1) such that
dist(xk,j , C) mdb (k)
for all j. Moreover, for all j and all x such that kx xk,j k
(1 k )dist(xj , C), we have that
kxx b(x, )k bbh (k mdb (k), k )
and
b
kxx mk,j
(xk,j , )k bbmh (k mdb (k), k )
If k,j (1 k )mdb (k), then
|(xk,j + sk,j , k ) mk,j (xk,j + sk,j )| ubh (k)2k,j
. . . and all the nice convergence properties follow!
Philippe Toint (Namur)
April 2009
187 / 323
Convex constraints and interior-point methods
Barriers and interior points
Constrained Cauchy and eigen-points (1)
Idea: restrict the step, not the trust region!
xk,j
k dist(xk,j , C)
@
@
@
R
@
k,j
xk,j + sk,j
ri{C}
But . . . what of sufficient decrease ???
Philippe Toint (Namur)
April 2009
188 / 323
Convex constraints and interior-point methods
Barriers and interior points
Constrained Cauchy and eigen-points (2)
Redefine the Cauchy arc:
def
CC
xk,j
(t) = {x | x = xk,j tgk,j , t 0, tkgk,j k (1 k )dk,j and x Bk },
kgk,j k
mk,j (xk,j )mk,j (xk,j ) kgk,j k min
, k,j , (1 k )dk,j
k,j
CC
1
2
. . . etc, etc, etc . . .
Philippe Toint (Namur)
April 2009
189 / 323
Convex constraints and interior-point methods
Barriers and interior points
A second inner primal algorithm
Algorithm 4.3: Inner primal 2
Step 0: Initialization. Given: xk,0 ri{C}, k,0 , 1 , 2 , 1 , 2 , k (0, 1).
Compute (xk,0 , k ), set j = 0.
Step 1: Model definition.
f
b
Define mk,j (xk,j + s) = mk,j
(xk,j + s) + mk,j
(xk,j + s)
Step 2: Step calculation. Define dk,j = dist(xk,j , C). Compute sk,j such that
xk,j + sk,j Bk,j C and dist(xk,j + sk,j , C) k dk,j
and such that it sufficiently reduces mk,j
Compute (xk,j + sk,j , k ) and
(xk,j , k ) (xk,j + sk,j , k )
k,j =
.
mk,j (xk,j ) mk,j (xk,j + sk,j )
1 , define xk,j+1 = xk,j + sk,j ; otherwise define xk,j+1 = xk,j .
Step 3: Acceptance of the trial point.
Then if k,j
Step 4: Trust-region radius update. Set
8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
Increment j by one and go to Step 1.
Philippe Toint (Namur)
if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .
April 2009
190 / 323
Convex constraints and interior-point methods
Barriers and interior points
The log barrier and its derivatives
Return to:
min f (x)
x0
The log barrier
b(x, ) = he, log(x)i
giving
log (x, ) = f (x) he, log(x)i
Using the notation X = diag(x1 , . . . , xn ), we obtain that
x b(x, ) = X 1 e
Philippe Toint (Namur)
and
xx b(x, ) = X 2 e
April 2009
191 / 323
Convex constraints and interior-point methods
Barriers and interior points
The primal log-barrier algorithm
Algorithm 4.4: Primal log-barrier algorithm
Step 0: Initialization. Given: x0 > 0, 0 > 0, and the forcing functions D () and
E (). Set k = 0.
Step 1: Inner minimization. Choose a value k (0, 1). Approximately minimize
the log-barrier function log (x, k ) = f (x) k he, log(x)i starting
from xk and using an inner algorithm in which
b
1
2
e, si + 12 hs, Xk,j
si
mk,j
(xk,j + s) = k he, log(xk,j )i hXk,j
Stop this algorithm as soon as an iterate xk,j = xk+1 is found for which
1
kx f (xk+1 ) k Xk+1
ek D (k ),
2
] E (k )
min [xx f (xk+1 ) + k Xk+1
and xk+1 > 0.
Step 2: Update the barrier parameter. Choose k+1 > 0 such that
limk k = 0. Increment k by one and return to Step 1.
Philippe Toint (Namur)
April 2009
192 / 323
Convex constraints and interior-point methods
Barriers and interior points
Convergence of the primal log-barrier algorithm (1)
OK for first order! . . . but existence of limit points not guaranteed
Define
A subsequence {xkj } is consistently active w.r.t. the bounds if,
for each i = 1, . . . , n, either
lim [xkj ]i = 0 or lim inf [xkj ]i > 0.
(Each bound constraint is asymptotically active or inactive for the
complete subsequence.)
def
A{xkj } = {i {1, . . . , n} | lim [xkj ]i = 0}.
j
Note: finite number of such subsequences a partition of {xk }
Philippe Toint (Namur)
April 2009
193 / 323
Convex constraints and interior-point methods
Barriers and interior points
Convergence of the primal log-barrier algorithm (2)
Finally,
Under appropriate assumptions,
lim inf [x f (xk )]i 0,
k
(i = 1, . . . , n).
Furthermore, for every consistently active subsequence {xk` },
lim [x f (xk` )]i = 0,
(i 6 A{xk` })
and
lim inf hu, [xx f (xk` )]ui 0
`
for each u | [u]i = 0 whenever i A{xk` }.
Philippe Toint (Namur)
April 2009
194 / 323
Convex constraints and interior-point methods
Barriers and interior points
The primal-dual framework (1)
2
In practice, as xk & 0, xx mk,j (xk,j ) + k Xk,j
causes slow progress.
Idea: replace this by
1
xx mk,j (xk,j ) + Xk,j
Zk,j
where Zk,j is a bounded positive diagonal.
Alternatively: KKT conditions for original problem:
x m(x) z = 0,
XZ = 0,
x 0,
z 0,
x m(x) z = 0,
XZ = e
x 0,
z 0.
Perturb:
Philippe Toint (Namur)
April 2009
195 / 323
Convex constraints and interior-point methods
Barriers and interior points
The primal-dual framework (2)
Now write Newtons method for the perturbed problem:
xx mk,j (xk,j )xk,j zk,j = gk,j + zk,j ,
Xk,j zk,j + Zk,j xk,j = k e Xk,j Zk,j e,
xk,j + xk,j 0, zk,j + zk,j 0.
Substituting the 2nd equation into the 1st:
h
i
h
i
1
1
xx mk,j (xk,j ) + Xk,j
Zk,j xk,j = gk,j k Xk,j
e
But
1
gk,j k Xk,j
e = x log (x, k )
Hence
h
i
1
xx mk,j (xk,j ) + Xk,j
Zk,j xk,j = x log (x, k )
Philippe Toint (Namur)
April 2009
196 / 323
Convex constraints and interior-point methods
Barriers and interior points
The primal-dual inner algorithm (1)
Algorithm 4.5: Inner primal-dual algorithm
Step 0: Initialization. Given: xk,0 ri{C}, zk,0 > 0 , k,0 , 1 , 2 , 1 ,2 , k .
Compute f (xk,0 ), set j = 0.
Step 1: Model definition. In Bk,j , define
h
i
1
f
1
mk,j (xk,j + s) = mk,j
(xk,j + s) k he, log(xk,j )i + hXk,j
e, si 21 hs, Xk,j
Zk,j si
Step 2: Step calculation. Define dk,j = dist(xk,j , C). Compute a step sk,j such
that xk,j + sk,j Bk,j , dist(xk,j + sk,j , C) k dk,j , and
(
"
mk,j (xk,j )mk,j (xk,j +sk,j ) max kgk,j k min
kgk,j k
k,j
Step 3: Acceptance of the trial point.
k,j =
#
, k,j , (1 k )dk,j
)
h
i
2
2
2 2
, k,j min k,j , k,j , (1k ) dk,j
Compute log (xk,j + sk,j , k ) and
log (xk,j , k ) log (xk,j + sk,j , k )
.
mk,j (xk,j ) mk,j (xk,j + sk,j )
If k,j 1 , then xk,j+1 = xk,j + sk,j , else xk,j+1 = xk,j .
Philippe Toint (Namur)
April 2009
197 / 323
Convex constraints and interior-point methods
Barriers and interior points
The primal-dual inner algorithm (2)
Algorithm 4.6: Inner primal-dual algorithm (2)
Step 4: Trust-region radius update. Set
8
< [k,j , )
[2 k,j , k,j ]
k,j+1
:
[1 k,j , 2 k,j ]
if k,j 2 ,
if k,j [1 , 2 ),
if k,j < 1 .
Step 5: Update the dual variables. Set zk,j+1 > 0. Increment j by one,go to
Step 1.
Philippe Toint (Namur)
April 2009
198 / 323
Convex constraints and interior-point methods
Barriers and interior points
The primal-dual outer algorithm
Algorithm 4.7: Outer primal-dual algorithm
Step 0: Initialization. Given: x0 > 0, z0 > 0, 0 > 0 and the forcing functions
D (), E (), C (). Set k = 0.
Step 1: Inner minimization. Choose k (0, 1). Approximately minimize log (x, k )
from xk using the primal-dual inner algorithm. Stop as soon as an iterate
(xk,j , zk,j ) = (xk+1 , zk+1 ) is found for which
kx f (xk+1 ) zk+1 k D (k ),
kXk+1 Zk+1 k I k C (k ),
and
1
min [xx f (xk+1 ) + Xk+1
Zk+1 ] E (k )
xk+1 > 0 and zk+1 > 0.
Step 3: Update the barrier parameter. Choose k+1 > 0 such that limk k = 0.
Increment k by one and return to Step 1.
1
Note: choosing zk,j = k Xk,j
e primal algorithm!
Philippe Toint (Namur)
April 2009
199 / 323
Convex constraints and interior-point methods
Barriers and interior points
Updating the dual variables
How to compute zk,j+1 in practice? Newton equations give
1
1
z k,j+1 = k Xk,j
e Xk,j
Zk,j sk,j .
. . . but what about zk,j+1 0?
Define
"
I = zul min
1
e, zk,j , k Xk,j+1
e
, zuu max
1
e, zk,j , 1
k e, k Xk,j+1 e
and choose
zk,j+1 =
Philippe Toint (Namur)
PI [z k,j+1 ]
zk,j
if xk,j+1 = xk,j + sk,j
if xk,j+1 = xk,j ,
April 2009
200 / 323
Convex constraints and interior-point methods
Barriers and interior points
Properties of the dual variables
Then zk,j+1 > 0 and
[zk,j ]i uzi max
1
[xk,j ]i
,1 .
If, furthermore,
lim ksk,j k = 0 when
lim kgk,j k = 0
then
1
lim
zk,j k Xk,j
e
= 0 if
lim kgk,j k = 0.
asymptotically exact barrier Hessian for fixed
Philippe Toint (Namur)
April 2009
201 / 323
Convex constraints and interior-point methods
Barriers and interior points
Scaling of the inner iterations
In practice, scaling is crucial!
Ideally,
k kk,j = k kxx mk,j (xk,j ) =
q
1
h, [Hk,j + Xk,j
Zk,j ]i
Under the usual assumptions, k kk,j is uniformly equivalent to
the Euclidean norm for fixed k.
xk,j
xk,j + sk,j
ri{C}
all usual convergence properties for fixed k
Philippe Toint (Namur)
April 2009
202 / 323
Convex constraints and interior-point methods
Barriers and interior points
Scaling of the outer iterations (1)
Scaled tests:
kx f (xk+1 ) zk+1 k[k+1] D (k )
kXk+1 Zk+1 k I k2 C (k ),
min
12
Mk+1
(xx f (xk+1 )
with
21
1
Xk+1
Zk+1 )Mk+1
E (k ),
def
1
Mk+1 = Hk+1 + Xk+1
Zk+1
But this matrix is unbounded when k % !
Philippe Toint (Namur)
April 2009
203 / 323
Convex constraints and interior-point methods
Barriers and interior points
Scaling of the outer iterations (2)
Fortunately,
Under the usual assumptions, the convergence properties are
preserved if
D (k )
lim
k
k
and
C (k ) k
lim
= 0.
k mini [xk+1 ]i
Moreover
If exact derivatives are used, the (k ) can be chosen to ensure
componentwise near quadratic rate of convergence.
This is quite remarkable!
Philippe Toint (Namur)
April 2009
204 / 323
Convex constraints and interior-point methods
Barriers and interior points
Barriers for general convex constraints
Now,
log (x, ) = f (x) he, log(c(x))i
The primal-dual model becomes
f
b
mk,j (xk,j + sk,j ) = mk,j
(xk,j + sk,j ) + mk,j
(xk,j + sk,j ),
with
b
mk,j
(xk,j + sk,j )
k he, log(c(xk,j ))i k hC 1 (xk,j )e, A(xk,j )sk,j i
+ 12 hA(xk,j )sk,j , [C 1 (xk,j )Yk,j ]A(xk,j )sk,j i
12
m
X
[yk,j ]i hsk,j , xx ci (xk,j )sk,j i
i=1
Quite a mouthful. . . but otherwise everything is OK!
Philippe Toint (Namur)
April 2009
205 / 323
Convex constraints and interior-point methods
Barriers and interior points
The outer primal-dual algorithm for convex constraints
xx `(xk,j , yk,j ) = xx f (xk,j )
m
X
[yk,j ]i xx ci (xk,j )
def
Gk,j = A (xk,j )C
(xk,j )Yk,j A(xk,j )
i=1
Algorithm 4.8: Primal-dual algorithm for convex constraints
Step 0: Initialization Given: x0 | c(x0 ) > 0, y0 > 0, 0 > 0, C (), D () and E ().
Set k = 0.
Step 1: Inner minimization Choose k (0, 1). Approximately minimize
log (x, k ) = f (x) k he, log(c(x))i
from xk . Stop as soon as (xk,j , yk,j ) = (xk+1 , yk+1 ) is found such that
kx f (xk+1 ) AT (xk+1 )yk+1 k D (k ),
kC (xk+1 )Yk+1 e k I k C (k ),
min [xx `(xk+1 , yk+1 ) + Gk+1 ] E (k )
and
(c(xk+1 ), yk+1 ) 0.
Step 3: Update the barrier parameter. Choose k+1 > 0 such that
limk k = 0. Increment k by one and return to Step 1.
Philippe Toint (Namur)
April 2009
206 / 323
Convex constraints and interior-point methods
Bibliography
Bibliography for lesson 4 (1)
1
J. V. Burke,
On the identification of active constraints,
SIAM Journal on Numerical Analysis, 25(5):1197-1211, 1988.
J. V. Burke,
On the identification of active constraints II: the nonconvex case,
SIAM Journal on Numerical Analysis, 27(4):1081-1102, 1990.
J. V. Burke and J. J. Mor
e,
Exposing Constraints,
SIAM Journal on Optimization, 4(3):573-595, 1994.
J. V. Burke, J. J. Mor
e and G. Toraldo,
Convergence properties of trust region methods for linear and convex constraints,
Mathematical Programming A, 47(3):305-336, 1990.
A. R. Conn, N. I. M. Gould, D. Orban and Ph. L. Toint,
A primal-dual trust-region algorithm for minimizing a non-convex function subject to bound and linear equality
constraints,
Mathematical Programming, 87(2):215-249, 2000.
N. I. M. Gould, D. Orban, A. Sartenaer and Ph. L. Toint,
Superlinear Convergence of Primal-Dual Interior Point Algorithms for Nonlinear Programming,
SIAM Journal on Optimization, 11(4):974-1002, 2001.
A. R. Conn, N. I. M. Gould and Ph. L. Toint,
Global convergence of a class of trust region algorithms for optimization with simple bounds,
SIAM Journal on Numerical Analysis, 25(182):433-460, 1988.
A. R. Conn, N. I. M. Gould and Ph. L. Toint,
LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A),
Spinger Verlag, Heidelberg, 1992.
A. R. Conn, N. I. M. Gould, A. Sartenaer and Ph. L. Toint,
Global convergence of a class of trust region algorithms for optimization using inexact projections on convex
constraints,
SIAM Journal on Optimization, 3(1):164221, 1993.
Philippe Toint (Namur)
April 2009
207 / 323
Convex constraints and interior-point methods
Bibliography
Bibliography for lesson 4 (2)
10 A. V. Fiacco and G. P. McCormick,
The Sequential Unconstrained Minimization Technique for Nonlinear Programming: a Primal-Dual Method,
Management Science, 0(2):360-366, 1964.
11 A. V. Fiacco and G. P. McCormick,
Nonlinear Programming: Sequential Unconstrained Minimization Techniques,
Wiley and Sons, Chichester (UK), 1968.
12 Ph. L. Toint,
Global convergence of a class of trust region methods for nonconvex minimization in Hilbert space,
IMA Journal of Numerical Analysis, 8(2):231-252, 1988.
Philippe Toint (Namur)
April 2009
208 / 323
The use of problem structure for large-scale applications
Lesson 5:
Sparsity, partial separability
and multilevel methods:
exploiting problem structure
Philippe Toint (Namur)
April 2009
209 / 323
The use of problem structure for large-scale applications
Outline
Sparsity and partial separability
Multilevel problems
Bibliography
Philippe Toint (Namur)
April 2009
210 / 323
The use of problem structure for large-scale applications
Sparsity
5.1: Sparsity and
partial separability
Philippe Toint (Namur)
April 2009
211 / 323
The use of problem structure for large-scale applications
Sparsity
Sparsity
A matrix is sparse when the proportion and/or distribution of
its zero entries allows its efficient numerical usage
An (oriented) graph is asociated with every sparse (non)symmetric matrix
Philippe Toint (Namur)
April 2009
212 / 323
The use of problem structure for large-scale applications
Sparsity
Main benefits of sparsity
Sparsity and optimization Hessian (and) Jacobian matrices
very important time/space savings in solving Newtons equations
(unconstrained or KKT)
1
2
factorizations (reduced fill-in)
iterative methods (fast matrixvector products)
sometimes important in approximations schemes
1
2
3
derivative-free methods (makes the number of function evaluations
linear in the number of variables)
finite-difference approximations
quasi Newton methods
a path for parallel computations
exploiting sparsity = an active algorithmic industry!
Philippe Toint (Namur)
April 2009
213 / 323
The use of problem structure for large-scale applications
Sparsity
The Curtis-Powell-Reid algorithm for estimating sparse
Jacobians
Finite differences for a Jacobian column:
Jei
c(x + hei ) c(x)
h
Question: How many finite differences for estimating a 5 5 Jacobian
with the structure:
0
B
B
B
@
Philippe Toint (Namur)
C
C
C?
A
April 2009
214 / 323
The use of problem structure for large-scale applications
Sparsity
The Curtis-Powell-Reid algorithm for estimating sparse
Jacobians
Finite differences for a Jacobian column:
Jei
c(x + hei ) c(x)
h
Question: How many finite differences for estimating a 5 5 Jacobian
with the structure:
0
B
B
B
@
Philippe Toint (Namur)
C
C
C
A
April 2009
214 / 323
The use of problem structure for large-scale applications
Sparsity
The Curtis-Powell-Reid algorithm for estimating sparse
Jacobians
Finite differences for a Jacobian column:
Jei
c(x + hei ) c(x)
h
Question: How many finite differences for estimating a 5 5 Jacobian
with the structure:
0
B
B
B
@
Philippe Toint (Namur)
C
C
C
A
April 2009
214 / 323
The use of problem structure for large-scale applications
Sparsity
The Curtis-Powell-Reid algorithm for estimating sparse
Jacobians
Finite differences for a Jacobian column:
Jei
c(x + hei ) c(x)
h
Question: How many finite differences for estimating a 5 5 Jacobian
with the structure:
0
B
B
B
@
Philippe Toint (Namur)
C
C
C
A
April 2009
214 / 323
The use of problem structure for large-scale applications
Sparsity
The Curtis-Powell-Reid algorithm for estimating sparse
Jacobians
Finite differences for a Jacobian column:
Jei
c(x + hei ) c(x)
h
Question: How many finite differences for estimating a 5 5 Jacobian
with the structure:
0
B
B
B
@
Je
c(x + he1 + he4 ) c(x)
h
Je
Answer: 3 finite-differences!
Philippe Toint (Namur)
C
C
C
A
c(x + he2 + he3 ) c(x)
h
Je
c(x + he5 ) c(x)
h
Curtis, Powell and Reid (1974), Steihaug et al.
April 2009
214 / 323
The use of problem structure for large-scale applications
Sparsity
The CPR algorithm for estimating sparse Jacobians
Algorithm 5.1: CPR algorithm
Build the column groups.
Place the columns in as few groups as possible such that
two columns in the same group have their nonzero entries in
different rows
Estimate the finite differences.
1
2
P
Build a difference vector h = group hi ei
Compute v = c(x + h) c(x)
Reconstruct the Jacobian.
Jij
Philippe Toint (Namur)
vi
hi
for all j such that j group
April 2009
215 / 323
The use of problem structure for large-scale applications
Sparsity
A graph colouring interpretation
Consider the intersection graph for the columns:
2
Philippe Toint (Namur)
April 2009
216 / 323
The use of problem structure for large-scale applications
Sparsity
A graph colouring interpretation
Consider the intersection graph for the columns:
2
Philippe Toint (Namur)
April 2009
216 / 323
The use of problem structure for large-scale applications
Sparsity
A graph colouring interpretation
Consider the intersection graph for the columns:
2
Philippe Toint (Namur)
April 2009
216 / 323
The use of problem structure for large-scale applications
Sparsity
A graph colouring interpretation
Consider the intersection graph for the columns:
2
minimize the number of colours,
such that adjacent nodes have different colours
can build column groups using heuristic algorithms for graph colouring
Coleman and More, (1983)
Philippe Toint (Namur)
April 2009
216 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C?
C
C
C
A
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (1)
Question: How many finite differences for estimating a 8 8 symmetric
Hessian with the structure:
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
Exploiting symmetry in CPR ( a direct method)
Powell and T (1979), Coleman and More (1984)
Philippe Toint (Namur)
April 2009
217 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
0
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
0
B
B
B
B
B
B
B
B
B
@
Philippe Toint (Namur)
C
C
C
C
C
C
C
C
C
A
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B
B
B
B
B
B
B
@
0
C
C
C
C
C
C
C
C
C
A
Apply CPR on the lower triangular part of the Hessian
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
But what about the conflicts with the upper triangular part?
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B
B
B
B
B
B
B ?
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
B
B
B ?
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Estimating sparse Hessians (2)
Question: Can we do better?
0
B
B
B
B
B
B
B
B
B
@
C
C
C
C
C
C
C
C
C
A
A more efficient substitution method. . .
Powell and T (1979), Coleman and More (1984) for a graph interpretation
Philippe Toint (Namur)
April 2009
218 / 323
The use of problem structure for large-scale applications
Sparsity
Optimized version for PDE stencils
Example: the 5-points Laplacian operator in 2D
(non-symmetric and symmetric)
Philippe Toint (Namur)
April 2009
219 / 323
The use of problem structure for large-scale applications
Sparsity
Optimized version for PDE stencils
Example: the 5-points Laplacian operator in 2D
(non-symmetric and symmetric)
Philippe Toint (Namur)
April 2009
219 / 323
The use of problem structure for large-scale applications
Sparsity
Partial separability
A more geometric concept:
Griewank and T. (1982)
f (x) is partially separable iff
f (x) =
p
X
fi (Ui x) where the matrices Ui are of low rank
i=1
if Ui = disjoint columns of the identity matrix (totally) separable
common case: Ui = overlapping columns of the identity matrix
f (x) =
p
X
fi (xSi )
i=1
Vocabulary:
element functions, element variables, internal variables ui = Ui x
Philippe Toint (Namur)
April 2009
220 / 323
The use of problem structure for large-scale applications
Sparsity
Sources and examples of partially separable functions
Example 1:
f (x1 , x2 , x3 , x4 ) = f1 (x1 , x2 ) + f2 (x2 , x3 , x4 ) + f3 (x4 , x5 )
Example 2:
f (x1 , x2 , x3 , x4 ) = f1 (3x1 + x2 ) + f2 (2x2 + x3 2x4 , x4 + 3x5 )
|
{z
} | {z }
| {z }
u1
u2
u3
Sources:
(nearly) all discretized problems
most problems in econometric modelling,
. . . and a lot more because. . .
Philippe Toint (Namur)
April 2009
221 / 323
The use of problem structure for large-scale applications
Sparsity
Properties of partially separable functions
If f (x) has a sparse Hessian matrix and is sufficiently smooth,
then it is partially separable
(but not conversely: ex : f (x1 , . . . , xn ) =
If f (x) =
Pp
i=1 fi (Ui x)
Pp
Pn
i=1 fi (xi )
i=1 fi (ui ),
x f (x) =
p
X
+ fn+1 (x1 + + xn )
then
UiT x fi (ui )
i=1
xx f (x) =
p
X
UiT xx fi (ui )Ui
i=1
(easy to compute, sparsity determined by Ui )
Philippe Toint (Namur)
April 2009
222 / 323
The use of problem structure for large-scale applications
Sparsity
The three points Laplacian operator
On a regular geometric grid
2
B 1
B
B
B
B
@
1
2
1
1
2
1
0
B
B
B
+B
B
@
0
2
C
B 1
C
B
C
B
C=B
C
B
@
1 A
2
1
1
2
1
1
2
1
1
1
1
1
1
1
1
C B
C B
C B
C+B
C B
A @
1
1
C B
C B
C B
C+B
C B
A @
1
1
1
1
1
1
1
C
C
C
C
C
A
C B
C B
C B
C+B
C B
A @
C
C
C
C
C
1 A
2
1
1
Note: Sum of rank one submatrices (ui = xi+1 xi )!
Philippe Toint (Namur)
April 2009
223 / 323
The use of problem structure for large-scale applications
Sparsity
Using the partially separable structure
Very useful for:
quasi-Newton Hessian matrix = sum of elementwise quasi-Newton
low rank submatrices (partitioned updating),
elementwise models in DFO (number of functions evaluations only
dependent of the maximum number of internal variables!),
optimally efficient finite-difference approximations,
(structured trust-regions),
expressing large-scale models.
LANCELOT based on an extension of this concept
Philippe Toint (Namur)
April 2009
224 / 323
The use of problem structure for large-scale applications
Sparsity
Exploitation of the computational tree
Idea: use computational tree for f (x) for solving Newtons equations
use chain-rule at the top of the computational tree
multiplicative decompositions (and partially separable)
often available from the problem modelling itself
Substantial computational gains
unpublished (?) by T. Coleman (2008)
Philippe Toint (Namur)
April 2009
225 / 323
The use of problem structure for large-scale applications
Multilevel problems
5.3: Multilevel problems
Philippe Toint (Namur)
April 2009
226 / 323
The use of problem structure for large-scale applications
Multilevel problems
Multilevel Optimization: The Problem
min f (x)
xIRn
f : IRn IR nonlinear, C 2 and bounded below
No convexity assumption
Results from the discretization of some infinite-dimensional problem
on a relatively fine grid for instance (n large)
Iterative search of a first-order critical point x (s.t. f (x ) = 0)
Philippe Toint (Namur)
April 2009
227 / 323
The use of problem structure for large-scale applications
Multilevel problems
Hierarchy of problem descriptions
Assume now that a hierarchy of problem descriptions is available, linked by
known operators
Finest problem description
Restriction R
P Prolongation
Fine problem description
Restriction R
P Prolongation
...
Restriction R
P Prolongation
Coarse problem description
Restriction R
P Prolongation
Coarsest problem description
Philippe Toint (Namur)
April 2009
228 / 323
The use of problem structure for large-scale applications
Multilevel problems
Grid transfer operators
Restriction
Prolongation
Ri : IRni IRni1
Pi : IRni1 IRni
Philippe Toint (Namur)
Ri = PiT
April 2009
229 / 323
The use of problem structure for large-scale applications
Multilevel problems
Sources for such problems
Parameter estimation in
discretized ODEs
discretized PDEs
Optimal control problems
Optimal surface design (shape optimization)
Data assimilation in weather forecast (different levels of physics in the
models)
Philippe Toint (Namur)
April 2009
230 / 323
The use of problem structure for large-scale applications
Multilevel problems
The minimum surface problem
Z
1Z 1
min
v
1 + (x v )2 + (y v )2
with the boundary conditions:
f (x), y = 0, 0 x 1
0,
x = 0, 0 y 1
f (x), y = 1, 0 x 1
0,
x = 1, 0 y 1
21
dx dy
Discretization using a finite
element basis
where
f (x) = x (1 x)
Philippe Toint (Namur)
April 2009
231 / 323
The use of problem structure for large-scale applications
Multilevel problems
The solution at different levels
solution at level 2 (converged)
solution at level 1 (converged)
2.5
solution at level 3 (converged)
1.5
0.9
0.8
0.7
1
1.5
0.6
0.5
0.4
0.5
0.3
0.2
0.5
0.1
0
5
0
10
0
20
8
2
1
n = 32 = 9
10
solution at level 6
0.25
0.25
0.2
0.2
0.15
0.15
0.2
0.1
0.1
0.1
0.05
0.05
0
40
0
80
30
20
15
20
25
30
n = 312 = 961
Philippe Toint (Namur)
35
n = 152 = 225
0.3
5
0
0.4
10
10
solution at level 5 (converged)
0.5
10
15
10
n = 72 = 49
solution at level 4 (converged)
20
15
0
150
60
40
20
0
10
20
30
40
50
n = 632 = 3969
60
70
100
50
0
20
40
60
80
100
120
140
n = 1272 = 16129
April 2009
232 / 323
The use of problem structure for large-scale applications
Multilevel problems
The main issue
Hierarchy of problem descriptions
globalization technique
&
Efficiency Robustness
Illustration within a trust-region framework
(Unconstrained case)
Philippe Toint (Namur)
April 2009
233 / 323
The use of problem structure for large-scale applications
Multilevel problems
Past and recent developments
Line-search
Fisher (1998), Frese-Bouman-Sauer (1999), Nash (2000)
Lewis-Nash (2000, 2002)
Oh-Milstein-Bouman-Webb (2003)
Wen-Goldfarb (2007, 2008)
Gratton-T (2007)
Trust-region
Gratton-Sartenaer-T (2006, 2008)
Gratton-Mouffe-T-Weber Mendonca (2009)
Gratton-Mouffe-Sartenaer-T-Tomanos (2009)
T-Tomanos-Weber Mendonca (2009)
Gross-Krause (2008)
Philippe Toint (Namur)
April 2009
234 / 323
The use of problem structure for large-scale applications
Multilevel problems
On the side of multigrid methods
Consider the linear system (discrete Poisson equation, for instance):
Ax = b
Ae = r
(residual equation)
(error)
(exact solution)
(approximation)
where
e = x x
r = b A
x (residual)
Expansion of e in Fourier modes shows high (oscillatory) and low (smooth)
frequency components:
Fourier modes
1
0.8
0.6
0.4
0.2
0
0.2
0.4
k=3
k = 24
0.6
k = 3 and k = 24
0.8
1
0
Philippe Toint (Namur)
10
20
30
40
50
60
April 2009
235 / 323
The use of problem structure for large-scale applications
Multilevel problems
Relaxation methods
Basic iterative methods:
correct the i th component of xk in the order 1, 2, . . . , n
to annihilate the i th component of rk
Jacobi
n
X
1
[xk+1 ]i =
aij [xk ]i + [b]i
aii
j=1, j6=i
Gauss-Seidel
[xk+1 ]i =
aii
i1
X
j=1
aij [xk+1 ]i
n
X
aij [xk ]i + [b]i
j=i+1
Solve the equations of the linear system one by one
Philippe Toint (Namur)
April 2009
236 / 323
The use of problem structure for large-scale applications
Multilevel problems
Smoothing effect
Very effective methods at smoothing, i.e., eliminating the
high-frequency (oscillatory) components of the error:
0.2
0.02
1.5
0.15
0.015
0.1
0.01
0.5
0.05
0.005
0
40
30
30
20
20
10
10
0
error of
initial guess
0
40
30
40
30
20
20
10
10
0
error after 10
GS iterations
30
40
30
20
20
10
10
0
error after 100
GS iterations
But they leave the low-frequency (smooth) components relatively
unchanged
Philippe Toint (Namur)
April 2009
237 / 323
The use of problem structure for large-scale applications
Multilevel problems
Multigrid in linear algebar
Assume now (two levels):
A fine grid (f ) description
Ae = r Af e f = r f
A coarse grid (c) description
Ac e c = r c
Linked by transfer operators
Ac = RAf P, e c = Re f , r c = Rr f
Philippe Toint (Namur)
April 2009
238 / 323
The use of problem structure for large-scale applications
Multilevel problems
Coarse grid principle
Smooth error modes on a fine grid
look less smooth on a coarse grid
When relaxation begins to stall at the finer level:
Move to the coarser grid where the smooth error modes appear more
oscillatory
Apply a relaxation at the coarser level:
more efficient
substantially less expensive
Philippe Toint (Namur)
April 2009
239 / 323
The use of problem structure for large-scale applications
Multilevel problems
Two-grid correction scheme
Annihilate oscillatory error level by level:
Fine
smooth
Smooth fine
Smaller fine
R
Oscil. coarse
P
smooth
(recur)
smooth
Smooth coarse
Note: P and R are not othogonal projectors!
A very efficient method for some linear systems
(when A(smooth modes) smooth modes)
Philippe Toint (Namur)
April 2009
240 / 323
The use of problem structure for large-scale applications
Multilevel problems
Does it work?
Smoothing on fine grid only:
2
0.2
0.02
1.5
0.15
0.015
0.1
0.01
0.5
0.05
0.005
0
40
30
0
40
30
30
20
10
40
20
10
10
0
30
20
20
10
10
0
30
40
30
20
20
10
0
Two-grid correction scheme:
2
0.2
0.02
1.5
0.15
0.015
0.1
0.01
0.5
0.05
0.005
0
30
0
30
30
20
30
30
20
20
10
10
0
k=0
Philippe Toint (Namur)
30
20
20
10
10
0
k = 10
20
10
10
0
k = 100
April 2009
241 / 323
The use of problem structure for large-scale applications
Multilevel problems
V-cycle
k+1
Smoothing
Philippe Toint (Namur)
April 2009
242 / 323
The use of problem structure for large-scale applications
Multilevel problems
W-cycle
k+1
Smoothing
Philippe Toint (Namur)
April 2009
243 / 323
The use of problem structure for large-scale applications
Multilevel problems
Mesh Refinement
Solve the problem on the coarsest level
Good starting point for the next fine level
Do the same on each level
Good starting point for the finest level
Finally solve the problem on the finest level
Philippe Toint (Namur)
April 2009
244 / 323
The use of problem structure for large-scale applications
Multilevel problems
Full Multigrid Scheme
Combination of Mesh Refinement and V-cycles
Philippe Toint (Namur)
April 2009
245 / 323
The use of problem structure for large-scale applications
Multilevel problems
Return to optimization
Hierarchy of problem descriptions
Trust-region technique
&
Efficiency Robustness
Multilevel optimization method
Note: Multilevel More-Sorensen algorithm: (Hk + I ) s = gk
T-Tomanos-Weber Mendonca, 2009
Philippe Toint (Namur)
April 2009
246 / 323
The use of problem structure for large-scale applications
Multilevel problems
The framework
Assume that we have:
A hierarchy of problem descriptions of f :
{fi }ri=0
with
fr (x) = f (x)
Transfer operators, for i = 1, . . . , r :
Ri : IRni IRni1
(the restriction)
Pi : IRni1 IRni
(the prolongation)
Terminology: a particular i is referred to as a level
Philippe Toint (Namur)
April 2009
247 / 323
The use of problem structure for large-scale applications
Multilevel problems
The idea
min fr (x) = f (x)
xIRn
at xk :
minimize Taylors model of fr around xk
in the trust region of radius k
or (whenever suitable and desirable)
at xk :
compute fr (xk ) (possibly Hk )
trial step sk
Restriction R
P Prolongation
use fr 1 to construct a coarse local model of fr
and minimize it within the trust region of radius k
If more than two levels are available (r > 1), do this recursively
Philippe Toint (Namur)
April 2009
248 / 323
The use of problem structure for large-scale applications
Multilevel problems
Example of recursion with 5 levels (r = 4)
k
Level 4
R4
Level 3
P4
0
R3
Level 2
P3
0
R2
Level 1
P2
0
R1
Level 0
P1
0
Notation:
R2
P2
0
R1
P 1 R1
0
P1
0
i: level index (0 i r )
k: index of the current iteration within level i
Philippe Toint (Namur)
April 2009
249 / 323
The use of problem structure for large-scale applications
Multilevel problems
Construction of the coarse local models
If fi 6= 0 for i = 0, . . . , r 1
Impose first-order coherence via a correction term:
glow = Rgup
Impose second-order coherence() via two correction terms:
glow = Rgup
and
Hlow = RHup P
() Not needed to derive first-order global convergence
If fi = 0 for i = 0, . . . , r 1
Galerkin model: Restricted version of the quadratic model at the
upper level
Philippe Toint (Namur)
April 2009
250 / 323
The use of problem structure for large-scale applications
Multilevel problems
Preserving the trust-region constraint (1)
up
xlow,0
xlow,k
up kxlow,k xlow,0 k
+
low
min +
,
kx
x
k
up
low
,k
low
,0
low
Note: Motivation to switch to -norm
Gratton, Sartenear, T (2008)
Philippe Toint (Namur)
April 2009
251 / 323
The use of problem structure for large-scale applications
Multilevel problems
Preserving the trust-region constraint (2)
In infinity norm:
min +
low , up kxlow ,k xlow ,0 k
Gratton, Mouffe, T, Weber Mendonca (2008)
Philippe Toint (Namur)
April 2009
252 / 323
The use of problem structure for large-scale applications
Multilevel problems
Use the coarse model whenever suitable
def
When kglow k = kRgup k kgup k
(Coarsening condition)
and
def
When kglow k = kRgup k > low
and
When i > 0
Philippe Toint (Namur)
April 2009
253 / 323
The use of problem structure for large-scale applications
Multilevel problems
Use the coarse model whenever desirable
Taylor model (Taylor step)
Coarse model (recursive step)
smoothing
coarsening
&
Alternate for efficiency (multigrid)
Be as flexible as possible
Leave the choice even when the coarse model is suitable
Philippe Toint (Namur)
April 2009
254 / 323
The use of problem structure for large-scale applications
Multilevel problems
Recursive multilevel trust-region algorithm (RMTR)
At iteration k
(until convergence):
Choose either a Taylor or (if suitable) a coarse local model
(first-order coherent):
Taylor model: compute a Taylor step
Coarse local model: apply the algorithm recursively
Evaluate the change in the objective function
If achieved decrease predicted decrease, then
accept the trial point
possibly enlarge the trust region
else
keep the current point
shrink the trust region
Impose current trust region upper level trust region
Philippe Toint (Namur)
April 2009
255 / 323
The use of problem structure for large-scale applications
Multilevel problems
Global convergence
Based on the trust-region technology
Uses the sufficient decrease argument (imposed in Taylors iterations)
Plus the coarsening condition (kRgup k kgup k)
Main result
lim kgr ,k k = 0
Gratton, Sartenaer, (2008)
Philippe Toint (Namur)
April 2009
256 / 323
The use of problem structure for large-scale applications
In more details
Intermediate results
At iteration (i, k) we associate the set:
def
R(i, k) = {(j, `) | iteration (j, `) occurs within iteration (i, k)}
Level 4
R4
Level 3
P4
0
R3
Level 2
P3
0
R2
Level 1
P2
0
R1
Level 0
Philippe Toint (Namur)
P1
0
R2
P2
0
R1
P 1 R1
0
P1
0
April 2009
257 / 323
The use of problem structure for large-scale applications
In more details
Let
def
V(i, k) = { (j, `) R(i, k) | mj,` kgi,k kj,` }
|
{z
}
sufficient decrease
Then, at a non critical point and if the trust region is small enough:
V(i, k) = R(i, k)
Back to classical trust-region arguments
Philippe Toint (Namur)
April 2009
258 / 323
The use of problem structure for large-scale applications
In more details
Premature termination
For a recursive iteration (i, k):
A minimization sequence at level i 1 initiated at iteration (i, k)
denotes all successive iterations at level i 1
until a return is made to level i
k
Level 4
R4
Level 3
R3
Level 2
P3
0
R2
Level 1
P2
0
R1
Level 0
Philippe Toint (Namur)
P1
0
R2
P2
0
R1
P 1 R1
0
P1
0
April 2009
259 / 323
The use of problem structure for large-scale applications
In more details
Properties of RMTR
Each minimization sequence contains at least one successful iteration
Premature termination in that case does not affect the convergence
results at the upper level
Which allows
Stop a minimization sequence after a preset number of successful
iterations
Use fixed lower-iterations patterns like the V or W cycles in multigrid
methods
Philippe Toint (Namur)
April 2009
260 / 323
The use of problem structure for large-scale applications
In more details
A practical RMTR algorithm: Taylor iterations
At the coarsest level
Solve using the exact More-Sorensen method
(small dimension)
At finer levels
Smooth using a smoothing technique from multigrid
(to reduce the high frequency residual/gradient components)
Philippe Toint (Namur)
April 2009
261 / 323
The use of problem structure for large-scale applications
In more details
SCM Smoothing
Adaptation of the Gauss-Seidel smoothing technique to optimization:
Sequential Coordinate Minimization (SCM smoothing)
Successive one-dimensional minimizations of the model
along the coordinate axes when positive curvature
Cost: 1 SCM smoothing cycle 1 matrix-vector product
Philippe Toint (Namur)
April 2009
262 / 323
The use of problem structure for large-scale applications
In more details
Three issues
How to impose sufficient decrease in the model ?
How to impose the trust-region constraint ?
What to do if a negative curvature is encountered ?
Philippe Toint (Namur)
April 2009
263 / 323
The use of problem structure for large-scale applications
In more details
Start the first SCM smoothing cycle
by minimizing along the largest gradient component
(enough to ensure sufficient decrease)
Perform (at most) p SCM smoothing cycles
while inside the trust region (reasonable cost)
Terminate
when an approximate minimizer is found (Stop)
when the trust-region boundary is passed (Stop at the boundary)
when a direction of negative curvature is encountered
(move to the boundary and Stop)
Philippe Toint (Namur)
April 2009
264 / 323
The use of problem structure for large-scale applications
In more details
Convergence to weak minimizers
SCM smoothing limits its exploration of the models curvature to the
coordinate axes only guarantees asymptotic positive curvature:
along the coordinate axes at the finest level (i = r )
along the the prolongation of the coordinate axes at levels
i = 1, . . . , r 1
along the prolongation of the coarsest subspace (i = 0)
Weak minimizers
Gratton, Sartenaer, T (2006)
Philippe Toint (Namur)
April 2009
265 / 323
The use of problem structure for large-scale applications
Numerical results
Some numerical flavor
Gratton, Mouffe, Sartenaer, T, Tomanos (2009)
All on Finest (AF)
Standard Newton trust-region algorithm (TCG)
Applied at the finest level
Multilevel on Finest (MF)
Algorithm RMTR
Applied at the finest level
Mesh Refinement (MR)
Standard Newton trust-region algorithm (TCG)
Applied successively from coarsest to finest level()
Full Multilevel (FM)
Algorithm RMTR
Applied successively from coarsest to finest level()
()
Starting point at level i + 1 obtained by prolongating the solution at level i
Philippe Toint (Namur)
April 2009
266 / 323
The use of problem structure for large-scale applications
Numerical results
Test problem characteristics
Problem name
DNT
P2D
P3D
DEPT
DPJB?
DODC
MINS-SB
MINS-OB
MINS-DMSA
IGNISC
DSSC
BRATU
MINS-BC?
MEMBR?
NCCS
NCCO
MOREBV
nr
511
1.046.529
250.047
1.046.529
1.046.529
65.025
1.046.529
65.025
65.025
65.025
1.046.529
1.046.529
65.025
393.984
103.050
103.050
1.046.529
r
8
9
5
9
9
7
9
7
7
7
9
9
7
9
7
7
9
Type
1-D,
2-D,
3-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
2-D,
quadratic
quadratic
quadratic
quadratic
quadratic
convex
convex
convex
convex
convex
convex
convex
convex
convex
nonconvex
nonconvex
nonconvex
Description
Dirichlet-to-Neumann transfer problem
Poisson model problem
Poisson model problem
Elastic-plastic torsion problem
Journal bearing problem
Optimal design problem
Minimium surface problem
Minimium surface problem
Minimium surface problem
Combustion problem
Combustion problem
Combustion problem
Minimium surface problem
Membrane problem
Optimal control problem
Optimal control problem
Boundary value problem
?: with bound constraints
Philippe Toint (Namur)
April 2009
267 / 323
The use of problem structure for large-scale applications
Numerical results
Performance profiles (CPU time)
1
0.8
0.6
0.4
FM
MR
MF
AF
0.2
10
20
Philippe Toint (Namur)
30
40
50
60
70
80
90
100
April 2009
268 / 323
The use of problem structure for large-scale applications
Numerical results
Zoom on on efficiency (CPU time)
1
0.8
0.6
0.4
FM
MR
MF
AF
0.2
0
1
Philippe Toint (Namur)
10
April 2009
269 / 323
The use of problem structure for large-scale applications
Numerical results
CPU times
Problem name
DNT
P2D
P3D
DEPT
DPJB
DODC
MINS-SB
MINS-OB
MINS-DMSA
IGNISC
DSSC
BRATU
MINS-BC
MEMBR
NCCS
NCCO
MOREBV
Best
Philippe Toint (Namur)
AF
5.2
1122.8
626.1
1364.4
3600.0
894.8
3600.0
1445.6
1196.8
2330.4
3183.8
2314.1
2706.4
1082.0
3600.0
3600.0
3600.0
MF
24.4
MR
72.8
569.7
47.5
18.3
4.6
69.5
95.4
1390.0
247.7
58.6
70.4
73.4
398.3
184.2
3600.0
116.7
289.6
488.2
1051.6
236.8
122.3
91.7
3600.0
161.8
524.6
335.2
3600.0
3600.0
292.4
279.5
3589.6
704.9
3600.0
FM
6.7
26.0
28.8
8.6
83.6
36
153.9
27.5
18.2
398.2
12.1
10.1
140.0
154.0
331.9
224.2
41.7
Second best
April 2009
270 / 323
The use of problem structure for large-scale applications
Numerical results
A glimpse at the solution process
Philippe Toint (Namur)
April 2009
271 / 323
The use of problem structure for large-scale applications
Bibliography
Bibliography for lesson 5 (1)
1
T. F. Coleman and J. J. Mor
e,
Estimation of sparse Jacobian matrices and graph coloring problems,
SIAM Journal on Numerical Analysis, 20:187-209, 1983.
T. F. Coleman and J. J. Mor
e,
Estimation of sparse Hessian matrices and graph coloring problems,
Mathematical Programming, 28:243-270, 1984.
A. R. Conn, N. I. M. Gould and Ph. L. Toint,
LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A),
Springer Verlag, Springer Series in Computational Mathematics 17, Heidelberg, 1992.
A. Curtis, M. J. D. Powell and J. Reid,
On The Estimation of Sparse Jacobian Matrices,
IMA Journal, 13:117-119, 1974.
M. Fisher,
Minimization Algorithms for Variational Data Assimilation,
in Recent Developments in Numerical Methods for Atmospheric Modelling, European Center for Medium-Range
Weather Forecasts, Reading, UK, pp. 364-385, 1998.
T. Frese, Ch. Bouman and K. Sauer,
Multiscale Bayesian Methods for Discrete Tomography,
in Discrete Tomography: Foundations, Algorithms and Applications (G. Herman and A. Kuba, eds.), Birkhauser,
Boston, pp. 237-261, 1999.
D. Goldfarb and Ph. L. Toint,
Optimal Estimation of Jacobian and Hessian Matrices That Arise in Finite Difference Calculations,
Mathematics of Computation, 43(167):69-88, 1984.
S. Gratton, M. Mouffe, Ph. L. Toint and M. Weber-Mendonca,
A recursive trust-region method in infinity norm for bound-constrained nonlinear optimization,
IMA Journal of Numerical Analysis, 28(4):827-861, 2008.
S. Gratton, M. Mouffe, A. Sartenaer, Ph. L. Toint and D. Tomanos,
Numerical Experience with a Recursive Trust-Region Method for Multilevel Nonlinear Optimization,
Optimization Methods and Software, to appear, 2009.
Philippe Toint (Namur)
April 2009
272 / 323
The use of problem structure for large-scale applications
Bibliography
Bibliography for lesson 5 (2)
10 S. Gratton, A. Sartenaer and Ph. L. Toint,
Recursive Trust-Region Methods for Multiscale Nonlinear Optimization, SIAM Journal on Optimization,
19(1):414-444, 2008.
11 S. Gratton, A. Sartenaer and Ph. L. Toint,
Second-order convergence properties of trust-region methods using incomplete curvature information, with an
application to multigrid optimization,
Journal of Computational and Applied Mathematics, 24(6):676-692, 2006.
12 S. Gratton and Ph. L. Toint,
Multi-Secant Equations, Approximate Invariant Subspaces and Multigrid Optimization,
FUNDP, Namur, Report 07/11, 2007.
13 A. Griewank and Ph. L. Toint,
On the unconstrained optimization of partially separable functions,
in Ninlinear Optimization 1981, (M. J. D. Powell, ed.), Academic Press, pp. 302-312, 1982.
14 S. G. Nash,
A Multigrid Approach to Discretized Optimization Problems,
Optimization Methods and Software, 14:99-116, 2000.
15 M. Lewis and S. G. Nash,
Model problems for the multigrid optimization of systems governed by differential equations,
SIAM Journal on Scientific Computing, 26(6):1811-1837, 2005.
16 S. Oh, A. Milstein, Ch. Bouman and K. Webb,
Multigrid algorithms for optimization and inverse problems,
in Computational Imaging(Ch. Bouman and R. Stevenson, eds.), DDM, Proceedings of the SPIE, 5016:59-70, 2003.
17 M. J. D. Powell and Ph. L. Toint,
On The Estimation of Sparse Hessian Matrices,
SIAM Journal on Numerical Analysis, 16(6):1060-1074, 1979.
18 Ph. L. Toint, D. Tomanos and M. Weber-Mendonca,
A multilevel algorithm for solving the trust-region subproblem,
Optimization Methods and Software, 24(2):299-311, 2009.
19 Ch. Gross and R. Krause,
On The Convergence of Recursive Trust-Region Methods for Multiscale Nonlinear Optimization and Applications to
Nonlinear Mechanics,
University of Bonn, Germany, 2008.
Philippe Toint (Namur)
April 2009
273 / 323
The use of problem structure for large-scale applications
Bibliography
Bibliography for lesson 5 (3)
20 Z. Wen and D. Goldfarb,
A Linesearch Multigrid Methods for Large-Scale Convex Optimization,
Department of Industrial Engineering and Operations Research, Columbia University, New York, July 2007.
Philippe Toint (Namur)
April 2009
274 / 323
Regularization methods and nonlinear step control
Lesson 6:
Cubic and quadratic
regularization methods:
a path towards
nonlinear step control
Philippe Toint (Namur)
April 2009
275 / 323
Regularization methods and nonlinear step control
Outline
Regularization for unconstrained problems
1
2
cubic
quadratic
Nonlinear step control
Cubic regularization for constrained problems
Conclusions
Bibliography
Philippe Toint (Namur)
April 2009
276 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Regularization techniques
for unconstrained Problems
Philippe Toint (Namur)
April 2009
277 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
The problem
We return to the unconstrained nonlinear programming problem:
minimize f (x)
for x IRn and f : IRn IR smooth.
Important special case: the nonlinear least-squares problem
minimize f (x) = 12 kF (x)k2
for x IRn and F : IRn IRm smooth.
Philippe Toint (Namur)
April 2009
278 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Unconstrained optimization a mature area?
f (x) where f C 1
minimize
n
xIR
(maybe
C2 )
Currently two main competing (but similar) methodologies
Linesearch methods
compute a descent direction sk from xk
set xk+1 = xk + k sk to improve f
Trust-region methods
compute a step sk from xk to improve a model mk of f
within the trust-region ksk k
set xk+1 = xk + sk if mk and f agree at xk + sk
otherwise set xk+1 = xk and reduce the radius
Philippe Toint (Namur)
April 2009
279 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
A useful theoretical observation
Consider trust-region method where
model = true objective function
Then
model and objective always agree
trust-region radius goes to infinity
a linesearch method
Nice consequence:
A unique convergence theory!
(Shultz/Schnabel/Byrd, 1985, T., 1988
Philippe Toint (Namur)
April 2009
280 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
The keys to convergence theory for trust regions
The Cauchy condition:
kgk k
, k
mk (xk ) mk (xk + sk ) TR kgk k min
1 + kHk k
The bound on the stepsize:
ksk
And we derive:
Global convergence to first/second-order critical points
Is there anything more to say?
Philippe Toint (Namur)
April 2009
281 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Is there anything more to say?
Observe the following: if
f has gradient g and globally Lipschitz continuous Hessian H with
constant 2L
Taylor, Cauchy-Schwarz and Lipschitz imply
f (x + s)
= f (x) + hs, g (x)i + 12 hs, H(x)si
R1
+ 0 (1 )hs, [H(x + s) H(x)]si d
f (x) + hs, g (x)i + 12 hs, H(x)si + 13 Lksk32
|
{z
}
m(s)
= reducing m from s = 0 improves f since m(0) = f (x).
Philippe Toint (Namur)
April 2009
282 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
The cubic regularization
Change from
f (x) + hs, g (x)i + 12 hs, H(x)si s.t. ksk
min
s
to
min
s
f (x) + hs, g (x)i + 12 hs, H(x)si + 13 ksk3
is the (adaptive) regularization parameter
(ideas from Griewank, Weiser/Deuflhard/Erdmann, Nesterov/Polyak, Cartis/Gould/T)
Philippe Toint (Namur)
April 2009
283 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Cubic regularization highlights
f (x + s) m(s) f (x) + s T g (x) + 21 s T H(x)s + 13 Lksk32
Nesterov and Polyak minimize m globally
N.B. m may be non-convex!
efficient scheme to do so if H has sparse factors
global (ultimately rapid) convergence to a 2nd-order critical point of f
better worst-case function-evaluation complexity than previously
known
Obvious questions:
can we avoid the global Lipschitz requirement?
can we approximately minimize m and retain good worst-case
function-evaluation complexity?
does this work well in practice?
Philippe Toint (Namur)
April 2009
284 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Cubic overestimation
Assume
f C2
f , g and H at xk are fk , gk and Hk
symmetric approximation Bk to Hk
Bk and Hk bounded at points of interest
Use
cubic overestimating model at xk
mk (s) fk + s T gk + 12 s T Bk s + 31 k ksk32
k is the iteration-dependent regularisation weight
p
easily generalized for regularisation in Mk -norm kskMk = s T Mk s
where Mk is uniformly positive definite
Philippe Toint (Namur)
April 2009
285 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Adaptive Regularization with Cubic (ARC)
Algorithm 6.1: The ARC Algorithm
Step 0: Initialization: x0 and 0 > 0 given. Set k = 0
Step 1: Step computation: Compute sk for which mk (sk ) mk (skC )
Cauchy point: skC = kC gk & kC = arg min mk (gk )
IR+
f (xk ) f (xk + sk )
Step 2: Step acceptance: Compute k =
f (xk ) mk (sk )
x k + sk
if k > 0.1
and set xk+1 =
xk
otherwise
Step 3: Update the regularization parameter:
k+1
(0, k ] = 21 k if k > 0.9
very successful
[ , 1 k ] = k if 0.1 k 0.9 successful
k
[1 k , 2 k ] = 2k otherwise
unsuccessful
Philippe Toint (Namur)
April 2009
286 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Local convergence theory for cubic regularization (1)
The Cauchy condition:
kgk k
mk (xk ) mk (xk + sk ) CR kgk k min
,
1 + kHk k
kgk k
k
The bound on the stepsize:
kHk k
ksk k 3 max
,
k
kgk k
k
(Cartis/Gould/T)
Philippe Toint (Namur)
April 2009
287 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Local convergence theory for cubic regularization (2)
And therefore. . .
lim kgk k = 0
first-order global convergence
Under stronger assumptions can show that
If sk minimizes mk over subspace with orthogonal basis Qk ,
lim QkT Hk Qk 0
second-order global convergence
Philippe Toint (Namur)
April 2009
288 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Fast convergence
For fast asymptotic convergence = need to improve on Cauchy point:
minimize over Krylov subspaces
1
g stopping-rule: ks mk (sk )k min(1, kgk k 2 )kgk k
s stopping-rule: ks mk (sk )k min(1, ksk k )kgk k
If Bk satisfies the Dennis-More condition
k(Bk Hk )sk k/ksk k 0 whenever kgk k 0
and xk x with positive definite H(x )
= Q-superlinear convergence of xk under the g- and s-rules
If additionally H(x) is locally Lipschitz around x and
k(Bk Hk )sk k = O(ksk k2 )
=
Q-quadratic convergence of xk under the s-rule
Philippe Toint (Namur)
April 2009
289 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Function-evaluation complexity
How many function evaluations (iterations) are needed to ensure that
kgk k ?
so long as for very successful iterations k+1 3 k for 3 < 1
= basic ARC algorithm requires at most
l m
C function evaluations
2
for some C independent of
c.f. steepest descent
if H is globally Lipschitz, the s-rule is applied and additionally sk is
the global (line) minimizer of mk (sk ) as a function of
= ARC algorithm requires at most
l
m
S
function evaluations
3/2
for some S independent of
Philippe Toint (Namur)
c.f. Nesterov & Polyak
April 2009
290 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Minimizing the model
m(s) f + s T g + 12 s T Bs + 13 ksk32
Derivatives:
= ksk2
s m(s) = g + Bs + s
ss m(s) = B + I +
s
ksk
s
ksk
T
Optimality: any global minimizer s of m satisfies
(B + I )s = g
= ks k2
B + I is positive semi-definite
Philippe Toint (Namur)
April 2009
291 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
The (adapted) secular equation
Require
(B + I )s = g
and
= ksk2
Define s():
(B + I )s() = g
and find scalar as the root of secular equations
=0
ks()k2
or
1
=0
ks()k2
or
=0
ks()k2
values and derivatives of s() satisfy linear systems with symmetric
positive definite B + I
need to be able to factorize B + I
Philippe Toint (Namur)
April 2009
292 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Plots of secular functions against
Example: g = (0.25 1)T , H = diag(1 1) and = 2
ks()k2
=0
Philippe Toint (Namur)
=0
ks()k2
=0
ks()k2
April 2009
293 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Large problems approximate solutions
Seek instead global minimizer of m(s) in a j-dimensional (j n) subspace
S IRn
g S = ARC algorithm globally convergent
Q orthogonal basis for S
s = Qu where
u = arg minj f + u T (Q T g ) + 12 u T (Q T BQ)u + 31 kuk32
uIR
= use secular equation to find u
if S is the Krylov space generated by {B i g }j1
i=0
= Q T BQ = T , tridiagonal
= can factor T + I to solve secular equation even if j is large
using g- or s-stopping rule = fast asymptotic convergence for ARC
using s-stopping rule = good function-evaluation complexity for
ARC
Philippe Toint (Namur)
April 2009
294 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
The main features of adaptive cubic regularization
And the result is. . .
longer steps on ill-conditioned problems
similar (very satisfactory) convergence analysis
best function-evaluation complexity for nonconvex problems
excellent performance and reliability
Philippe Toint (Namur)
April 2009
295 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Numerical experience small problems using Matlab
Performance Profile: iteration count 131 CUTEr problems
1
fraction of problems for which method within of best
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
ACO g stopping rule (3 failures)
ACO s stopping rule (3 failures)
trustregion (8 failures)
0.1
1.5
Philippe Toint (Namur)
2.5
3.5
4.5
April 2009
296 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
The quadratic regularization for NLS (ARQ)
Consider the Gauss-Newton method for nonlinear least-squares problems.
Change from
min
s
1
2
kc(x)k2 + hs, J(x)T c(x)i + 12 hs, J(x)T J(x)si s.t. ksk
to
min
s
kc(x) + J(x)sk + 12 ksk2
is the (adaptive) regularization parameter
(idea by Nesterov)
Philippe Toint (Namur)
April 2009
297 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Quadratic regularization: reformulation
Note that
min
s
kc(x) + J(x)sk + 12 ksk2
min
,s
+ 12 ksk2 such that kc(x) + J(x)sk2 = 2
exact penalty function for the problem of minimizing ksk subject to
c(x) + J(x)s = 0.
Iterative techniques. . . as for the cubic case (Cartis, Gould,T.):
solve the problem in nested Krylov subspaces
Lanczos factorization of tridiagonal matrices
different scalar secular equation (solution by Newtons method)
Philippe Toint (Namur)
April 2009
298 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
The keys to convergence theory for quadratic regularization
The Cauchy condition:
kJkT ck k
kJkT ck k
kJkT ck k
,
m(xk ) m(xk + sk ) QR
min
kck k
1 + kJkT Jk k k kck k
The bound on the stepsize:
ksk k
Philippe Toint (Namur)
1 kJkT ck k
2 k kck k
April 2009
299 / 323
Regularization methods and nonlinear step control
Regularization methods for unconstrained problems
Convergence theory for the quadratic regularization
Convergence results:
Global convergence to first-order critical points
Quadratic convergence to roots
Valid for
general values of m and n,
exact/approximate subproblem solution
(Bellavia/Cartis/Gould/Morini/T.)
Philippe Toint (Namur)
April 2009
300 / 323
Regularization methods and nonlinear step control
Nonlinear stepsize control
6.2: A unifying concept:
nonlinear stepsize control
Philippe Toint (Namur)
April 2009
301 / 323
Regularization methods and nonlinear step control
Nonlinear stepsize control
Towards a unified global convergence theory
Objectives:
recover a unified global convergence theory
possibly open the door for new algorithms
Idea:
cast all three methods into a generalized TR framework
allow this TR to be updated nonlinearly
Philippe Toint (Namur)
April 2009
302 / 323
Regularization methods and nonlinear step control
Nonlinear stepsize control
Towards a unified global convergence theory (2)
Given
3 continuous first-order criticality measures (x), (x), (x)
an adaptive stepsize parameter
define a generalized radius (, (x)) such that
(, ) is C 1 , strictly increasing and concave,
(0, ) = 0 for all ,
(, ) is non-increasing
(, ) (, )
...
Philippe Toint (Namur)
April 2009
303 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
6.3: Cubic regularization
for constrained problems
Philippe Toint (Namur)
April 2009
304 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
The constrained case
Can we apply regularization to the constrained case?
Consider the constrained nonlinear programming problem:
minimize f (x)
x F
for x IRn and f : IRn IR smooth, and where
F is convex.
Main ideas:
exploit (cheap) projections on convex sets
define using the generalized Cauchy point idea
prove global convergence + function-evaluation complexity
Philippe Toint (Namur)
April 2009
305 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Constrained step computation (1)
min
s
f (x) + hs, g (x)i + 12 hs, H(x)si + 13 ksk3
subject to
x +s F
is the (adaptive) regularization parameter
Criticality measure: (as before)
(x) =
min
hx f (x), di ,
x+dF ,kdk1
def
Philippe Toint (Namur)
April 2009
306 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
The generalized Cauchy point for ARC
Cauchy step: Goldstein-like piecewise linear seach on mk along the
gradient path projected onto F
Find
def
xkGC = PF [xk tkGC gk ] = xk + skGC
(tkGC > 0)
such that
mk (xkGC ) f (xk ) + ubs hgk , skGC i
(below linear approximation)
and either
mk (xkGC ) f (xk ) + lbs hgk , skGC i
(above linear approximation)
or
kPT (x GC ) [gk ]k epp |hgk , skGC i|
k
(close to paths end)
no trust-region condition!
Philippe Toint (Namur)
April 2009
307 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Searching for the ARC-GCP
5
4
3
2
1
0
1
2
3
4
5
0
0.2
0.4
0.6
0.8
1.2
1.4
mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 + 31 ksk
Philippe Toint (Namur)
1.6
such that s 1.5
April 2009
308 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Remember the same for a quadratic model?
5
4
3
2
1
0
1
2
3
4
5
6
0
0.2
0.4
0.6
0.8
1.2
1.4
1.6
mk (0 + s) = 3.57s1 1.5s2 s3 + s1 s2 + 3s2 + s2 s3 2s3 such that s 1.5 and 2.8
Philippe Toint (Namur)
April 2009
309 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
A constrained regularized algorithm
Algorithm 6.2: ARC for Convex Constraints (COCARC)
Step 0: Initialization. x0 F, 0 given. Compute f (x0 ), set k = 0.
Step 1: Generalized Cauchy point. If xk not critical, find the
generalized Cauchy point xkGC by piecewise linear search on the
regularized cubic model.
Step 2: Step calculation. Compute sk and xk+ = xk + sk F such
that mk (xk+ ) mk (xkGC ).
def
Step 3: Acceptance of the trial point. Compute f (xk+ ) and k .
If k 1 , then xk+1 = xk + sk ; otherwise xk+1 = xk .
Step 4: Regularisation parameter update. Set
if k 2 ,
(0, k ]
[k , 1 k ]
if k [1 , 2 ),
k+1
[1 k , 2 k ] if k < 1 .
Philippe Toint (Namur)
April 2009
310 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Local convergence theory for COCARC
The Cauchy condition:
k
,
mk (xk ) mk (xk + sk ) CR k min
1 + kHk k
k
,1
k
The bound on the stepsize:
"
kHk k
ksk k 3 max
,
k
k
k
1 1 #
2
k 3
,
k
And therefore. . .
lim k = 0
(Cartis/Gould/T)
Philippe Toint (Namur)
April 2009
311 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Function-Evaluation Complexity for COCARC (1)
But
What about function-evaluation complexity?
If, for very successful iterations, k+1 3 k for 3 < 1,
the COCARC algorithm requires at most
l m
C function evaluations
2
(for some C independent of ) to achieve k
c.f. steepest descent
Do the nicer bounds for unconstrained optimization extend to the
constrained case?
Philippe Toint (Namur)
April 2009
312 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Function-evaluation complexity for COCARC (2)
As for unconstrained, impose a termination rule on the subproblem
solution:
Do not terminate solving minxk +sF mk (xk + s) before
+
m
k (xk ) min(stop , ksk k) k
where
def
m
k (x) =
min
hx mk (x), di
x+dF ,kdk1
c.f. the s-rule for unconstrained
Note: OK at local constrained model minimizers
Philippe Toint (Namur)
April 2009
313 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Walking through the pass...
2
1
x g
k
x+
min
2
feasible
3
7
A beyond the pass constrained problem with
m(x, y ) = x
Philippe Toint (Namur)
42
100
3
10
x2
1
10
y 3 + 13 [x 2 + y 2 ] 2
April 2009
314 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Walking through the pass...with a sherpa
2
x
k,c
1
x
x g
k,a
+
xk
2
feasible
3
7
A piecewise descent path from xk to xk+ on
m(x, y ) = x
Philippe Toint (Namur)
42
100
3
10
x2
1
10
y 3 + 13 [x 2 + y 2 ] 2
April 2009
315 / 323
Regularization methods and nonlinear step control
Regularization techniques for constrained problems
Function-Evaluation Complexity for COCARC (2)
Assume also
xk xk+ in a bounded number of feasible descent substeps
kHk xx f (xk )k ksk k2
xx f () is globally Lipschitz continuous
{xk } bounded
The COCARC algorithm requires at most
C function evaluations
3/2
(for some C independent of ) to achieve k
Caveat: cost of solving the subproblem
Philippe Toint (Namur)
c.f. unconstrained case!!!
April 2009
316 / 323
Regularization methods and nonlinear step control
Conclusions
Conclusions for lesson 6
Much left to do. . . but very interesting
Unconstrained nonliear stepsize control could lead to very untypical
methods. Example:
p
k = k = k = kgk k,
(, ) =
Meaningful numerical evaluation still needed for many of these
algorithms
Many issues regarding regularizations still unresolved
Philippe Toint (Namur)
April 2009
317 / 323
Regularization methods and nonlinear step control
Bibliography
Bibliography for lesson 6
R. H. Byrd, R. B. Schnabel and G. A. Shultz,
A trust region algorithm for nonlinearly constrained optimization,
SIAM Journal on Numerical Analysis, 24: 11521170, 1987.
Ph. L. Toint,
Global convergence of a class of trust region methods for nonconvex minimization in Hilbert space,
IMA Journal of Numerical Analysis, 8(2): 231252, 1988.
A. Griewank,
The modification of Newtons method for unconstrained optimization by bounding cubic terms,
Department of Applied Mathematics and Theoretical Physics, University of Cambridge (UK), Report NA/12, 1981.
M. Weiser, P. Deuflhard and B. Erdmann,
Affine conjugate adaptive Newton methods for nonlinear elastomechanics,
Optimization Methods and Software, 22(3): 413431, 2007.
Yu. Nesterov and B. T. Polyak,
Cubic regularization of Newton method and its global performance,
Mathematical Programming, 108(1): 177-205, 2006.
C. Cartis and N. I. M. Gould and Ph. L. Toint,
Adaptive cubic overestimation methods for unconstrained optimization,
FUNDP, Namur, Report 07/05, 2007.
C. Cartis, N. I. M. Gould and Ph. L. Toint,
Trust-region and other regularisation of linear least-squares problems,
BIT, to appear, 2009.
Yu. Nesterov,
Modified Gauss-Newton scheme with worst-case guarantees for global performance,
Optimization Methods and Software, 22(3): 469483, 2007.
S. Bellavia, C. Cartis, N. I. M. Gould, B. Morini and Ph. L. Toint,
Convergence of a Regularized Euclidean Residual Algorithm for Nonlinear Least-Squares,
FUNDP, Namur, Report 08/11, 2008.
C. Cartis, N. I. M. Gould and Ph. L. Toint,
An adaptive cubic regularization algorithm for nonconvex optimization with convex constraints and its
function-evaluation complexity,
FUNDP, Namur, Report 08/05R, 2009.
Philippe Toint (Namur)
April 2009
318 / 323
Conclusions
Not covered in the course
non-smooth techniques
specifically convex problems
penalty functions
augmented Lagrangians
affine scaling methods
general sequential quadratic programming (SQP)
systems of nonlinear equations
...
Many thanks to you all for your patience!
Philippe Toint (Namur)
April 2009
319 / 323