TMA947 / MMG621 – Nonlinear optimization Lecture 5
TMA947 / MMG621 — Nonlinear optimization
Lecture 5 — Uncontrained optimization algorithms
Emil Gustavsson, Zuzana Nedělková
November 6, 2017
[Minor revision: Axel Ringh - Spetember, 2024]
Consider the unconstrained optimization problem to
minimize
n
f (x), (1)
x∈R
where f ∈ C 0 on Rn (f is continuous). Mostly, we assume that f ∈ C 1 holds (f is continuously
differentiable), sometimes even C 2 . The choice of the algorithm depends on the size of the prob-
lem, availability of ∇f (x) and ∇2 f (x), convexity of f and if the goal is to find a local or the global
minimum.
Most algorithms for unconstrained optimization problems are what we call line search type algo-
rithms.
Definition. Line search type algorithm
Step 0: Starting point x0 ∈ Rn . Let k := 0.
Step 1: Find search direction pk ∈ Rn
Step 2: Perform line search, i.e., find αk > 0 such that f (xk + αk pk ) < f (xk )
Step 3: Let xk+1 := xk + αk pk .
Step 4: If termination criteria is fulfilled then stop! Otherwise, let k := k + 1 and go to Step 1.
p 5
4
f
3
w 1
−1
x
−2
−5
−4
−3 −3
−2
−1
0
1 −4
2
3
4 −5
a c
5
Most algorithms we consider are inherently local, meaning that the search direction pk is only
based on the information at the current point xk , that is, f (xk ), ∇f (xk ), and ∇2 f (xk ).
Think of a near-sighted mountain climber. The climber is in a deep fog and can only check his or
her barometer for the height and feel the steepness of the slope under her feet.
1
TMA947 / MMG621 – Nonlinear optimization Lecture 5
Step 1: Search directions
Vector pk is a descent direction at xk if f (xk + αpk ) < f (xk ) for all α ∈ (0, δ] for some δ > 0.
Let f ∈ C 1 in some neighborhood of xk ∈ Rn , if ∇f (xk ) ̸= 0, then pk = −∇f (xk ) is a descent
direction for f at xk (follows from optimality conditions). This search step is called steepest descent
direction because it solves the problem to
minimize ∇f (xk )T p.
p∈Rn :∥p∥=1
Let Q ∈ Rn×n be an arbitrary symmetric, positive definite matrix. Then pk = −Q∇f (xk ) is a
descent direction for f at xk , because
∇f (xk )T pk = −∇f (xk )T Q∇f (xk ) < 0,
due to the positive definiteness of Q.
Examples:
– Steepest descent: Q = I,
– Newton’s method: Q = [∇2 f (xk )]−1 .
We will now derive Newton’s method. To do so, we need to assume that f ∈ C 2 . We also first
assume that ∇2 f (x) is positive definite. A second-order Taylor approximation is then:
1
f (xk + p) − f (xk ) ≈ ∇f (xk )T p + pT ∇2 f (xk )p =: φxk (p)
2
We now try to minimize this approximation by setting the gradient of φxk (p) to zero:
∇p φxk (p) = ∇f (xk ) + ∇2 f (xk )p = 0 ⇔ ∇2 f (xk )p = −∇f (xk )
Now by choosing the vector fulfilling this we obtain pk = −[∇2 f (xk )]−1 ∇f (xk ) as the search
direction. When n = 1, we get that pk = −f ′ (xk )/f ′′ (xk ).
When the Hessian ∇2 f (xk ) is positive definite this search direction is a descent direction. But
when ∇2 f (xk ) is negative definite (may be also non invertible), the search direction is an ascent
direction, meaning that Newton’s method does differentiate between minimization and maxi-
mization problem. The solution to this problem is to modify ∇2 f (xk ) by adding a diagonal ma-
trix γI such that (∇2 f (xk ) + γI) is positive definite (this can always be done, why?). This method
is called the Levenberg-Marquardt modification. We thus take as search direction
−1
pk = − ∇2 f (xk ) + γI
∇f (xk ).
Note that
– Steepest descent: γ = ∞,
– Newton’s method: γ = 0.
2
TMA947 / MMG621 – Nonlinear optimization Lecture 5
What happens when we can not compute ∇2 f (xk )? Try to approximate the Hessian in some way
choosing approximate matrix B k . From Taylor expansion for ∇f (xk ) we have that
∇2 f (xk )(xk − xk−1 ) ≈ ∇f (xk ) − ∇f (xk−1 )
so the approximate matrix B k has to fulfill
B k (xk − xk−1 ) = ∇f (xk ) − ∇f (xk−1 ).
Many different choices of B k exist, and they lead to what is called quasi-Newton methods.
To summarize:
Steepest descent: pk = −∇f (xk )
2
Netwon’s method: ∇ f (xk )pk = −∇f (xk )
2
Levenberg-Marquardt: (∇ f (xk ) + γI)pk = −∇f (xk )
Quasi-Newton: B k pk = −∇f (xk ).
Step 2: Line search
In each iteration one would like to solve
minimize φ(α) := f (xk + αpk ).
α≥0
The optimality conditions for the problem are
φ′ (α∗ ) ≥ 0,
α∗ φ′ (α∗ ) = 0,
α∗ ≥ 0.
These conditions state that if α∗ > 0, then φ′ (α∗ ) = 0, which implies that
∇f (xk + α∗ pk )T pk = 0,
meaning that the search direction pk is orthogonal to the gradient of f at xk + α∗ pk .
p 5
4
f
3
w 1
−1
x
−2
−5
−4
−3 −3
−2
−1
0
1 −4
2
3
4 −5
a c
5
3
TMA947 / MMG621 – Nonlinear optimization Lecture 5
However, solving the line search problem to optimality is unnecessary. The optimal solution to
the original problem lies elsewhere anyway. Examples of methods to choose step lengths αk
– Interpolation: Use f (xk ), ∇f (xk ), and ∇f (xk )T pk to approximate φ = f (xk + αpk ) quadrat-
ically. Then minimize this approximation of φ analytically.
– Newton’s method: Repeat improvements from a quadratic approximation: α = α−φ′ (α)/φ′′ (α)
– Golden section: Derivative-free method which shrinks an interval wherein a solution to
φ′ (α) = 0 lies.
We will often use what is denoted as the Armijo rule. The idea is to choose a step length α which
provides sufficient decrease in f . We have that
f (xk + αpk ) ≈ f (xk ) + α∇f (xk )T pk ,
for very small values of α > 0, meaning that we predict that the objective function will decrease
with α∇f (xk )T pk if we move a step length α in the direction of pk . Now this might be too
optimistic, and we will therefore accept the step length if the actual decrease is at least a fraction
µ (µ is small, typically µ ∈ [0.001, 0.01]) of the predicted decrease, i.e., we will accept α if
f (xk + αpk ) − f (xk ) ≤ µα∇f (xk )T pk ,
or equivalently, if
φ(α) − φ(0) ≤ µαφ′ (0).
We usually start with α = 1. If this is not fulfilled, then choose α := α/2.
r a
b c
Figure 1: The interval (R) accepted by the Armijo step length rule
4
TMA947 / MMG621 – Nonlinear optimization Lecture 5
Convergence
In order to state a convergence result for the algorithm, we make an additional assumption for
the search directions. We need the directions pk to fulfill
∇f (xk )T pk
− ≥ s1 , ∥pk ∥ ≥ s2 ||∇f (xk ||, and ∥pk ∥ ≤ M (2)
∥∇f (xk )∥ · ∥pk ∥
for some s1 , s2 , M > 0, where the first inequality makes the angle between pk and ∇f (xk ) stay
between 0 and π/2, but not too close to π/2. The second inequality makes sure that the only
case when pk can be zero is when the gradient is zero. These two conditions guarantee a certain
descent quality.
Theorem (convergence of unconstrained algorithm). Suppose f ∈ C 1 and for the starting point x0
the level set {x ∈ Rn | f (x) ≤ f (x0 )} is bounded. Consider the iterative algorithm described above.
Suppose that for all k, pk fulfills (2) and αk is chosen according to the Armijo rule. Then
a) the sequence {xk } is bounded,
b) the sequence {f (xk )} is descending and lower bounded, and
c) every limit point of {xk } is a stationary point.
Proof. See Theorem 11.4 in the book.
If we add the assumption that f is a convex function, then we can show that
optimum exists ⇐⇒ {xk } converges to an optimal solution.
Step 4: Termination criteria
We can not terminate the algorithm when ∇f (xk ) = 0, since this rarely happens. We need to have
some tolerance level. Three examples are
a) ∥∇f (xk )∥ ≤ ε1 (1 + |f (xk )|), where ε1 > 0 is small.
b) f (xk−1 ) − f (xk ) ≤ ε2 (1 + |f (xk )|), where ε2 > 0 is small.
c) ∥xk − xk−1 ∥ ≤ ε3 (1 + ∥xk ∥), where ε3 > 0 is small.
Can also use the max-norm ∥ · ∥∞ instead.
5
TMA947 / MMG621 – Nonlinear optimization Lecture 5
A note on trust region methods
Trust region methods use a quadratic approximation of the function around the current iterate
xk , avoid a line search but instead bound the length of the search direction. Let
1
φxk (p) := f (xk ) + ∇f (xk )T p + pT ∇2 f (xk )p.
2
Since this is a local approximation, we restrict our approximation to a trust region in the neighbor-
hood of xk , i.e., we trust the model in the region where ∥p∥ ≤ ∆k . We then solve the problem
to
minimize φxk (p),
subject to ∥p∥ ≤ ∆k .
and let the solution be pk . Then we update our iterate as xk+1 = xk + pk . We also update the trust
region parameter ∆k depending on the progress so far (actual reduction/predicted reduction).
The method is robust and possess strong convergence. More detailed information about trust
region methods can be found in the book on pages 301–302.
A note on black-box functions
In some cases the value of the objective function f (x) is given through some unknown simulation
procedure. This implies that we do not have a clear representation of the gradient of the objective
function. In some cases, we can perform numerical differentiation and approximate the partial
derivatives as, e.g.,
∂f (x) f (x + αei ) − f (x)
≈ ,
∂xi α
where ei = (0, . . . , 0, 1, 0, . . . , 0)T is the unit vector in Rn .
If the simulation is not accurate, we get a bas derivative information. We can use derivative-free
methods instead. These try to build a model fˆ of the objective function f from evaluating the
objective function at some specific test points and optimize the model fˆ instead of the function f .