KEMBAR78
Lecture 5 | PDF | Mathematical Optimization | Derivative
0% found this document useful (0 votes)
31 views6 pages

Lecture 5

Uploaded by

mralreda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views6 pages

Lecture 5

Uploaded by

mralreda99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

TMA947 / MMG621 – Nonlinear optimization Lecture 5

TMA947 / MMG621 — Nonlinear optimization

Lecture 5 — Uncontrained optimization algorithms


Emil Gustavsson, Zuzana Nedělková

November 6, 2017

[Minor revision: Axel Ringh - Spetember, 2024]

Consider the unconstrained optimization problem to


minimize
n
f (x), (1)
x∈R

where f ∈ C 0 on Rn (f is continuous). Mostly, we assume that f ∈ C 1 holds (f is continuously


differentiable), sometimes even C 2 . The choice of the algorithm depends on the size of the prob-
lem, availability of ∇f (x) and ∇2 f (x), convexity of f and if the goal is to find a local or the global
minimum.

Most algorithms for unconstrained optimization problems are what we call line search type algo-
rithms.
Definition. Line search type algorithm

Step 0: Starting point x0 ∈ Rn . Let k := 0.


Step 1: Find search direction pk ∈ Rn
Step 2: Perform line search, i.e., find αk > 0 such that f (xk + αk pk ) < f (xk )
Step 3: Let xk+1 := xk + αk pk .
Step 4: If termination criteria is fulfilled then stop! Otherwise, let k := k + 1 and go to Step 1.

p 5

4
f
3

w 1

−1
x
−2
−5
−4
−3 −3
−2
−1
0
1 −4
2
3
4 −5
a c
5

Most algorithms we consider are inherently local, meaning that the search direction pk is only
based on the information at the current point xk , that is, f (xk ), ∇f (xk ), and ∇2 f (xk ).

Think of a near-sighted mountain climber. The climber is in a deep fog and can only check his or
her barometer for the height and feel the steepness of the slope under her feet.

1
TMA947 / MMG621 – Nonlinear optimization Lecture 5

Step 1: Search directions

Vector pk is a descent direction at xk if f (xk + αpk ) < f (xk ) for all α ∈ (0, δ] for some δ > 0.

Let f ∈ C 1 in some neighborhood of xk ∈ Rn , if ∇f (xk ) ̸= 0, then pk = −∇f (xk ) is a descent


direction for f at xk (follows from optimality conditions). This search step is called steepest descent
direction because it solves the problem to

minimize ∇f (xk )T p.
p∈Rn :∥p∥=1

Let Q ∈ Rn×n be an arbitrary symmetric, positive definite matrix. Then pk = −Q∇f (xk ) is a
descent direction for f at xk , because

∇f (xk )T pk = −∇f (xk )T Q∇f (xk ) < 0,

due to the positive definiteness of Q.

Examples:

– Steepest descent: Q = I,
– Newton’s method: Q = [∇2 f (xk )]−1 .

We will now derive Newton’s method. To do so, we need to assume that f ∈ C 2 . We also first
assume that ∇2 f (x) is positive definite. A second-order Taylor approximation is then:
1
f (xk + p) − f (xk ) ≈ ∇f (xk )T p + pT ∇2 f (xk )p =: φxk (p)
2
We now try to minimize this approximation by setting the gradient of φxk (p) to zero:

∇p φxk (p) = ∇f (xk ) + ∇2 f (xk )p = 0 ⇔ ∇2 f (xk )p = −∇f (xk )

Now by choosing the vector fulfilling this we obtain pk = −[∇2 f (xk )]−1 ∇f (xk ) as the search
direction. When n = 1, we get that pk = −f ′ (xk )/f ′′ (xk ).

When the Hessian ∇2 f (xk ) is positive definite this search direction is a descent direction. But
when ∇2 f (xk ) is negative definite (may be also non invertible), the search direction is an ascent
direction, meaning that Newton’s method does differentiate between minimization and maxi-
mization problem. The solution to this problem is to modify ∇2 f (xk ) by adding a diagonal ma-
trix γI such that (∇2 f (xk ) + γI) is positive definite (this can always be done, why?). This method
is called the Levenberg-Marquardt modification. We thus take as search direction
−1
pk = − ∇2 f (xk ) + γI

∇f (xk ).

Note that

– Steepest descent: γ = ∞,
– Newton’s method: γ = 0.

2
TMA947 / MMG621 – Nonlinear optimization Lecture 5

What happens when we can not compute ∇2 f (xk )? Try to approximate the Hessian in some way
choosing approximate matrix B k . From Taylor expansion for ∇f (xk ) we have that

∇2 f (xk )(xk − xk−1 ) ≈ ∇f (xk ) − ∇f (xk−1 )

so the approximate matrix B k has to fulfill

B k (xk − xk−1 ) = ∇f (xk ) − ∇f (xk−1 ).

Many different choices of B k exist, and they lead to what is called quasi-Newton methods.

To summarize:

Steepest descent: pk = −∇f (xk )


2
Netwon’s method: ∇ f (xk )pk = −∇f (xk )
2
Levenberg-Marquardt: (∇ f (xk ) + γI)pk = −∇f (xk )
Quasi-Newton: B k pk = −∇f (xk ).

Step 2: Line search

In each iteration one would like to solve

minimize φ(α) := f (xk + αpk ).


α≥0

The optimality conditions for the problem are

φ′ (α∗ ) ≥ 0,
α∗ φ′ (α∗ ) = 0,
α∗ ≥ 0.

These conditions state that if α∗ > 0, then φ′ (α∗ ) = 0, which implies that

∇f (xk + α∗ pk )T pk = 0,

meaning that the search direction pk is orthogonal to the gradient of f at xk + α∗ pk .

p 5

4
f
3

w 1

−1
x
−2
−5
−4
−3 −3
−2
−1
0
1 −4
2
3
4 −5
a c
5

3
TMA947 / MMG621 – Nonlinear optimization Lecture 5

However, solving the line search problem to optimality is unnecessary. The optimal solution to
the original problem lies elsewhere anyway. Examples of methods to choose step lengths αk

– Interpolation: Use f (xk ), ∇f (xk ), and ∇f (xk )T pk to approximate φ = f (xk + αpk ) quadrat-
ically. Then minimize this approximation of φ analytically.
– Newton’s method: Repeat improvements from a quadratic approximation: α = α−φ′ (α)/φ′′ (α)
– Golden section: Derivative-free method which shrinks an interval wherein a solution to
φ′ (α) = 0 lies.

We will often use what is denoted as the Armijo rule. The idea is to choose a step length α which
provides sufficient decrease in f . We have that

f (xk + αpk ) ≈ f (xk ) + α∇f (xk )T pk ,

for very small values of α > 0, meaning that we predict that the objective function will decrease
with α∇f (xk )T pk if we move a step length α in the direction of pk . Now this might be too
optimistic, and we will therefore accept the step length if the actual decrease is at least a fraction
µ (µ is small, typically µ ∈ [0.001, 0.01]) of the predicted decrease, i.e., we will accept α if

f (xk + αpk ) − f (xk ) ≤ µα∇f (xk )T pk ,

or equivalently, if
φ(α) − φ(0) ≤ µαφ′ (0).
We usually start with α = 1. If this is not fulfilled, then choose α := α/2.

r a

b c

Figure 1: The interval (R) accepted by the Armijo step length rule

4
TMA947 / MMG621 – Nonlinear optimization Lecture 5

Convergence

In order to state a convergence result for the algorithm, we make an additional assumption for
the search directions. We need the directions pk to fulfill

∇f (xk )T pk
− ≥ s1 , ∥pk ∥ ≥ s2 ||∇f (xk ||, and ∥pk ∥ ≤ M (2)
∥∇f (xk )∥ · ∥pk ∥

for some s1 , s2 , M > 0, where the first inequality makes the angle between pk and ∇f (xk ) stay
between 0 and π/2, but not too close to π/2. The second inequality makes sure that the only
case when pk can be zero is when the gradient is zero. These two conditions guarantee a certain
descent quality.
Theorem (convergence of unconstrained algorithm). Suppose f ∈ C 1 and for the starting point x0
the level set {x ∈ Rn | f (x) ≤ f (x0 )} is bounded. Consider the iterative algorithm described above.
Suppose that for all k, pk fulfills (2) and αk is chosen according to the Armijo rule. Then

a) the sequence {xk } is bounded,

b) the sequence {f (xk )} is descending and lower bounded, and

c) every limit point of {xk } is a stationary point.

Proof. See Theorem 11.4 in the book.

If we add the assumption that f is a convex function, then we can show that

optimum exists ⇐⇒ {xk } converges to an optimal solution.

Step 4: Termination criteria

We can not terminate the algorithm when ∇f (xk ) = 0, since this rarely happens. We need to have
some tolerance level. Three examples are

a) ∥∇f (xk )∥ ≤ ε1 (1 + |f (xk )|), where ε1 > 0 is small.


b) f (xk−1 ) − f (xk ) ≤ ε2 (1 + |f (xk )|), where ε2 > 0 is small.
c) ∥xk − xk−1 ∥ ≤ ε3 (1 + ∥xk ∥), where ε3 > 0 is small.

Can also use the max-norm ∥ · ∥∞ instead.

5
TMA947 / MMG621 – Nonlinear optimization Lecture 5

A note on trust region methods

Trust region methods use a quadratic approximation of the function around the current iterate
xk , avoid a line search but instead bound the length of the search direction. Let

1
φxk (p) := f (xk ) + ∇f (xk )T p + pT ∇2 f (xk )p.
2
Since this is a local approximation, we restrict our approximation to a trust region in the neighbor-
hood of xk , i.e., we trust the model in the region where ∥p∥ ≤ ∆k . We then solve the problem
to

minimize φxk (p),


subject to ∥p∥ ≤ ∆k .

and let the solution be pk . Then we update our iterate as xk+1 = xk + pk . We also update the trust
region parameter ∆k depending on the progress so far (actual reduction/predicted reduction).
The method is robust and possess strong convergence. More detailed information about trust
region methods can be found in the book on pages 301–302.

A note on black-box functions

In some cases the value of the objective function f (x) is given through some unknown simulation
procedure. This implies that we do not have a clear representation of the gradient of the objective
function. In some cases, we can perform numerical differentiation and approximate the partial
derivatives as, e.g.,
∂f (x) f (x + αei ) − f (x)
≈ ,
∂xi α
where ei = (0, . . . , 0, 1, 0, . . . , 0)T is the unit vector in Rn .

If the simulation is not accurate, we get a bas derivative information. We can use derivative-free
methods instead. These try to build a model fˆ of the objective function f from evaluating the
objective function at some specific test points and optimize the model fˆ instead of the function f .

You might also like