School of Computer Science and Applied Mathematics
APPM 3017: Optimization III
Lecture 03 : Line Search Methods
Lecturer: Matthews Sejeso Date: August 2021
LEARNING OUTCOMES
By the completion of this lecture you should be able to:
1. Describe the stepsize selection of line search methods.
2. Describe convergence of gradient methods with dynamic stepsize selection.
Reference:
• Chapter 3, Jorge Nocedal and Stephen J. Wright, ‘Numerical Optimization’.
• Chapter 1, Dimitri P. Bertsekas, ‘Nonlinear Programming’.
3.1 Introduction
In unconstrained optimization, we minimize the objective function that depends on real variables, with no
restrictions on the values of these variables. The formulation is
min f (x), (3.1)
x
where x ∈ Rn is a real vector with n components and f : Rn → R is a continuously differentiable nonlinear
function. The solutions to the optimization problem (3.1) are almost always impossible to obtain directly
- with few exceptions. Hence, for the most part, we solve this problem with iterative algorithms. Most of
the interesting algorithms for this problem rely on an important idea, called iterative descent. The idea is
to start from the initial point x0 and successively generate the sequence {xk }, such that f is decreased at
each iteration. There are two main types of algorithms in unconstrained optimization - line search methods
and trust region methods; in this lecture, we focus on the former.
In line search method we first choose a search direction dk from the current point xk , using the information
about the function f . Then we determine the stepsize sk along the search direction, so that the new point
is
xk+1 = xk + sk dk (3.2)
The aim of different choices of sk is mainly to ensure that the algorithm defined by (3.2) results to be
globally convergent, possibly without deteriorating the rate of convergence. We have assumed a constant
stepsize up to this point, i.e. fixed stepsize sk = s, is used for all iterations. The constant stepsize rule is
straightforward. However, if the stepsize is too large, divergence will occur; while the stepsize is too small,
the convergence rate may be very slow. Thus, the constant stepsize rule is useful only for problems where
an appropriate constant stepsize value is known or can be determined fairly.
The best possible choice of the stepsize sk is the value that minimize the objective function f along the
search direction dk :
sk = arg min f (xk + sdk ), (3.3)
s
However, unless the function f has some particular structure, an exact minimization of (3.3) is computationally
expensive. Moreover, in general, the whole minimization algorithm does not gain any special advantage
3-1
from the knowledge of such optimal sk . Therefore, it is more convenient to use approximate methods,
i.e., computationally simple methods that guarantee particular convergence properties. We discuss how to
choose the stepsize sk in a dynamic manner by approximately solving the one-dimensional minimization
problem (3.3).
3.2 Gradient methods
We consider gradient related descent methods of the form
xk+1 = xk − sk Dk ∇f (xk ), (3.4)
where Dk is a positive definite symmetric matrix. Since dk = −Dk ∇f (xk ), the descent condition
∇f (xk )dk < 0 is written as
∇f (xk )Dk ∇f (xk ) > 0,
and it holds due to the positive definiteness of Dk . Here are some of choices of the matrix Dk , resulting in
methods that are widely used:
• Steepest descent method: The choice of Dk is the identity matrix, that is Dk = I. This is the
simplest choice but it often leads to slow convergence.
• Newton’s method The choice of Dk is the inverse Hessian of the objective function the the current
iterate, that is Dk = [∇2 f (x)]−1 . Generally, Newton’s methods converges fast as compare to steepest
descent methods.
We need to choose the stepsize by solving a one-dimensional minimization problem (3.3). An exact solution
is, in general, difficult to obtain. Most practical methods try some candidate s and select the one that
reduces the function value. The main aim is to assure that we get a sufficiently significant decrease in f
without taking a tiny step.
3.3 Stepsize selection
Once the search direction is computed, we have to decide how far to move along that direction. The distance
to move along the search direction dk is determined by the stepsize sk . We would like to choose sk to give
a substantial reduction of f . The ideal choice of sk would be the global minimizer of a one-dimensional
optimization problem:
s∗ = arg min f (xk + sdk ) (3.5)
s>0
By solving (3.5) exactly, we would drive the maximum benefit from the search direction dk − exact line
search. However, exact minimization may be computationally expensive, and it is usually unnecessary.
Most practical methods perform an inexact line search, i.e. they find an approximate solution of (3.5).
A typical inexact line search method tries out a sequence of candidate values for s, stopping to accept
one of these values when certain conditions are satisfied. We impose the following two conditions on the
stepsize: (1) sk has to guarantee a sufficient reduction of the objective function f , and (2) sk has to be
sufficiently distant from 0. The inexact line search methods find the interval of acceptable values for sk
such that these conditions are satisfied. Then a bisection or interpolation can be used to compute a good
stepsize within this interval.
3-2
3.3.1 Armijo Condition
A popular inexact line search condition stipulate that sk must give a sufficient decrease in the objective
function f , as measured by the following inequality:
f (xk + sk dk ) ≤ f (xk ) + c1 sk ∇f (xk )T dk , (Armijo Condition) (3.6)
for some constant c1 ∈ (0, 1). In other words, the reduction in f is proportional to both the step-length sk
and the directional derivative ∇f (xk )T dk . The inequality (3.6) is called the Armijo condition. Can you
spot any weakness of the Armijo condition?
3.3.2 Goldstein Conditions
The Armijo condition may not be enough by itself to ensure that the algorithm makes reasonable progress,
since it is satisfied for all sufficiently small values of s. To rule out unacceptably short steps, the stepsize
sk is required to satisfy the Goldstein conditions defined as:
Definition 3.1. Let c1 and c2 be two constants such that 0 < c1 < c2 < 1. The stepsize sk satisfy the
Goldstein conditions if:
(1) f (xk + sk dk ) ≤ f (xk ) + c1 sk ∇f (xk )T dk ,
(2) f (xk + sk dk ) ≥ f (xk ) + c2 sk ∇f (xk )T dk .
The first condition ensure sufficient reduction in f , while the second condition ensure sufficient distance
between xk and xk+1 . The weak point of Goldstein conditions is that they can exclude the optimal value
of the stepsize.
3.3.3 Wolfe Conditions
An alternative way of preventing the step length from being too short is to impose conditions on the
derivative of the objective function f . We introduce the curvature condition, which require sk to satisfy:
∇f (xk + sk dk )T dk ≥ c2 ∇f (xk )T dk , (3.7)
for some constant c2 ∈ (c1 , 1), where c1 in the constant from the Armijo condition (3.6). The right hand
side of (3.7) is simply the derivative of (3.5) with respect to s. The curvature condition ensure that the
slope of the univariate function f (xk + sdk ) at s = sk is greater than c2 times the initial slope at s = 0.
This makes sense because if the slope ∇f (xk + sdk )T dk is strongly negative, we have an indication that
we can reduce f significantly along the search direction dk . On the other hand, if ∇f (xk + sdk )T dk is
slightly negative or even positive, it is an indication that we cannot expect much more decrease of f along
the direction dk .
We couple the sufficient decrease condition (3.6) with curvature condition (3.7) to define the Wolfe
conditions:
Definition 3.2. Let c1 and c2 be two constants such that 0 < c1 < c2 < 1. The stepsize sk satisfy the
Wolfe conditions if:
(1) f (xk + sk dk ) ≤ f (xk ) + c1 sk ∇f (xk )T dk ,
(2) ∇f (xk + sk dk )T dk ≥ c2 ∇f (xk )T dk .
3-3
The stepsize may satisfy Wolfe’s conditions without being particularly close to a minimizer of (3.5). We
can modify the curvature condition to enforce sk lie in a small neighbourhood of the local minimizer of
(3.5). This lead to the definition of the Strong Wolfe conditions:
Definition 3.3. Let c1 and c2 be two constants such that 0 < c1 < c2 < 1. The step length sk satisfy the
Strong Wolfe conditions if:
(1) f (xk + sk dk ) ≤ f (xk ) + c1 sk ∇f (xk )T dk ,
(2) ∇f (xk + sk dk )T dk ≤ c2 ∇f (xk )T dk .
The second condition of strong Wolfe conditions does not allow the slope ∇f (xk + sdk )T dk to be positive.
Hence, this exclude points that are far from stationary points of (3.5).
3.3.4 Backtracking line search
We look at the implementation of the above conditions. The Armijo condition alone does not ensure
that the method will make reasonable progress along the given search direction. However, suppose the
line search method chooses its candidate steps appropriately by using a so-called backtracking approach.
In that case, we can dispense with the second condition of Goldstein and use the sufficient decrease
condition(Armijo condition) to terminate the line search procedure. In its most basic form, backtracking
proceeds as described in Algorithm 1
Algorithm 1 Backtracking line search
1: Choose s̄ > 0, reduction factor ρ ∈ (0, 1) and c1 ∈ (0, 1); then set s ← s̄.
2: while f (xk + sdk ) > f (xk ) + c1 s∇f (xk )T dk do
3: s ← ρs;
4: end while
5: return sk = s.
In this procedure, the initial stepsize s̄ is chosen to be 1 in Newton methods, but can have different values
in other algorithms such as steepest descent method. The parameter s̄ fixes the search for stepsize to lie
within the interval [0, s̄]. In practice c1 is chosen to be very small, for example c1 = 10−3 . An acceptable
step length will be found after a finite number of trails, because sk will eventually become small enough
that the sufficient decrease condition holds. Notice, other conditions may not be implemented because the
backtracking strategy avoids excessively small steps.
3.4 Convergence
We state convergence results of the gradient methods using the Armijo condition to select the stepsize.
Theorem 3.1. Let {xk } be a sequence of generated by a gradient method (3.4), and assume that dk satisfy
the descent condition and the stepsize sk is chosen using the Armijo condition. Then every limit point of
{xk } is a stationary point.
Refer to the references for a proof of this theorem. Similar theorems can be stated for Goldstein and Wolfe
conditions. The same conclusion of Theorem 3.1 holds if we use the exact minimization of f (xk + sdk )
as the stepsize.
3-4
3.5 Summary
In this lecture, we have discussed inexact line search methods. Along a given descent direction, we take a
step that reduces the function to some acceptable degree. The Armijo condition can measure the acceptable
reduction of the objective function. In addition to the Armijo condition, the Goldstein conditions also
avoid tiny stepsizes. However, Goldstein conditions may fail to bracket the optimal value of the step size.
Wolfe conditions resolve this problem. In practice, implementation of Armijo with backtracking is usually
enough.
PROBLEM SET III
1. Consider the problem of minimizing the function of two variables f (x, y) = 3x2 + y 4 .
(a) Apply one iteration of the steepest descent method with (1, −2) as the starting point and with
the stepsize chosen by the Armijo condition with s̄ = 1, c1 = 0.1, ρ = 0.5.
(b) Repeat (a) using s̄ = 1, c1 = 0.1, ρ = 0.1 instead. How does the cost of the new iterate compare
to that obtained in (a)? Comment on the tradeoffs involved in the choice of ρ.
(c) Apply one iteration of Newton’s method with the same starting point and stepsize rule as in (a).
How does the cost of the new iterate compare to that obtained in (a)? How about the amount
of work involved in finding the new iterate?
2. Show graphically that if 0 < c2 < c1 < 1, there may be no stepsizes that satisfy the Wolfe conditions.
3. Program the steepest descent and Newton methods using backtracking line search. Use them to
minimize the Rosenbrock function. Set the initial stepsize s̄ = 1 and print the stepsize used by each
method at each iteration. First try the initial point x0 = [1.2, 1.2]T and then the more difficult staring
point x0 = [−1.2, 1]T .
3-5