KEMBAR78
1998 Book NumericalAnalysis | PDF | Numerical Analysis | System Of Linear Equations
100% found this document useful (2 votes)
796 views340 pages

1998 Book NumericalAnalysis

Uploaded by

Elham Anaraki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
796 views340 pages

1998 Book NumericalAnalysis

Uploaded by

Elham Anaraki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 340

Graduate Texts in Mathematics 181

Editorial Board
S. Axler F. W. Gehring K.A. Ribet

Springer Science+Business Media, LLC


Graduate Texts in Mathematics

T AKElJTl/ZARING. Introduction to 33 HIRSCH. Differential Topology.


Axiomatic Set Theory. 2nd ed. 34 SPITZER. Principles of Random Walk.
2 OXTOBY. Measure and Category. 2nd ed. 2nd ed.
3 SCHAEFER. Topological Vector Spaces. 35 ALEXANDER/WERMER. Several Complex
4 HILTON/STAMMBACH. A Course in Variables and Banach Algebras. 3rd ed.
Homological Algebra. 2nd ed. 36 KELLEY/NAMIOKA et al. Linear
5 MAC LANE. Categories for the Working Topological Spaces.
Mathematician. 2nd ed. 37 MONK. Mathematical Logic.
6 HUGHES/PIPER. Projective Planes. 38 GRAUERT/FRITZSCHE. Several Complex
7 SERRE. A Course in Arithmetic. Variables.
8 T AKElJTI/ZARING. Axiomatic Set Theory. 39 ARVESON. AD Invitation to C*-Algebras.
9 HUMPHREYS. Introduction to Lie 40 KEMENy/SNELLlKNAPP. Denumerable
Algebras and Representation Theory. Markov Chains. 2nd ed.
10 COHEN. A Course in Simple Homotopy 41 ApO~TOL. Modular Functions and
Theory. Dirichlet Series in Number Theory.
II CONWAY. Functions of One Complex 2nd ed.
Variable!. 2nd ed. 42 SERRE. Linear Representations of Finite
12 BEALS. Advanced Mathematical Analysis. Groups.
13 ANDERSON/FuLLER. Rings and Categories 43 GILLMAN/JERISON. Rings of Continuous
of Modules. 2nd ed. Funclions.
14 GOLUBITSKY/GUILLEMIN. Stable Mappings 44 KENDtG. Elementary Algebraic
and Their Singularities. Geometry.
15 BERBERIAN. Lectures in Functional 45 LOEVE. Probability Theory I. 4th ed.
Analysis and Operator Theory. 46 LOEVE. Probability Theory II. 4th ed.
16 WINTER. The Structure of Fields. 47 MOISE. Geometric Topology in
17 ROSENBLATT. Random Processes. 2nd ed. Dimensions 2 and 3.
18 HALMOS. Measure Theory. 48 SACHS/WU. General Relativity for
19 HALMOS. A Hilbel1 Space Problem Book. Mathematicians.
2nd ed. 49 GRUENBERG/WEIR. Linear Geometry.
20 HUSEMOLLER. Fibre Bundles. 3rd ed. 2nd ed.
21 HUMPHREYS. Linear Algebraic Groups. 50 EDWARDS. Fermat's Last Theorem.
22 BARNES/MACK. An Algebraic 51 KLINGENBERG. A Course in Differential
Introduction to Mathematical Logic. Geomelry.
23 GREUB. Linear Algebra. 4th ed. 52 HARTSHORNE. Algebraic Geometry.
24 HOLMES. Geometric Functional Analysis 53 MANIN. A Course in Mathematical
and Its Applications. Logic.
25 HEWITT/STROMBERG. Real and Abstraci 54 GRAVER/WATKINS. Combinatorics with
Analysis. Emphasis on the Theory of Graphs.
26 MANES. Algebraic Theories. 55 BROWN/PEARCY. Introduction to Operator
27 KELLEY. General Topology. Theory I: Elemenls of Functional
28 ZARISKI/SAMUEL. Commutative Algebra. Analysis.
Vol. I. 56 MASSEY. Algebraic Topology: An
29 ZARISKI/SAMUEL. Commutative Algebra. Introduction.
VoU!. 57 CROWELLlFox. Introduction to Knot
30 JACOBSON. Lectures in Abstract Algebra Theory.
I. Basic Concepts. 58 KOBLITZ. p-adic Numbers, p-adic
31 JACOBSON. Lectures in Abstract Algebra Analysis, and Zeta-Functions. 2nd ed.
II. Linear Algebra. 59 LANG. Cyclotomic Fields.
32 JACOBSON. Lectures in Abstract Algebra 60 ARNOLD. Mathematical Methods in
III. Theory of Fields and Galois Theory. Classical Mechanics. 2nd ed.

cOll/inued ajier index


Rainer Kress

Numerical Analysis

With II Illustrations

i Springer
Rainer Kress
Institut fUr Numerische und
Angewandte Mathematik
Universităt G6ttingen
D-37083 G6ttingen
Germany

Editorial Board
S. Axler F. W. Gehring K.A. Ribet
Department of Department of Department of
Mathematics Mathematics Mathematics
San Francisco State University University of Michigan University of California
San Francisco, CA 94132 Ann Arbor, MI 48109 at Berkeley
USA USA Berkeley, CA 94720
USA

Mathematics Subject Classification (1991): 65-01

Library of Congress Cataloging-in-Publication Data


Kress, Rainer, 1941-
Numerical analysis / Rainer Kress.
p. cm. - (Graduate texts in mathematics ; 181)
lncludes bibliographical references and index.
ISBN 978-1-4612-6833-8 ISBN 978-1-4612-0599-9 (eBook)
DOI 10.1007/978-1-4612-0599-9
1. Numerical analysis. 1. Title. Il. Series.
QA297.K725 1998
519.4-dc21 97-43748

Printed on acid-free paper.

© 1998 Springer Science+Business Media New York


Originally published by Springer-Verlag New York in 1998
Softcover reprint ofthe hardcover lst edition 1998
Ali rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC) except for brief excerpts in con-
nection with reviews or scholarly analysis. Use in connection with any form ofinformation storage and
retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known
or hereafter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this publication, even if the
former are not especially identified, is not to be taken as a sign that such names, as understood by the
Trade Marks and Merchandise Marks Act, may accordingly be used freely by anyone.

Production managed by Karina Mikhli; manufacturing supervised by Jacqui Ashri.


Photocomposed copy prepared from the author's TeX file.

987654321

ISBN 978-1-4612-6833-8
Preface

No applied mathematician can be properly trained without some basic un-


derstanding of numerical methods, Le., numerical analysis. And no scientist
and engineer should be using a package program for numerical computa-
tions without understanding the program's purpose and its limitations.
This book is an attempt to provide some of the required knowledge and
understanding. It is written in a spirit that considers numerical analysis
not merely as a tool for solving applied problems but also as a challenging
and rewarding part of mathematics. The main goal is to provide insight
into numerical analysis rather than merely to provide numerical recipes.
The book evolved from the courses on numerical analysis I have taught
since 1971 at the University of Gottingen and may be viewed as a successor
of an earlier version jointly written with Bruno Brosowski [10] in 1974. It
aims at presenting the basic ideas of numerical analysis in a style as concise
as possible. Its volume is scaled to a one-year course, i.e., a two-semester
course, addressing second-year students at a German university or advanced
undergraduate or first-year graduate students at an American university.
In order to make the book accessible not only to mathematicians but
also to scientists and engineers, I have planned it to be as self-contained as
possible. As prerequisites it requires only a solid foundation in differential
and integral calculus and in linear algebra as well as an enthusiasm to see
these fundamental and powerful tools in action for solving applied prob-
lems. A short presentation of some basic functional analysis is provided in
the book to the extent required for a modern presentation of numerical
analysis and a deeper understanding of the subject.
VI Preface

An introductory book of a few hundred pages cannot completely cover


all classical aspects of numerical analysis and all of the more recent devel-
opments. I am willing to admit that the choice of some of the topics in the
present volume is biased by my own preferences and that some important
subjects are omitted.
I was taught numerical analysis in the mid sixties by my thesis adviser,
Professor Erich Martensen, at the Technische Hochschule in Darmstadt.
Martensen's perspective on teaching mathematics in general and numeri-
cal analysis in particular had a great and long-lasting impact on my own
teaching. Therefore, this book is dedicated to Erich Martensen on the oc-
casion of his seventieth birthday.
I would like to thank Thomas Gerlach and Peter Otte for carefully read-
ing the book, for checking the solutions to the problems, and for a number
of suggestions for improvements. Special thanks are given to my friend
David Colton for reading over the book for correct use of the English lan-
guage. Part of the book was written while I was on sabbatical leave at the
Department of Mathematical Sciences at the University of Delaware and
the Department of Mathematics at the University of New South Wales. I
gratefully acknowledge the hospitality of these institutions. I also am grate-
ful to Springer-Verlag for being willing to take the economic risk of adding
yet another volume to the already huge number of existing introductions
to numerical analysis.

Gottingen, September 1997 Rainer Kress


Contents

1 Introduction 1

2 Linear Systems 5
2.1 Examples for Systems of Equations . 6
2.2 Gaussian Elimination. 11
2.3 LR Decomposition 18
2.4 QR Decomposition 19
Problems · ....... 23

3 Basic Functional Analysis 25


3.1 Normed Spaces ..... 26
3.2 Scalar Products . . . . . 29
3.3 Bounded Linear Operators. 32
3.4 Matrix Norms . . . . . . . . 34
3.5 Completeness . . . . . . . . 40
3.6 The Banach Fixed Point Theorem 43
3.7 Best Approximation 47
Problems · . . . . . . . . . . . . . . . . 49

4 Iterative Methods for Linear Systems 53


4.1 Jacobi and Gauss-Seidel Iterations 53
4.2 Relaxation Methods 60
4.3 Two-Grid Methods 68
Problems · ........ 75
viii Contents

5 Ill-Conditioned Linear Systems 77


5.1 Condition Number . 78
5.2 Singular Value Decomposition. 81
5.3 Tikhonov Regularization . 86
Problems . 90

6 Iterative Methods for Nonlinear Systems 93


6.1 Successive Approximations 94
6.2 Newton's Method . . . . 101
6.3 Zeros of Polynomials . . 110
6.4 Least Squares Problems 114
Problems . 117

7 Matrix Eigenvalue Problems 119


7.1 Examples . 120
7.2 Estimates for the Eigenvalues 122
7.3 The Jacobi Method. 126
7.4 The QR Algorithm. 133
7.5 Hessenberg Matrices 144
Problems . 149

8 Interpolation 151
8.1 Polynomial Interpolation. 152
8.2 Trigonometric Interpolation 161
8.3 Spline Interpolation 169
8.4 Bezier Polynomials 179
Problems . 186

9 Numerical Integration 189


9.1 Interpolatory Quadratures . 190
9.2 Convergence of Quadrature Formulae. 198
9.3 Gaussian Quadrature Formulae .. 200
9.4 Quadrature of Periodic Functions . 207
9.5 Romberg Integration 212
9.6 Improper Integrals 217
Problems . 221

10 Initial Value Problems 225


10.1 The Picard-Lindel6f Theorem. 226
10.2 Euler's Method . . . 231
10.3 Single-Step Methods 234
10.4 Multistep Methods 243
Problems . 254
Contents IX

11 Boundary Value Problems 257


11.1 Shooting Methods . 258
11.2 Finite Difference Methods . 262
11.3 The Riesz and Lax-Milgram Theorems . 268
11.4 Weak Solutions . 274
11.5 The Finite Element Method. 279
Problems . 283

12 Integral Equations 287


12.1 The Riesz Theory. 288
12.2 Operator Approximations 291
12.3 Nystrom's Method . . . . 296
12.4 The Collocation Method 302
12.5 Stability 310
Problems . 313

References 317

Index 322
Glossary of Symbols

Sets and Spaces

IN set of natural numbers


7L set of integers
IR set of real numbers
(; set of complex numbers
Ixl absolute value of a real or complex number x
(a, b) open interval (a, b) := {x E IR: a < x < b}
[a,b] closed interval [a, b] := {x E IR : a ::; x ::; b}
x conjugate of a complex number x
IRn n-dimensional real Euclidean space
(;n n-dimensional complex Euclidean space
C[a,b] space of real- or complex-valued continuous
functions on the interval [a, b]
Cm[a,b] space of m-times continuously
differentiable functions
L 2 [a,b] space of real- or complex-valued
square-integrable functions
set of m elements ai, ... , am
product U x V := {(x,y) : x E U, Y E V}
of two sets U and V
U\V difference set U \ V := {x E U : X rt. V}
for two sets U and V
U closure of a set U
F:X-+Y a mapping with domain X and range in Y
Xli Glossary of Symbols

Vectors and Matrices

row vector in IRn or en


with components Xl, ... , X n
XT = (Xl, ... ,Xn)T the transpose of X, i.e., a column vector
X * -_
(- - )T
Xl, ... ,X the adjoint of X
n
A = (ajk) m x n matrix with elements ajk
AT the transpose of A
A* the adjoint of A
At the pseudo-inverse of A
A-I the inverse of an n x n matrix A
detA the determinant of an n x n matrix A
cond(A) the condition number of an n x n matrix A
peA) the spectral radius of an n x n matrix A
I the n x n identity matrix
diag(al,." ,an) diagonal matrix with
diagonal elements aI, ... , an

Norms

11·11 norm on a linear space


1I·11i £1 norm of a vector, £1 norm of a function
11·112 £2 norm of a vector, £2 norm of a function
11·1100 maximum norm of a vector or a function
(. , .) scalar product on a linear space

Miscellaneous

E element inclusion
C set inclusion
u,n union and intersection of sets
o empty set
Oem) a quantity of order m
o end of proof
1
Introduction

Numerical analysis is concerned with the development and investigation of


constructive methods for the numerical solution of mathematical problems.
This objective differs from a pure-mathematical approach as illustrated by
the following three examples.
By the fundamental theorem of algebra, a polynomial of degree n has
n complex zeros. The various proofs of this result, in general, are noncon-
structive and give no procedure for the explicit computation of these zeros.
Numerical analysis provides constructive methods for the actual computa-
tion of the zeros of a polynomial.
The solution of a system of n linear equations for n unknowns can be
given explicitly by Cramer's rule. However, Cramer's rule is only of the-
oretical importance, since for actual computations it is completely useless
for linear systems with more than three unknowns. An important task
in numerical analysis consists in describing and developing more practical
methods for the solution of systems of linear equations.
By the Picard-LindeI6ftheorem, the initial value problem for an ordinary
differential equation has a unique solution (under appropriate regularity as-
sumptions). Despite the fact that the existence proof in the Picard-Lindel6f
theorem actually is constructive through the use of successive iterations, in
applied mathematics there is need for more effective procedures to numer-
ically solve the initial value problem.
In general, we may say that for the basic problems in numerical analysis
existence and uniqueness of a solution are guaranteed through the results
of pure mathematics. The main topic of numerical analysis is to provide
efficient numerical methods for the actual computation of the solution. In
2 1. Introduction

some cases these numerical methods are actually based on constructive


existence proofs.
By a constructive method we understand a procedure that for any pre-
scribed accuracy determines an approximate solution by a finite number
of computational steps. In general, the number of computational steps of
course will depend on the required accuracy. Only very few methods will
terminate with the exact solution after finitely many computational steps
as, for example, Gaussian elimination for solving a system of linear equa-
tions. In most cases, the numerical methods will only yield approximations
to the exact solution. As a typical example, the numerical evaluation of
a definite integral by the trapezoidal rule will, in general, provide only
an approximate value for the integral. In this context two main questions
arise, namely the question of estimating the error between the exact and
the approximate solution and the question of numerical stability.
A numerical method is useful only if it is possible to decide on the accu-
racy of the approximate solution, i.e., if reliable estimates on the difference
between the exact and approximate solution can be given. Therefore, be-
sides the development and design of numerical schemes, a substantial part
of numerical analysis is concerned with the investigation and estimation of
the errors occurring in these schemes. Here one has to discriminate between
the approximation errors, i.e., the errors that arise through replacing the
original problem by an approximate problem, and the roundoff errors, i.e.,
the errors that occur through the fact that in the actual computation, in
general, real numbers are replaced by floating-point decimal numbers with
a fixed number of digits.
As far as stability is concerned, one has to distinguish between properly
and improperly posed problems. A problem is called properly posed or
well-posed if the solution depends continuously on the data, i.e., if small
changes in the data cause only small changes in the solution. Otherwise, the
problem is called improperly posed or ill-posed. Numerical approximations
never can circumvent the improper posedness of a problem. However, it is
desirable to control the effects of the ill-posed nature of a problem by an
adequate choice of the numerical method. On the other hand, for properly
posed problems efforts have to be made not to destroy the well-posedness
by a poorly designed numerical approximation.
To the author's taste, the topic of stability and properly posedness is
more challenging from a mathematical perspective than the rather unin-
spiring topic of roundoff errors. Therefore, in this book emphasis is given
to ill-posedness and the related issue of ill-conditioning, whereas the dis-
cussion of roundoff errors is given only cursory attention.
The basic problems of numerical analysis are as old as mathematics it-
self, and for a number of problems there exist classical approaches such as
Newton's method for the solution of nonlinear equations, Gaussian elimi-
nation for the solution of systems of linear equations, Gauss-Seidel and
Jacobi iterations for linear systems, Lagrange interpolation for the ap-
1. Introduction 3

proximation of arbitrary functions by polynomials, Simpson's rule for nu-


merical integration, and Euler's method for the solution of initial value
problems. However, the main breakthrough of numerical methods is con-
nected with the advances in computer technology made within the last
four decades. Only the electronic computer allows one to perform exten-
sive numerical computations without error and within a reasonable amount
of time. Hence, progress in numerical analysis and computer science have
always been closely interrelated in recent history.
This book will introduce the reader to the following branches of numerical
analysis:
Solution of systems of linear and nonlinear equations,
Numerical solution of matrix eigenvalue problems,
Interpolation and numerical integration,
Numerical solution of initial and boundary value problems for differ-
ential equations,
Numerical solution of integral equations.
Of course, in an introductory exposition of only about three hundred pages
it is impossible to cover all of these areas exhaustively. Therefore, the reader
should not expect a comprehensive treatment of all existing numerical pro-
cedures. As already pointed out in the preface, our goal will be to guide
the reader toward the basic ideas and questions in each of the above top-
ics with an emphasis on the analysis and the understanding of numerical
methods rather than merely their description. In order to achieve this,
we will try to illustrate general principles by way of considering the main
and most important methods, and we will leave aside discussions of more
elaborate details of advanced methods and the consideration of lengthy
subtleties for exceptional cases. Given the rapid development of numerical
methods, a reasonable introduction to numerical analysis has to confine
itself to presenting a solid foundation by restricting the presentation to the
basic principles and procedures.
The book includes a chapter on the necessary basic functional-analytic
tools for the solid mathematical foundation of numerical analysis. These
are indispensable for any deeper study and understanding of numerical
methods, in particular for differential equations and integral equations.
The limit of space and the taste and restrictions in experience of the
author have caused the omission of some important topics such as linear
and nonlinear optimization, approximation theory, and parallel computing,
among others. On the other hand, with separate chapters on the solution
of ill-conditioned systems of linear equations and the numerical solution
of integral equations two topics are included that do not appear in most
introductions to numerical analysis. They are included because of their im-
portance and in order to indicate to the reader where the author's mathe-
matical research interests lie.
A study of numerical analysis remains incomplete without the numer-
ical experience of individually implementing the numerical algorithms. It
4 1. Introduction

is very important to build up a familiarity with numerical methods by ac-


tually seeing the numbers working. For example, one has to complement
the theoretical understanding of the method of successive approximations
by the experience of actually running the numerical schemes. After hav-
ing understood the basic principles of a numerical method, it is important
to develop the ability to actually implement the method numerically and
work with it. In this sense the reader is encouraged to test on the computer
numerically all of the algorithms presented in this book.
The organization of the book is as follows. The first part of the book,
Chapters 2 to 7, covers numerical linear algebra and is concerned with
the solution of systems of linear and nonlinear equations. The necessary
functional-analytic tools will be presented in Chapter 3. The second part
of the book, Chapters 8 to 12, covers numerical analysis and is concerned
with interpolation, numerical integration, and the numerical solution of dif-
ferential and integral equations. At the reader's convenience it is possible to
study most of the second part of the book before reading the first part, with
the exception of the chapter on functional analysis. Each chapter concludes
with a set of problems. These are intended as exercises and applications of
the material given in the chapter.
The references at the end of the book are intended as a possible guide to
some of the literature covering the topics of the individual chapters more
exhaustively. The list of references is not meant as a bibliography on the
vast number of introductions to numerical analysis competing with this
book. However, we explicitly encourage the reader to explore the libraries
and consult some of the other volumes on numerical analysis in order to
develop a broad perspective.
2
Linear Systems

The solution of systems of linear equations arises in various parts of mathe-


matics and is of central importance in numerical analysis. To illustrate the
significance of linear systems, we will start this chapter by providing some
examples of their occurrence as part of the numerical solution of differential
and integral equations. After seeing the examples, we will proceed with the
solution of systems of linear equations. In principle, we have to distinguish
between two groups of methods for the solution of linear systems:
1. In the so-called direct methods, or elimination methods, the exact solu-
tion, in principle, is determined through a finite number of arithmetic
operations (in real arithmetic leaving aside the influence of roundoff
errors).
2. In contrast to this, iterative methods generate a sequence of approx-
imations to the solution by repeating the application of the same
computational procedure at each step of the iteration. Usually, they
are applied for large systems with special structures that ensure con-
vergence of the successive approximations.
A key consideration for the selection of a solution method for a linear
system is its structure. In some problems, the matrix of the linear system
may be a full matrix, i.e., it has few zero entries. And in other problems,
the matrix may be very large and sparse, i.e., only a small fraction of the
entries are different from zero. Roughly speaking, direct methods are best
for full matrices, whereas iterative methods are best for very large and
sparse matrices.
6 2. Linear Systems

We will begin our treatment of linear systems by presenting the best-


known and most widely used direct method, which is attributed to Gauss,
since it is based on considerations published by Gauss in 1801 in his Dis-
quisitiones Arithmeticae. The chapter concludes with a brief description of
elimination by orthonormal decomposition.
In this book, for an m x n matrix A = (ajk), j = 1, ... , m, k = 1, ... , n,
with real or complex coefficients, AT shall always denote the transposed
matrix; i.e., AT is the n x m matrix with entries

akj = ajk, k = 1, ... ,n, j = 1, ... ,m.


By A* we denote the adjoint of the matrix Aj i.e., A* = AT is the transpose
of the matrix with complex conjugate entries. In particular, the transpose
and adjoint of a row vector are column vectors and vice versa.

2.1 Examples for Systems of Equations


Example 2.1 We consider the discretization of the boundary value prob-
lem for the ordinary differential equation

-u"(X) = f(x, u(x)), x E [0, I), (2.1)

with boundary condition

u(o) = u(l) = 0. (2.2)


Here, f : [0,1] x JR -+ JR is a given continuous function, and we are looking
for a twice continuously differentiable solution u : [0,1] -+ JR. Boundary
value problems of this type occur, for example, in the mathematical treat-
ment of vibrations of a string or a rod and in the solution of heat conduction
problems. They often also arise in the solution of problems like the following
Example 2.2 after applying separation of variables. The theory of ordinary
differential equations (see [12]) provides conditions on the right-hand side f
of (2.1), ensuring existence and uniqueness of a solution u to the boundary
value problem (2.1)-(2.2) (for the case of linear differential equations see
also Chapter 11).
For the approximate solution we choose an equidistant subdivision of the
interval [0, 1] by setting

Xj = jh, j = 0, ... ,n + 1,
where the step size is given by h = 1/ (n + 1) with n E IN. At the internal
grid points x j, j = 1, ... , n, we replace the differential quotient in the
differential equation (2.1) by the difference quotient

u" (Xj) ~ ~2 [u(xj+d - 2u(xj) + U(Xj-l)]


2.1 Examples for Systems of Equations 7

to obtain the system of equations


1
- h 2 [Uj-l - 2uj + ui+I! = f(xj, Uj), j = 1, ... , n,
for approximate values Uj to the exact solution u(Xj). This system has to
be complemented by the two boundary conditions Uo = Un+! = O. For an
abbreviated notation we introduce the n x n matrix
2 -1
-1 2-1
1 -1 2-1
A = h2
-1 2-1
-1 2
and the vectors U = (Ul,' .. ,un)T and F(U) = (f(Xl' Ul), ... ,f(xn , un))T.
Then our system of equations, including the boundary conditions, reads

AU = F(U). (2.3)
For obvious reasons, the above matrix A is called a tridiagonal matrix, and
the vector F is diagonal; i.e., the jth component of F depends only on
the jth component of u. If (2.1) is a linear differential equation, i.e., if f
depends linearly on the second variable u, then the tridiagonal system of
equations (2.3) also is linear.
The following two questions will be addressed later in the book (see
Chapter 11):
1. Can we establish existence and uniqueness of a solution to the system
of equations (2.3) for sufficiently small step size h, provided that the
boundary value problem (2.1)-(2.2) itself is uniquely solvable?
2. How large is the error between the approximate solution Uj and the
exact solution u(Xj)? Do we have convergence of the approximate
solution towards the exact solution as h -+ O?
At this point we would like only to point out that the discretization of
boundary value problems for ordinary differential equations leads to sys-
tems of equations with a large number of unknowns, since we expect that
in order to achieve a reasonably accurate approximation we need to choose
the step size h sufficiently small. 0

Example 2.2 We now consider the discretization of the boundary value


problem for the elliptic partial differential equation
- f::!. u(x) = f(x, u(x)), xED, (2.4)
with Dirichlet boundary condition
U(X) = 0, x E aD. (2.5)
8 2. Linear Systems

Here, D C rn? is a bounded domain, ~ denotes the Laplacian

! : D x IR. -+ IR. is a given continuous function, and we are looking for


a solution U : jj -+ IR. that is continuous in jj and twice continuously
differentiable in D. Boundary value problems of this type arise, for example,
in potential theory and in heat conduction problems. The theory of elliptic
partial differential equations (see [24)) provides conditions on the given
function! that ensure existence and uniqueness of a solution u.
For describing a numerical approximation method we restrict ourselves
to the case of the square D = (0,1) x (0,1). We choose an equidistant
quadratic grid with grid points

Xi} = (ih,jh), i,j = 0, ... , n + 1,


where the step size again is given by h = Ij(n+l) with n E IN. Analogously
to the previous example, at the internal grid points Xij, i, j = 1, ... ,n, we
replace the Laplacian by the Laplace difference operator

Obviously, for each point Xij, this difference operator has nonvanishing
weights only at the four neighboring points on the vertical and horizontal
line through Xi}' This observation also illustrates why the set of grid points
with nonvanishing weights is called the star associated with the Laplace
difference operator. Using this difference approximation leads to the system
of equations
1
h2 [4Uij -Ui+l,j -Ui-l,j -Ui,j+l-Ui,j-d = !(Xij,Uij), i,j = 1, ... ,n,

for approximate values Uij to the exact solution u( Xij). This system has to
be complemented by the boundary conditions

UO,j = Un+l,j = 0, j = 0, ... , n + 1,


at the grid points on the vertical parts and

Ui,O = Ui,n+l = 0, i = 1, ... , n,

at the grid points on the horizontal parts of the boundary aD. In order to
write this system in matrix form we rearrange the unknowns by ordering
them row by row and setting
2.1 Examples for Systems of Equations 9

where m = n 2 • Furthermore, we introduce an m x m matrix A in the form


of an n x n block tridiagonal matrix

B -I
-I B -I
1 -I B -I
A = h2
-I B -I
-I B

where I denotes the n x n identity matrix and B is the n x n tridiagonal


matrix
4 -1
-1 4-1
-1 4-1
B=
-1 4-1
-1 4
After introducing the vectors U and F(U) analogously to Example 2.1, we
can rewrite the system of equations in the short form

AU = F(U), (2.6)

which also includes the boundary conditions.


Again we postpone the questions of unique solvability of the system (2.6)
and the problem of convergence and error estimates for later parts of the
book (see Chapter 11). Here, we conclude the example with the observation
that the system has n 2 unknowns, where n will be fairly large if the step
size h is sufficiently small in order to achieve a reasonably accurate approxi-
mation to the solution of the boundary value problem. These large systems
of equations arising in the discretization of partial differential equations
call for efficient solution methods. 0

Example 2.3 Consider the linear integral equation

cp(x) -1 1
K(x, y)cp(y) dy = f(x), x E [0,1],

where K : [0,1] x [0,1]-+ JR and f : [0,1] -+ JR are given continuous func-


tions and where we seek a continuous solution cp : [0, 1] -+ JR. Such integral
equations either arise directly in the solution of applied problems, or more
often they occur indirectly in the solution of boundary value problems for
differential equations. If the homogeneous form of this equation, i.e., the
integral equation with the right-hand side f = 0, admits only the trivial
solution cp = 0, then for each f the inhomogeneous integral equation has a
unique solution cp (see Chapter 12).
10 2. Linear Systems

For the numerical approximation we replace the integral by the rectan-


gular sum

1
1 1 n
K(x, y)cp(y) dy ~ - L K(x, Xk)'P(Xk)
o n k=]
with equidistant grid points Xk = kin, k = 1, ... , n. If we require the
approximated equation to be satisfied only at the grid points, we arrive at
the system of linear equations
1 n
'Pj - - LK(Xj,Xk)CPk
n k=]
= f(xj), j = 1, ... ,n,
for approximate values 'Pj to the exact solution cp(Xj). As in the preced-
ing examples, we postpone the question of unique solvability of the linear
system and the convergence and error analysis (see Chapter 12). 0

Example 2.4 In this last example we will briefly touch on the method of
least squares. Consider some (physical) quantity u depending on time t and
a parameter vector a = (a], ... , an)T E IRn in terms of a known function

u(t) = f(t; a).


In order to determine the values of the parameter a (representing some
physical constants), one can take m measurements of u at different times
t], ... ,t m and then try to find a by solving the system of equations

u(tj) = f(tj;a), j = 1, ... ,m.

If m = n, this system consists of n equations for the n unknowns a], ... ,an'
However, in general, the measurements will be contaminated by errors.
Therefore, usually one will take m > n measurements and then will try to
determine a by requiring the deviations

to be as small as possible. Usually the latter requirement is posed in the


least squares sense, Le., the parameter a is chosen such that
m

g(a) := L[U(tk) - f(tk; aW


k=]

attains a minimal value. The necessary conditions for a minimum,

~-o
8aj - , j = 1, ... ,n,
lead to the normal equations

j = 1, ... ,n,
2.2 Gaussian Elimination 11

for the method of least squares. These constitute a system of n, in general,


nonlinear equations for the n unknowns aI, ... , an. 0

At this point, the reader should be convinced of the need for effective
methods for solving large systems of linear and nonlinear equations and be
willing to be introduced to such methods in the subsequent chapters. We
also wish to note that the discretization of differential equations leads to
sparse matrices, whereas for the least squares problem and the discretiza-
tion of integral equations one is faced with full matrices.

2.2 Gaussian Elimination


We proceed with describing the Gaussian elimination method for a system
of linear equations
Ax=y.
Here A is a given n x n matrix A = (ajk) with real (or complex) entries, y
a given right-hand side y = (Yl,' .. ' Yn)T E IRn (or en), and we are looking
for a solution vector x = (Xl, ... ,xn)T E IRn (or en). More explicitly, our
system of equations can be written in the form
n
LajkXk =Yj, j = 1, ... ,n;
k=l

that is,

+ annxn = Yn·
Assuming that the reader is familiar with basic linear algebra, we recall the
following various ways of saying that the matrix A is nonsingular:
1. The inverse matrix A -1 exists.
2. For each Y the linear system Ax = Y has a unique solution.
3. The homogeneous system Ax = 0 has only the trivial solution.
4. The determinant of A satisfies det A :j:. O.
5. The rows (columns) of A are linearly independent.
The very basic idea of the Gaussian elimination method is to use the first
equation to eliminate the first unknown from the last n - 1 equations, then
use the new second equation to eliminate the second unknown from the last
n - 2 equations, etc. This way, by n - 1 such eliminations the given linear
12 2. Linear Systems

system is transformed into an equivalent linear system that is of triangular


form
+
+

bn- 1 ,n-l Xn-l + bn- 1 ,n Xn = Zn-l


bnnx n = Zn
Recall that two linear systems are called equivalent if every solution of one
is a solution of the other. The triangular system can be solved recursively
by first obtaining X n from the last equation, then obtaining X n - l from the
second to last equation, etc. This procedure is known as backward substi-
tution. Explicitly, it is described by Xn = zn/b nn and

Xm = f-
mm
(zm - t
k=m+l
bm,kXk), m =n - 1, n - 2, ... , 1.

We begin by considering a nonsingular matrix A. To eliminate the un-


known Xl, for j = 2, ... , n we multiply the first equation by ajJ/all and
subtract the result from the jth equation. For this we have to require that
all =I- O. Since we assume the matrix to be nonsingular, this can be achieved
by reordering the rows or the columns of the given system. This procedure
leads to a system of the form

(2)
an2 X2 +
with the new coefficients given by
.- a(1)
blk·- lk' k = 1, ... ,n,

j,k = 2, ... ,n,


and the new right-hand sides given by
(1) (1)
Z . - y(l) YJ(2):= YJ(I) a j1 Yl j = 2, ... ,n.
1·- 1 , (1)
all
2.2 Gaussian Elimination 13

Here, for the coefficients and right-hand sides of the original system we
have set aj~) := ajk and y]l) := Yj'
Proceeding in this way, the given n x n system for the unknowns Xl, ... , X n
is equivalently transformed into an (n -1) x (n -1) system for the unknowns
X2, ... , X n . Adding a multiple of one row of a matrix to another row does
not change the value of its determinant. Therefore, in the above elimina-
tion the determinant of the system remains the same (with the exception
of a possible change of its sign if the order of rows or columns is changed).
Hence, the resulting (n - 1) x (n - 1) system for X2, ... , X n again has a
nonvanishing determinant, and we can apply precisely the same procedure
to eliminate the second unknown X2 from the remaining (n - 1) x (n - 1)
system.
By repeating this process we complete the forward elimination, by which
the system of linear equations

(l)x _ y(l)
+ a nn n - n

with a nonsingular matrix A = (aj~) is equivalently transformed into a


triangular system

+
+

bn-l,n-IXn-1 + bn-l,nXn = Zn-l

by n - 1 recursive elimination steps of the form

(m) (m)
(m+l)._ (m)
a jk . - a jk -
ajm a mk
(m) j, k = m + 1, ... ,n,
amm
m = 1, ... ,n-1.
(m) (m)
ajm Ym
(m)
j = m+ 1, ... ,n,
amm
14 2. Linear Systems

The coefficients and the right-hand sides of the final triangular system are
given by
.- a(j)
b jk·- jk' k = j, ... ,n, j = 1, ... ,n,
and
· '.= y(j)
ZJ J' J' - 1 ... ,n.
- ,

The condition a~~ ;f. 0, which is necessary for performing the algorithm,
always can be achieved by a reordering of the rows or columns, since oth-
erwise the matrix A would not be nonsingular.
We would like to compress the operations of one elimination step into
the following scheme

a b

e d

where the rectangle illustrates the remaining part of the matrix and the
right-hand side for which the elimination has to be performed. Here, a
stands for the elimination element, or pivot element; the elements b in
the elimination row remain unchanged; the elements e of the elimination
column are replaced by zero (with the exception of the pivot element a)j
and the remaining elements d are changed according to the rule
be
d-+d--.
a
We note that in computer calculations, of course, the new values for the
coefficients of the matrix and the right-hand sides can be stored in the
locations held by the old values.
More explicitly, the entire Gaussian elimination can be written in the
following algorithmic form.
Algorithm 2.5 (Gaussian elimination)

1. Forward elimination:

For m = 1, ... ,n - 1 do

for j = m + 1, ... ,n do

ajmYm
Yj:= Yj - - - -
a mm
2.2 Gaussian Elimination 15

2. Backward substitution:

For m = n,n -1, ... ,1 do Xm := Ym

for k = m + 1, ... ,n do Xm := Xm - amkXk

Xm
Xm := - -
a mm

If the matrix A is singular and has rank r, the elimination procedure


will terminate after r steps. The matrix of the remaining (n - r) x (n - r)
system for the unknowns x r + I, ... , X n is the zero matrix, because otherwise
the rank of A would be different from r. Hence, in this case the given linear
system is solvable if and only if the right-hand sides after r elimination
steps satisfy
Zr+l = ... = Zn = O.

The solutions can be found from the triangular system by arbitrarily choos-
ing Xr+l, ... , X n and then recursively determining X r , ... , Xo. This way we
obtain the (n - r)-dimensional solution manifold.
In order to control the influence of roundoff errors we want to keep the
quotient a;:l /a~,l" small; i.e., we want to have a large pivot element a~,l".
Therefore, instead of only requiring a~,l" oj:. 0, in practice, either complete
pivoting or partial row or column pivoting is employed. For complete piv-
oting, both the rows and the columns are reordered such that a~~ has
maximal absolute value in the (n - m + 1) x (n - m + 1) matrix remaining
for the mth forward elimination step. In order to minimize the additional
computational cost caused by pivoting, for row (or column) pivoting the
rows (or columns) are reordered such that a~,l" has maximal absolute value
in the elimination column (or row), i.e., in the mth column (or row). Of
course, in the actual implementation of the Gaussian elimination algorithm
the reordering of rows and columns need not be done explicitly. Instead,
the interchange may be done only implicitly by leaving the pivot element
at its original location and keeping track of the interchange of rows and
columns through the associated permutation matrix.
The following example illustrates that partial pivoting does not always
prevent loss of accuracy in the numerical computations.

Example 2.6 We consider the system

Xl + 200X2 = 100

Xl + X2 = 1

with the exact solution Xl = 100/199 = 0.502 ... , X2 = 99/199 = 0.497 ....
For the following computations we use two-decimal-digit floating-point
16 2. Linear Systems

arithmetic. Column pivoting leads to all as pivot element, and the elimi-
nation yields
Xl + 200X2 = 100
- 200X2 = -99,
since 199 = 200 in two-digit floating-point representation. From the second
equation we then have X2 = 0.50 (0.495 = 0.50 in two decimal digits), and
from the first equation it finally follows that Xl = O.
However, if by complete pivoting we choose al2 as pivot element, the
elimination leads to
Xl + 200X2 = 100

= 0.5
(0.995 = 1.00 in two decimal digits), and from this we get the solution
Xl = 0.5, X2 = 0.5 (0.4975 = 0.50 in two decimal digits), which is correct
to two decimal digits. 0

Since complete pivoting is more costly than partial pivoting, in practical


computations one can try to overcome the disadvantages of partial pivoting
by scaling the matrix. This means that if B = D I AD 2 , in order to obtain
the solution X of Ax = Y we first solve Bz = DIY for z and then determine
X from X = D 2 z. Here D I and D 2 are some diagonal matrices chosen such
that for the matrix B the row and column sums of the absolute values are
approximately equal. A diagonal matrix D = (d jk ) is a matrix with the
off-diagonal elements equal to zero; i.e., d jk = 0 for j ::j:. k. For a detailed
discussion of scaling we refer to [271. Unfortunately, there is no known
general procedure for such scaling, i.e., for choosing the diagonal matrices
D I and D 2 .
For an estimate of the computational cost of Gaussian elimination we
perform a count of the number of multiplications. By an we denote the
number of multiplications that are required for solving a triangular n x n
system by back substitution. Obviously, for an we have the recurrence
relation
an = an-l + n,
since we need n multiplications to obtain Xl from the first equation after
having already determined X2, ... , X n . Hence, we have

an = t
k=l
k = n(n 2+ 1) ,

since al = 1. By fln,r we denote the number of multiplications needed


for the forward elimination simultaneously for r different right-hand sides.
Here we have the recurrence relation

fln,r = fln-l,r + (n + r)(n - 1),


2.2 Gaussian Elimination 17

since the elimination of the unknown Xl requires n + r multiplications for


each row of the n - 1 rows. From this it follows that

because f31,r = O. Adding ran and f3n,r we obtain the following result.

Theorem 2.7 Gaussian elimination for the simultaneous solution of an


n x n system for r different right-hand sides requires a total of

multiplications.

The computational cost, counting only the multiplications, in Gaussian


elimination is n 3 /3+0(n 2 ). It is left to the reader to show that the number
of additions is also n 3 /3 + O(n 2 ) (see Problem 2.7). Doubling the number
of unknowns increases the computation time by a factor of eight. Assuming
1 J.L sec = 10- 6 sec per addition and multiplication, Le., on a computer with
one million floating point operations per second, the solution of a system
with n = 103 requires approximately ten minutes, and with n = 10 4 it
requires approximately six days. This illustrates dramatically that for the
solution of large linear systems iterative methods, which we will study in
Chapter 4, are better suited than direct methods. Row or column pivoting
leads to an additional cost proportional to n 2 , whereas complete pivoting
adds costs proportional to n 3 . For the latter reason, complete pivoting is
used only rarely in practical computations.
The Gaussian algorithm also allows the computation of the determinant
and the inverse of a matrix A. The determinant det A is simply given by the
product of the diagonal elements in the triangular matrix obtained through
the elimination procedure. If the determinant is computed using expansions
by submatrices, then the operational count is n! multiplications, as com-
pared to n 3 /3 for Gaussian elimination. This illustrates why Cramer's rule
for the solution of linear systems is only a theoretical mathematical tool
and not a tool for practical computations.
The inverse of a matrix is obtained by solving the linear system simul-
taneously for the n right-hand sides given by the columns of the identity
matrix, i.e., by solving the n systems

AXi = ei, i = 1, ... , n,

where ei is the ith column of the identity matrix. Then the n solutions
will provide the columns of the inverse matrix A-I. We would
Xl, ... , X n
like to stress that one does not want to solve a system Ax = y by first
18 2. Linear Systems

computing A-I and then evaluating x = A-I y, since this generally leads
to considerably higher computational costs.
The Gauss-Jordan method is an elimination algorithm that in each step
eliminates the unknown both above and below the diagonal. The com-
plete elimination procedure transforms the system equivalently into a di-
agonal system. The multiplication count shows a computational cost of
order n 3 /2 + O(n 2 ), i.e., an increase of 50 percent over Gaussian elimina-
tion. Hence, the Gauss-Jordan method is rarely used in applications. For
details we refer to [26, 27].

2.3 LR Decomposition
In the sequel we will indicate how Gaussian elimination provides an LR
decomposition (or factorization) of a given matrix.
Definition 2.8 A factorization of a matrix A into a product
A=LR
of a lower (left) triangular matrix L and an upper (right) triangular matrix
R is called an LR decomposition of A.
A matrix A = (ajk) is called lower triangular or left triangular if ajk = 0
for j < k; it is called upper triangular or right triangular if ajk = 0 for
j > k. The product of two lower (upper) triangular matrices again is lower
(upper) triangular, lower (upper) triangular matrices with nonvanishing
diagonal elements are nonsingular, and the inverse matrix of a lower (upper)
triangular matrix again is lower (upper) triangular (see Problem 2.14).
Theorem 2.9 For a nonsingular matrix A, Gaussian elimination (without
reordering rows and columns) yields an LR decomposition.
Proof. In the first elimination step we multiply the first equation by ajI/all
and subtract the result from the jth equation; i.e., the matrix Al = A is
multiplied from the left by the lower triangular matrix
1
a21
1
all
L1 =
anI
1
all
The resulting matrix A 2 = L 1 A 1 is of the form
A = 2 (al~ An - : ) '
2.4 QR Decomposition 19

where An - 1 is an (n - 1) x (n - 1) matrix. In the second step the same pro-


cedure is repeated for the (n -1) x (n - 1) matrix An-I. The corresponding
(n - 1) x (n - 1) elimination matrix is completed as an n x n triangular
matrix L 2 by setting the diagonal element in the first row equal to one. In
this way, n - 1 elimination steps lead to
Ln - 1 ... L1A = R,
with nonsingular lower triangular matrices L 1 , •.. ,Ln - 1 and an upper tri-
angular matrix R. From this we find
A=LR,
where L denotes the inverse of the product L n - 1 ... L 1. o
We wish to point out that not every nonsingular matrix allows an LR
decomposition. For example,

has no LR decomposition. However, since Gaussian elimination with row


reordering always works, for each nonsingular matrix A there exists a per-
mutation matrix P such that P A has an LR decomposition (see Problem
2.16). A permutation matrix is a matrix of the form P = (ep(l)' ' ep(n))
where el, ... ,en are the columns of the identity matrix and p(l), ,p(n)
is a permuation of 1, ... ,n.
Recall that an n x n matrix A is called symmetric if it has real coefficients
and A = AT. A symmetric matrix A is called positive definite if x T Ax > 0
for all x E lRn with x f. O. Positive definite matrices have positive diagonal
elements (see Problem 2.10), and therefore a reordering of rows and columns
is not necessary for Gaussian elimination (for pivoting, the largest diagonal
element is chosen). It can be shown (see Problem 2.13) that symmetry and
positive definiteness are preserved throughout the elimination if diagonal
elements are taken as pivot elements. Therefore, for symmetric positive
definite matrices the LR decomposition is always possible. If A = LR, then
we have also A = AT = R T L T , and from Problem 2.15 we can deduce that
L can be normalized such that A = LL T . Such a decomposition is used
in the Cholesky method for the solution of linear systems with symmetric
positive definite matrices. Because of symmetry, the computational cost
for the Cholesky method is n 3 /6 + O(n 2 ) multiplications and n 3 /6 + O(n 2 )
additions. For details we refer to [26, 27].

2.4 QR Decomposition
We conclude this chapter by describing a second elimination method for
linear systems, which leads to a QR decomposition.
20 2. Linear Systems

Definition 2.10 A factorization of a matrix A into a product

A=QR
of a unitary matrix Q and an upper (right) triangular matrix R is called a
QR decomposition of A.

We recall that a matrix Q is called unitary if

QQ* = Q*Q = I.
The product of two unitary matrices again is unitary.
In terms of the columns of the matrices A = (al,"" an) and
Q = (ql, ... , qn) and the coefficients of R = (rjk), the QR decomposition
A = QR means that
k
ak = Lrikqi, k = 1, ... ,n. (2.7)
i=l

Hence, the vectors aI, ... , an of <en have to be orthonormalized from the
left to the right into an orthonormal basis ql, ... , qn' This, for example, can
be achieved by the Gram-Schmidt orthonormalization procedure (see The-
orem 3.18). However, since the Gram-Schmidt orthonormalization tends to
be numerically unstable, we describe the QR decomposition by Householder
matrices.
Definition 2.11 A matrix H of the form

H = 1- 2vv*,

where v is column vector with v*v = 1, i.e., a unit vector, zs called a


Householder matrix.
Remark 2.12 Householder matrices are unitary and satisfy H = H*.
Proof. We compute

H* = I* - 2(vv*)* = 1- 2vv* = H

and

H H* = H* H = (I - 2vv*)(I - 2vv*) = 1- 4vv* + 4vv*vv* = I,

where we use that v*v = 1. o

Geometrically a Householder matrix corresponds to reflection across the


plane through the origin orthogonal to v. To see this we write

x=vv*x+y
2.4 QR Decomposition 21

with the component vv*x of x E en in the v-direction and a component y


orthogonal to v. Then we obtain

Hx =x - 2vv*x = -vv*x + y;
i.e., H x has the opposite component -vv*x in the v-direction and the
same component y orthogonal to v. Because of this property, Householder
matrices are also called elementary reflection matrices.
We now describe the elimination of the unknown Xl by multiplying A
from the left by a Householder matrix HI = I - 2VI vi. By al we denote
the first column of A and by ek the kth column of the identity matrix; in
particular, el = (1,0, ... ,0)*. Then the first column bl of the product HIA
is given by
bl = HIAel = Hlal = al - 2vlvial.
We would like to achieve that bl = O"el with 0" -I- O. Hence, except for the
first row, VI must be a multiple of al. Therefore, we try

(2.8)

with
all -I- 0,

all = o.
Then we have

and
u *l al = a *l al =f II~
all vaial = 2"1*
u l UI·

Without loss of generality we may assume that J ai al - jalll > 0, since


otherwise we would have that al = all el, i.e., that the first column already
has the required form. Therefore, if we finally choose

then VI is a unit vector, and as requested we have

The remaining columns bk = HIAek are obtained from the columns ak of


A by
22 2. Linear Systems

From the two possible signs in (2.8) the positive sign yields the numerically
more stable variant.
The same procedure is now repeated for the remaining (n - 1) x (n - 1)
matrix. The corresponding (n - 1) x (n - 1) Householder matrix has to be
completed as an n x n Householder matrix. In general, if A k is an n x n
matrix of the form

with a k x k upper triangular matrix Rk and an (n - k) x (n - k) matrix


An-k' we apply the Householder transformation described above with the
first column of An - k . With the corresponding (n-k) x (n-k) Householder
matrix Hn - k the n x n matrix

1
H k = ( 0k

yields an n x n-Householder matrix H k that leaves the first k columns


in triangular form and, in addition, transforms the (k + l)st column into
triangular form. In this way, after at most n - 1 steps, we arrive at

Hn - l ·· ·HIA =R
with Householder matrices HI, ... ,Hn - I and an upper triangular matrix
R. From this we obtain
A=QR

with the unitary matrix

Q = (Hn - I ··· Hd* = HI'" Hn- I ·

We summarize our result in the following theorem.

Theorem 2.13 To each n x n matrix a QR decomposition can be obtained


through n - 1 Householder transformations.

The elimination by QR decomposition via Householder matrices can be


considered as an alternative to Gaussian elimination, since it does not need
pivoting. However, the operation count shows that 2n 3 /3 + O(n 2 ) multi-
plications are required (see Problem 2.18), i.e., twice the cost of Gaussian
elimination, and the added expense of partial pivoting in Gaussian elim-
ination does not close this gap. Hence, QR decomposition is rarely used
for the solution of linear systems. But later in this book we will see that
QR decomposition is an essential part of one of the best algorithms for
numerically computing the eigenvalues of a matrix (see Section 7.4).
Problems 23

Problems
2.1 Solve the linear system

X3 = 10

by Gaussian elimination.

2.2 Write a computer program for the solution of a system of linear equations
by Gaussian elimination with partial pivoting and test it for various examples.
You will need this code as part of other numerical algorithms later in this book.

2.3 Describe pivoting in Gaussian elimination by using permutation matrices.

2.4 Let A and B be two n x n matrices. Show that if AB is nonsingular, then


A and B are nonsingular.
2.5 Let A, B, C, and D be n x n matrices and let A be nonsingular. Show that

det (~ ~) = detAdet(D - 1
CA- B).

2.6 Verify the summation formulas

t
k=l
k = ~ n(n + 1) and
~
L...Jk
k=l
2
= 61 n(n+ 1)(2n+ 1)

that were used in the proof of Theorem 2.7.

2.1 Prove the analogue of Theorem 2.7 for the number of additions in Gaussian
elimination.

2.8 Show that tridiagonal matrices

A=
bn - 1 an-l Cn-l
bn an

with the properties

lajl~lbjl+ICjL bjcj!,O, j=2, ... ,n-1,

and lad> led > 0 and lanl > Ibnl > 0 are nonsingular.
2.9 Show that Gaussian elimination for tridiagonal n x n matrices requires 4n
multiplications.
24 2. Linear Systems

2.10 Show that the diagonal elements of a positive definite matrix are positive.

2.11 Prove that if A = LL T where L is a real lower triangular nonsingular n x n


matrix, then A is symmetric and positive definite.
2.12 Show that

is not positive definite.


2.13 Show that for a symmetric positive definite matrix the symmetry and pos-
itive definiteness are preserved in Gaussian elimination if diagonal elements are
. t eIements, I.e.,
t a ken as plVO . t h e su b matrices
. a jk , J, = m, ... ,n, are symmetric
(m). k .
and positive definite.
2.14 Show that the product of two lower (upper) triangular matrices again is
lower (upper) triangular, that lower (upper) triangular matrices with nonvanish-
ing diagonal elements are nonsingular, and that the inverse matrix of a lower
(upper) triangular matrix again is lower (upper) triangular.
2.15 Let A be a nonsingular matrix and suppose A = L1Rl = L2R2, where L I
and L2 are lower triangular matrices with diagonal elements equal to one and R 1
and R2 are upper triangular matrices. Show that L I = L 2 and R I = R2.
2.16 Show that for each nonsingular n x n matrix A there exists a permutation
matrix P such that P A has an LR decomposition.
2.17 Solve the linear system
XI + 6X2 2X3 =5

2xI + X2 2X3

2Xl + 2X2 + 6X3 10


by QR decomposition.
2.18 Show that the solution of an n x n linear system by QR elimination with
Householder matrices requires 2n 3 /3 + O(n 2 ) multiplications.
2.19 Let A be a complex n x n matrix and y E a;n and assume that A, ReA,
and 1m A are nonsingular. Show that the n x n complex linear system Ax = y is
equivalent to the two n x n real systems
{(1m A)-l ReA + (Re A)-limA} Rex = (1m A)-l Rey + (ReA)-1 Imy,

{(1m A)-I Re A + (Re A)-limA} Imx = (1m A)-11m y - (Re A)-l Rey.
2.20 Use QR decomposition to prove Hadamard's inequality
n

I det AI 2 ~ II L lajkj2
j=1 k=1

for the determinant of an n x n matrix A = (ajk)'


3
Basic Functional Analysis

In the subsequent chapters we want to discuss iterative methods for the


solution of systems of linear and nonlinear equations. For this we will need
some fundamental concepts of functional analysis, which we will start to
develop now. We shall use these functional-analytic tools also in later parts
of this book in some of our convergence and error analysis for the approx-
imate solution of differential and integral equations.
We begin by introducing the notions of normed spaces and their ele-
mentary properties, where we assume that the reader is familiar with the
concept of linear spaces or vector spaces and their basic properties. Then
we proceed by considering scalar product spaces as special cases of normed
spaces.
We will continue with the discussion of linear and continuous operators
acting between normed spaces. Particular attention is given to linear oper-
ators between finite-dimensional spaces, i.e., to matrices and their various
norms. The main part of this chapter is Banach's fixed point theorem, also
known as the contraction mapping principle, which is one of the most im-
portant tools in numerical analysis and is the fundamental basis of our
investigations of iterative methods for linear and nonlinear systems. At the
end of the chapter we will introduce some of the basic concepts of approx-
imation theory, which will be useful later in other parts of this book.
For a broader and more detailed study we refer to [5, 34, 35, 39, 59) or
any other introductory book on functional analysis.
26 3. Basic Functional Analysis

3.1 Normed Spaces


Definition 3.1 Let X be a complex (or real) linear space (vector space).
A function II . II : X -t IR with the properties
(N1) IIxll > 0, (positivity)

(N2) Ilxll ° if and only if x = 0, (definiteness)

(N3) Ilaxll lalllxll, (homogeneity)

(N4) IIx + yll < IIxll + lIyll, (triangle inequality)


for all x, y E X and all a E ~ (or IR) is called a norm on X. A linear
space X equipped with a norm is called a normed space. For X = IRn or
X = ~n we will also call the norm a vector norm.

Example 3.2 Some examples of norms on IRn and ~n are given by

n
Ilxllt := L IXjl, Ilxll<Xl:= J=l,
. max... ,n
IXjl
j=l

for x = (Xl, ... , x n ) T. It is an easy exercise for the reader to verify that
the norm axioms (N1)-(N3) are satisfied. The triangle inequality for the
norms 11·111 and II· 11<Xl follows immediately from the triangle inequality in
IR or ~. The verification of the triangle inequality for the norm II . 112 is
postponed until Section 3.2. 0

The norms in Example 3.2 are denoted the £1, £2, and £<Xl norm, respec-
tively. For obvious reasons the £2 norm is also called the Euclidean norm,

r,
and the £<Xl norm is called the maximum norm. The three norms are special
cases of the £p norm

II x llp= (t,lx i IP (3.1)

defined for any real number p ~ 1. The £<Xl norm is the limiting case of
(3.1) as p -t 00 (see Problem 3.1).
Remark 3.3 For each norm, the second triangle inequality

Illxll - Ilylll :s Ilx - yll


holds for all x, y EX.
Proof. From the triangle inequality we have

Ilxll = IIx - y + yll :s IIx - yll + lIyll,


3.1 Normed Spaces 27

whence IIxll - lIyll ~ IIx - yll follows. Analogously, by interchanging the


roles of x and y we have lIyll- Ilxll ~ lIy - xII. 0

For two elements x, y in a normed space Ilx - yll is called the distance
between x and y.

Definition 3.4 A sequence (x n ) of elements in a normed space X is called


convergent if there exists an element x E X such that

lim
n-+oo
IIx n - xii = 0,
i.e., if for every c > 0 there exists an integer N(c) such that IIxn - xII < c
for all n ~ N(c). The element x is called the limit of the sequence (x n ),
and we write
lim X n = x
n-+oo
or
x n -t x, n -t 00.

A sequence that does not converge is called divergent.

Theorem 3.5 The limit of a convergent sequence is uniquely determined.


Proof. Assume that X n -t x and Xn -t Y for n -t 00. Then from the triangle
inequality we obtain that

IIx - yll = IIx - Xn +Xn - yll ~ IIx - xnll + IIxn - yll -t 0, n -t 00.

Therefore, IIx - yll = 0 and x = y by (N2). 0

Definition 3.6 Two norms on a linear space are called equivalent if they
have the same convergent sequences.

Theorem 3.7 Two norms 11·lla and 11·11 b on a linear space X are equivalent
if and only if there exist positive numbers c and C such that

for all x EX. The limits with respect to the two norms coincide.

Proof. Provided that the conditions are satisfied, from IIx n - xll a -t 0,
n -t 00, it follows that IIx n - xllb -t 0, n -t 00, and vice versa.
Conversely, let the two norms be equivalent and assume that there is
no C > 0 such that IIxlib ~ Cllxll a for all x E X. Then there exists a
sequence (x n ) with IIxnll a = 1 and IIxnllb ~ n 2 . Now, the sequence (Yn)
with Yn := xn/n converges to zero with respect to II . lIa, whereas with
respect to II . lib it is divergent because of IIYnilb ~ n. 0

Theorem 3.8 On a finite-dimensional linear space all norms are equiva-


lent.
28 3. Basic Functional Analysis

Proof. In a linear space X with finite dimension n and basis U1,"" Un


every element can be expressed in the form
n
X = LOjUj.
j=1

As in Example 3.2,
IIxli oo := . max
J=l, ... ,n
10jI (3.2)

defines a norm on X. Let II . II denote any other norm on X. Then, by the


triangle inequality we have
n
Ilxll :S L 10jillujil :S Cllxll oo
j=l

for all x E X, where


n
C:= Lllujll.
°
j=1

Assume that there is no c > such that cllxll oo :S IIxll for all x EX.
Then there exists a sequence (xv) with Ilxvll = 1 such that IIx v ll oo 2 v.
Consider the sequence (Yv) with Yv := x v /llx v ll oo and write
n
Yv =L 0jvUj'
j=1

Because of IIYviloo = 1 each of the sequences (Ojv), j = 1, ... , n, is bounded


in <C. Hence, by the Bolzano-Weierstrass theorem we can select convergent
subsequences 0j,v(l) ~ OJ, e~ 00, for each j = 1, ... ,n. This now implies
IIYv(l) - ylloo ~ 0, e~ 00, where
n
y:= LOjUj,
j=l

e
and also IIYv(l) - yll :S CIIYv(i) - ylloo ~ 0, ~ 00. But on the other hand
we have IIYvll = l/lIx v ll oo ~ 0, v ~ 00. Therefore, y = 0, and consequently
IIYv(l)lIoo ~ 0, e ~ 00, which contradicts IIYvlloo = 1 for all v. 0

The following definitions carryover some useful concepts from Euclidean


space to general normed spaces.
Definition 3.9 A subset U of a normed space X is called closed if it con-
tains all limits of convergent sequences of U. The closure U of a subset U
of a normed space X is the set of all limits of convergent sequences of U. A
subset U is called open if its complement X \ U is closed. A set U is called
dense in another set V if V C U, i. e., if each element in V is the limit of
a convergent sequence from U.
3.2 Scalar Products 29

Obviously, a subset U is closed if and only if it coincides with its closure.


For Xo in X and r > 0 the set B[xo,r] := {x EX: IIx - xoll ~ r} is closed
and is called the closed ball of radius r and center xo. Correspondingly, the
set B(xo,r) := {x EX: IIx - xoll < r} is open and is called an open ball.

Definition 3.10 A subset U of a normed space X is called bounded if


there exists a positive number C such that Ilxll ~ C for all x E U.

Convergent sequences are bounded (see Problem 3.6).

Theorem 3.11 Any bounded sequence in a finite-dimensional normed space


X contains a convergent subsequence.

Proof. Let Ul, ... , Un be a basis of X and let (xv) be a bounded sequence.
Then writing
n
Xv = Lajvuj
j=l

and using the norm (3.2), as in the proof of Theorem 3.8 we deduce that
each of the sequences (ajv), j = 1, ... , n, is bounded in ceo Hence, by
the Bolzano-Weierstrass theorem we can select convergent subsequences
aj,v(l) -+ aj, e-+ 00, for each j = 1, ... ,n. This now implies

n
xv(l) -+ L ajuj E X, e-+ 00,
j=l

and the proof is finished. 0

3.2 Scalar Products


Definition 3.12 Let X be a complex (or real) linear space. Then a func-
tion (. , .) : X x X -+ ce (or JR.) with the properties
(HI) (x,x) > 0, (positivity)

(H2) (x,x) o if and only if x = 0, (definiteness)

(H3) (x,y) (y,x), (symmetry)

(H4) (ax+,8y,z) a(x, z) + ,8(y, z), (linearity)


for all x, y, z E X and a,,8 E ce (or JR.) is called a scalar product, or an
inner product, on X. (By the bar we denote the complex conjugate.) A
linear space X equipped with a scalar product is called a pre-Hilbert space.
30 3. Basic Functional Analysis

As a simple consequence of (H3) and (H4) we note the antilinearity


(H4') (x, ay + (3z) = a(x, y) + ,8(x, z).
Example 3.13 An example of a scalar product on lRn and ([;n is given by
n
(x,y) := LXiYi
j=1

for x = (XI, ... ,Xn)T and y = (Yl' ... ,Yn)T. (Note that (x, y) = y*x.)
Theorem 3.14 For a scalar product we have the Cauchy-Schwarz inequal-
ity
l(x,y)1 2 ::; (x,x)(y,y)
for all x, y EX, with equality if and only if x and yare linearly dependent.
Proof. The inequality is trivial for x = O. For x :f. 0 it follows from
(ax + (3y, ax + (3y) = laI 2 (x, x) + 2 Re{a,8(x, y)} + 1{31 2 (y, y)
= (x,x)(y,y) -1(x,y)jZ,
where we have set a = -(X,X)-1/2(X,y) and {3 = (X,X)I/2.Since (".) is
positive definite, this expression is nonnegative, and it is equal to zero if
and only if ax + {3y = O. In the latter case x and yare linearly dependent
because (3 :f. o. 0

Theorem 3.15 A scalar product (.,.) on a linear space X defines a norm


by
IIxll := (X,X)I/2
for all x EX; i.e., a pre-Hilbert space is always a normed space.
Proof. We leave it as an exercise for the reader to verify the norm axioms.
The triangle inequality follows by

Ilx + yll2 = (x + y,x + y) ::; IIxll 2+ 211xllilyll + IIyl12 = (11xll + IIylD 2


from the Cauchy-Schwarz inequality. o

Note that we can rewrite the Cauchy-Schwarz inequality in the form


l(x,y)l::; IIxlillyll·
The scalar product of Example 3.13 generates the Euclidean norm of Ex-
ample 3.2, and therefore it is called the Euclidean scalar product. Theorem
3.15 includes the triangle inequality for the Euclidean norm that we post-
poned in Example 3.2.
The following definition generalizes the concept of orthogonality from
Euclidean space to pre-Hilbert spaces.
3.2 Scalar Products 31

Definition 3.16 Two elements x and y of a pre-Hilbert space X are called


orthogonal if
(x,y) =0.
Two subsets U and V of X are called orthogonal if each pair of elements
x E U and y E V are orthogonal. For two orthogonal elements or subsets
we write x ..1 y and U ..1 V, respectively. A subset U of X is called an
orthogonal system if (x, y) = 0 for all x, y E U with x i y. A n orthogonal
system U is called an orthonormal system if IIxll = 1 for all x E U.

Theorem 3.17 The elements of an orthonormal system are linearly inde-


pendent.

Proof. From
n
L(Xkqk =0
k=l

for the orthonormal system {ql' ... , qn}, by taking the scalar product with
qj, we immediately have that (Xj = 0 for j = 1, ... , n. 0

The Gram-Schmidt orthogonalization procedure as described in the fol-


lowing theorem provides a converse of Theorem 3.17. For a subset U of
a linear space X we denote the set spanned by all linear combinations of
elements of U by span{U}.

Theorem 3.18 Let {uo, UI, ... } be a finite or countable number of linearly
independent elements of a pre-Hilbert space. Then there exists a uniquely
determined orthogonal system {qo, ql , ... } of the form

qn = Un + Tn, n = 0,1, ... , (3.3)

with TO = 0 and Tn E span{uo, ... , un-d, n = 1,2, ... , satisfying

span{uo, ... ,un}=span{qo, ... ,qn}, n=O,l,.... (3.4)

Proof. Assume that we have constructed orthogonal elements of the form


(3.3) with the property (3.4) up to qn-l. By (3.4), the {qo, ... ,qn-d are
linearly independent, and therefore IIqkll i 0 for k = 0,1, ... , n - 1. Hence,

is well-defined, and using the induction assumption, we obtain (qn, qm) =0


for m = 0, ... , n - 1 and

span{ uo,· .. ,Un-I, un} = span{ qo, ... , qn-l, un} = span{qo, ... ,qn-l, qn}.
Hence, the existence of qn is established.
32 3. Basic Functional Analysis

Assume that {qo, ql, ... } and {qO' ql, ... } are two orthogonal sets of el-
ements with the required properties. Then clearly qo = Uo = qo. Assume
that we have shown that equality holds up to qn-l = qn-l. Then, since
qn - qn E span{uo, ... , un-d, we can represent qn - iin as a linear combi-
nation of ql, ... , qn-l; Le.,
n-l

qn - qn = L O:kqk·
k=O

Now the orthogonality yields

=0,

whence qn = iin· o

3.3 Bounded Linear Operators


By the symbol A : X -+ Y we will denote a mapping whose domain of
definition is a set X and whose range is contained in a set Y; Le., for every
x E X the mapping A assigns a unique element Ax E Y. The range is the
set A(X) := {Ax: x E X} of all image elements. We will use the terms
mapping, function, and operator synonymously. (We have already used this
convention in Definitions 3.1 and 3.12.)
Definition 3.19 An operator A mapping a subset U of a normed space X
into a normed space Y is called continuous at x E U if for every sequence
(x n ) from U with lim n -+ oo Xn = x we have lim n -+ oo AX n = Ax. The function
A : U -+ Y is called continuous if it is continuous for all x E U.

An equivalent definition is the following: A function A : U C X -+ Y


is continuous at x E U if for every c > 0 there exists 8 > 0 such that
IIAx - AYIl < c for all y E U with IIx - yll < 8. Here we have used the same
symbol II . II for the norms on X and Y. Note that by the second triangle
inequality of Remark 3.3 the norm is a continuous function.
Definition 3.20 An operator A : X -+ Y mapping a linear space X into
a linear space Y is called linear if

A(o:x + {3y) = o:Ax + {3Ay


for all x, y E X and all 0:, {3 E {: (or ill).

Theorem 3.21 A linear operator is continuous if it is continuous at one


element.
3.3 Bounded Linear Operators 33

Proof. Let A : X -t Y be continuous at Xo EX. Then for every x E X and


every sequence (x n ) with X n -t x, n -t 00, we have
AX n = A(x n - x + xo) + A(x - xo) -t A(xo) + A(x - xo) = A(x), n -t 00,

sincexn-x+xo-txo,n-too. 0

Definition 3.22 A linear operator A : X -t Y from a normed space X


into a normed space Y is called bounded if there exists a positive number
G such that
IIAxl1 :S Gllxll
for all x EX. Each number G for which this inequality holds is called a
bound for the operator A. (Again we have used the same symbol II . II for
the norms on X and Y.)
Theorem 3.23 A linear operator A : X -t Y is bounded if and only if
IIAII:= sup IIAxl1 < 00.
Ilxll=l
The number IIAII is the smallest bound for A and is called the norm of A.
Proof. Assume that A is bounded with the bound G. Then
sup IIAxl1 :S G,
IIxll=l
and, in particular, IIAII is less than or equal to any bound for A. Conversely,
if IIAII < 00, then using the linearity of A and the homogeneity of the norm,
we find that
IIAxl1 = IIA CI:II) 1IIIx il :S IIAllllxl1
for all x 1= 0. Therefore, A is bounded with the bound IIAII. o
Theorem 3.24 A linear operator is continuous if and only if it is bounded.
Proof. Let A : X -t Y be bounded and let (x n ) be a sequence in X with
Xn -t 0, n -t 00. Then from IIAxnl1 :S Gllxnll it follows that AXn -t 0,
n -t 00. Thus, A is continuous at x = 0, and because of Theorem 3.21 it is
continuous everywhere in X.
Conversely, let A be continuous and assume that there is no G > Osuch
that IIAxl1 :S Gllxll for all x E X. Then there exists a sequence (x n ) in X
with IIx n ll = 1 and IIAx n 11 2 n. Consider the sequence Yn := xn/llAxnll.
Then Yn -t 0, n -t 00, and since A is continuous, AYn -t A(O) = 0, n -t 00.
This is a contradiction to IIAYnl1 = 1 for all n. Hence, A is bounded. 0
Remark 3.25 Let X, Y, and Z be normed spaces and let A: X -t Y and
B : Y -t Z be bounded linear operators. Then the product B A : X -t Z,
defined by (BA)x := B(Ax) for all x E X, is a bounded linear operator
with IIBAII :S IIAIIIIBII·
Proof. This follows from II(BA)xll = IIB(Ax)1I :S IIBIlIIAllllxll. 0
34 3. Basic Functional Analysis

3.4 Matrix Norms


Theorem 3.26 Let (ajk) be a real or complex n x n matrix. Then the
linear operators A : IRn -+ IRn and A : ce n -+ ce n , defined by
n

(Ax)j := L ajkXk, j = 1, ... , n,


k=l
are bounded with respect to each norm on IR nand ce n . In particular, we
have
n

max L
IIAlh = k=l, lajkl, (3.5)
... ,n.
)=1

n
IIAlloo = . max
)=l,... ,n L lajkl, (3.6)
k=l

(3.7)

In this case the norms are also called matrix norms. (Note that in (3.5)-
(3.7) both the domain and the range are given the same norm.)
Proof. By Theorem 3.8 it suffices to prove boundedness of A with respect
to one norm. For II . lit we can estimate

n n n n

:S L IXkl L lajkl:S E1ax


k-1, ... ,n
L lajkl L IXkl·
k=l j=l j=l k=l
Therefore, we have that
n

IIAIIt:S k=l,
max L lajkl·
... ,n.
(3.8)
)=1

Now choose i such that

and choose Z E IR
n
with Zi = 1 and Zk = 0 for k =I: i. Then IIzII1 = 1 and
3.4 Matrix Norms 35

Hence n

IIAlh = sup
I/xl/,=1
IIAxlh ~ IIAzlh = max L
k=l, ... ,n j=1
jajkl, (3.9)

and from (3.8) and (3.9) we obtain (3.5).


For II . 1100 we can estimate

IIAxll oo = j~~,nl(Ax)jl = j~~,nltajkXkl k=1

Therefore, we have that


n

IIAlioo ~ 1==1,
. max... ,n
L lajkl· (3.10)
k=1

Now choose i such that

and choose z E ~n with Zk = (lik/laikl if aik :j:. 0 and Zk = 1 if aik = O.


Then IIzlloo = 1 and

IIAzii oo = j~~,nl(Az)jl = j~~,nltajkZkl k=l

Hence
n

IIAlloo = sup
I/xl/ oo =l
IIAxiloo ~ IIAzlioo = . max L
J=1, ... ,n k=1
lajkl, (3.11)

and from (3.10) and (3.11) we obtain (3.6).


Finally, for II . 112, using the Cauchy-Schwarz inequality we can estimate
36 3. Basic Functional Analysis

Therefore,
n
IIAII~ ~ L lajkl 2 ,
j,k=l

and (3.7) is proven. In this inequality equality does not hold, in general, as
can be seen by considering the identity matrix. 0

In order to derive a representation for IIAI12 we need to recall the defini-


tion and some basic facts about eigenvalues and eigenvectors of a matrix.
A number A E e is called an eigenvalue of the matrix A if there exists a
vector x E en with x ::J 0 such that

Ax = AX.

The vector x is called an eigenvector for the eigenvalue A. Each n x n


matrix has at least one and at most n eigenvalues, since the characteristic
polynomial det(A - AI) has at least one and at most n zeros. Eigenvectors
for different eigenvalues are linearly independent (see Problem 3.12). The
algebraic multiplicity of an eigenvalue of a matrix is its multiplicity as a zero
of the characteristic polynomial; its geometric multiplicity is the number of
linearly independent eigenvectors associated with the eigenvalue.
Theorem 3.27 To each matrix A there exists a unitary matrix Q such
that Q* AQ is an upper triangular matrix.

Proof. Assume that it has been shown that for each (n - 1) x (n - 1)


matrix A n - l there exists a unitary (n -1) x (n -1) matrix Qn-l such that
Q~-l A n - l Qn-l is an upper triangular matrix. Let A be an eigenvalue of the
n x n matrix An with eigenvector u. We may assume that (u, u) = 1, where
(.,.) is the Euclidean scalar product. Using the Gram-Schmidt procedure
of Theorem 3.18 we can construct an orthonormal basis of en of the form
u, V2,"" vn . Then we define a unitary n x n matrix by

With the aid of (u, Vj) = 0, j = 2, ... , n, we see that

U~AnUn = U~(AU, AnV2,'." Anvn ) =( ~ *


An - l

with some (n -1) x (n-1) matrix An-I. By the induction assumption there
exists a unitary (n - 1) x (n - 1) matrix Qn-l such that Q~_IAn-IQn-1
is upper triangular. Then

defines a unitary n x n matrix, and Q~AnQn is upper triangular. 0


3.4 Matrix Norms 37

Lemma 3.28 For an n x n matrix A and its adjoint A * we have that

(Ax, y) = (x, A*y)

for all x, y E {;n, where (. , .) denotes the Euclidean scalar product.

Proof. Simple calculations yield


n n n
(Ax,y) = L(Ax}j:ilj = LLajkxkYj
j=l j=lk=l
n n n
=L L XkakjYj =L XkA*Yk = (x, A*y),
k=lj=l k=l

where we have used that a kj = ajk. o

Theorem 3.29 The eigenvalues of a Hermitian n x n matrix are real, and


the eigenvectors form an orthogonal basis in {;n.

Proof. If A is Hermitian, i.e., if A = A*, then the matrix A := Q* AQ from


Theorem 3.27 is also Hermitian, since

A* = (Q* AQ)* = Q* A*Q** = Q* AQ = A.


Therefore, in this case the upper triangular matrix A must be diagonal;
Le.,
A= D := diag(Al,' .. , An).
Since from Q* AQ = D it follows that AQ = QD, we can conclude that
the columns of Q = (Ul,"" un) satisfy AUj = AjUj, j = 1, ... , n. Hence
the eigenvectors of a Hermitian matrix form an orthogonal basis in {;n.
Because of

the eigenvalues of Hermitian matrices are real. o

For a positive semidefinite matrix A, i.e., for a Hermitian matrix with


the property
(Ax, x) ~ 0, x E {;n,
all eigenvalues are real and nonnegative. Analogously, the eigenvalues of a
positive definite matrix A, i.e., of a Hermitian matrix with the property

(Ax, x) > 0, x E {;n, x =I- 0,

are positive.
38 3. Basic Functional Analysis

Definition 3.30 The number

p(A) := max {IAI : A eigenvalue of A}


is called the spectral radius of A.
Theorem 3.31 For an n x n matrix A we have

IIAllz = J p(A* A).


If A is Hermitian, then
IIAllz = p(A).
Proof. From Lemma 3.28 we have that

IIAxll~ = (Ax, Ax) = (x, A* Ax)

for all x E <en. Hence the Hermitian matrix A * A is positive semidefinite


and therefore has n orthonormal eigenvectors

A* AUj = J1.~Uj, j = 1, ... , n,

with real nonnegative eigenvalues. We use the orthonormal basis of eigen-


vectors and represent x E <en by
n
X = LQjUj
j=l

and have

and

IIAxllJ = (Ax,Ax) ~ (x,A' Ax) ~ (t. aj"j, ~~la,",) n


= LJ1.~IQjlz.
j=l

From this we obtain that

IIAxll~ ::; p(A* A)llxll~,


whence
IIAII~ ::; p(A* A)
follows. On the other hand, if we choose j such that J1.~ = p(A* A), then we
have that

IIAII~ = [ sup IIAxllz]Z 2: IIAujll~ = (Uj, A* AUj) = J1.~ = p(A* A).


IIxliFl
3.4 Matrix Norms 39

This concludes the proof of IIAII2 = J p(A* A). If A is Hermitian, then


A* A = A 2 , whence p(A* A) = p(A 2 ) = [p(AW follows. 0

The following final theorem of this section is of basic importance for


establishing a necessary and sufficient condition for the convergence of it-
erative methods for linear systems.
Theorem 3.32 For each norm on ~n and each n x n matrix A we have
that
p(A) ~ IIAII.
Conversely, to each matrix A and each c > 0 there exists a norm on ~n
such that
IjAIl ~ p(A) + c.
Proof. Let Abe an eigenvalue of A with eigenvector u. We may assume that
Ilull = 1. Then the first part of the theorem follows from
IIAII = sup IIAxl1 ~ IIAul1 = IIAul1 = IAI·
IIxll=1

1
For the second part, by Theorem 3.27 there exists a unitary matrix Q
such that
bu bl2 b13 . . bIn
b 22 b 23 . . b 2n
B = Q* AQ = b 33 . . b3n
(
bnn
is upper triangular. Because of det(AI - A) = det(AI - B), the eigenvalues
of A are given by Aj = bjj , j = 1, ... , n. We set
b:= max
l~j~k~n
Ib'kl
J

and
6 := min (1, (n ~ l)b)
and define the diagonal matrix
D := diag(l, 6, 6 2 , ... ,6 n - l
)

with the inverse


D- I -_ d'lag (1 , 0~-l , 0~-2 , ~-n+l) .
.•• , 0

Then for C := D- I BD we have that

)
bll 6b l2 62 bl3 6n - I bln
b 22 6b 23 6n - 2b2n
C= b33 6 n - 3 b 3n

(
40 3. Basic Functional Analysis

Since 6 :5 1, by Theorem 3.26, we can estimate


IIClloo :5 J=l,
. max Ibjjl + (n - 1)6b:5 p(A) + c.
... ,n

After setting V := QD we define a norm on en by IIxll := IIV- 1 xlloo. Using


C = V-1AV we now obtain

for all x E en. Hence


IIAII :5 IIClloo :5 p(A) + c,
and the proof is finished. o

3.5 Completeness
Definition 3.33 A sequence (x n ) of elements in a normed space X is
called a Cauchy sequence if for every c > 0 there exists an integer N(c)
such that
Ilxn - xmll < c
for all n, m ~ N(c), i.e., if limn,m-Too IIx n - xmll = O.
Theorem 3.34 Every convergent sequence is a Cauchy sequence.
Proof. Let X n -+ x, n -+ 00. Then, for c > 0 there exists N(c) E IN such
that IIx n - xii < c/2 for all n ~ N(c). Now the triangle inequality yields

IIx n - xmll = IIx n - x +x - xmll :5 IIx n - xII + IIx - xmll <c


for all n, m ~ N(c). o

The fact that the converse of Theorem 3.34 is not true in general gives
rise to the following definition.
Definition 3.35 A subset U of a normed space X is called complete if
every Cauchy sequence of elements in U converges to an element in U. A
normed space is called a Banach space if it is complete. A pre-Hilbert space
is called a Hilbert space if it is complete.
The subset of rational numbers is not complete in JR. In order to give
further examples, we introduce some infinite-dimensional normed spaces.
The set C[a, b) of continuous functions f : [a, b) -+ JR equipped with
pointwise addition and scalar multiplication,
(f + g)(x) := f(x) + g(x), (af)(x):= af(x),
obviously is a linear space. Since the monomials x f-t x n , n = 0, 1, ... , are
linearly independent (see Theorem 8.2), C[a, b) has infinite dimension.
3.5 Completeness 41

Example 3.36 The linear space C[a, b] lurnished with the maximum norm

11/1100 := zE[a,b]
max l!(x)1

is a Banach space.
Proof. The norm axioms (N1)-(N3) are trivially satisfied. The triangle in-
equality follows from

111+ glloo = zE[a,b]


max IU + g)(x) I = IU + g)(xo)1 :S If(xo)1 + Ig(xo)1

:S max max Ig(x)1 = 11/1100 + IIglloo


I/(x)1 + zE[a,b]
zE[a,b]

for some Xo E [a, b]. Since the condition IIfn - 11100 < € is equivalent to
I/n(x) - f(x)1 < € for all x E [a, b], convergence of a sequence of continuous
functions in the maximum norm is equivalent to uniform convergence on
[a, b]. Since the Cauchy criterion is sufficient for uniform convergence of a
sequence of continuous functions to a continuous limit function, the space
C[a, b] is complete with respect to the maximum norm. 0

Example 3.37 The linear space C[a, b] equipped with the L 1 norm

II/lh := l b
I/(x)1 dx

is not complete.

Proof. The norm axioms are trivially satisfied. Without loss of generality
we take [a, b] = [0,2] and choose

o :S x :S 1,
fn(x) := {
1, 1 :S x :S 2.

Then for m > n we have that

Il/n - Imlh =
Jto (x
n
- x m ) dx:S _1_ -t 0, n -t 00,
n +1

and therefore Un) is a Cauchy sequence. Now we assume that Un) con-
verges with respect to the L 1 norm to a continuous function f; Le.,

IIfn - flh -t 0, n -t 00.

11
Then

11
o
If(x)1 dx :S 11
0
I/(x) - xnl dx +
0
x n dx :S II! - Inlh + --1
1
n+
-t 0
42 3. Basic Functional Analysis

for n -t 00, whence f(x) = 0 follows for 0 ::; x ::; 1. Furthermore, we have
1
2If (x) - 11 dx = 1 2If (X) - fn(x)1 dx ::; Ilf - fnll) -t 0, n -t 00.

This implies that f(x) = 1 for 1 ::; x ::; 2, and we have a contradiction,
since f is continuous.
However, we note that the space £1 [a, b] of measurable and Lebesgue
integrable real-valued functions is complete with respect to the £1 norm
(see [5, 51, 59]). 0

Example 3.38 The linear space C[a, b] equipped with the £2 norm

b ) 1/2
IIfI12:= ( llf(xWdx

is not complete.

Proof. The norm is generated by the scalar product


b
(f,g):= l f(x)g(x)dx.

Considering the same sequence as in Example 3.37, it can be seen that


C[a, b] also is not complete with respect to the £2 norm. Again note that
the space £2[a, b] of measurable and Lebesgue square-integrable real-valued
functions is complete with respect to the £2 norm (see [5, 51, 59]). 0

Theorem 3.39 Each finite-dimensional normed space is a Banach space.


Proof. Let X be finite-dimensional with basis Ul, ... ,Un and assume that
(xv) is a Cauchy sequence in X. We represent
n
Xv = Lajvuj
j=1

and recall from Theorem 3.8 that there exists C > 0 such that

for all v, J.l E IN. Hence for j = 1, ... , n the (ajv) are Cauchy sequences
in ceo Therefore, there exist al, .. . ,an such that ajv -t aj, v -t 00, for
j = 1, ... ,n, since the Cauchy criterion is sufficient for convergence in ceo
Then we have convergence,
n
Xv -t x:= Lajuj E X, v -t 00,
j=1

and the proof is finished. 0


3.6 The Banach Fixed Point Theorem 43

Remark 3.40 Complete sets are closed, and each closed subset of a com-
plete subset is complete.

Proof. This is trivial. o

3.6 The Banach Fixed Point Theorem


Definition 3.41 Let U be a subset of a normed space X. An operator
A : U --+ X is called a contraction operator if there exists a constant
q E [0, 1) such that
IIAx - Ayll :::; qllx - yll
for all x, y E U. Each constant q satisfying this inequality is called a con-
traction number of the operator A.
Frequently, we will call a contraction operator simply a contraction.
Remark 3.42 Each contraction operator is continuous.
Proof. This is trivial, since the convergence IIx n - xii --+ 0, n --+ 00, implies
that IIAx n - Axil :::; qllx n - xII --+ 0, n --+ 00. 0

An operator A : U --+ X is called Lipschitz continuous with Lipschitz


constant L if there exists a positive constant L such that

IIAx - Ayll :::; Lllx - yll


for all x, y E U. Thus, contraction operators are Lipschitz continuous op-
erators with Lipschitz constant less than one.
Definition 3.43 An element x of a normed space X is called a fixed point
of an operator A : U C X --+ X if

Ax=x.

Theorem 3.44 Each contraction operator has at most one fixed point.
Proof. Assume that x and yare two different fixed points of the contraction
operator A. Then

o i= IIx - yll = IIAx - Ayll :::; qllx - yll,


whence 1 :::; q follows. This is a contradiction to the fact that A is a con-
traction operator. 0

Theorem 3.45 (Banach) Let U be a complete subset of a normed space


X and let A : U --+ U be a contraction operator. Then A has a unique fixed
point.
44 3. Basic Functional Analysis

Proof. Starting from an arbitrary element Xo E U we define a sequence (x n )


in U by the recursion

Xn+l := Ax n , n = 0,1,2, ....

Then we have

and from this we deduce by induction that

Hence, for m > n, by the triangle inequality and the geometric series it
follows that

(3.12)
qn
::::: (qn + qn+l + ... + qm-l)llxl - xoll : : : l Ilxl - xoll.
-q
Since qn -+ 0, n -+ 00, this implies that (x n ) is a Cauchy sequence, and
therefore because U is complete there exists an element x E U such that
X n -+ x, n -+ 00. Finally, the continuity of A from Remark 3.42 yields

x = lim Xn+! = lim AX n = Ax;


n-+ ex> n --+ <Xl

i.e., x is a fixed point of A. That this fixed point is unique we have already
settled by Theorem 3.44. 0

The main importance of Banach's fixed point theorem in numerical anal-


ysis originates from its constructive proof. Besides establishing existence of
a fixed point by the method of successive approximations, it also provides
an algorithm for obtaining numerical approximations. And this algorithm
is very easy to program because of its iterative nature. We explicitly state
this in the following theorem.
Theorem 3.46 Let A be a contraction operator with contraction constant
q mapping a complete subset U of a normed space X into itself. Then the
successive approximations

Xn+l := Ax n , n = 0,1,2, ... ,

with arbitrary Xo E U converge to the unique fixed point x of A. We have


the a priori error estimate
3.6 The Banach Fixed Point Theorem 45

and the a posteriori error estimate


q
Ilxn- xii ~ -1-
-q
IIx n - xn-Ill
for all n E IN.
Proof. The a priori error estimate follows from (3.12) by passing to the
limit m -t 00. The a posteriori estimate follows from the a priori estimate
applied with starting element Xo = Xn-I. 0

The a priori estimate is used in order to obtain upper bounds on the


number of iteration steps, which are necessary to achieve a desired accuracy.
In order to guarantee that

for a given accuracy E:, by the a priori estimate we need


InE'
n>-
- lnq
iterations, where E' = (1 - q)C/llxl - xoll. The smaller the contraction con-
stant q, the fewer iteration steps are required. The a posteriori estimate,
which in general yields better estimates as compared with the a priori esti-
mate, is used to check the accuracy during the computation and terminate
the iterations when the required accuracy is reached.
The property
IIAx - Ayll < IIx - yll
for all x, y with x i- y, which is weaker than the contraction property, is not
sufficient in general to ensure the existence of a fixed point, as illustrated
in the following example (see also Problem 3.18).
Example 3.47 The function f : [0,00) -t [0,00) given by

1
f(x):= X+--
l+x
as a consequence of

f(x)-f(y)= x+y+xy (x-y)


l+x+y+xy
fulfills the condition
If(x) - f(y)1 < Ix - yl
for x i- y. However, because of
1
-->0
l+x
for all x 2: 0, it does not have a fixed point. o
46 3. Basic Functional Analysis

We conclude this section by considering the special case of linear opera-


tors, i.e., by considering the Neumann series (see Problem 3.16).
Let A : X -+ Y be an operator mapping a set X into a set Y. If for
each y E A(X) there is only one element x E X with Ax = y, then A is
said to be injective and to have an inverse A - I : A(X) -+ X defined by
A-Iy := x. The inverse mapping satisfies A-I A = [ on X and AA- I = [
on A(X), where [ denotes the identity operator mapping each element into
itself. If A(X) = Y, then the mapping is said to be surjective. The mapping
is called bijective if it is injective and surjective, i.e., if the inverse mapping
A - I : Y -+ X exists.
Theorem 3.48 Let B : X -+ X be a bounded linear operator on a Banach
space X with IIBII < 1, and let [ : X -+ X denote the identity operator.
Then [ - B is bijective; i. e., for each Z E X the equation

x - Bx =z
has a unique solution x EX. The successive approximations

Xn+1 := BX n + z, n = 0,1,2, ... ,

with arbitrary Xo E X converge to this solution, and we have the a priori


estimate
IIx n - xII ~ 11~~\~II IlxI - xoll
and the a posteriori estimate

IIBII
Ilxn - xII ~ 1 -IIBII IIx n - xn-III
for all n E :IN. Furthermore, the inverse operator (I - B)-I is bounded by

-1 1
11([ - B) II ~ 1 _ IIBII

Proof. For fixed, but arbitrary, Z E X we define the operator A : X -+ X


by
Ax := Bx + z, x E X.
Then we have

IIAx - Ayll = IIB(x - y)1I ~ IIBllllx - yll


for all x,y E X; i.e., A is a contraction with contraction number q = IIBII·
Now the statements of the theorem can be deduced from Theorem 3.46.
With the starting element Xo = Z the successive approximations lead to
3.7 Best Approximation 47

with the iterated operators B k : X -t X defined recursively by B O := I


and Bk := BBk-l for k E IN. Hence, in view of Remark 3.25, we have

Ilxnll :S t
k=O
IIBkzl1 :S t
k=O
IIBllkllzll :S 1 ~~ik" '
and therefore, since X n -t (I - B)-l z, n -t 00, it follows that

II(I - B)-l II < lizII


z - 1-IIBII
for all z E X. o

3.7 Best Approximation


Definition 3.49 Let U c X be a subset of a normed space X and let
w EX. A n element v E U is called a best approximation to w with respect
to U if
Ilw - vii = inf Ilw - ull, uEU

i. e., if v E U has smallest distance from w.

Theorem 3.50 Let U be a finite-dimensional subspace of a normed space


X. Then for every element in X there exists a best approximation with
respect to U.

Proof. Let wE X and choose a minimizing sequence (un) for w; i.e., Un E U


satisfies
Ilw - unll -t d:= inf Ilw - ull, n -t 00.
uEU

Because of Ilunll :S Ilw - unll + Ilwll the sequence (un) is bounded. By


Theorem 3.11 the sequence (un) contains a convergent subsequence (un(t))
with limit v E U. Then

Ilw - vii = lim


(-+00
Ilw - un(l) II = d

completes the proof. o


Theorem 3.51 Let U be a linear subspace of a pre-Hilbert space X. An
element v is a best approximation to w E X with respect to U if and only
if
(w-v,u)=o (3.13)
for all u E U, i. e., if and only if w - v 1. U. To each w E X there exists at
most one best approximation with respect to U.
48 3. Basic Functional Analysis

Proof. We begin by noting the equality

IIw - uI1 2 = IIw - vI1 2 + 2 Re(w - v, v - u) + Ilv - u11 2, (3.14)

which is valid for all u, v E U. From this, sufficiency of the condition (3.13)
is obvious, since U is a linear subspace.
To establish the necessity we assume that v is a best approximation and
(w - v, uo) :f. 0 for some Uo E U. Then, since U is a linear subspace, we
may assume that (w - v, uo) E JR. Choosing

(w - v,uo)
u =v + Il uoll 2 UO,

from (3.14) we arrive at

I 1 (w-V,UO)2 2
11 = I w - vi 2 -
2
II W - U Il uol1 2 < Ilw - vII ,

which contradicts the fact that v is a best approximation of w.


Finally, assume that Vi and V2 are best approximations. Then from (3.13)
it follows that (w - Vi, Vi - V2) = 0 = (w - V2, VI - V2). This implies
(VI - V2, VI - V2) = 0, whence VI = V2 follows. 0

Theorem 3.52 Let U be a complete linear subspace of a pre-Hilbert space


X. Then to each element w E X there exists a unique best approximation
with respect to U. The operator P : X -+ U mapping w E X onto its best
approximation is a bounded linear operator with the properties

p2 =P and IIPII = 1.
It is called the orthogonal projection from X onto U.

Proof. Choose a sequence (un) with

1
Ilw - un ll 2 ~ d2 + -,
n
n E IN, (3.15)

where d:= inf uEu IIw - ull. Then

2 2 2
<4d
- +-+-
n m

for all n, m E IN, and since ~ (un + Urn) E U, it follows that


2
2222 1 22
lIun- urnll ~ 4d +;;: + m - 4 w - 2" (un + Urn) ~;;: +m.
II 11
Problems 49

Hence, (un) is a Cauchy sequence, and since U is complete, there exists an


element v E U such that Un -+ v, n -+ 00. Passing to the limit n -+ 00
in (3.15) shows that v is a best approximation of w with respect to U.
Uniqueness of the best approximation follows from Theorem 3.51.
Trivially, we have Pu = U for all U E U, and this implies p 2 = P. From
(3.13) it can be deduced that P is a linear operator and that

for all w E X. Therefore, P is bounded with IIPII :::; 1. From Remark 3.25
and p 2 = P it follows that IIPII 2: 1, which concludes the proof. 0

Corollary 3.53 Let U be a finite-dimensional linear subspace of a pre-


Hilbert space X with basis Ul, .•. , Un' The linear combination
n
V = Lakuk
k=l

is the best approximation to w E X with respect to U if and only if the


coefficients al, ... , an satisfy the normal equations
n
Lak(Uk,Uj) = (w,Uj), j = 1, . .. ,n. (3.16)
k=l

Proof. The normal equations (3.16) obviously are equivalent to (3.13). 0

The normal equations for the best approximation in pre-Hilbert spaces


provide further examples of systems of linear equations. The solution be-
comes trivial if the basis Ul, ... , Un is orthonormal.
Corollary 3.54 Let U be a finite-dimensional linear subspace of a pre-
Hilbert space X with orthonormal basis Ul, ... , Un' Then the orthogonal
projection operator is given by
n
Pw = L(W,Uk)Uk, wE X.
k=l

Proof. This is trivial from either the orthogonality condition of Theorem


3.51 or the normal equations of Corollary 3.53. 0

Problems
3.1 Show that (3.1) defines a norm on ~n for p ~ 1 and that

lim
p-too
IIxll p = IIxll oo
for all x E ~n.
50 3. Basic Functional Analysis

3.2 Indicate the closed balls {x E lR? : IIxli p :5 I} for p = 1,2,00. What
properties do they have in common?

3.3 Show that (3.1) does not define a norm on (;n for 0 < p < 1.
3.4 For the e1 and eoo norms on (;n show that Ilxli oo :5l1xlh :5 nllxll oo .
3.5 Let X and Y be normed spaces with norms II . IIx and II . IIY, respectively.
Show that
II(x,y)11 := Ilxllx + Ilylly,
lI(x,y)1I := (lIxll3c + lIyll~)1/2,
lI(x, y)1I := max(lIxllx, lIylly),
for (x, y) E X x Y define norms on the product X x Y.

3.6 Show that convergent sequences are bounded.

3.7 Let (x n ) be a sequence of elements of a normed space X. The series

is called convergent if the sequence (Sn) of partial sums


n

Sn:= LXk
k=1

converges. The limit S = Iim n -+ oo Sn is called the sum of the series. Show that
in a Banach space X the convergence of the series
00

is a sufficient condition for the convergence of the series 2:;'=1 Xk and that

3.8 A norm 1I·lIa on a linear space X is called stronger than a norm 1I·lIb if every
sequence converging with respect to the norm II . lIa also converges with respect
to the norm II . lib. Show that II . lIa is stronger than II . lib if and only if there
exists a positive number C such that IIxllb :5 Cllxll a for all x E X. Show that on
CIa, bl the maximum norm is stronger than the L2 norm (and stronger than the
L 1 norm). Construct a counterexample to demonstrate that the maximum norm
and the L 2 norm (and the maximum norm and the L 1 norm) are not equivalent.

3.9 Show that in a normed space the operations of addition and multiplication
by a scalar are continuous functions. Show that in a pre-Hilbert space the scalar
product is a continuous function.
Problems 51

3.10 Show that a norm 11.11 on a linear space X is generated by a scalar product
if and only if the parallelogram equality

holds for all x, y EX. Show that the f 1 and foo norms on <en are not generated
by scalar products.

3.11 Let A be a positive definite n x n matrix and denote by (. , .) the Euclidean


scalar product on <en. Show that (Ax, y) defines a scalar product on <en.

3.12 Show that eigenvectors of a matrix for different eigenvalues are linearly
independent.

3.13 Let X and Y be normed spaces and denote by L(X, Y) the linear space
of all bounded linear operators A : X -t Y. Show that L(X, Y) equipped with

IIAII:= sup IIAxll


Ilxll=l

again is a normed space and that L(X, Y) is a Banach space if Y is a Banach


space.

3.14 Let A : X -t X denote an operator from a normed space X into itself.


The iterated operators An : X -t X, n = 0,1, ... , are defined recursively by
A O = I and An := AA n- 1 for n E IN. If A is bounded and linear, show that
n
IIA ll SIlAlln.

3.15 Show that for n x n matrices A the series

converges (with respect to any norm on <en), and denote the sum of the series by
eA. Show that if A is an eigenvalue of A, then e A is an eigenvalue of eA.

3.16 Show that if B : X -t X is a linear operator on a Banach space X with


IIBII < 1,
then the Neumann series

LB
00

k
= (I - B)-l
k=O

converges in the Banach space L(X, X).

3.11 Let U be a complete subset of a normed space X and let A : U -t U be


a continuous operator, and assume that Am is a contraction for some m E IN.
Show that A has a unique fixed point and that the successive approximations
Xn+l := Ax n , n = 0, 1, ... , with arbitrary Xo E U converge to this fixed point.

3.18 A subset U of a normed space X is called sequentially compact if each


sequence from U contains a convergent subsequence with limit in U. Let U be a
52 3. Basic Functional Analysis

complete and sequentially compact subset of a normed space X and let A : U -t U


be an operator with the property

IIAx - Ayll < IIx - yll


for all x, y E U with x =I y. Show that A has a unique fixed point and that the
successive approximations Xn+l := Ax n , n = 0,1, ... , with arbitrary Xo E U
converge to this fixed point.

3.19 Let {Un: n E IN} be an orthonormal system in a pre-Hilbert space X.


Show that the following properties are equivalent:
(a) span{u n : n E IN} is dense in X
(b) Each 'P E X can be expanded in a Fourier series

n=l

(c) For each 'P E X we have Parseval's equality

n=l

Show that properties (a)-(c) imply that


(d) x = °is the only element in X with (x, Un) = ° for all n E lN,
and that (a), (b), (c), and (d) are equivalent if X is a Hilbert space.

3.20 Show that the best approximation to a function f E e[o, 211"] in the £2
norm with respect to the space of trigonometric polynomials of degree at most n
is given by the partial sum
n

= ~o + L +L
n

(Pnf)(x) ak cos kx bk sin kx, x E [0,211"],


k=l k=l

of the Fourier series of f with the Fourier coefficients

ak =- 11
11" 0
21f

f(x)coskxdx, bk =- 11
11" 0
21f

f(x)sinkxdx.
4
Iterative Methods for Linear Systems

This chapter is devoted to applying the analysis developed in the previous


chapter to the iterative solution of systems of linear equations. In particular,
we will discuss in detail the Jacobi and the Gauss-Seidel iterations, which
essentially go back to Gauss. In Supplementum Theoriae Combinationis
Observationum Erroribus Minime Obnoxia, published in 1822, Gauss used
a variant of the Gauss-Seidel method for the solution of the linear systems
arising through his least squares method, since they were too large for
elimination methods.
With the advent of computers the size of the linear systems that could
be solved grew enormously, leading to the requirement of speedup of the
convergence of the classical Jacobi and Gauss-Seidel iterations. In this
context, we will introduce the reader to the idea of relaxation methods,
including a typical example that illustrates the dramatic gain in the speed
of convergence by overrelaxation. We will conclude the section with the idea
of defect correction iteration and indicate its application to the very efficient
solution of the large linear systems arising from the discretization of linear
differential and integral equations by two-grid and multigrid methods.

4.1 Jacobi and Gauss-Seidel Iterations


We start by supplementing the sufficient condition of Theorem 3.48 for
convergence of the method of successive approximations by establishing a
necessary and sufficient condition for the finite-dimensional case.
54 4. Iterative Methods for Linear Systems

Theorem 4.1 Let B be an n x n matrix. Then the successive approxima-


tions
XII+I := BX II + Z, V = 0,1,2, ... ,

converge for each Z E ce n and each Xo E ce n if and only if


p(B) <1
for the spectral radius of B.
Proof. If p(B) < 1, then by Theorem 3.32 there exists a norm II . lion ce n
such that IIBII < 1. Now convergence follows from Theorem 3.48 together
with the equivalence of all norms on ce n according to Theorem 3.8.
Conversely, suppose that convergence holds. If we assume that p(B) 2:: 1,
then there exists an eigenvalue A of B with IAI 2:: 1. Let x denote an as-
sociated eigenvector. Then the successive iterations for the right-hand side
Z = x and the starting element Xo = x lead to the divergent sequence
XII = (L~=o A ) x. This is a contradiction.
k
0

We note that Theorem 4.1 remains valid for bounded linear operators
B : X --t X in infinite-dimensional Banach spaces with the definition of
the spectral radius appropriately modified. However, the proof requires a
different and deeper analysis.
For the iterative solution of a system of linear equations of the form
Ax =y
we distinguish different methods by the way in which the original system
is transformed into an equivalent fixed-point form. We decompose A by

into a diagonal matrix


D = diag(all' ... ,ann),
a proper lower (left) triangular matrix

a~1 0
AL = a~1 a32 0
(
anI an2 an,n-I

and a proper upper (right) triangular matrix

° )
an-I,n

°
4.1 Jacobi and Gauss-Seidel Iterations 55

We assume that all the diagonal entries of A are different from zero. Hence
the inverse D-l of D exists.
In the method attributed to Jacobi, which is sometimes also called the
method of simultaneous displacements, the system Ax = y is transformed
into the equivalent form

and the latter is solved by successive approximations

Xv+l := - D- 1(AL + AR)X v + D-1y, V = 0,1,2, ... ,

with arbitrarily chosen starting element Xo. Written in components, one


step of the Jacobi iteration scheme reads
n
y.
Xv+l.j = - l: a·
a··
k
_J_ Xv.k + _ J
a··
, j = 1, .. . ,n.
k=l JJ JJ
k¥j

Theorem 4.2 Assume that the matrix A = (ajk) satisfies

qoo := . max
J=l •...• n
'"'
L.J
n
k=l
a·k
_J_
ajjI I< 1 (4.1)

ki'j

or
ql := max
k=l, ....n . n
l: I-l...-
ajj I
a·k
<1 (4.2)
J=l
j#

(t
or
2
q2 := I:jk 1 ) 1/2 < 1. (4.3)
j.k=l JJ
j#
Then the Jacobi method, or method of simultaneous displacements,
n
Xv+l.j = - l: a·k
k=lajj
Xv.k +
_J_
y.
ajj
_J, j = 1, ... , n, v = 0,1,2, ... ,
k¥j

converges for each y E en and each Xo E en to the the unique solution of


Ax = y (in any norm on en). For J-t = 1,2,00, if qlJ. < 1, we have the a
priori error estimate
56 4. Iterative Methods for Linear Systems

and the a posteriori error estimate

for all v E IN.

Proof. The Jacobi matrix _D- l (A L + AR) has diagonal entries zero and
off-diagonal entries -ajk/ajj. Hence by Theorem 3.26 we have

11- D-l(AL + AR)lloo = qoo,


11- D-l(A L + AR)lh = ql,

11- D-l(A L + A R)1I2 :s q2·


Now the assertion follows from Theorem 3.48. o
Note that the sufficient convergence conditions (4.1)-(4.3) are not equiv-
alent. Roughly speaking, each criterion ensures convergence if the diagonal
entries of A are dominant. The condition (4.1) can also be written as
n
Llajkl < lajjl, j = 1, ... ,nj (4.4)
k=l
k#-j
i.e., the matrix A is required to be strictly row-diagonally dominant. From
(4.2) it can be deduced (see Problem 4.4) that if
n
L lajkl < lakkl, k = 1, ... ,n, (4.5)
j=l
j#

i.e., if the matrix A is strictly column-diagonally dominant, then the Jacobi


iterations converge.

For the Gauss-Seidel method, which is also known as the method of


successive displacements, we proceed differently and transform Ax = y via

into the equivalent form

which is then solved by the successive approximations


4.1 Jacobi and Gauss-Seidel Iterations 57

with arbitrarily chosen starting element xo. For the actual computations
we rewrite this as
(D + Adx v+1 = -ARX v + y, v = 0, 1,2, ... ,
and solve the linear system for Xv+l with the lower triangular matrix D+AL
by forward substitution. This leads to the Gauss-Seidel iteration scheme
in the following explicit form:
j-l n
Xv+l,j = - L a·k
_J_
a"JJ
Xv+l,k - L a·k
_J_ Xv,k
y.
+ _J , j = 1, ... , n.
k=l k=j+l a··
JJ
a··
JJ

Here and in the sequel empty sums have to be interpreted as zero.


In the Jacobi iteration scheme all the components of the new approx-
imation vector xv+! are obtained by using only the components of the
previous approximation vector xv, which explains why this method is also
called the method of simultaneous displacements. However, in the Gauss-
Seidel iterations each new component of xv+! is immediately used in the
computation of the next component; i.e., for computing the jth compo-
nent Xv+l,j, the values Xv+!,1,X v+!,2, ... ,Xv+!,j-l are already used. This
is very convenient for computer calculations, since the new values can be
stored in the locations held by the old values, which reduces the storage
requirements.
Theorem 4.3 Assume that the matrix A = (ajk) fulfills the Sassenfeld
criterion
p:= . max Pj < 1,
J=l, ... ,n

where the numbers Pj are recursively defined by

PI := tI
k=2
alk
all
I, Pj:= ~ Iajk
k=lajj
IPk + k=j+l Iajk
ajj
tI, j = 2, ... ,n.

Then the Gauss-Seidel method, or method of successive displacements,


j-l n
XV+l,j =- "" a·k Xv+!,k- ""
L..J _J_ ak Xv,k+-
L..J _J_ y'J , j = 1, ... , n, v = 0,1,2, ... ,
k=l ajj k=j+l ajj ajj

converges for each y E {;n and each Xo E {;n to the the unique solution of
Ax = y (in any norm on {;n) . We have the a priori error estimate
pI'
Ilxv- xlloo 'S -1- Ilxl - xoll oo
-P
and the a posteriori error estimate
P
Ilxv- xlloo 'S -1--P IIx v - xv-dloo
for all v E IN.
58 4. Iterative Methods for Linear Systems

Proof. Consider the equation

for z E {;n with I/zl/<Xl = 1, that is,


i-I n

L a··
a'k a'k
xi = - _J_ Xk - '"' _J_ zk, j = 1, ... ,n.
LJ a··
k=1 JJ k=i+l JJ

By induction, this implies that IXil ~ Pi for j = 1, ... , n, and therefore


I/xl/<Xl ~ p. Hence we have

and the assertion of the theorem follows from Theorem 3.48. o


Corollary 4.4 Assume that the matrix A is strictly row-diagonally domi-
nant. Then the Gauss-Seidel iterations converge.

Example 4.5 The tridiagonal matrix

2 -1
-1 2 -1
-1 2 -1
A=
-1 2-1
-1 2

from Example 2.1 is not strictly row-diagonally dominant, but it satisfies


the Sassenfeld criterion.

Proof. Obviously, q<Xl = 1; i.e., (4.1) is not fulfilled. We have the recursion

1 1 1
Pi = 2 Pj-l + 2 ' j = 2, ... ,n -1, Pn = 2 Pn-l·
From this, by induction, it follows that

1
Pj = 1 - 2j , j = 1, ... ,n -1,
Therefore,
1
P = 1 - 2n - 1 < 1,
and this implies convergence of the Gauss-Seidel iterations by Theorem
4.3. 0
4.1 Jacobi and Gauss-Seidel Iterations 59

Since the matrix A is tridiagonal, the system Ax = y can be solved


efficiently by elimination (see Problem 2.9). Nevertheless, this matrix pro-
vides a very suitable example for the analysis of iterative methods for lin-
ear systems arising in the discretization of ordinary and partial differential
equations. This is due to the fact that in more general cases, for exam-
ple for the linear system of Example 2.2, there are more technical details
to consider, which distract from the basic principles. However, these basic
principles do not depend on the dimension of the underlying differential
equation problem.
In Example 4.5, if n is large, the contraction number p will be close to
one, i.e., the convergence rate of the Gauss-Seidel iterations will be unsat-
isfactorily slow. Before we indicate how the convergence can be accelerated,
we continue by discussing a weaker form of row-diagonal dominance.

Definition 4.6 An n x n matrix A = (ajk) is called reducible if there exist


two nonempty sets N, M c {I, ... , n} such that

NnM=0, NUM={I, ... ,n},

and
ajk = 0, j E N, k E M.
Otherwise the matrix is called irreducible.
A reducible matrix A, after a reordering of the rows and columns, can
be partitioned into a 2 x 2 block matrix of the form

A= (~~~ A~2)
(see Problem 4.5). Therefore, solving a linear system with the matrix A
can be reduced to solving two smaller linear systems with the matrices All
and A 22 .

Theorem 4.7 Assume that the matrix A = (ajk) is irreducible and weakly
row-diagonally dominant; i.e., A is row-diagonally dominant,
n
L lajkl ~ lajjl, j = 1, ... ,n, (4.6)
k=l
k:f-j

with inequality holding for at least one row j. Then the Jacobi iterations
converge for each y E en and each XQ E en to the unique solution of
Ax = y (in any norm on en).
Proof. By (4.6) and Theorem 3.26 we have that IIBlioo ~ 1 for the Jacobi
matrix B = - V- l (A L + A R ). Therefore, from Theorem 3.32 it follows that
p(B) ~ 1 for the spectral radius.
60 4. Iterative Methods for Linear Systems

Now assume that there exists an eigenvalue Aof B with IAI = 1. For the
associated eigenvector we may assume that Ilxlloo = 1. Then from Ax = Bx
we obtain the inequality

IAIIXjl~tl:jkIIXkl~tl:jkl~l, j=l, ... ,n. (4.7)


k=l J) k=l J)
k#j k#j

Let N := {j : IXjl = I}. Since Ilxlloo = 1, we have that N :/;0. For j E N


we have IAllxjl = 1, and therefore equality holds in (4.7); Le.,

t
k=l
I I= 1,
:jk
J)
j E N.
k#j

From this it follows that

M:= {l, ... ,n} \N :/;0,

since A is weakly row-diagonally dominant. Because A is irreducible, there


exists jo E Nand k o E M such that ajoko :/;0. Now by using

we obtain the contradiction

Therefore, we have p(B) < 1, and the statement of the theorem follows
from Theorem 4.1. 0

We leave it to the reader as an exercise to show that the matrix A


from Example 4.5 is irreducible and weakly row-diagonally dominant (see
Problem 4.6), implying convergence of the Jacobi iterations.

4.2 Relaxation Methods


From combining the a priori error estimate of Theorem 3.48 with Theorem
4.1 we see that the spectral radius p(B) of the iteration matrix B may
be considered as a measure for the speed of convergence of the successive
approximations. Therefore, it is desirable to design the iterative scheme
such that p(B) becomes small. This aim is the motivation of the relaxation
methods to be discussed in this section.
4.2 Relaxation Methods 61

Each step of the Jacobi iterations can be written in the form

xv+! = Xv + D-1(y - Ax v ),

indicating how the new approximation xv+! is obtained by correcting the


previous approximation Xv. The basic idea of the relaxation methods is
to multiply the correction term by some weight factor. Note that if the
following relaxation iterations converge, then they converge to a solution
of Ax = y.
Definition 4.8 The iterative scheme

Xv+l :=xv+wD-1(y-Axv ), v=O,1,2, ... ,

i.e., in components

Xv+l,j = Xv,j + a
W

JJ
. [yj - t
k=l
ajkXv,k], j = 1, ... , n,

is known as the Jacobi method with relaxation. The weight factor w > 0 is
called the relaxation parameter.
Theorem 4.9 Assume that the Jacobi matrix B := _D- 1 (A L + A R ) has
real eigenvalues and spectral radius less than one. Then the spectral radius
of the iteration matrix

for the Jacobi method with relaxation becomes minimal for the relaxation
parameter
2
Wopt = -::-----:-------=--
2 - A max - Amin

and has spectral radius

I - D- 1 A) _ Amax - Amin
p( Wopt - 2_ \ _ \ . '
.l\max Amln

where Amin and A max denote the smallest and the largest eigenvalue of B,
respectively. In the case Amin :j:. - Amax the convergence of the Jacobi method
with optimal relaxation parameter is faster than the convergence of the
Jacobi method without relaxation.
Proof. For W > 0 the equation Bu = AU is equivalent to

[(1- w)I + wB]u = [1- w + WA]U.


Hence the eigenvalues A of B correspond to the eigenvalues 1 - W + WA of
(1 - w)I + wB. Therefore, the eigenvalues of (1 - w)I + wB are real, and
62 4. Iterative Methods for Linear Systems

the smallest eigenvalue of (l-w)/ +wB is given by 1-W+W.A m in and the


largest by 1 - w + W.A max . Obviously, the spectral radius becomes minimal
if the smallest and the largest eigenvalue are of opposite sign and have the
same absolute value, i.e., if

1- Wopt + Wopt.Amin = -1 + Wopt - Wopt.Amax.

From this, elementary algebra yields the optimal parameter Wopt and the
spectral radius p(I - woptD-1 A) as stated in the theorem. 0

For the Gauss-Seidel iterations, from (D + AL)x lI +! = -ARX II + Y it


follows that
xv+! = XII + D-1[y - ALX II +! - (D + AR)x lI ].

Hence, the corresponding relaxation method is defined as follows. Note


again that if the relaxation iterations converge, then they converge to a
solution of Ax = y.
Definition 4.10 The iterative scheme
XII+! = XII + wD-1[y - ALx lI +! - (D + AR)x lI ], v = 0,1,2, ... ,
i. e., in components

W
XII+l j + j- j
= XII'j a j = 1, ... ,n,
,

is known as the Gauss-Seidel method with relaxation or as the successive


overrelaxation (SOR) method with relaxation coefficient W > O.
From
(D + WAL)XV+l = wy + [(1 - w)D - WAR]X II

we obtain that the iteration matrix of the SOR method is given by

Here, as opposed to the relaxation of the Jacobi method, the iteration


matrix depends nonlinearly on the relaxation parameter. This makes the
convergence analysis of the SOR method more complicated.
Theorem 4.11 (Kahan) A necessary condition for the BOR method to
be convergent is that 0 < W < 2.
Proof. Since the eigenvalues /l-l,." ,/l-n of B(w) are the zeros of the char-
acteristic polynomial, they satisfy
n
II/l-j =detB(w)
j=l
4.2 Relaxation Methods 63

(where multiple eigenvalues are repeated according to their algebraic mul-


tiplicity). From this, by the multiplication rules for determinants and since
D + wA L and (1 - w)D - WAR are triangular matrices, it follows that
n
II J.tj = det(D + WAL)-1 det[(1 - w)D - WAR] = (1 - w)n.
j=1

This now implies


p[B(w)] 2:: 11 - wi,
and from Theorem 4.1 we conclude the necessity of 0 < w < 2 for conver-
gence. 0

Theorem 4.12 (Ostrowski) If A is Hermitian and positive definite, then


the SOR method converges for all Xo E (':n, all y E (':n, and all 0 < w < 2
to the unique solution of Ax = y.
Proof. Let J.t be an eigenvalue of B(w) with eigenvector Xj Le.,

[(1 - w)D - wAR]x = J.t(D + wAdx.


With the aid of

(2 - w)D - wA - W(AR - Ad = 2[(1 - w)D - WAR]

and
(2 - w)D + wA - W(AR - Ad = 2[D + wALl
we deduce that

[(2 - w)D - wA - W(AR - AL)]X = J.t[(2 - w)D + wA - W(AR - Ad]x.

Taking the Euclidean scalar product with x, it now follows that


(2 - w)d - wa + iws
J.t = (2)d
-w +wa+zws .,

where we have set

a:= (Ax,x), d:= (Dx,x), s:= i(ARx - ALx,x).

Since A is positive definite, we have a > 0 and d > 0, and since A is


Hermitean, s is real. From

1(2 - w)d - wal < 1(2 - w)d + wal

for 0 < w < 2 we now can conclude that 1J.t1 < 1 for 0 < w < 2. Hence
convergence of the SOR method for 0 < w < 2 follows from Theorem 4.1. 0
64 4. Iterative Methods for Linear Systems

The calculation of the optimal relaxation parameter, i.e., the parameter


minimizing the spectral radius, is difficult except in some simple cases.
Usually it is obtained only approximately by trial and error, based on trying
several values of wand observing the effect on the speed of convergence.
However, the effort is well worth the time, since the resulting improvement
of the convergence can be considerably large, as we will indicate by the
following analysis, which relates the convergence of the SOR method to
that of the Jacobi method for a certain class of matrices that occurs in the
discretization of boundary value problems.
Definition 4.13 A matrix A = D + AL + AR with nonsingular diagonal
D is called consistently ordered if the eigenvalues of
1
C(a) := -aD- 1 A L - - D- 1 A R , a E ([; \ {O},
a
do not depend on a.
The following theorem ensures that the analysis we are going to develop
applies to the matrix of Example 2.1, i.e., of Example 4.5.
Remark 4.14 Tridiagonal matrices with nonzero diagonal elements are
consistently ordered.
Proof. After introducing the diagonal matrix
S(a) := diag(l, a, a 2 , . .. , an-I)
for tridiagonal matrices A = D + A L + A R , we have that
S(a)C(I)S(a)-l = C(a);
i.e., all matrices C(a) are similar, and therefore they have the same eigen-
~~. 0

Without going into detail, we wish to say that a much wider class of
matrices arising in the discretization of differential equations enjoys the
property of being consistently ordered in the sense of Definition 4.13. For
a more comprehensive study we refer to [61, 63, 66].
Theorem 4.15 (Young) Assume that A is a consistently ordered matrix
and that the eigenvalues of the Jacobi matrix -D-1(AL + A R ) are real
with spectral radius A = p[-D-1(A L + A R )] < 1. Then the SOR method
converges for all 0 < w < 2. The spectral radius of the SOR matrix B(w)
is minimal for
2
Wopt = ~ > 1.
l+vl-A2 -
In this case we have
1- Jl- A2
p[B(wopd] = 1 + Vf=A2 .
4.2 Relaxation Methods 65

Proof. From

(I + wD- 1Ad[J,t1 - B(w)] = J,t(I + wD- 1Ad - D- 1[(I- w)D - wAR]

= (J,t + w - 1)1 + Vii w (Vii D- 1A L + ~ D- 1AR )


and the fact that 1 + wD- 1A L is nonsingular it can be seen that J,t # 0 is
an eigenvalue of B(w) if and only if

A=J,t+w-l (4.8)
Viiw
is an eigenvalue of

Since A is assumed to be consistently ordered, it follows that J,t # 0 is an


eigenvalue of B(w) if and only if A is an eigenvalue of -D-1(A L + A R ).
Solving the quadratic equation

J,t + w - 1 = Vii WA

yields

Setting 0: = -1 in Definition 4.13, it is obvious that if A is an eigenvalue


of -D-l(AL + A R ), then -A also is an eigenvalue of -D-1(AL + A R ).
Therefore, since we are interested only in the spectral radius of B(w), we
can confine our considerations to

W1AI JW2A2 )2
J,t=
( --+
2
--+I-w
4

Because of IAI < 1, the quadratic equation


w2 A2 - 4w + 4 = 0

has two real solutions, and only one of them belongs to the interval (0,2),
namely
2
WO(A) = ~>1.
1+ 1- A
2 -

This implies that


66 4. Iterative Methods for Linear Systems

Therefore, we have

Il'lw)1 = ( W~AI + JW:A' + 1- W)', 0< W:5 Wo(A). (4.9)

For wo(..\) < w < 2 the eigenvalues are complex, with


1J,t(w) I = w - 1, wo(..\) < w < 2. (4.10)
From the expressions (4.9) and (4.10) it can be seen that 1J,t(w) I is mono-
tonically nondecreasing with respect to 1..\1. Hence
2
wA w2A2

W:1, J
(
-+ --+I-w
)
0< w ~ wo(A),
p[B(w)] = 4 ' (4.11)
{
wo(A) < w < 2.
The function
wA . /w 2A2
/(w) := 2 + Y-4- + 1 - w
has the properties /(0) = 1 and
2
j'(w) = ~+ wA - 2 < O.
22vw2A2+4-4w
The latter follows from
A2 (4 - 4w + w2A2 ) < 4 - 4wA 2 + w2A4 = (2 _ wA 2)2.
Therefore, the spectral radius described by (4.11) is strictly monotonically
decreasing for 0 < w < Wo and strictly monotonically increasing for
Wo < w < 2 (see Figure 4.1). Since p[B(O)] = p[B(2)] = 1, we finally
obtain that p[B(w)] < 1 for all 0 < w < 2 and that p[B(w)] assumes its
minimum for w = wo(A) with value p[B(wo(A))] = wo(A) - 1. 0

Corollary 4.16 Under the assumptions of Theorem 4.15 the Gauss-Seidel


method converges twice as fast as the Jacobi method.
Proof. From (4.8) we observe that J,t =..\2 for w = 1; i.e., we have
p[B(l)] = {p[-D-1(AL + A R )]}2
for the spectral radii of the Gauss-Seidel matrix B(l) and the Jacobi ma-
trix -D-1(A L + A R ). Now the statement follows from the observation that
by the a priori estimate of Theorem 3.48 the number N of iterations re-
quired for a desired accuracy is inversely proportional to the modulus of
the logarithm of the spectral radius; Le.,
N(Gauss-Seidel) Inp[-D-1(AL + A R )] 1
N(Jacobi) ~ In p[B(1)] = 2'
and this proves the assertion. o
4.2 Relaxation Methods 67

p[B(w)]

1.---

1 wo 2 w

FIGURE 4.1. Spectral radius for SOR

Example 4.11 For the tridiagonal matrix A from Example 4.5 we have
N(SOR) 11"

N(Jacobi) ~ 4(n + I)
for the optimal relaxation parameter.
Proof. Using the trigonometric addition theorem
1 . 1I"j(k - I) I . 1I"j(k + I) 1I"j. 1I"jk
- sm
2 n+1
+-
2
sm
n+1
= cos - - sm - - ,
n+1 n+1
it can be seen that the Jacobi matrix
o1
101
101

101
1 0

corresponding to Example 4.5 has the eigenvalues


1I"j
Aj = cos n + I' j = I, ... ,n,

and associated eigenvectors Vj with components

v}" k
. -
= sm 1I"jk
-, k 1 ... , n,
=, J
. = 1,... , n.
, n+1
Hence,
68 4. Iterative Methods for Linear Systems

and

From Theorem 4.15 we obtain


2
Wopt = . 11"
l+sm--
n+l
and
• 11"
I-sm-- 2~
n+l "
p[B(wopt)] = - - .---'-"-::11"';'-=- ~ 1 - -n-+-l '
l+sm--
n+l
whence
211"
-lnp[B(wopt)] ~ --
n+l
follows. This concludes the proof. o

For example, for n = 30 the optimal SOR method is about forty times
as fast as the Jacobi method. Note that the improvement on the speed
of convergence improves as n increases. The fact that in Example 4.17,
and, more generally, in almost all linear systems arising in the discretiza-
tion of boundary value problems, the optimal relaxation parameter has the
property W > 1 explains why the method is known as the overrelaxation
method.

4.3 Two-Grid Methods


Consider the linear system
Ax =y (4.12)
with a nonsingular matrix A, and assume that we already have an approx-
imate solution Xo available with a residual, or defect,

TO := y - Axo,

for which, in general, TO f- O. Then we try to improve on the accuracy by


writing
Xl = Xo + 80 (4.13)
with some correction term 80 , Substituting this into (4.12) we obtain that
80 has to satisfy the defect correction equation

A80 = TO
4.3 Two-Grid Methods 69

in order that Xl satisfy (4.12). We observe that the correction term 150 will,
in general, be small compared to xo, and therefore it is unnecessary to solve
the defect correction equation exactly. Hence we write

where A;iprox is some approximation for the inverse A-I of A. Substituting


this into (4.13) we obtain

(4.14)

as our new approximate solution to (4.12). This procedure is known as the


defect correction principle.
Repeating this process yields the defect correction iteration defined by

Xv+! := Xv + A;;:';prox[Y - AxvJ, v = 0,1,2, ... , (4.15)

for the solution of (4.12). By Theorem 4.1, the iteration (4.15) converges
to the unique solution X of A;iprox[Y - Ax] = 0, provided that the spectral
radius of the iteration matrix 1- A;iproxA is less than one. Since the unique
solution X of Ax = Y trivially satisfies A;iprox[Y - Ax] = 0, we then have
convergence of the scheme (4.15) to the unique solution of (4.12). For a
rapid convergence it is desirable that the spectral radius be close to zero,
which will be the case if A;iprox is a reasonable approximation to A-I. For
a more complete introduction to the defect correction principle we refer to
[56].
Here we wish to indicate briefly two applications. Firstly, the defect cor-
rection principle (4.14) can be used to improve on the accuracy of an
approximate solution xo, obtained for example by Gaussian elimination.
Then, in principle, the computation of Xo corresponds to some approxima-
tion Xo = A;iproxY obtained from an LR decomposition. This means that
evaluating 150 = A;iproxro is achieved by applying again the same elimi-
nation algorithm to the defect correction equation. This way, the defect
correction principle provides a simple tool to improve on the accuracy of a
solution to a linear system obtained by elimination.
Secondly, we would like to illustrate the more systematic use of the defect
correction principle for the development of multigrid methods as a powerful
tool for the fast iterative solution of linear systems arising in the discretiza-
tion of differential and integral equations. For the sake of simplicity we will
confine ourselves to the case of two-grid iterations.
The basic idea of two-grid methods is to use the defect correction princi-
ple with the approximate inverse A;plprox for the matrix Aline of a large lin-
ear system corresponding to a fine approximation grid given simply by the
exact inverse of the matrix Acoarse of a smaller linear system, correspond-
ing to a coarse approximation grid. Of course, a number of mathematical
problems arise in the design of such methods concerning the appropriate
70 4. Iterative Methods for Linear Systems

relation between the fine and coarse grid and the transfer between the two
grids. We will outline some ideas on the structure of two-grid methods by
again considering the simple model problem from Example 2.1 as a typical
case.
Recall that the solution vector U(h) E rn.n of the linear system

(4.16)

with the n x n tridiagonal matrix

2 -1
-1 2-1
-1 2-1

-1 2-1
-1 2

corresponds to approximate values ujh) ~ u(jh), j = 1, ... , n, for the


solution u of the boundary value problem (2.1)-(2.2) at the internal grid
points. Since we want to make use of two different grids in our analysis, we
indicate the dependence on the mesh width

h=_1_
n+1

in the matrix A(h) and the solution U(h). We assume that n is odd because
later we want to choose the coarser grid by doubling the mesh width.
We start from the Jacobi iteration with relaxation

(4.17)

as introduced in Definition 4.8. From our analysis in Example 4.17 we


deduce that A(h) has the n eigenvalues

4 . 21fjh
J.Lj = h 2 sm 2 j = 1, .. . ,n, (4.18)

and associated eigenvectors vJh) with components

(h)
vj,k = sm
. ( 'kh)
1fJ , k = 1, ... ,n, j = 1, ... ,n. (4.19)

Note that by Theorem 3.29, the eigenvectors of the Hermitian matrix A(h)
form an orthogonal basis for rn.n (see Problem 4.18). The vt), j = 1, ... , n,
are also eigenvectors of the Jacobi matrix 1- [D(h)j-l A(h), with eigenvalues

Aj = cos(1fjh), j = 1, ... ,no


4.3 Two-Grid Methods 71

From Theorem 4.9 we observe that w = 1 is the optimal choice for the
Jacobi iteration with relaxation. However, it will turn out that in the con-
text of two-grid methods the damped, or underrelaxed, Jacobi method with
o < w < 1 is more important. This is due to the following observation.
Since the vt), j = 1, ... , n, provide a basis for IRn , we can represent the
difference between the exact solution U(h) and the l/th iteration U II in the
form
n
U (h) - UII -- 'L-t
" u'},11 v(h)
A<.
j .
j=1

From the fact that

we derive the recurrence relation

Qj,lI+l = { 1 - 2w sm
. 2 j
1T
2 h} Qj,II' j = 1, . .. ,n,
for the coefficients Qj,II' In particular, if we choose w = 0.5, we have that

Qj,lI+l = cos2 21Tjh Qj,lI, j = 1, ... , n. (4.20)

From this we observe that even though convergence of the iterations (4.17)
becomes slower when we decrease w, for w = 0.5 the convergence restricted
to the subspace
Wn := span{v!!.±,!, ... ,vn }
2

of high frequencies is dramatically accelerated, since in this case from (4.20)


we have that
. n+1
J = -2- , ... ,n.

This fact can be expressed by saying that the damped Jacobi iteration is a
smoothing iteration. In the sequel we will consider only the damping factor
w = 0.5.
The slow convergence with respect to low frequencies will now be taken
care of by the defect correction principle through incorporating a so-called
coarse grid correction on the grid with mesh width 2h. For this we need
to transfer vectors corresponding to the fine grid to vectors correspond-
ing to the coarse grid and vice versa. The transfer from the fine grid
to the coarse grid requires a restriction and corresponds to a mapping
n
R(h) : IR -+ IR ":;-' . Note that we only need to consider this mapping for
the interior grid points. Instead of choosing the restriction (R(h)Y)k = Y2k,
72 4. Iterative Methods for Linear Systems

k = 1, ... , n 21 , for y E IRn it turns out to be advantageous to also incor-


porate information contained in the odd nodal points of the fine grid by
using the restriction
n-l
k= 1""'-2-

as illustrated in Figure 4.2.

1 2 3 4 5 6 7
I I I I I I I

1 2 3
FIGURE 4.2. Restriction operator of the two-grid method for n =7
The corresponding matrix is

2 1
1 2 1

1 2 1
1 2

With the aid of elementary trigonometric manipulations one can establish


the relation

n- 1
R
(h)V(h)
J
= C2V(2h)
J J'
R(h) (h)
vn +1 -
__ 2 (2h)
j - SjV j ,
.
J = 1,... , -2- ,
(4 21)
.

between the eigenvectors (4.19) for the fine and the coarse grid (see Problem
4.19). Here we have set

j1rh . j1rh n-l


Cj = cosT' Sj = sm-2-, j = 1""'-2-'

The transfer from the coarse grid to the fine grid is called prolongation
and corresponds to a mapping p(h) : IR.!!j-! --+ IRn . The simplest choice for
p(h) is given by the piecewise linear interpolation (see Chapter 8)

n-l
k=I""'-2-

n+l
k = 1""'-2-
4.3 Two-Grid Methods 73

n-l
for y E IR --r, as illustrated in Figure 4.3. The corresponding matrix is
given by p(h) = 2R(h).i. Either by direct computation or from (4.21) and
the fact that the matrices p(h) and 2R(h) are adjoint one can establish that
(see Problem 4.19)

(4.22)

1 2 3 4 5 6 7
I I I I I I I

1 2 3
FIGURE 4.3. Prolongation operator of the two-grid method for n =7

Now we are in a position to use the n x n matrix P(h)[A(2h)t 1 R(h) as


the coarse-grid correction. Computing P(h)[A(2h)t 1 R(h)y corresponds to
first restricting the vector y E IRn to R(h)y E IR n;-l , then solving the
n;-l X n;-l system A(2h)Z = R(h)y by an elimination method, and finally
prolonging the solution z E IR!!j-! to p(h) Z E IRn . Combining this coarse-
grid correction with N steps of the damped Jacobi iteration in the sense of
(4.14) now yields one step of the two-grid iteration scheme

where IN (UII , F(h») denotes the result of N steps of the damped Jacobi
iterations (4.17) with starting element UII' Obviously, the iteration matrix
corresponding to this two-grid method is given by

(4.23)

For an investigation of the convergence for our two-grid iteration scheme


we need to determine the spectral radius of TN. For simplicity we confine
ourselves to the case where N = 1; i.e., one step of the damped Jacobi itera-
tion on the fine grid alternates with a coarse-grid correction by elimination
on the coarse grid. We set T 1 = T.

Theorem 4.18 For the spectral radius ofT we have that p(T) = 0.5; i.e.,
the two-grid iterations converge.
74 4. Iterative Methods for Linear Systems

Proof. We note that from (4.18) and (4.19), with h replaced by 2h, we have
that

whence
n-1
j = 1""'-2-'
follows. From this, using (4.20)-(4.22) and R(h)v~ = 0, it can be derived
2
that
Tv(h)
(h~ (4.24)
( TV n + 1 _ j
for J. -- 1, ... , -2-
n-l an d

(h) _ 1 (h)
Tv!l.±.! - -2 v!l.±.! . (4.25)
2 2

Since the matrix


Q=(~ ~)
has the eigenvalues 0 and 2, from (4.24) and (4.25) it can be seen that the
matrix T has the eigenvalues

. n+1
J = 1""'-2-'

and the eigenvalue zero of multiplicity n2"l. This implies the assertion on
the spectral radius of T. 0

Theorem 4.18 shows that the two-grid method is a very fast iteration. As
compared to the classical Jacobi and Gauss-Seidel methods and also to the
SOR method with optimal relaxation parameter, it decreases the spectral
radius from a value close to one to one-half, which causes a substantial
increase in the speed of convergence. However, for practical computations
it has the disadvantage that in each step the solution of a system with half
the number of unknows is required.
This drawback of the two-grid method is remedied by the multigrid
method. Whereas for the two-grid method as described above only two
grids are used, the multigrid method uses M > 2 different grids with mesh
widths hI-' = 21-' h, p, = 1, ... , M, obtained from the mesh width h on the
finest grid. The multigrid method is defined recursively. The method for
M + 1 grids performs one or several steps of the damped Jacobi iteration
on the finest grid with mesh width h and uses as approximate inverse for
the defect correction one or several steps of the multigrid iteration on the
M grids with mesh widths 2h, 4h, ... ,2 M h. To be more explicit, the three-
grid method uses one or several steps of the two-grid method as the defect
Problems 75

correction of the damped Jacobi iteration on the finest grid; the four-grid
method uses one or several steps of the three-grid method as the defect
correction; and so on. To describe further details of the multigrid method,
in particular showing that the computational cost of one step of a multigrid
iteration is proportional to the cost of the Jacobi iterations on the finest
grid provided that the coarsest grid is coarse enough, is beyond the aim of
this introduction. For a comprehensive study we refer to [8, 26, 29, 63].

Problems
4.1 Consider the solution of the linear system

by the Jacobi method. Give an estimate on the number of iterations needed to


ensure that IIx v - xll oo ~ 10- 3 if the iteration is started with Xo = (0,0, of.

4.2 Write a computer program for the Jacobi method, the Gauss-Seidel method,
and the SOR method and test it for various examples.

4.3 Show that a matrix A has spectral radius peA) < 1 if and only if it satisfies
Iim v -+ oo A V = O.

4.4 Prove that the Jacobi method converges for strictly column-diagonally dom-
inant matrices (compare (4.5)).

4.5 Show that an n x n matrix A is reducible if and only if there exists an n x n


permutation matrix P such that

),
where All is a kxkmatrix and A 22 is an (n-k)x(n-k) matrix with 1 ~ k ~ n-l.

4.6 Show that the matrix A from Example 4.5 is irreducible and weakly row-
diagonally dominant.

4.7 Let

A=(~ : ~).
Show that for 1 ~ 20 < 2 the Gauss-Seidel method is convergent and the Jacobi
method is not.
76 4. Iterative Methods for Linear Systems

4.8 For the matrix

A=(=i ~~ -!)
show that the Jacobi method is convergent and the Gauss-Seidel method is not.
4.9 For the matrix

A = (-~ ~ =~)
show that the Gauss-Seidel method is convergent and the Jacobi method is not.
4.10 Show that the matrix
2 0 -1

A~ 0
( -1
2
-1
-1
2
-1
-1 )
0
-1 -1 0 2
is irreducible and that the Jacobi method is not convergent.
4.11 Show that the iteration matrix of the Gauss-Seidel method has eigenvalue
zero.
4.12 Consider the variant of the Gauss-Seidel iteration where the components
are iterated from the nth component backward to the first component. What is
the iteration matrix of this method? Obtain a symmetric method by alternating
one step of the forward Gauss-Seidel method and one step of the backward
Gauss-Seidel method. What is the iteration matrix of this method?
4.13 Show that the Jacobi iteration converges for a matrix A if and only if it
converges for the transposed matrix AT.
4.14 Show that the matrix A of Example 2.2 is irreducible, positive definite,
and weakly row-diagonally dominant.
4.15 Compute the eigenvalues of the Jacobi iteration matrix for the matrix A
of Example 2.2.
4.16 Let A = (ajk) be a nonnegative n x n matrix, i.e., ajk ::::: 0, j, k = 1, ... , n,
and let p(A) < 1. Show that I - A is nonsingular and (I - A)-l is nonnegative.
4.17 Give a counterexample to show that the Jacobi method, in general, does
not converge for positive definite matrices (see Theorem 4.12).
4.18 Show by direct computations that the eigenvectors given by (4.19) are
orthogonal.
4.19 Prove the relations (4.21), (4.22), (4.24), and (4.25).

4.20 Show that


p(TN):'S max [t(1 - t)N + (1 - t)N t)
o9::;~

for the two-grid iteration matrix with N damped Jacobi iterations at each step.
5
Ill-Conditioned Linear Systems

For problems in mathematical physics Hadamard [31] postulated three re-


quirements: A solution should exist, the solution should be unique, and the
solution should depend continuously on the data. The third postulate is
motivated by the fact that in general, in applications the data will be mea-
sured quantities and therefore always contaminated by errors. A problem
satisfying all three requirements is called well-posed. Otherwise, it is called
ill-posed. If A : X -t Y is a bounded linear operator mapping a normed
space X into a normed space Y, then the equation Ax = y is well-posed
if A is bijective and the inverse operator A-I : Y -t X is bounded (see
Theorem 3.24). Since the inverse of a linear operator again is linear, in
the case of finite-dimensional spaces X and Y, by Theorem 3.26 bijectivity
of A implies boundedness of the inverse operator. Hence, in the sense of
Hadamard, nonsingular linear systems are well-posed.
However, since one wants to make sure that small errors in the data
of a linear system will cause only small errors in the solution, there is an
additional need for a measure of the degree of well-posedness, or stability.
Such a measure is provided through the notion of the condition number,
which we will introduce in this chapter. This will enable us to distinguish
between well-conditioned and ill-conditioned linear systems. For the latter,
small errors in the data may cause large errors in the solution, and therefore
their numerical solution requires special care.
Hence, we will continue the chapter with a brief discussion of the singular
value cutoff and the Tikhonov regularization as efficient means to deal with
ill-conditioned linear systems. Our analysis will be based on the singular
value decomposition and will include the introduction of the pseudo-inverse,
78 5. Ill-Conditioned Linear Systems

or Moore-Penrose inverse. For an extension of these ideas to ill-posed linear


operator equations in infinite-dimensional spaces we refer to [14, 22, 28, 37,
39,43].

5.1 Condition Number


We begin with an example of an ill-conditioned linear system arising through
a simple least squares problem.
Example 5.1 We consider the best approximation of a given continuous
function f : [0,1] -+ IR by a polynomial
n
p(x) =L O:k xk
k=O

of degree n in the least squares sense, i.e., with respect to the £2 norm.
Using the monomials x H x k , k = 0,1, ... , n, as a basis of the subspace
Pn C e[O, 1] of polynomials of degree less than or equal to n (see Theorem
8.2), from Corollary 3.53 and the integrals

it follows that the coefficients 0:0, ... , O:n of the best approximation are
uniquely determined by the normal equations

L·J + k1 + 1
n
k=O
O:k =
1 0
1
f(x)x i dx, j = O, ... ,n. (5.1)

In the special case


1
f(x) = 1+x
we have the right-hand sides

1
1 xi
r··-
).- 0
- - dx '
1+x
j =O, ... ,n.

In particular, ro = In 2, and from the geometric sum


i . .
~(_1)i-1Xi-l
L-t = 1 -1( -+l )JX), J. = 1,... ,n,
i=l
x

we deduce that

ri = (-l)i {ln2 + t(-l)i~}, j = 1, .. . ,n.


•=1
5.1 Condition Number 79

Therefore, the solution of (5.1) is of the form

aj = f3j In 2 + "Ij, j = 0, ... , n,


with rational numbers f3j and "Ij' Table 5.1 gives the exact solution of the
linear system (5.1) obtained by Gaussian elimination carried out in terms of
rational numbers to compute the coefficients f3j and "Ij and then inserting
In 2 with ten-decimal-digit accuracy. The results indicate convergence of
the coefficients to the coefficients ak = (-l)k of the Taylor series for f.

TABLE 5.1. Exact solution of the linear system (5.1)

n ao al az a3 a4 a5 a6

1 0.9314 -0.4766
2 0.9860 -0.8040 0.3274
3 0.9972 -0.9389 0.6645 -0.2247
4 0.9994 -0.9830 0.8630 -0.5334 0.1543
5 0.9999 -0.9956 0.9512 -0.7688 0.4191 -0.1059
6 0.9999 -0.9989 0.9843 -0.9011 0.6672 -0.3242 0.0727

However, if we take as right-hand sides the values obtained for Tj by using


In 2 with five-decimal-digit accuracy, then Gaussian elimination yields the
results of Table 5.2.

TABLE 5.2. Numerical solution of the linear system (5.1)

n ao al az a3 a4 a5 a6

1 0.93 -0.47
2 0.98 -0.80 0.32
3 0.99 -0.95 0.70 -0.24
4 1.00 -1.16 1.63 -1.69 0.72
5 1.06 -2.74 12.68 -31.16 33.87 -13.25
6 1.39 -16.58 151.09 -584.79 1071.93 -926.75 304.49

Despite the fact that the changes in the right-hand sides are less than
0.000005, we obtain drastic changes in the solution. Therefore, qualitatively
we may say that our linear system provides an example of an ill-conditioned
system. The matrix of this example is known as the Hilbert matrix. 0

For a quantitative analysis of the phenomenon illustrated by Example


5.1 we introduce the concept of the condition number.
80 5. Ill-Conditioned Linear Systems

Definition 5.2 Let X and Y be normed spaces and let A : X -4 Y be a


bounded linear operator with a bounded inverse A-I : Y -4 X. Then

cond(A) := II All IIA- I II

is called the condition number of A.

Clearly, cond(A) depends on the chosen norm. Because of (see Remark


3.25)

we always have cond(A) 2: 1. Definition 5.2, in particular, includes the


condition number of a nonsingular n x n matrix A. Here, in the case where
both the domain and range are given the £p norm for p = 1,2,00 we will
write condp(A).

Theorem 5.3 Let X and Y be Banach spaces, let A : X -4 Y be a bounded


linear operator with a bounded inverse A-I: Y -4 X and let AO : X -4 Y
be a bounded linear operator such that IIA -111 IIAO - All < 1. Assume that
x and XO are solutions of the equations

Ax =y (5.2)

and
(5.3)
respectively. Then

IIx o - xII cond(A) { lI y o - yll IIAo - All }


IIxli ::; 1_ d(A) IIAo - All IIyll + IIAII .
con IIAII

Proof. Writing AO = A[I + A-I(AO - A)], by Theorem 3.48 we observe


that the inverse operator [Aot 1 = [I + A-l(AO - A)]-1 A-I exists and is
bounded by
IIA- 111
II[AO]-III ::; 1-IIA- I IIIIAo _ All' (5.4)

From (5.2) and (5.3) we find that

AO(xO - x) = yO - Y - (AO - A)x,

whence

follows. Now we can estimate


5.2 Singular Value Decomposition 81

and insert (5.4) to obtain

IIx O - xii cond(A) { Ilyo - yll /lAo - All }


/lxll ~ 1 -IiA-l/lIIAo - All IIA/llix/l + IIA/I .
From this the assertion follows with the aid of IIAllllxl1 ~ Ilyli. o

Theorem 5.3 shows that the condition number may serve as a measure of
stability for linear operator equations and, in particular, for linear systems.
A linear system with a small condition number is stable, whereas a large
condition number indicates instability. We call a linear system with a small
condition number well-conditioned. Otherwise, it is called ill-conditioned.
By Theorem 3.31, the condition number of a Hermitian matrix A in the
Euclidean norm is given by

IAmaxl
cond 2 (A) = IAminl '

where Amax and Amin denote the eigenvalues of A with largest and smallest
modulus, respectively. Table 5.3 is obtained by employing the QR algorithm
(see Section 7.4) for the computation of matrix eigenvalues. It illustrates
quantitatively the degree of instability, i.e., the ill-conditionedness of the
linear system from Example 5.1.

TABLE 5.3. Condition number for the linear system (5.1)

n 2 3 4 5 6
Amax 1.27 1.41 1.50 1.57 1.62
Amin 6.57.10- 2 2.69.10- 3 9.67.10- 5 3.29.10- 6 1.08.10- 7
cond 2 19.3 5.24.102 1.55.104 4.77.105 1.50.101

5.2 Singular Value Decomposition


In the sequel we wish to introduce some of the basic concepts for the
approximate solution of ill-conditioned linear systems. Our approach will
be based on the singular value decomposition of a matrix A, which need
not be a square matrix.
For each m x n matrix A, representing an operator A : (Cn ~ (Cm, the
n x n matrix A* A is Hermitian and positive semidefinite (see Problem 5.9).
Therefore, the eigenvalues of A* A are real and nonnegative (see Theorem
3.29). The nonnegative square roots of these eigenvalues are called the
singular values of A.
For the remainder of this chapter, by (".) we denote the Euclidean
scalar product in (Cn. For an m x n matrix A of rank T, the nullspace
82 5. Ill-Conditioned Linear Systems

N(A) = {x E <en : Ax = O} has dimension dimN(A) =n - r. We note


that A * Au = 0 implies that

IIAuII2 = (Au, Au) = (u,A*Au) = 0;


i.e., the nullspaces of A and A * A coincide. Hence dim N (A * A) = n - r, and
therefore A has exactly r positive singular values J.l (counted according to
their geometric multiplicity, i.e., according to the dimension of the nullspace
of J.l2 I - A* A).
Theorem 5.4 Let A be an m x n matrix of rank r. Then there exist non-
negative numbers

J.lI 2': J.l2 2': ... 2': J.lr > J.lr+l = ... = J.ln = 0
and orthonormal vectors UI, ... , Un E <en and VI, ... , Vm E <em such that

AUj = 0, j = r + 1, ... , n, (5.5)

A*Vj=O, j=r+l, ... ,m.


For each x E <en we have the singular value decomposition
r

Ax = LJ.lj(x,Uj)Vj. (5.6)
j=l

Each system (J.lj, Uj, Vj) with these properties is called a singular system of
the matrix A.
Proof. The Hermitian and semipositive definite matrix A * A of rank r has
n orthonormal eigenvectors UI, ... , Un with nonnegative eigenvalues
(5.7)

which we may assume to be ordered according to J.lI 2': J.l2 2': ... 2': J.lr > 0
and J.lr+l = ... = J.ln = O. We define
1
Vj:= - AUj, j = 1, ... ,r.
J.lj
Then, using (5.7) we have

(Vj, Vk) = _1_ (Auj, AUk) = _1_ (Uj, A* AUk) = t5 jk , j, k = 1, ... , r,


J.ljJ.lk J.ljJ.lk
where t5 jk = 1 for k = j, and t5 j k = 0 for k f:. j. Further, we compute that
A*vj = J.ljUj, j = 1, ... ,r, and hence the first line of (5.5) is proven. The
second line of (5.5) is a consequence of N(A) = N(A* A).
5.2 Singular Value Decomposition 83

If r < m, by the Gram-Schmidt orthogonalization procedure from The-


orem 3.18 we can extend VI, . .. , Vr to an orthonormal basis VI, ... ,Vrn of
{:rn. Since A* has rank r, we have dimN(A*) = m - r. From this we can
conclude the third line of (5.5).
Since the Ul, ... , Un form an orthonormal basis of {:n, we can represent
n
X = ~)x,Uj)Uj,
j=1
and (5.6) follows by applying A and observing (5.5). o
Clearly, we can rewrite the equations (5.5) in the form
A = VDU*, (5.8)
where U = (Ul, ... ,un) and V = (VI, ... ,Vrn ) are unitary n x nand m x m
matrices, respectively, and where D is an m x n diagonal matrix with entries
djj = J.Lj for j = 1, ... , rand d jk = 0 otherwise.
Theorem 5.5 Let A be an m x n matrix of rank r with singular system
(J.Lj, uj, Vj). The linear system
Ax =y (5.9)
is solvable if and only if
(y, z) = 0 (5.10)
for all z E cern with A*z = O. In this case a solution of (5.9) is given by
r 1
Xo =L ----:- (y,Vj)Uj. (5.11)
j=1 J.L)
Proof. Let x be a solution of (5.9) and let A*z = o. Then
(y,z) = (Ax,z) = (x,A*z) = O.
This implies the necessity of condition (5.10) for the solvability of (5.9).
Conversely, assume that (5.10) is satisfied. In terms of the orthonormal
basis VI, ... , Vrn of {:rn condition (5.10) implies that
r

y = L(y,Vj)Vj, (5.12)
j=1
since A*vj = 0 for j = r + 1, ... , m. For the vector Xo defined by (5.11) we
have that r

Axo = L(y, Vj) Vj.


j=1
In view of (5.12) this implies that Axo = y, and the proof is complete. 0
84 5. Ill-Conditioned Linear Systems

Since N(A) = span{ur+I, .. "U n }, the vector Xo defined by (5.11) has


the property
(xo,x) = 0
for all x E N(A). In the case where equation (5.9) has more than one
solution, the general solution is obtained from (5.11) by adding an arbitrary
solution x of the homogeneous equation Ax = O. Then from

Ilxo + xll~ = IIxoll~ + 2 Re(xo, x) + Ilxll~ = IIxoll~ + IIxll~


we observe that (5.11) represents the uniquely determined solution of (5.9)
with minimal Euclidean norm.
In the case where equation (5.9) has no solution, we represent
m
y=~)y,Vj)Vj
j=I

in terms of the orthonormal basis VI, ... , V m . Let Xo be given by (5.11) and
let x E (:n be arbitrary. Then

(Ax - Axo, Axo - y) = 0,


since Ax - Axo E span{VI , .•. , vr } and Axo - y E span{V r + I, ... , V m }. This
implies
IIAx - yll~ = IIAx - Axoll~ + IIAxo - yll~,
whence (5.11) represents a least squares solution of (5.9) (see Example 2.4).
Again, it can be shown that (5.11) is the uniquely determined least squares
solution of (5.9) with minimal Euclidean norm (see Problem 5.11).
Hence, (5.11) defines a linear operator At : (:m -t (:n by

(5.13)

which of course also allows a representation by an n x m matrix. Due to


the properties of At y as discussed above, this operator or matrix is known
as the pseudo-inverse or Moore-Penrose inverse of A (see [7]). It was first
introduced by Moore in 1920 and independently rediscovered by Penrose
in 1955. For an alternative introduction of At see Problem 5.12.
By Theorem 3.31 the condition number of a nonsingular matrix with
respect to the Euclidean norm is given by the quotient of the largest and
smallest singular value. Theorem 5.5 demonstrates the influence of small
singular values on the condition of the matrix A. If for some J E (: we
perturb the right-hand side by setting yO = Y + JVj, we obtain a perturbed
solution XO = x + JUj/J.tj. Hence, the ratio Ilxo - xllz/ll yo - yllz = l/J.tj
becomes large if A possesses small singular values.
5.2 Singular Value Decomposition 85

This observation suggests stabilizing an ill-conditioned linear system by


damping or filtering out the influence of the factor 1/ JLj in the solution
formula (5.11). In the so-called spectral cutoff, the terms in (5.11) cor-
responding to small singular values are simply neglected. Of course, this
requires some strategy on how to determine the number of terms being
summed up in (5.11). A very effective strategy is provided by the following
discrepancy principle. If the right-hand side y of a linear system is known
only within an error level 0 then it is quite natural to require Ax = y to be
satisfied only up to the same accuracy 0, since it does not make much sense
to try to satisfy the linear system more accurately than the right-hand side
is known. To describe the discrepancy principle more precisely, given an
erroneous right-hand side y6 with known error level II y6 - yll2 ~ 0, in the
spectral cutoff the solution x = At y of Ax = y is approximated by
1
L-
p
x p := (y6,Vj)Uj (5.14)
j=l JLj
for some 0 ~ p ~ r. For the following theorem we have to assume that
Ax = y is solvable.
Theorem 5.6 Let A be an m x n matrix with singular system (JLj, Uj, Vj)
and let y E A(C n ), y6 E Cm satisfy
lI y 6 - yl12 ~0~ IIy6 112
for 0 > O. Then there exists a smallest integer p = p(o) such that
IIAxp - y6 112 ~ 0. (5.15)
This discrecancy principle for the spectral cutoff is regular in the sense that
if the error level 0 tends to zero, then
x p -t At y , 0 -t O. (5.16)
Proof. Consider the function F: {O, 1, ... ,r} -t IR defined by
F(p) := IIAx p _ y611~ _ 02.
In terms of the singular system, we can write
m

F(p) = L l(y6,VjW - 02. (5.17)


j=p+1
Hence, F is monotonically nonincreasing with F(O) = IIy6 11 2 - 02 ~ 0 and
F(r) = _0 2 < 0 if the rank r of A is equal to m. If r < m, then using
(y,Vj) = 0, j = r + 1, ... ,m (see the proof of Theorem 5.5), we have
m
F(r) = L l(y6 - y,vjW - 02 ~ II y6 - yll~ - 02 ~ O.
j=r+1
86 5. Ill-Conditioned Linear Systems

Therefore, there exists a smallest integer p = p(<5) such that F(P) ~ O. Note
that p ~ r. In actual computations, this stopping parameter p is determined
by terminating the sum (5.14) when the right-hand side of (5.17) becomes
smaller or equal to zero for the first time.
In order to show the convergence (5.16), we note that IIAx p - yOliz ~ <5
implies

i.e., Ax p ~ y, <5 ~ O. From this, since At Av = v for all v E span{vl,"" v r },


we finally can conclude that x p ~ At y , <5 ~ O. 0

The spectral cutoff method requires the full solution of the eigenvalue
problem for the matrix A * A, which we will describe in Chapter 7. As an
alternative, in the following section we shall describe the Tikhonov regu-
larization, which can be performed without explicitly knowing the singular
value decomposition.

5.3 Tikhonov Regularization


Tikhonov regularization as introduced independently by Phillips in 1962
and Tikhonov 1963 is obtained from (5.11) by multiplying 1/ /1-j by the
damping factor
/1-]
a +/1-] ,
where a is some positive regularization parameter.

Theorem 5.7 Let A be an m x n matrix of rank r with singular system


(/1-j, Uj, Vj) and let a > O. Then for each y E (;m the linear system

ax", + A* Ax", = A*y (5.18)

is uniquely solvable, and the solution is given by

(5.19)

Proof. For a > 0 the matrix aJ + A *A is positive definite and therefore


nonsingular. Since
aUj + A* AUj = (a + J.lJ)Uj,
a singular system for the matrix aJ + A*A is given by (a + /1-],Uj,Uj),
j = 1, ... , n. Now the assertion follows from Theorem 5.5 with the aid of
(A*y,uj) = (y,Auj) and using (5.5). 0
5.3 Tikhonov Regularization 87

Corollary 5.8 Under the assumptions of Theorem 5.7 we have conver-


gence:
lim (a.J + A* A)-l A*y = At y .
o~O

Proof. This is obvious from (5.13) and (5.19). o


Before we proceed with a discussion on how to choose the regularization
parameter a., we give an interpretation of Tikhonov regularization as a
penalized least squares method.
Theorem 5.9 Let A be an m x n matrix and let a. > O. Then for each
y E (Cffi there exists a unique X o E (Cn such that

IIAxo- yll~ + a.llxoll~ inf {IiAx - yll~


= xE(;n + a.llxllD· (5.20)

The minimizing vector Xo is given by the unique solution of the linear


system (5.18).

Proof. (Compare to the proof of Theorem 3.51.) We first note the relation

IIAx - yll~ + a.llxll~ = IIAxo- yll~ + a.llxoll~


+ 2 Re (x - x o , a.X o + A* Ax" - A*y) (5.21)

which is valid for all x, X o E (Cn. From this it is obvious that the solution
X o of (5.18) satisfies (5.20).
Conversely, let X o be a solution of (5.20) and assume that

a.X o + A* Ax o =I A*y.

Then, setting z := a.X o + A* Ax o - A*y, for x := Xo - €z with € E IR from


(5.21) we have

IIAx - yll~ + a.llxll~ = IIAx o - yll~ + a.llxoll~ - 2ea + €2b,


where
a := IIzll~ and b:= IIAzlI~ + a.llzll~
are both positive. By choosing € = alb we obtain

IIAx - yll~ + a.llxll~ < IIAx o - yll~ + a.llxll~,


which contradicts (5.20). o

The interpretation of Tikhonov regularization through the above Theo-


rem 5.9 indicates that it keeps the residual IIAx o - yll~ small and stabilizes
by preventing X o from becoming large through the penalty term a.llxoll~.
88 5. Ill-Conditioned Linear Systems

From the proof of Theorem 5.7 we know that the eigenvalues of the
Hermitian matrix a1 + A* A are given by a + J.LJ, j = 1, ... ,no Hence by
Theorem 3.27 we have that

cond 2 (a1 + A* A) = a + J.L~ ~ 2J.Lr, 0<a ~ J.Li- (5.22)


a + J.L n a
Therefore stability of the linear system (5.18) requires the regularization
parameter a to be fairly large. On the other hand, in order to keep the
system (5.18) reasonably close to the original system Ax = y, we expect
that a needs to be small. This observation is made more precise through the
following considerations on the error occurring in Tikhonov regularization.

error

Etotal

Eapprox

~------E data

a
FIGURE 5.1. Total error for Tikhonov regularization

Given an erroneous right-hand side y5 with error levellly5 - yll2 ~ 8, the


Tikhonov regularization approximates the solution x = At y of Ax = y by
the solution x", of the regularized linear system
(5.23)

Then, for the total error, writing


x", - x = (a1 + A* A)-l A*(y5 - y) + (a1 + A* A)-l A*y - At y ,

by the triangle inequality we have the estimate

This decomposition shows that the total error consists of two parts:

Etotal ~ Edata + Eapprox'


The first term, with the aid of Theorem 3.31, can be estimated by

E data = II(a1 + A* A)-l A*112 8 2: ~ 8.


a + J.L r
5.3 Tikhonov Regularization 89

It reflects the influence of the incorrect data and, for fixed 0, becomes large
as 0: -t 0, if the smallest positive singular value J.Lr is close to zero (see also
Problem 5.16). The second term,

Eapprox = 11(0:1 + A* A)-l A*y - At Y11 2 ,


describes the approximation error due to the replacement of Ax = y by the
regularized equation (5.23), and by Corollary 5.8, it goes to zero as 0: -t O.
This error behavior is illustrated in Figure 5.1.
On one hand, in view of (5.22) the stability of the system requires a large
regularization parameter 0: to keep E data small, i.e., to keep the influence
of the data error IlyeS - yll2 small. On the other hand, keeping Eapprox small
asks for a small parameter 0:.
Obviously, the choice of the parameter 0: has to be made through a
compromise between accuracy and stability. An efficient strategy to achieve
this is again provided by the discrepancy principle. In the following theorem
we need to assume that Ax = y is solvable.
Theorem 5.10 Let A be an m x n matrix and let y E A(<e n ), yeS E <em
satisfy
IlyeS - yl12 ~ 0 < IlyeSlb
for 0 > O. Then there exists a unique 0: = 0:(0) > 0 such that the unique
solution X a of (5.23) satisfies

(5.24)
This discrecancy principle for Tikhonov regularization is regular in the
sense that if the error level 0 tends to zero, then

(5.25)
Proof. We have to show that the function F : (0,00) -t IR defined by

F(o:) := IIAx a - yeSlI~ - 02


has a unique zero. In terms of a singular system, from the representation
(5.19) we find that

~
2
F(o:) = LJ (
0:
2)2 I(y eS ,Vj)1 2 - 02.
j=l 0: + J.Lj
Therefore, F is continuous and strictly monotonically increasing with the
limits F(o:) -t _0 2 < 0,0: -t 0, and F(o:) -t lIyeSlI~ - 02 > 0,0: -t 00.
Hence, F has exactly one zero 0: = 0:(0).
Note that the condition lI yeS - yl12 ~ 0 < lI yeSII2 implies that y =I- O. Using
(5.23), (5.24), and the triangle inequality we can estimate

lI yeSII2 - 0 = lIyOll2 -IiAx a - yeS 112 ~ IIAx a l12


90 5. Ill-Conditioned Linear Systems

and
alJAx a lJ2 = IJAA*(yO - Ax a )1I2 S IJAA*1I 26.
Combining these two inequalities and using lI y o112 ~ IJyll2 - 6 yields

This implies that a --+ 0, 6 --+ O. Now the convergence (5.25) follows from
the representations (5.13) for At y and (5.19) for X a (with y replaced by
yO) and the fact that lI yo - yl12 --+ 0, 6 --+ o. 0

In practice, of course, one does not need to determine the regularization


parameter satisfying (5.24) exactly. Usually the following strategy will be
sufficient: Choose some moderately sized a and then keep decreasing a by
a constant factor " say, = 0.5, until F(a) becomes negative.
In order to illustrate that Tikhonov regularization works, Table 5.4 gives
some numerical results for the linear system of Example 5.1 with the erro-
neous right-hand side generated by using In 2 ~ 0.69315 and choosing the
regularizing parameter a = 10- 10 (without attempting to use Theorem
5.10).

TABLE 5.4. Regularized solution of the linear system (5.1)

n ao at a2 a3 a4 a5 a6
1 0.9315 -0.4767
2 0.9862 -0.8052 0.3285
3 0.9987 -0.9546 0.7021 -0.2491
4 1.0015 -1.0193 1.0154 -0.7605 0.2644
5 0.9992 -0.9659 0.7236 -0.1458 -0.2838 0.1735
6 0.9995 -0.9618 0.6564 0.0254 -0.2818 -0.1512 0.2166

Problems
5.1 For the condition number of linear operators show that

cond(AB) ~ cond(A) cond(B).


5.2 Let A be an n x n matrix and Q be a unitary n x n matrix. Show that

and

5.3 Determine cond2(A) for the matrix A of Example 2.1 and discuss its be-
havior for large n.
Problems 91

5.4 Find the inverse of the matrix

~)
7
A=(~ 11
2

and find the condition numbers condp(A) for p = 1,2,00.


5.5 Find the inverse of the matrix

A= CO
i')
1 4
10 5
1
4 5 10
0 -1 7

and find the condition numbers condp(A) for p = 1,2,00.

5.6 Calculate condcx:>(A) for the matrix

A= ( ~
Show that one can improve the condition of a matrix by scaling through calcu-
lating condcx:>(DA) where D is the diagonal matrix

D = diag(I/3, 1/111, 1/10101).


5.1 Let A = (ajk) be an n x n matrix satisfying
n

2)aj k l = 1, j = 1, ... ,no


k=!

Show that
condcx:>(A) :s; condcx:>(DA)
for all n x n diagonal matrices D (see Problem 5.6).

5.8 For a nonsingular matrix A show that

cOn~(A) = II~II min{IIBII: A +B is singular}.

This indicates that if a nonsingular matrix has a large condition number, it is


close to a singular matrix.

5.9 Show that for an m x n matrix A the n x n matrix A' A is Hermitian and
positive semidefinite.

5.10 Find the singular value decomposition of

o
A=(~ o
1
-1 ~).
92 5. Ill-Conditioned Linear Systems

5.11 Show that At y is the least squares solution of Ax = y with minimal norm.
5.12 Show that the pseudo-inverse At is uniquely determined by the properties

Express the pseudo-inverse in terms of the decomposition (5.8).

5.13 For the pseudo-inverse show that (At)t = A and (At)* = (A')t.
5.14 Give an example to show that in general, (AB)t =F BtAt.

5.15 What is the pseudo-inverse of A : (;n -t (;m given by Ax = (x, a)b with
a E (;m and b E (;n?

5.16 For an m x n matrix show that

11(01 + A' A)-l A'llz ::; ..Ja


for 0 > 0.
5.17 Give an alternative proof of Theorem 5.9 by using the necessary and suf-
ficient conditions for the minimum of a function of n variables.

5.18 Let X and Y be finite-dimensional pre-Hilbert spaces and let A: X -t Y


be a linear operator. Show that there exists a uniquely determined linear operator
A' : Y -t X with the property

(Ax,y)y = (x,A'y)x
for all x E X and y E Y. Use this result to formulate and prove a generalization
of Theorem 5.9 for the minimization of

"Ax - yll~ + ollxll3.:·


5.19 Show that
n n
(x,y):= LXjili + L(Xj - xj-dOli - ih-d
j=O j=l

defines a scalar product on (;n. Discuss its use in Tikhonov regularization as


indicated in Problem 5.18, where in addition to large components of the solution
vector oscillations between consecutive components are also penalized.

5.20 Show that A : G[O, 1) -t G[O, 1] defined by

(Af)(x) := lX f(y) dy, x E [0,1],

is a bounded linear operator that does not have a bounded inverse; i.e., show
that differentiation is an ill-posed problem.
6
Iterative Methods for
Nonlinear Systems

In this chapter we will study the solution of systems of nonlinear equa-


tions. As opposed to linear equations, no explicit solution techniques are,
in general, available for nonlinear equations, and hence their solution com-
pletely relies on iterative methods. In the first section we shall begin with
the application of the Banach fixed point theorem for systems of nonlin-
ear equations with one or several variables. Given the fact that iterative
techniques have a long history in mathematics, the significance of Banach's
fixed point theorem originates from its unified approach, covering a wide
variety of different successive approximation methods.
In the second section, we will continue with the study of Newton's it-
eration method for finding zeros of functions of one or several variables.
This iteration scheme is attributed to Newton, since in 1669 he developed
a solution method for cubic equations by linearization that may be viewed
as a precursor of what is now known as Newton iteration. He also used this
method for approximately solving Kepler's equations for planetary motion.
In the concluding two sections of this chapter we will consider the appli-
cation of Newton's method for finding zeros of polynomials and its modifi-
cation into the more recently developed Levenberg-Marquardt scheme for
solving the least squares problem.
Given the vast number of iterative methods available for nonlinear equa-
tions, we will confine our presentation to describing the fundamental ideas
and will not aim at a complete treatment of the subject.
94 6. Iterative Methods for Nonlinear Systems

6.1 Successive Approximations


In this section, we will consider systems of n nonlinear equations for n
unknowns of the form
f(x) = x,
where x = (Xl, ... ,xn)T and f(x) = (!I(xl, ... ,xn), ... ,fn(Xl, ... ,xn)T.
We begin by studying the case of a single nonlinear equation with one
unknown. Obviously, in one dimension, solving f(x) = x geometrically
corresponds to determining the intersection of the graph of the function f
with the straight line described by the function x f-t x.

Theorem 6.1 Let D c rn. be a closed interval and let f : D --t D be a


continuously differentiable function with the property

q := sup 1!,(x)1 < 1.


xED

l'f/.en the equation f(x) = x has a unique solution xED, and the successive
approximations
Xv+1 := f(x v ), v = 0,1,2, ... ,
with arbitrary Xo E D converge to this solution. We have the a priori error
estimate

and the a posteriori error estimate


q
Ix v - xl :s -1-
-q
Ix v - xv-II
for all v E IN.

Proof. Equipped with the norm 11·11 = 1·1 the space rn. is complete. By the
mean value theorem, for x, y E D with x < y, we have that

f(x) - f(y) = !,(~)(x - y)

for some intermediate point ~ E (x, y). Hence

If(x) - f(y)1 :s sup 1!'(~)llx - yl = qlx - yJ,


~ED

which is also valid for x, y E D with x ~ y. Therefore, f is a contraction,


and the assertion follows from the Banach fixed point Theorem 3.46. 0

Figure 6.1 illustrates graphically the successive approximations for func-


tions f with positive and negative slope, respectively, of absolute value
less than one. Note that the sequence (xv) converges to the fixed point
6.1 Successive Approximations 95

monotonically if f has positive slope and that it converges with values al-
ternating above and below the fixed point if f has negative slope. In both
cases the slope of the function f has absolute value less than one in a
neighborhood of the fixed point. From drawing a corresponding figure for
a function with a slope of absolute value greater than one it can be seen
that the corresponding iteration will move away from the fixed point (see
Problem 6.2).

: f
I

xo

FIGURE 6.1. Fixed point iteration

The following theorem states that for a fixed point x with 1!,(x)1 < 1 we
always can find starting points Xo ensuring convergence of the successive
approximations.
Theorem 6.2 Let x be a fixed point of a continuously differentiable func-
tion f such that 1!,(x)1 < 1. Then the method of successive approximations
xv+! := f(x v ) is locally convergent; i.e., there exists a neighborhood B of
the fixed point x such that the successive approximations converge to x for
all Xo E B.
Proof. Since!, is continuous and 1!,(x)1 < 1, there exist constants 0 < q < 1
and J > 0 such that 1!'(y)1 :s q for all y E B := [x - 6, x + J]. Then we have
that
If(y) - xl = If(y) - f(x)1 :s qly - xl :s Iy - xl :s J
for all y E Bj i.e., f maps B into itself and is a contraction f : B -+ B.
Now the statement of the theorem follows from Theorem 6.1. 0

Theorem 6.2 expresses the fact that for a fixed point x with 1!,(x)1 < 1
the sequence x v+ 1 : = f (x v) converges if the starting point Xo is sufficiently
close to x. In practical situations the problem of how to obtain such a good
initial guess is unresolved in general. Frequently, however, a good estimate
of the fixed point might be known a priori from the underlying application
or might be deduced from analytic observations.
The following examples illustrate that in some cases we also have global
convergence, where the successive approximations converge for each start-
ing point in the domain of definition of the function f.
96 6. Iterative Methods for Nonlinear Systems

Example 6.3 In order to describe a division by iteration, for a > 0 we


consider the function f : IR --+ IR given by f(x) := 2x - ax 2 . The graph of
this function is a parabola with maximum value l/a attained at l/a. By
solving the quadratic equation f(x) = x it can be seen that f has the fixed
points x = 0 and x = l/a. Obviously, f maps the open interval (0,2/a)
into (0, l/a). Since I'(x) = 2(1 - ax), we have 1'(0) = 2 and 1'(1/a) = o.
From the the property x < f(x) < l/a, which is valid for 0 < x < l/a,
it follows that the sequence xv+! := 2x v - ax~ is monotonicly increasing
and bounded. Hence, the successive approximations converge to the fixed
point x = l/a for arbitrarily chosen Xo E (0, 2/a). Figure 6.2 illustrates the
convergence. The numerical results are for a = 2 and two different starting
points, Xo = 0.3 and Xo = 0.4. 0

v Xv Xv

0 0.30000000 0.40000000
1 0.42000000 0.48000000
2 0.48720000 0.49920000
3 0.49967232 0.49999872

l/a 2/a
FIGURE 6.2. Division by iteration

Example 6.4 For computing the square root of a positive real number a
by an iterative method we consider the function f : (0,00) --+ (0,00) given
by
f(x):= ~ (x+~).
By solving the quadratic equation f(x) = x it can be seen that f has
the fixed point x = va. By the arithmetic geometric mean inequality we
have that f(x) > va for x > 0; i.e., f maps the open interval (0,00) into
[va, 00), and therefore it maps the closed interval [va, 00) into itself. From
j'(x) =~ (1- :2)
it follows that
q := sup 1j'(x)1 = ~.
va'5,x<oo 2
Hence f: [va, 00) --+ [va, 00) is a contraction. Therefore, by Theorem 6.1
the successive approximations

Xv+l := ~ (Xv + :v)' v = 0, 1, ... ,


6.1 Successive Approximations 97

converge to the square root v'a for each Xo > 0, and we have the a posteriori
error estimate
Iv'a - xvi::; Ixv - xv-II·
Figure 6.3 illustrates the convergence. The numerical results again are for
a = 2. 0

v Xv

0 5.00000000
1 2.70000000
2 1.72037037
3 1.44145537
4 1.41447098
5 1.41421359
6 1.41421356

FIGURE 6.3. Square root by iteration

In both of Examples 6.3 and 6.4 the numerical values exhibit a very
rapid convergence. This is due to the fact that because of f'(x) = 0 at the
fixed point, the contraction number is very small. We shall elaborate on
this observation later when we consider Newton's method.

TABLE 6.1. Iterations for Example 6.5

v Xv V Xv

0 1.00000000 7 0.72210243
1 0.54030231
2 0.85755322
3 0.65428979 45 0.73908513
4 0.79348036 46 0.73908514
5 0.70136877 47 0.73908513
6 0.76395968 48 0.73908513

Example 6.5 Consider the function I : [0,1] ---+ [0,1] given by

I(x) := cosx.

Here we have
q = sup 1f'(x)1 = sin 1 < 1,
O:S;x:S1
98 6. Iterative Methods for Nonlinear Systems

and Theorem 6.1 implies that the successive approximations Xv+l := cos Xv
converge to the unique solution x of cos x = x for each Xo E [0,1]. Table
6.1 illustrates the convergence, which is notably slower than in the two
previous examples. 0

By the following example we illustrate how to obtain a fixed point of


a function with derivative greater than one by working with the inverse
function.

Example 6.6 The function h : (0,1) -t (-00,00) given by h(x) := x+ln x


is strictly monotonically increasing with limits limx-+o h(x) = -00 and

°
lim x-+ oo h(x) = 00. Therefore, the function f(x) := -lnx has a unique
fixed point x. Since this fixed point must satisfy < x < 1, the derivative

1j'(x)1 = -x1 > 1


implies that f is not contracting in a neighborhood of the fixed point.
However, we can still design a convergent scheme because x = - In x is
equivalent to e- X = x. We consider the inverse function

g(x) := e- x

of f, which has derivative Ig'(x)1 = e- X < 1 at the fixed point, so that we


°
can apply Theorem 6.2. Obviously, for each < a < lie the exponential
function g maps the interval [a, 1] into itself. Since

q = sup Ig'(x)1 = e- a < 1,


a~x~l

by Theorem 6.1 it follows that for arbitrary Xo > 0 the successive approx-
imations Xv+l = e- xv converge to the unique solution of x = e- x . 0

Now we will extend Theorem 6.1 to systems of nonlinear equations. A


subset D of a linear space X is called convex if

AX + (1 - A)Y E D

for all x, y E D and all A E (0,1), Le., if the straight line connecting x and
y is contained in D,
Theorem 6.7 Let DC R n be open and convex and let f : D -t R n be a
mapping
f(x) = (fl(Xl, .. "Xn), ... ,fn(Xl,.,.,xn)T,
where the Ii : D -t R, j = 1, ... , n, are continuously differentiable func-
tions. By

j'(x) = (~~: (x)) j,k=l, ... ,n


6.1 Successive Approximations 99

we denote the Jacobian matrix of f. Then we have the mean value theorem

IIf(x) - f(y)11 < max 1If'[.\x + (1 - .\)y]lIllx - yll


- 0~A9

for all x,y E D (and all norms 11·11 on IRn ).

Proof. Let 9 : [0,1] -t IRn be continuous. We will show that

(6.1)

where the integral on the left-hand side has to be understood as the vector
of the integrals over the components of g. The function .\ f---+ IIg(.\) II is
continuous, since the norm is a continuous function. Therefore, the integral
on the right-hand side of (6.1) is well-defined. Consider the equidistant
subdivision .\i = i/m, i = 0,1, ... , m, for m E IN. Then we have the
converging Riemann sums

1
and
1
~ g(.\i) (.\i - .\i-d -t g(.\) d.\, m -t 00.

From the second limit, by the continuity of the norm we conclude that

Now (6.1) follows by passing to the limit m -t 00 in the inequality

which is a consequence of the triangle inequality.


Since D is convex, for all x, y E D we have that

1r
1
d
Ii(x) -li(y) = d.\ fj[.\x + (1 - .\)y] d.\, j = 1, . .. ,n.
0

By the chain rule we compute

af.
d
d.\ Ii[.\x + (1 - .\)y] = Ln
axl [.\x + (1 - .\)y] (Xk - Yk),
k=1 k
100 6. Iterative Methods for Nonlinear Systems

and therefore
of
1o k=1 Xk [.\x + (1 - .\)Y](Xk - Yk) d.\;
L 1 n
h(x) -h(y) = 0 j

i.e., in vector form,

f(x) - f(y) = 1 1
J'(.\x + (1 - .\)y] (x - y) d.\.

From this, with the aid of (6.1) and the continuity of .\ H f'[.\x + (1- .\)y],
we obtain

IIf(x) - f(y)1I ~ 1 11f'[.\x


1
+ (1 - .\)y]lIl1x - yll d.\

< max
- O~>'~1
1If'[.\x + (1 - .\)y]lIl1x - yll,

which ends the proof. o


Theorem 6.8 Let D c IRn be closed and convex (with a nonempty inte-
rior) and let f : D -t D be a continuous mapping. Assume further that f
is continuously differentiable in the interior of D and that its Jacobian can
be continuously extended to all of D such that

sup 11J'(x)1I < 1


xED

in some norm 11·11 on IRn. Then the equation f (x) = x has a unique solution
xED, and the successive approximations

X,,+1 := f(x,,), v = 0,1,2, ... ,

converge for each Xo E D to this fixed point. We have the a priori error
estimate
q"
Ilx" - xII ~ -l-q II x l - xoll
and the a posteriori error estimate
q
Ilx" - xii ~ -l-q Ilx" - x"-111
for all v E IN.

Proof. By the mean value Theorem 6.7 the mapping f : D -t D is a con-


traction. 0
6.2 Newton's Method 101

By Theorem 3.26 we have that each of the conditions

sup . max
xEDJ=l, ... ,n
L
n
8f ' (x) < 1,
1_J
8Xk
I
k=l

sup max
xED k=l, ... ,n.
Ln
8f ' (x) < 1,
1_ J
8 Xk
I
J=l

sup
xED
[t I~~j
j,k=l k
(X)1
2
,] 1/2 <1

ensures convergence of the successive approximations in Theorem 6.8.


The following local convergence theorem can be proven analogously to
Theorem 6.2.
Theorem 6.9 Let x be a fixed point of a continuously differentiable func-
tion f such that II f' (x) II < 1 in some norm \I . \I on lRn. Then the method
of successive approximations Xv+l := f(x v ) is locally convergent; i.e., there
exists a neighborhood B of the fixed point x such that the successive approx-
imations converge to x for all starting elements Xo E B.
Example 6.10 For the system
Xl = 0.5 cos Xl - 0.5 sin X2

X2 = O.5sinxl + 0.5COSX2
we have
j'(x) = ( -0.5 sinx l -0.5 COSX2 )
0.5 cos Xl -0.5 sinx2 '
and therefore 11f'(x)112 ~ J(f.5 for all x E lR2. Hence Theorem 6.8 is
applicable. o
The reader will not be surprised to learn that for speeding up convergence
of the successive approximations, concepts developed for linear equations
like relaxation methods or multigrid methods can also be successfully em-
ployed in the nonlinear case. However, since we discussed these methods
in some detail in Sections 4.2 and 4.3 for linear equations, we shall refrain
from repeating the analysis for nonlinear equations.

6.2 Newton's Method


We now want to determine zeros of a function of n variables; i.e., we want
to solve equations of the form
f(x) = 0,
102 6. Iterative Methods for Nonlinear Systems

where f : D -t IR n is a continuously differentiable function defined on some


open subset D C IRn .
We begin by considering a function of one variable. Let Xo be an approx-
imation to a zero of the function f. In a neighborhood of xo, by Taylor's
formula we have that

f(x) ~ f(xo) + !'(xo) (x - xo) =: g(x). (6.2)

Therefore, we may consider the zero of the affine linear function g as a


new approximation to the zero of f and denote it by Xl. From the linear
equation
f(xo) + !'(Xo) (Xl - Xo) = 0 (6.3)
we immediately obtain
f(xo)
Xl = Xo - f'(xo) .

Geometrically, the affine linear function g describes the tangent line to the
graph of the function f at the point Xo.
This consideration can be extended to the case of more than one variable.
Given an approximation Xo to a zero of f, by Taylor's formula we still have
the approximation (6.2), where now, as in the previous section,

!,(X) = (;~: (X)) j,k=l, ... ,n

denotes the Jacobian matrix of f. Again we obtain a new approximation


Xl for the solution of f(x) = 0 by solving the linearized equation (6.3), i.e.,
by
Xl = Xo - (J'(xo)t f(xo).
l

Geometrically, the function g of (6.2) corresponds to the hyperplane tan-


gent to f at the point Xo.
Iterating this procedure leads to Newton's method, as described in the
following definition. In the case of one variable, the geometric situation is
shown in Figure 6.4.

Definition 6.11 Let D C IRn be open and let f : D -t IRn be a continu-


ously differentiable function such that the Jacobian matrix !,(x) is nonsin-
gular for all xED. Then Newton's method for the solution of the equation

f(x) = 0

is given by the iteration scheme

XII+1 := XII - [!'(Xll)]-l f(x lI ), V = 0,1, ... ,

starting with some Xo ED.


6.2 Newton's Method 103

FIGURE 6.4. Newton's method

We explicitly note that xv+! is obtained by solving the system of linear


equations

for Xv - xv+!; Le., no matrix inversion is required.


Example 6.12 For the function
1
f(x):= a--
X

where a > 0, the Newton iteration is given by

X v +l := 2x v - ax~.

By Example 6.3 we have convergence for all Xo E (0,2ja). o


Example 6.13 For the function

f(x) =: x 2 - a

where a > 0, the Newton iteration is given by

Xv+! := i (xv + :v)'


By Example 6.4 we have convergence for all Xo E (0,00). o
Of course, we cannot expect that Newton method's will always converge.
However, by the following analysis we can assure local convergence.
Theorem 6.14 Let D c lRn be open and convex and let f : D -t lRn be
continuously differentiable. Assume that for some norm II . lion lRn and
some Xo E D the following conditions hold:
104 6. Iterative Methods for Nonlinear Systems

(a) f satisfies
11!'(x) - !'(y)11 :S T'llx - yll
for all x, y E D and some constant T' > O.
(b) The Jacobian matrix !' (x) is nonsingular for all xED, and there
exists a constant f3 > 0 such that

(c) For the constants

0:= 1I[J'(xo)t1f(xo)11 and q:= of3T'

the inequality
1
q <-
2
is satisfied.
(d) For r := 20 the closed ball B[xo, r] := {x : Ilx - xoll :S r} is contained
in D.
Then f has a unique zero x· in B[xo, r]. Starting with Xo the Newton
iteration
xv+! := Xv - [J'(x v )t 1 f(x v ), V = 0,1, ... , (6.4)
is well-defined. The sequence (xv) converges to the zero x· of f, and we
have the error estimate

Proof. 1. Let x, y, zED. From the proof of Theorem 6.7 we know that

f(y) - f(x) = 1 1
!'[AX + (1 - A)Y] (y - x) dA.

Hence

f(y) - f(x) - !,(z) (y - x) = 1 1


{f'[AX + (1- A)Y] - f'(z)} (y - X)dA,

and estimating with the aid Qf (6.1) and condition (a) we find that

IIf(y) - f(x) - f'(z) (y - x) II

:S T'lly - x1111 IIA(x - z) + (1 - A)(Y - z)1I dA

:S ~ lIy - xII {llx - zll + lIy - zll}·


6.2 Newton's Method 105

Choosing z = x shows that


Ilf(y) - f(x) - j'(x) (y - x)1I ::; ~ lIy - xII 2 (6.5)

for all x, y ED, and choosing z = Xo yields

IIf(y) - f(x) - j'(xo) (y - x)11 ::; Tf'lIy - xII (6.6)

for all x, y E B[xo, r).


2. We proceed by proving through induction that

IIx lI - xoll::; rand IIx lI - xII-III::; aq2V-1-l, v = 1,2,.... (6.7)

This is valid for v = 1, since

IIxl - xoll = 11[j'(xo)t 1 f(xo) II = a = ~ < r


as a consequence of conditions (c) and (d). Assume that the inequalities
(6.7) are proven up to some v ~ 1. Then by condition (b) and since
XII E B[xo, r) CD, the element XII+! is well-defined. With the aid of condi-
tion (b), the definition (6.4) applied to XII' the estimate (6.5), the induction
assumption, and the definition of q we can estimate

IIxlI +! - XliII = 1I[f'(X II ))-1 f(xlI)11 ::; 13l1f(x lI )1I

= 13l1f(xlI ) - f(xlI-d - f'(xlI-d(x lI - xlI-dll

::; "2 13"1 [aq 2


13"1 IlxlI - XII-III 2 ::;"2 v
-
1
-
1]2
= "2a q2
v
-
1
< aq 2
v
1
-.

From this, with the help of the triangle inequality, the induction assump-
tion, and condition (c), we obtain that

IlxlI +1 - xoll ::; IlxlI +! - Xliii + ... + IIxl - xoll

::; a (1 + q + q3 + q7 + ... + q2V-1) ::; 1~ q ::; 2a = r;


i.e., the inequalities (6.7) also hold for v + 1.
3. For JL > 0, using q < 1/2, we now can estimate
106 6. Iterative Methods for Nonlinear Systems

From this we observe that (XII) is a Cauchy sequence, since q < 1/2 and by
Theorem 3.39 the limit
x· = lim XII
11-"+00

exists. Passing to the limit 1J ~ 00 in (6.7) we obtain IIx· - xoll ~ r, i.e.,


x· E B[xo, r], and passing to the limit f..L ~ 00 in (6.8) the error estimate
of the theorem follows.
4. We now show that the limit x· is a zero of the function f. With the aid
of (6.4) and condition (a) we can estimate

Ilf(xll)11 = 11f'(xll) (XII+l - xll)11


~ 11f'(xll ) - f'(xo) + f'(xo) II II XII +I - xliII

~ blix II - xoll + 1If'(xo)lllllx V +I - xliII ~ 0, 1J ~ 00.

Hence f(x ll ) ~ 0, 1J ~ 00, and the continuity of f implies that indeed


f(x·) = O.
5. We conclude the proof by showing that x· is the only zero of f in the
ball B[xo,r]. For this we consider the function 9 : B[xo,r] -+ IR n defined
by
g(x) := X - [f'(xO)]-l f(x).
From conditions (b) and (c) and the inequality (6.6), by writing
g(x) - g(y) = [j'(xo)t1{f(y) - f(x) - j'(xo)(y - x)}
we deduce that

IIg(x) - g(y)11 ~ f3'Yrily - xii ~ 2qlly - xii


for all x, y E B[xo, r]; i.e., 9 is a contraction. Therefore, by Theorem 3.44
the function 9 has at most one fixed point in B[xo, r]. Now uniqueness
of the zero of f in B[xo, r] follows from the equivalence of the equations
g(x) = x and f(x) = O. 0

Our main application of Theorem 6.14 consists in deriving the following


local convergence result for Newton's method.
Corollary 6.15 Let D c
IR n be open and let f : D ~ IR n be twice con-
tinuously differentiable, and assume that x· is a zero of f such that the
Jacobian f'(x·) is nonsingular. Then Newton's method is locally conver-
gent; i.e.} there exists a neighborhood B of the zero x· such that the Newton
iterations converge to x· for all Xo E B.
Proof. Since f is twice continuously differentiable, by the mean value The-
orem 6.7 applied to the components of f' there exists 'Y > 0 such that

11j'(x) - j'(y)11 ~ 'Yllx - yll


6.2 Newton's Method 107

for all x, y in some closed ball B[x·, p) centered at x·. We write

j'(x) = !,(x·){I + [f'(X·))-l[f'(X) - !'(x·m

and deduce from the above estimate and Theorem 3.48 that the radius p
of B[x·, p) can be chosen such that f'(x) is nonsingular on B[x·, p] and
1I[f'(x·))-lll::; (3 for all x E B[x·,p] and some constant (3 > O.
Since f is continuous, f(x·) = 0 implies that there exists 8 < p/2 such

I}
that
. {p4(3' 2(32"(
IIf(xo)1I < mm

for all IIxo - x·1I < 8. Then, after setting a := 1I[f'(XO)]-l f(xo)11 we have
the inequalities

and
2a ::; 2(3l1f(xo)1I < ~ .

Hence for the open and convex ball B(x·, p) and for each Xo with
Ilxo - x·11 < 8 the assumptions of Theorem 6.14 are satisfied. 0

Corollary 6.16 Let f : (a, b) --+ IR be twice continuously differentiable


and assume that x· is a simple zero of f. Then Newton's method is locally
convergent.

Proof. For simple zeros we have f'(x·) ¥- O. o


Example 6.17 For the function f(x) := x - cos x the Newton iteration
reads
XII - cos XII
Xll+l := XII - 1+ .
smx ll
and leads to the numerical values of Table 6.2. o

TABLE 6.2. Newton iterations for Example 6.17

V XII

0 1.00000000
1 0.75036387
2 0.73911289
3 0.73908513
4 0.73908513
108 6. Iterative Methods for Nonlinear Systems

Example 6.18 For the function f(x) := x - e- x the Newton iteration


reads
Xv - e- Xu
Xv+l := xv-
1 + e- Xu
and leads to the numerical values of Table 6.3. o

TABLE 6.3. Newton iterations for Example 6.18

v Xv

0 1.00000000
1 0.53788284
2 0.56698699
3 0.56714329
4 0.56714329

In both examples we observe that the speed of convergence is consider-


ably improved as compared with the simple successive approximations of
Examples 6.5 and 6.6. For a general description of this more rapid conver-
gence of Newton's method we need the following definition.
Definition 6.19 A convergent sequence (xv) from a normed space with
limit x is said to be convergent of order p ~ 1 if there exists a constant
C > 0 such that

IIx v+! - xii :s Cllx v - xll P , v = 1,2, ....

Convergence of order one or two is also called linear or quadratic conver-


gence, respectively. We note that the convergence in Banach's fixed point
Theorem 3.45 is, in general, linear.
Theorem 6.20 Under the assumptions of Theorem 6.14 Newton's method
converges quadratically.

Proof. Using condition (b) of Theorem 6.14 and the inequality (6.5) we can
estimate
Ilx· - xv+lll = Ilx· - Xv + It(xv)t 1 f(xv)11
:s Illt(xv))-lllllf(x·) - f(x v ) - t(xv)(x· - xv)11

:s fJ2'Y IIx· - x v ll 2 ,

since f(x·) = O. o
6.2 Newton's Method 109

Roughly speaking, the quadratic convergence of Newton's method means


that the number of correct digits in the numerical approximation is doubled
in each iteration step, as observed in Examples 6.3, 6.4, 6.17, and 6.18.
Although by this property Newton's method is very attractive, it has to be
observed that one step of the Newton iteration for nonlinear systems can be
very costly both through the need for evaluating the entries of the Jacobian
f'(x v ) and through the cost of solving the linear system to arrive at the
new iteration Xv+l. Therefore, a great variety of modifications of Newton's
method have been developed that mitigate, in particular, the first difficulty.
These modified Newton methods, in general, are of the form

Xv+l := Xv - Avf(x v ), v = 0,1, ... ;

i.e., the inverse [f'(xv)]-l of the Jacobian is replaced by some approximat-


ing matrix A v . Here we will only briefly mention two classical and simple
possibilities for avoiding the evaluation of the Jacobian at each iteration
step.
In the simplified, or frozen, Newton method, for all steps the matrix A v
is kept the same and chosen as the inverse of the Jacobian for the starting
point; i.e., the iteration scheme is

Xv+l := Xv - [f'(xo)t 1f(x v ), V = 0,1, ....

Geometrically, in the one-dimensional case this means that the tangent line
of f at Xv is replaced by the parallel to the tangent line of f at Xo passing
through (xv, f(x v )).
Theorem 6.21 Under the assumptions of Theorem 6.14 the simplified
Newton method converges linearly to the unique zero of f in B[xo, r].
Proof. Recall that the function

g(x) ;= X - [f'(xo)t 1 f(x)

defined in the proof of Theorem 6.14 is a contraction. We show that 9 maps


B[xo, r] into itself. For this we write

Xo - g(x) = [!,(xo)t1{f(x) - f(xo) - !,(xo)(x - xo) + f(xo)}.

Then estimating with the help of conditions (b), (c) and (d) and the in-
equality (6.5) we obtain

IIg(x) - xoll :s ~'Y Ilx - xol1 2 + a :s 2a 2(h + a = (2q + l)a < 2a = r


for all X with IIx - xoll :s r. Now the statement of the theorem follows from
the Banach fixed point Theorem 3.46. 0
110 6. Iterative Methods for Nonlinear Systems

In the secant method for a function of one variable the derivative f'(x v )
is approximated by the difference quotient and the corresponding iterative
scheme is given by
Xv - Xv-l
Xv+! := Xv - f(x ) _ f(Xv-l) f(x v ), v = 0,1,.... (6.9)
v

Geometrically, this means that the tangent line at Xv is replaced by the


secant line through the two points Xv and Xv-I. Obviously, this method
needs two initial elements Xo and Xl' Generalizations to functions in m. n
are possible (see [47]).
In general, for the simplified Newton method and for the secant method
we can expect only linear convergence. The idea underlying the more so-
phisticated modified Newton methods is to choose the approximating ma-
trices A v in a manner leading to an improvement over linear convergence
without requiring the computational costs of the full Newton method. In
the so called rank one methods suggested by Broyden in 1965, in each it-
eration step the matrix A v is updated from the previous matrix A v - l by
adding only a matrix of rank one such that the resulting iteration scheme is
superlinearly convergent. Roughly speaking, the latter means that for the
sequence Xv --+ x, v --+ 00, we have that

Ilxv+I - xii :S Cvllxv - xii, v = 1,2, ... ,


such that C v --+ 0, v --+ 00. For details we refer to the literature (see
[20,47]).

6.3 Zeros of Polynomials


In this section we shall apply Newton's method to the computation of the
zeros of polynomials. Finding the zeros of polynomials is a classical problem
in mathematics and numerical analysis despite the fact that it very seldom
occurs in applications. We first observe that Newton's method also works
for a complex function of a complex variable, allowing the computation of
complex zeros.
Consider the polynomial

p(x) = aox n + alXn - 1 + a2X n- 2 + ... + an-IX + an


with real or complex coefficients ao, aI, ... ,an' For the application of New-
ton's method, in each iteration step we need to compute the values of p and
pi at the point Xv. This can be effectively done by the Horner scheme. This
is based on writing the polynomial in the form of nested multiplications
6.3 Zeros of Polynomials 111

which suggests the recursion

bm = bm- i Z + am, m = 1, ... , n, (6.10)

starting with bo = ao. Performing these n multiplications and additions,


we arrive at the value of the polynomial p(z) = bn .
For the polynomial

Pi ()
x := bOX n-i + biX n-2 + b2X n-3 + ... + bn-2 X + bn-i,
using (6.10) we compute

n-i n
Pi(X) (x - Z) + bn = L bmxn-i-m(X - Z) + bn = L amX n- m = p(X).
m=O m=O
This implies that for a zero z the Horner scheme provides the coefficients
of the polynomial obtained by dividing P by the linear factor x - z. In
addition, we have that

P' (x) = p~ (x) (x - z) + Pi (X), (6.11)

and in particular,
p'(z) = Pi(Z).
Hence, applying the Horner recursion to the polynomial Pi yields the value
of the derivative p'(z). By repeating this process recursively, we can deter-
mine all the derivatives of P at the point z, since by induction, from (6.11)
we obtain that

whence

follows for k = 1, ... , n. Therefore, defining recursively polynomials Pk of


degree n - k by applying the Horner scheme to the preceding polynomial
Pk-i leads to
p(kl(Z) = k!Pk(Z), k = 1, ... ,n.
We can summarize this in the following theorem.

Theorem 6.22 Let

px n
( ) =aox +aix
n-i
+a2x n-2 +···+an-ix+an

be a polynomial of degree n. For z E <C the complete Horner scheme


112 6. Iterative Methods for Nonlinear Systems

ao al a2 an-l an
Z bo bl b2 bn - l bn
Z b'0 b~ b~ b~_l
Z b"
0 b"
1 b"
2 b"
n-2

b(n-l) b(n-l)
Z 0 1
b(n)
Z 0

contains the derivatives


(k) p(k)(z)
bn _ k = k!' k=O,l, ... ,n,

of the polynomial p at the point z. The scheme is recursively defined by


b!;l) := am, m = 0, ... , n, and
.- b(k-l)
bo(k) .- 0 ,
b(k).=
m'
zb(k)
m-l
+ b(k-l)
m , m = 1,... ,n - k ,
for k = 0, ... , n.
Example 6.23 For the polynomial p(x) := x 3 - x2 + 3x - 5 the Horner
scheme
z 1 -1 3 -5
2 1 1 5 5
2 1 3 11
2 1 5
2 1

for z = 2 leads to p(2) = 5, p'(2) = 11, p"(2) = 10, p"'(2) = 6. o


We continue by outlining how to compute all the zeros of a polynomial
p of degree n with real coefficients. We first assume that p has only simple
real zeros and proceed as follows:
1. Either from analytic considerations or by plotting a graph of the
polynomial we obtain a rough estimate of the location of the zeros
Zn < Zn-l < ... < Z2 < Zl·
2. Starting with some Xo > Zl, by Newton iteration we compute the
largest zero Zl. The global convergence of Newton's method in this
case follows from monotonicity arguments (see Problem 6.13).
3. By the Horner scheme we divide p by the linear factor x - Zl and carry
out step two for the reduced polynomial to compute Z2. Repeating
this procedure, we successively obtain approximations for all zeros.
4. In order to improve the accuracy, for all zeros Newton's method is ap-
plied to the full polynomial p with the starting points of the iteration
given by the approximations obtained in step three.
6.3 Zeros of Polynomials 113

Now we consider the case of multiple real zeros. If z is a zero of order m,


then we can write
p(x) = (x - Z)ffiq(X), (6.12)
where the polynomial q of degree n - m has a value q(z) =/: O. To see the
effect of (6.12) on Newton's method we consider it as a fixed-point iteration
xv+! := g(x v ) with g defined by
p(x)
g(x) := x - p'(x) .

Using (6.12), by elementary differentiation we obtain

g'(z) = 1 - -!.
m
.
Therefore, by Theorem 6.2, at a multiple zero Newton's method is locally
convergent. Obviously, the convergence at a multiple zero is only linear.
However, one can modify Newton's method for multiple zeros such that
the quadratic convergence is preserved (see Problem 6.14).
For finding complex zeros, in principle one can apply Newton's method
in ce. For this one has to keep in mind that for polynomials with real coef-
ficients, the starting values need to be complex, since otherwise Newton's
method would produce only real approximations. For the conjugate com-
plex zeros of a polynomial with real coefficients Bairstow's method avoids
working in the complex plane by using the fact that for two conjugate zeros,
the product of the linear factors (x - z)(x - z) is a polynomial of degree
two with real coefficients. The basic idea is to write the polynomial p of
degree n in the form
p(x) = (x 2 - ux - v)q(x) + a(x - u) + b,
where q is a polynomial of degree n - 2, and a and b are constants depending
on u, v E JR.. The factor x 2 - ux - v corresponds to two conjugate complex
zeros of p if the pair u, v solves the nonlinear system a(u, v) = 0, b( u, v) = O.
The latter can be solved by Newton's method, and once the solution u, v is
known, the two zeros of p are obtained by solving the quadratic equation
x 2 - ux - v = O.
We conclude this section with some consideration of the question of sta-
bility. In particular, we show that the zeros of polynomials can be quite
sensitive to small changes in the coefficients even if all the zeros are simple
and well separated from each other.
Let p and q be polynomials of degree n and assume that Zo is a simple
zero of p. Consider the perturbed polynomial
p(',€) :=p+€q,

where € is small. Using the theory of functions of a complex variable, it can


be shown that in a neighborhood of € = 0 the zero z(€) depends analytically
114 6. Iterative Methods for Nonlinear Systems

on the parameter c. The derivative z' can be obtained by differentiating


p[z(c),c] = 0 with respect to c. This yields

{p'[z(c)] + cq'[z(c)]}z'(c) + q[z(c)] = 0,


and setting c = 0, it follows that
z' (0) = _ q(zo) .
P'(zo)
Hence, for small c we have that

q(Zo)
z(c) ~ Zo - c.,--(). (6.13)
p Zo
Example 6.24 The polynomial

p(x) := (x - 1) (x - 2)··· (x - 10) = x lO - 55x 9 + ... + 1O!

has the zeros 1, 2, ... ,10, which are well separated from each other. We
perturb the coefficient of x 9 by choosing q(x) := 55x 9 . Since p'(lO) = 9!,
by (6.13), the zero Zo = 10 of the polynomial p is perturbed into

55· 109 5
10 - 9! c ~ 10 - 1.5 . 10 c.

This illustrates that finding the zeros of p is an ill-conditioned problem and


that a reliable approximation of the zeros is impossible. 0

6.4 Least Squares Problems


Quite often the problem of solving a system of nonlinear equations may
be replaced by an equivalent problem of minimizing a function and vice
versa. We illustrate this by introducing the Levenberg-Marquardt method
as one of the most effective procedures for solving nonlinear least squares
problems.
Let g : IR n """"* IR be a twice continuously differentiable function and
consider the problem of minimizing g. Let Xo be an approximation for a
local minimum of g. In a neighborhood of xo, by Taylor's formula we may
approximate

g(x) ~ g(xo) + (x - xo)T gradg(xo) + ~ (x - xofgil (xo)(x - xo), (6.14)

where
6.4 Least Squares Problems 115

denotes the Hessian matrix of g. Minimizing the quadratic function on the


right-hand side of (6.14) yields

Xl = Xo - [g"(xo)t l gradg(xo) (6.15)

as a new approximation for the minimum of g. We observe that (6.15) ob-


viously coincides with one Newton step for solving the necessary condition
gradg(x) = 0 for a local minimum.
However, if (6.14) is only a very poor approximation to g, then we expect
the Newton step (6.15) not to be very effective. In this case it is more
appropriate to use a so-called method of steepest descent; i.e., choose

Xl = Xo - AM gradg(xo) (6.16)

as a new approximation. Here M is a positive definite matrix, and the


step size A > 0 is chosen such that g(xt} < g(xo) is satisfied. This can be
achieved, since by Taylor's formula we have that

g[xo - AM gradg(xo)] ~ g(xo) - A[gradg(xo)f M gradg(xo)

and M is assumed to be positive definite.


After introducing the vector y E JRn and the n x n matrix A by

og
Yj(x) := -~ (x), (6.17)
UXj

we can rewrite the Newton iteration (6.15) as the linear system

A(XO)(XI - xo) = Y, (6.18)

which we have to solve for the difference Xl - xo. Similarly, one step of the
steepest descent (6.16) can be transformed into

Xl -xo = AMy. (6.19)

Now recall the least squares problem of Example 2.4. In a slight refor-
mulation, this problem consists in minimizing the function
m
g(x) := L)Ii(x) - U;]2
;=1

over some domain D, where the Ii : D --* JR are given functions and the
U; E JR are given constants for i = 1, ... , m. We compute the derivatives

og ~ ali
ax. (x) = 2 L)Ii(x) - Ui] ax. (x)
J i=l J
116 6. Iterative Methods for Nonlinear Systems

and

aXjaagXk (x) = 2 L {ali


2
m afi a2 Ii }
~ (x) ~ (x) + [Ii(x) - ud a a (x) .
i=l UXj UXk Xj Xk

In this case the matrix (ajk) contains second derivatives of the functions
Ii- However, since these derivatives are multiplied by the factor [Ii(x) -Ui],
which will become small by minimizing g, it is justified to neglect this term.
Note that if Newton's method converges, it always will converge to a zero,
even if we do not use the exact Jacobian for the computation, provided that
the approximate Jacobian at the limit is nonsingular. Hence, we simplify
and replace (6.17) by

;:.- a
Ii Ii
ajk(X) := 2 L. ~ (x) ~ (x)
a (6.20)
i=l uXj UXk
and note that ajj(x) > O.
Now the Levenberg-Marquardt method combines (6.18) and (6.19) by
first introducing the n x n matrix A = (ajk) with entries
ajj = (1 + 'Y)ajj, ajk = ajk, j =j:. k,

where'Y is some positive parameter, and then replacing (6.18) and (6.19)
by
(6.21)
Obviously, for large 'Y the matrix A will become diagonally dominant, and
(6.21) will get close to the steepest descent, with

M = diag (_1_ , ... , _1_)


all ann

and A = l/T For'Y --+ 0, on the other hand, (6.21) will turn into the Newton
step (6.18). This ability to gradually vary between Newton's method and
the steepest descent method is one of the basic features of the Levenberg-
Marquardt method, which we describe as follows:
1. Choose an initial guess xo, some moderately sized value for 'Y, and a
factor 0:, say 'Y = 0.001 and °: = 10.
2. Solve the linear system (6.21) to obtain Xl.
3. If g(xd > g(xo), then reject Xl as a new approximation, replace 'Y by
O:'Y, and go back and repeat step two.
4. If g(Xl) < g(xo), then accept Xo as a new approximation, replace Xo
by Xl and'Y by 'Y/a, and go back to step two.
5. Terminate when the difference Ig(xd - g(xo)1 is smaller than some
given tolerance.
For a detailed analysis of this method we refer to [44]. For a study of
nonlinear optimization methods and their relation to nonlinear systems we
refer to [20].
Problems 117

Problems
6.1 Prove Brouwer's fixed point theorem in IR; i.e., show that if D C IR is a
closed and bounded interval and if f : D -+ D is continuous, then f has a (not
necessarily unique) fixed point.

6.2 Draw figures illustrating monotone or alternating divergence of the succes-


sive iterations for a fixed point of a function of one variable.

6.3 Show how to solve the equation tan x =x by successive approximations.

6.4 Show that

v square roots

6.5 Let D C IR be an open interval and let f : D -+ D be m times continuously


differentiable. Under the assumption that the sequence Xv+l := f(x v ) converges
to some x in D with f'(x) = !,,(x) = ... = f<m-l)(x) = 0, show that the
convergence is of order m.

6.6 Let the sequence (xv) in IR converge to x such that Xv -# x for all v E IN
and
Xv+l - x = (q + ~v )(xv - x), v = 0,1, ... ,
where Iql < 1 and ~v -+ 0, v -+ 00. Show that

(Xv+l - x v )2
Yv := x v - ---'--'--,---'---
Xv+2 - 2xv+l + Xv
is well-defined for sufficiently large v and that

lim Yv - x -- 0',
v~oo Xv - X

Le., the sequence (Yv) converges to x more rapidly than the sequence (xv).
This method for speeding up the convergence of sequences is known as Aitken's
82 method.

6.1 Let D C IR be an open interval, let f : D -+ IR be twice continuously dif-


ferentiable, and let x be a fixed point of f with f' (x) -# 1. Show that Steffensen's
method

+ 3a)
Xv(x~
Xv+l:= 2 3 , v = 0, 1, ... ,
Xv +a
is a method of order three for computing the square root of a positive number a.
118 6. Iterative Methods for Nonlinear Systems

6.10 Prove an analogue of Corollary 6.16 for the secant method (6.9).

6.11 Give conditions for monotone convergence of Newton's method for a func-
tion of one variable.

6.12 Show that Newton's method for the function f(x) := x n -a, x> 0, where
n> 1 and a > 0, converges globally to a 1 / n .
6.13 Assume that the polynomial p with real coefficients has only real zeros
and denote the largest zero by Zl. Show that for any initial point Xo with Xo > Zl
Newton's method converges to Zl.

6.14 Assume that Z is a zero of order m of the polynomial p. Show that

p(X v )
Xv+l := Xv - m -(-) ,
pi Xv
v = 0,1, ... ,
converges locally and quadratically to the zero z.

6.15 Show that for a nonsingular n x n matrix A the sequence

A V + 1 := A v [2I - AA v ], v = 0, 1, ... ,

converges quadratically to the inverse A-\ provided that 111- AAol1 < 1.
6.16 Write a computer program for finding n simple zeros of a polynomial of
degree n with real coefficients. Use this code for the computation of the zeros of
the Laguerre polynomial £4 (x) = x 4 - 16 x 3 + 72 x 2 - 96 x + 24.

6.17 Show that for the function f : (0,00) ---+ IR given by


In2 . ( lnx)
f(x) := ---;- sm 21l' In2 +1
the Newton iterations starting with Xo = 1 converge and that the limit, however,
is not a zero of f.

6.18 The eigenvalue problem Ax = AX for an n x n matrix A is equivalent to


the equation f(z) = 0, where f : IRn x IR ---+ IRn x IR is defined by

Write down Newton's method for this equation.

6.19 Write a computer program for solving a least squares problem by the
Levenberg-Marquardt method.

starting with Zo = °
6.20 The set of all points ( E <C for which the fixed point iteration Zv+l := z~ +(
remains bounded is called the Mandelbrot set. Write a
computer program for visualizing the Mandelbrot set.
7
Matrix Eigenvalue Problems

Many problems in science and engineering lead to eigenvalue problems for


matrices. These occur either directly or by discretization of eigenvalue prob-
lems for differential or integral operators. In the latter case the size of the
matrices will be rather large. It is the purpose of this chapter to intro-
duce some of the main ideas in matrix eigenvalue computations without
attempting to be comprehensive. For a more detailed study we refer to
[27,65].
For the numerical computation of matrix eigenvalues we have to distin-
guish between two groups of methods:
1. In the so-called direct methods the eigenvalues are obtained as zeros
of the characteristic polynomial.
2. In contrast, iterative methods approximate the eigenvalues through a
successive approximation procedure without using the characteristic
polynomial.
Since, as illustrated in Example 6.24, the computation of zeros of poly-
nomials of high degree tends in general to be ill-conditioned, in practice it-
erative methods are used almost exclusively. In this chapter we will discuss
the two most important methods of this class, namely the Jacobi method
and the QR algorithm. In the last section we will also briefly describe the
Hessenberg method as an example of a direct method.
A key factor in all eigenvalue computations is the fact that similarity
transformations leave the eigenvalues of a matrix invariant; i,e., for a given
matrix A the matrices A and C- I AC have the same eigenvalues for all
nonsingular matrices C. This can be seen either from the equivalence of
120 7. Matrix Eigenvalue Problems

the equations

or from the multiplication theorem for determinants

det(AI - A) = det[C- 1 (AI - A)C] = det(AI - C- 1 AC);

Le., similar matrices have the same characteristic polynomial. This invari-
ance allows one to transform a given matrix A by a similarity transfor-
mation into a matrix of simpler form with the same eigenvalues as A. In
particular, the iterative methods successively construct sequences of similar
matrices that converge to a diagonal matrix or an upper (or lower) trian-
gular matrix from which the eigenvalues can be read off as the diagonal
elements.

7.1 Examples
We begin by illustrating how the discretization of eigenvalue problems for
differential operators leads to eigenvalue problems for large matrices.

Example 7.1 The vibrations of a string are modeled by the so-called wave
equation
82w 1 82w
8x 2 c2 8t 2 '
where w = w(x, t) denotes the vertical elongation and c is the speed of
sound in the string. Assuming that the string is clamped at x = 0 and
x = 1, the boundary conditions w(O, t) = w(l, t) = 0 must be satisfied for
all times t. Obviously, the time-harmonic wave

w(x, t) = v(x)e iwt

with frequency w solves the wave equation, provided that the space-dependent
part v satisfies
-V" = Avon [0,1],

where A := w2 /c 2 . The boundary conditions w(O,t) w(l, t) o are


satisfied if v satisfies the boundary conditions

v(O) = v(l) = O.

Hence, introducing the linear space

U := {v E C[O, 1] : V is twice continuously differentiable, v(O) = v(l) = O}

and defining the differential operator D : U -+ C[O, 1] by D : v ~ -v",


we are led to the eigenvalue problem Dv = Av. Elementary calculations
7.1 Examples 121

show that the functions vm(x) = sin m1rX are eigenfunctions of D with the
eigenvalues Am = m 2 1r 2 for m = 1,2, .... It can be shown that these are
the only eigenvalues and eigenfunctions of D.
For discussing an approximate solution we consider the slightly more
general differential equation
-v" + pu = Avon [0,1]
with boundary conditions v(O) = v(l) = 0, where P E e[O, 1] is a given pos-
itive function. We can proceed as in Example 2.1 and choose an equidistant
mesh Xj = jh, j = 0, ... ,n+ 1, with step size h = l/(n+ 1) and n E IN. At
the internal grid points Xj, j = 1, ... , n, we replace the differential quotient
by the difference quotient

v"(Xj) ~ :2 {v(Xj+d - 2v(xj) + v(xj-d}


to obtain the system of equations
1
h 2 {-Vj-I + 2vj - vj+d + PjVj = AVj, j = 1, ... , n,

for approximate values Vj to the exact solution v(Xj). Here, we have set
Pj := p(Xj) for j = 0, ... ,n + 1. This system has to be complemented by
the two boundary conditions Vo = Vn+l = O. For an abbreviated notation
we introduce the n x n tridiagonal matrix

1 -1
A = h2

and the vector u = (VI, ... ,vn)T. Then the above system of equations,
including the boundary conditions, reads
Au = AU;
i.e., the eigenvalue problem for the differential operator D is approximated
by the eigenvalue problem for the matrix A. 0

The important question as to how well the matrix eigenvalues approx-


imate the eigenvalues of the differential operator and whether we have
convergence of the eigenvalues as h -+ 0 is beyond the scope of this book
(see Problem 7.2). The example is meant only as an illustration of the fact
that eigenvalue problems for large matrices arise through the discretiza-
tion of eigenvalue problems for ordinary differential operators and also for
partial differential operators. In the same spirit, eigenvalue problems for
integral operators can be approximated by matrix eigenvalue problems, as
indicated in the following example.
122 7. Matrix Eigenvalue Problems

Example 7.2 Consider the eigenvalue problem

11 K(x,y)cp(y)dy = Acp(x), x E [0,1]'

for a linear integral operator with continuous kernel K. For the numerical
approximation we proceed as in Example 2.3 and approximate the integral
by the rectangular rule with equidistant quadrature points Xk = kin for
k = 1, ... , n. If we require the approximated equation to be satisfied only
at the grid points, we arrive at the approximating system of equations
1 n
- LK(xj,Xk)CPk
n
= ACPj, j = I, ... ,n,
k=l

for approximate values CPj to the exact solution cp(Xj). Hence, we approx-
imate the eigenvalues of the integral operator by the eigenvalues of the
matrix with entries K (x j , X k) In. Of course, instead of the rectangular rule
any other quadrature rule can be used. A discussion of the convergence of
the matrix eigenvalues to the eigenvalues of the integral operator is again
beyond the aim of this introduction. 0

7.2 Estimates for the Eigenvalues


At this point we urge the reader to recall the basic facts about eigenvalues
of matrices, in particular those that were presented in Section 3.4. In the
sequel, by (".) we denote the Euclidean scalar product in ce n and by 11·112
the corresponding Euclidean norm.
The eigenvalues of Hermitian matrices can be characterized by the fol-
lowing maximum principles. These can be used to get some rough estimates
for the eigenvalues. Note that for the eigenvalues of Hermitian matrices the
geometric and the algebraic multiplicity coincide (see Problem 7.4).
Theorem 7.3 (Rayleigh) Let A be a Hermitian n x n matrix with eigen-
values
A1 2: A2 2: ... 2: An
(where multiple eigenvalues occur according to their multiplicity) and cor-
responding orthonormal eigenvectors Xl, X2, ... , Xn . Then

Aj = max (~x, ~), j = 1, ... ,n,


xEVj X,X
x#O

where the subspaces V1, ... , Vn are defined by V1 := ce n and

Vj := {x E ce n : (X,Xk) = 0, k = I, ... ,j -I}, j = 2, ... ,n.


7.2 Estimates for the Eigenvalues 123

Proof. Let x E ltj with x:f O. Then


n n

X = I)x, Xk)Xk and L I(X,XkW = (x, x).


k=j k=j

Hence
n
Ax = LAk(X,Xk)Xk
k=j
and
n n
(Ax, x) = LAkl(x,Xk)12::; Aj LI(x,XkW = Aj(X,X).
k=j k=j
This implies
sup (Ax,x) <A
- j,
xEV; ( X,X )
x#O

and the statement follows from (Ax j , x j) = Aj and x j E ltj. o


This maximum principle can be used in a simple manner to obtain lower
bounds for the largest eigenvalue of Hermitian matrices. For the matrix

A=
1
3 5 1
32) ,
( 214

by using x = (1,1, l)T we find the estimate Al ~ 7.33 as compared to the


exact eigenvalue Al = 7.58 .... Using x = (1,2, l)T leads to the estimate
Al ~ 7.50.
Using Rayleigh's principle to obtain bounds for the smaller eigenvalues
requires the knowledge of the eigenvectors for the preceding larger eigen-
values. This problem is circumvented in the following minimum maximum
principle.

Theorem 7.4 (Courant) Let A be a Hermitian n x n matrix with eigen-


values
Al ~ A2 ~ ... ~ An
(where multiple eigenvalues occur according to their multiplicity). Then

\ . (Ax, x ) .
/\j = mm max ( ) ' J = 1, ... , n,
x, x
U;EM; xEU;
x#O

where M j denotes the set of all subspaces Uj C ce n of dimension n +1 - j.


124 7. Matrix Eigenvalue Problems

Proof. First we note that because of


(Ax,x)
sup = sup (Ax, x)
xEUj (x, x) xEUj
x;eo (x,x)=l

and the continuity of the function x H (Ax, x), the supremum is attained;
i.e., the maximum exists.
By Xl, X2, ... ,xn we denote orthonormal eigenvectors corresponding to
the eigenvalues Al ~ A2 ~ ... ~ An. First, we show that for a given
subspace Vj of dimension n + 1 - j there exists a vector x E Vj such that
(X,Xk) = 0, k= j + 1, ... ,no (7.1)
Let Zl, ... , Zn+l-j be a basis of V j . Then we can represent each x E Vj by
n+l-j
X= L aiZi· (7.2)
i=l

In order to guarantee (7.1), the n +1- j coefficients al, ... , a n +1-j must
satisfy the n - j linear equations
n+l-j
L ai(zi,xk) = 0, k = j + 1, ... ,no
i=l

This underdetermined system always has a nontrivial solution. For the


corresponding x given by (7.2) we have x :f; 0, and from
j

x = L(X,Xk)Xk
k=l
we obtain that
j j

(Ax,x) = L Akl(x,xkW ~ Aj L I(X,XkW = Aj(x,x),


k=l k=l
whence
(Ax,x) \
max > A'
xEUj (x,x) - J
x;eo
follows.
On the other hand, for the subspace
Vj = {x E (:n: (X,Xk) = 0, k = 1, ... ,j -I}
of dimension n +1- j, by Theorem 7.3 we have the equality

max (Ax,x) = Aj,


xEUj (x,x)
x;eO
and the proof is finished. o
7.2 Estimates for the Eigenvalues 125

Corollary 7.5 Let A and B be two Hermitian n x n matrices with eigen-


values Al (A) 2: A2(A) 2: ... 2: An(A) and Al (B) 2: A2(B) 2: ... 2: An(B).
Then
IAj(A) - Aj(B)1 ::; IIA - BII, j = 1, ... , n,
for any norm 11·11 on (;n.

Proof. From the Cauchy-Schwarz inequality we have that

(Ax - Bx, x) ::; II(A - B)xlb IIxll2 ::; IIA - B11211xll~

and hence
(Ax,x) ::; (Bx,x) + IIA - BI12I1xll~.
By the Courant minimum maximum principle of Theorem 7.4 this implies

Interchanging the roles of A and B, we also have that

and therefore

Now the statement follows from

IIA - BI12 = p(A - B) ::; IIA - BII,


which is a consequence of Theorems 3.31 and 3.32. o
Corollary 7.6 For the eigenvalues Al 2: A2 2: ... 2: An of a Hermitian
n x n matrix A = (ajk) we have that
n
IAi-a~iI2::; L lajkl 2, i=1, ... ,n,
j,k=l
j#

where the elements a~l' ... , a~n represent a permutation of the diagonal
elements au, ... ,ann of A such that a~ 1 2: a~2 2: ... 2: a~n·
Proof. Use B = diag(aj j ) and 11·11 = II· 112 in the preceding corollary. 0

We conclude this section with an extension of the above results to general


matrices that gives a rough estimate as to where in (; the eigenvalues are
located.
126 7. Matrix Eigenvalue Problems

Theorem 7.7 (Gerschgorin) Let A = (aik) be a complex n x n matrix


and define the disks

and

aj := {A E {:: IA - aiil ~t k=l


la kil }, j = 1, ... ,no

kf-j

Then the eigenvalues A of A satisfy


n n
AE Uai n Uaj.
i=l i=l

Proof. Assume that Ax = AX and IIxli oo = 1, and for x = (Xl, ... ,xn)T
choose j such that IXil = Ilxlloo = 1. Then

n n
IA-aiil = I(A-aii)xil = Laikxk ~ Llaikl,
k=l k=l
k#-i k#-i

and therefore

Since the eigenvalues of A * are the complex conjugate of the eigenvalues of


A (see Problem 7.3) we also have that
n
.A E UGj,
i=l

and the theorem is proven. o

7.3 The Jacobi Method


The method described in this section was discovered by Jacobi in 1846 and
can be used to iteratively compute all the eigenvalues and eigenvectors of
real symmetric matrices.
7.3 The Jacobi Method 127

Lemma 7.8 The Frobenius norm

of an n x n matrix A = (ajk) is invariant with respect to unitary transfor-


mations.
Proof. The trace
n
trA:= Lajj
j=1
of a matrix A is commutative; Le., tr AB = tr BA. This follows from
n n n n n n

L(AB)jj = LLajkbkj = LLbkjajk = L(BAhk.


j=1 j=1 k=1 k=1 j=1 k=1
In particular, we have that
n n n n
trAA* = LLajkakj = LLlajkl 2.
j=lk=1 j=lk=l
Therefore, for each unitary matrix Q it follows that

IIQ* AQII} = tr(Q* AQQ* A*Q) = tr(Q* AA*Q) = tr(AA*QQ*) = IIAII},


and the lemma is proven. o
Corollary 7.9 The eigenvalues of an n x n matrix A (counted repeatedly
according to their algebraic multiplicity) satisfy Schur's inequality
n
L IAj 1 ~ IIAII}·
2

j=1

Equality holds if and only if the matrix A is normal, i.e., if AA* = A* A.


Proof. By Theorem 3.27 there exists a unitary matrix Q such that
R := Q* AQ is an upper triangular matrix. Hence
n n n
IIAII} = IIRII} =L IAjl2 +L L Irjkl 2 , (7.3)
j=l j=lk=j+l

since the diagonal elements of R = (rjk) coincide with the eigenvalues of


the similar matrices Rand A. Now Schur's inequality follows immediately
from (7.3).
128 7. Matrix Eigenvalue Problems

For the discussion of the case of equality, we first note that any unitary
transformation of a normal matrix is again normal. This is a consequence
of the identity

Q* AQ(Q* AQ)* - (Q* AQ)*Q* AQ = Q*(AA* - A* A)Q.

If equality holds in Schur's inequality, then (7.3) implies that R is a diagonal


matrix. Hence R, and therefore A, is normal.
Conversely, if A is normal, then the upper triangular matrix R must also
be normal. Now, from
n n
(RR*)jj = L rjkrkj = L Irjkl
2
k=1 k=j
and
n j

(R* R)jj =L rjkrkj =L Ir kjl2


k=1 k=1
we conclude that
n j

L Irjkl 2 = L Ir kjl2, j = 1, ... ,n.


k=j k=1
This implies rjk = 0 for j < k, i.e., R is a diagonal matrix, and from (7.3)
we deduce that equality holds in Schur's inequality if A is normal. 0

For any n x n matrix A = (ajk) we introduce the quantity


1/2

N(A):=
(
.t
J,k=1
lajkl
2

)
(7.4)

j#

as a measure for the deviation of A from a diagonal matrix.


Lemma 7.10 Normal matrices A satisfy
n n
L l..\jl2 = L lajjl2 + [N(AW·
j=1 j=1
Proof. This follows from Corollary 7.9. o

The main idea of the Jacobi method for real symmetric matrices is to
successively reduce N(A) by elementary plane rotation matrices such that
in the limit the matrix becomes diagonal (with the eigenvalues as diagonal
entries).
7.3 The Jacobi Method 129

Lemma 7.11 For each pair j < k and each 'P E rn. the matrix
1

cos 'P - sin 'P


U=
sin'P cos'p

which coincides with the identity matrix except for Ujj = Ukk = cos'p and
Ukj = -Ujk = sin'P (and which describes a rotation in the xjxk-plane) is
unitary.

Proof. This follows from

( cOS'P
sin 'P
- sin'P
cos 'P )( cos'p
- sin'P
sin 'P
cos'p ) (~ ~ )
and
( cos'p
- sin'P
sin'P ) ( C?S'P
cos'P sm'P
- sin'P
cos'p ) (~ ~ ).
Lemma 7.12 Let A be a real symmetric matrix and let U be the unitary
matrix of Lemma 7.11. Then B = U* AU is also real and symmetric and
has the entries
. 2 . 2
bj j = ajj cos 2 'P + ajk sm 'P + akk sm 'P,

. 2 'P - ajk sm
bkk = ajj sm . 2
'P + akk cos 'P,
2

bij = bji = aij cos'P + aik sin'P, i I- j, k,


bik = bki = -aij sin 'P + aik cos 'P, i I- j, k,
bit = ail, i, II- j, k;

i.e., the matrix B differs from A only in the jth and kth rows and columns.

Proof. The matrix B is real, since A and U are real, and it is symmetric,
since the unitary transformation of a Hermitian matrix is again Hermitian.
Elementary calculations show that

cOS'P cos'P - sin'P ) = (


( - sin'P sin 'P cos'P
130 7. Matrix Eigenvalue Problems

with bjj , bjk , bkj, and bkk as stated in the theorem. For i -:J j, k we have
that
n
bij = L uisasrurj = aijUjj + aikUkj = aij cos r.p + aik sin r.p
r,s=l
and
n
bik = L uisasrUrk = aijUjk + aikUkk = -aij sinr.p + aik cosr.p.
T,s=l

Finally, we have
n
bil = L uisasrUrl = ail
r,s=l
for i, l -:J j, k. o
Lemma 7.13 For

7r
r.p = "4 ' ajj = akkl
the transformation of Lemma 7.12 annihilates the elements
bjk = bkj = 0
and reduces the off-diagonal elements according to
[N(B)]2 = [N(AW - 2a;k'
Proof. bjk = bkj = 0 follows immediately from Lemma 7.12. Applying
Lemma 7.8 to the matrices

and

yields
+ 2ajk + akk = bj j + b2kk'
2
aj2 j 2 2

From this, with the aid of Lemmas 7.8 and 7.12 we find that
n n
[N(B)f = II B II} - L b;i = IIAII} - L b;i
i=l i=l

n
= [N(A)j2 + L(a;i - b;i) = [N(A)f - 2a;k'
i=l

which completes the proof. o


7.3 The Jacobi Method 131

Note that the quantities required for the computation of the elements of
the transformed matrix can be obtained by the trigonometric identities
1
=
cos2~
J1 + tan22~ ,
cos J
~ = ~ (1 + cos 2~), sin ~ = ~ (1 - J cos 2~).
The sign of the root in the expression for sin ~ has to be chosen such that
it coincides with the sign of tan 2~.
The classical Jacobi method generates a sequence (A,,) of similar matri-
ces by starting with the given matrix A o := A and choosing the unitary
transformation at the vth step according to Lemma 7.13 such that the non-
diagonal element of A"-1 with largest absolute value is annihilated. It is
obvious that the elements annihilated in one step of the Jacobi iteration,
in general, do not remain zero during subsequent steps. However, we can
establish the following convergence result.

Theorem 7.14 The classical Jacobi method converges; i.e., the sequence
(A,,) converges to a diagonal matrix with the eigenvalues of A as diagonal
elements.

Proof. For one step of the Jacobi method, from

[N(AW :s (n 2
- n) max a;l
i,l=l, ... ,n
i#-l

we obtain that
2 [N(A)F
a .k > ..::....,--'..-'-'-:-
J -n(n-1)
for the nondiagonal element ajk with largest modulus. Hence, from Lemma
7.13 we deduce that

where
q ._
.-
(1 _n(n-1)
2 ) 1/2

For the sequence (A,,) this implies that

N(A,,) :s q" N(Ao )


for all v E IN, whence N(A,,) --t 0, v --t 00, since q < 1. o
132 7. Matrix Eigenvalue Problems

Note that for large n the value of q is close to one, indicating a slow
convergence of the Jacobi method. Writing A v = (ajk,v) by Corollary 7.6
we have the a posteriori error estimate

IAj -ajj,vl ~ N(A v ), j = l, ... ,n,

after performing v steps of the Jacobi method. Further error estimates can
be derived from Gerschgorin's Theorem 7.7.
Approximations to the eigenvectors can be obtained by successively mul-
tiplying the unitary transformations of each step. We have A v = Q~AQv,
where Qv = UI ... Uv is the product of the elementary unitary transforma-
tions for each step. From

Av ~ D = diag(AI,'." An)
it follows that AQv ~ QvD. Hence the columns Qv = (UI,"" un) of Qv
satisfy AUj ~ AjUj for j = 1, ... , n; i.e., they provide approximations to
the eigenvectors.
In each step, the classical Jacobi method requires the determination of
the nondiagonal element with largest modulus. In order to reduce the com-
putational costs, in the cyclic Jacobi method the nondiagonal elements are
annihilated in the order
(1,2), ... , (1, n), (2,3), ... , (2, n), (3,4), ... , (n - 1, n)
independent of their size. Convergence results can also be established for
this variant (see [27]).
A further refinement is to choose a constant threshold and to annihilate
in each cyclic sweep only those off-diagonal elements that are larger in
absolute value than the threshold. Of course, the threshold needs to be
lowered after each sweep, Le., after performing a full cycle. For details we
refer to [48, 65].
Example 7.15 For the matrix

-D
2 -1
-1 2
o -1

the first six transformed matrices for the classical Jacobi method are given
by
1.0000 0.0000 -0.7071 )
Al = 0.0000 3.0000 -0.7071 ,
( -0.7071 -0.7071 2.0000

0.6340 -0.3251 0.0000 )


-0.3251 3.0000 -0.6280 ,
0.0000 -0.6280 2.3660
7.4 The QR Algorithm 133

(
0.6340 -0.2768 -0.1704 )
A3 = -0.2768 3.3864 0.0000 ,
-0.1704 0.0000 1.9796

(
0.6064 0.0000 -0.1695 )
A4 = 0.0000 3.4140 0.0169 ,
-0.1695 0.0169 1.9796

0.5858 0.0020 0.0000 )


A, = ( 0.0020 3.4140 0.0168 ,
0.0000 0.0168 2.0002

0.5858 0.0020 -0.0000 )


A. = ( 0.0020 3.4142 0.0000 .
-0.0000 0.0000 2.0000

The exact eigenvalues of A are Al = 2 + y'2, A2 = 2, A3 =2 - y'2. 0

7.4 The QR Algorithm


The QR algorithm was suggested by Francis in 1961 and is an iterative
method for computing all eigenvalues and eigenvectors for arbitrary com-
plex matrices. In applications, it is the most commonly used method for
eigenvalue computations. Our presentation of the QR algorithm follows
[62).
For motivation we first consider the power method introduced by von
Mises in 1929 for finding the eigenvalue with largest modulus.
Definition 7.16 A matrix A is called diagonalizable if there exists a non-
singular matrix C such that C- l AC is a diagonal matrix; i.e., A is similar
to a diagonal matrix.
Theorem 7.17 Ann x n matrix A is diagonalizable if and only if it has
n linearly independent eigenvectors.
Proof. Assume that C- l AC = D, where D = diag(Al,"" An), is diagonal.
Then Dej = Ajej, j = 1, ... , n, with the canonical orthonormal basis
el, . .. ,en of (Cn. This implies that the vectors Xj := Cej, j = 1, ... ,n, are
eigenvectors of A, since
134 7. Matrix Eigenvalue Problems

The vectors Xl, ... ,X n are linearly independent because C is nonsingular


and the el, ... ,en are linearly independent.
Conversely, assume that Xl, ,X n are n linearly independent eigenvec-
tors of A for the eigenvalues AI, ,An. Then the matrix C = (Xl, ... ,xn )
formed by the eigenvectors as columns is nonsingular, and we have that

AC = (AXI, ... ,Axn ) = (AIXI, .. . ,Anxn ) = CD,


where D = diag(AI, ... , An). Hence C- I AC = D. o
We order the eigenvalues of a diagonalizable n x n matrix A according
to their absolute values and assume that

i.e., there is only one eigenvalue of maximal modulus. Starting from an


arbitrary vector Vo E ce n we construct the sequence

by the successive iterations V v := AVII_I. Note that in order to avoid nu-


merical overflow or underflow we need to scale after each step. Since the
n linearly independent eigenvectors Xl, ... ,X n of A form a basis of ce n , we
can represent

whence
n
Allvo = L akAA;xk
k=l
follows. Scaling after each step by the factor 1/Al leads to

and consequently

and

as v --t 00, provided that al :j:. 0. Of course, in principle, Al cannot be


used as a scaling factor, since it is not known. However, this is irrelevant,
since the eigenvector is determined only up to multiplication by a complex
constant; i.e., only the direction of the eigenvector is relevant. In practical
computations, the condition al :j:. 0, i.e., Vo f/. span{x2,'" ,x n }, will be
automatically satisfied through roundoff errors.
7.4 The QR Algorithm 135

The fact that we need to find only the direction of the eigenvectors
motivates us to interpret the power method as a successive iteration of
subspaces. For

S:= span{vo} and A"S = span{A"vo}


from the above we have that A"S -t span{xt}, v -t 00. More generally,
we can choose any subspace S of dimension 1 ~ dim S < n and iterate
A"S={A"v:VES}.
Lemma 7.18 Let A be a diagonalizable n x n matrix with eigenvalues

and corresponding eigenvectors Xl, X2, ... , X n . Assume that for some m
with 1 ~ m < n we have that IAm I > IAm+ll and define

T:= span{xl, ... ,xm } and U:= span{xm+l,'.' ,x n }.

Further, assume that S is a subspace of ce n with dimension m satisfying

SnU = {O}.
Then the orthogonal projections PAvs and PT of ce n onto A"S and T,
respectively, satisfy

for some constant M; i. e., the subspaces A" S converge to T.

Proof. 1. First, we show that we can choose a convenient basis for S.


Let Yl, . .. , Ym denote a given basis of S. Then, for i = 1, ... , m, we can
represent
m
Yj = L bjkXk + Vj, (7.5)
k=l

where Vj E U. We prove that the m x m matrix B = (bjk) is nonsingular.


To accomplish this, assume that Q:l, ... , Q: m solve the homogeneous adjoint
system
m
L bjkQ:j = 0, k = 1, ... ,m.
j=l
Then from (7.5) it follows that
m m
L Q:jYj = L Q:jVj,
j=l j=l
136 7. Matrix Eigenvalue Problems

and from this, with the aid of SnU = {O} and the linear independence of the
Yj,we conclude that a1 = ... = am = O. Hence, B indeed is nonsingular.
We denote the entries of the inverse of B by B- 1 = (Cjk)' Then
m

Zj:= ~CjkYk' j = 1, ... ,m,


k=l

defines a new basis for S of the form

where Uj E U for j = 1, ... , m. Because of


Allz j = XiXj + Alluj, j = 1, ... ,m,

the linearly independent vectors


AII Z ' AII '
·-
W jll'-~- J - X
j
+ ~'
U
J j = 1, ... ,m,
J J

form a basis of AilS. Since we can represent any U E U in the form


n
u= ~ akXk,
k=m+1

from
n
Allu = ~ akAkxk
k=m+1
we conclude that there exists a constant L > 0 such that

Am+l11l
II
I---.x:- ' = 1, ... ,m,
Ilwjll - Xj 2 ~ L j (7.6)

for all v E IN.


2. By Corollary 3.53, the orthogonal projection of an element 1] E ec n onto
the subspace T is given by
m

PT1] = ~ akXk, (7.7)


k=l

where the coefficients a1, ... , am solve the normal equations


m

~ak(xk,Xj) = (Tl,Xj), j = 1, ... ,m. (7.8)


k=l

Analogously, we have
m
PAVSTl = ~l3kllwkll (7.9)
k=l
7.4 The QR Algorithm 137

and
m
Ll3kll(Wkll,Wjll) = (TJ,Wjll), j = 1, ... ,m. (7.10)
k=I

We denote the m x m matrices of the linear systems (7.8) and (7.10) by X


and W II , respectively. Then, with the aid of the Cauchy-Schwarz inequality,
(7.6) implies that

v E IN, (7.11)

for some constant C I . We denote the right-hand sides of (7.8) and (7.10) by
a and bll , respectively. Again from (7.6) and the Cauchy-Schwarz inequality
we have that
v E IN, (7.12)

for some constant C2 . Now, considering the linear system (7.10) as a per-
turbation of (7.8), from Theorem 5.3 we can conclude that

(7.13)

for the vectors Q: = (Q:I, ... ,Q:m)T and 1311 = (I3III, ... ,l3mll)T and some
constant C3 . From (7.7) and (7.9), using (7.6), (7.13), and the triangle in-
equality, the assertion of the lemma follows. 0

The subspace T of Lemma 7.18 is invariant with respect to A; i.e.,


A(T) = T. By a knowledge of invariant subspaces the eigenvalue prob-
lem for the full matrix A can be reduced to eigenvalue problems for two
smaller matrices. Assume that

is a unitary matrix such that its first m columns represented by the matrix
Pt form a basis of T. Then P; API = 0, since T is invariant with respect
to A, and P; PI = O. Therefore, the unitary transformation yields

P* AP = ( Pt API
Pi API

i.e., the eigenvalue problem for A is reduced to two smaller eigenvalue


problems for the m x m matrix All and the (n - m) x (n - m) matrix A 22 .
The successive iterations of Lemma 7.18 yield only approximations A" S
to the invariant subspace T. However, if
138 7. Matrix Eigenvalue Problems

denotes a unitary matrix such that its first m columns represented by the
matrix QI v form a basis of AV S, then for

B 12 ,v )
B 22 ,v

we expect that B 21 ,v -+ 0, V -+ 00. Before we can establish this result we


need to investigate further the iteration of subspaces.
Choose a basis YI, ... , Yn of (Cn and consider the subspaces

Sm:=span{YI, ... ,Ym}, m=I, ... ,n-1.

For a simultaneous iteration of all the subspaces AV Sm it clearly suffices


to iterate the basis vectors AV yl , , AV yn . If the assumptions of Lemma
7.18 are satisfied for each m = 1, , n - 1, then

for m = 1, ... , n - 1. Hence we expect to be able to construct unitary ma-


trices Qv such that Q~AQv -+ R, v -+ 00, where R is an upper triangular
matrix that is similar to A.
For the actual computation two difficulties arise. Firstly, the iterated
vectors have to be scaled in order to avoid numerical overflow or un-
derflow. Secondly, by Theorem 7.17, as v -+ 00 each of the n sequences
(AVyt}, ... , (AV yn ) will converge to the subspace span{xd spanned by the
eigenvector for the eigenvalue Al with largest modulus. Hence, for large v
the vectors AV yl , ... , AV Yn will be almost collinear; i.e., the basis elements
AV yl , ... , AV Yn are almost linearly dependent and therefore ill-conditioned
for spanning the iterated subspaces.
Both these difficulties can be remedied by orthonormalizing the basis
after each step. Assume that qlv,"" qnv are orthonormal vectors such
that
AVSm = span{qIV,'" ,qmv}, m = 1, ... ,n-1.
Then we compute Aqlv, ... , Aqnv and orthonormalize these vectors from
left to right to obtain the vectors rl v , ... , r nv . This procedure preserves the
property

span{rlv, ... ,rmv } = span{Aqlv, ... ,Aqmv}

for m = 1, ... , n - 1.

Theorem 7.19 Assume that A is a diagonalizable nxn matrix with eigen-


values
7.4 The QR Algorithm 139

and corresponding eigenvectors Xl, X2, •.• , X n , and set

for m = 1, ... , n - 1. Let qlO, ... , qnO be an orthonormal basis of {;n and
let the subspaces
Sm :=span{qlO, ... ,qmo}
satisfy
Sm n Um = {O}, m = 1, ... , n - 1.
Assume that for each v E lN we have constructed an orthonormal system
qlv, . .. , qnv with the property

AVSm = span{qlv, ... ,qmv}, m = 1, ... ,n -1, (7.14)

and define Qv = (ql v, ... ,qnv). Then for the sequence of matrices
Av = (ajk,v) given by
(7.15)
we have convergence:

lim ajk,v = 0,
v-too
1<k <j :s n,
and
lim ajj,v
v-too
= Aj, j = 1, ... ,n.
Proof. 1. Without loss of generality we may assume that IIxjll2 1 for
j = 1, ... , n. From Lemma 7.18 it follows that

(7.16)

for some constant M and

r := max Am+11
- - < 1.
I
m=l,... ,n-l Am

From this, for the projections

and w nv := X n , we conclude that

Ilwmv -x m Il2:S Mr v , m = 1, ... ,n, v E IN. (7.17)

For sufficiently large v the vectors Wl v ,' .. , W nv are linearly independent,


and we have that

span{Wl v,'" ,wmv } = AVSm , m = 1, ... ,n - 1.


140 7. Matrix Eigenvalue Problems

To prove this we assume to the contrary that the vectors Wl v , ..• , W nv are
not linearly independent for all sufficiently large v. Then there exists a
sequence Vt such that the vectors Wl vt , ... , w nvt are linearly dependent for
each f E IN. Hence there exist complex numbers 0u, ... ,Ont such that
n n
L 0ktWknt =0 and L IOktl 2
= 1. (7.18)
k=l k=l

By the Bolzano-Weierstrass theorem, without loss of generality, we may


assume that
0kt -t Ok, f -t 00, k = 1, ... , n.
Passing to the limit f -t 00 in (7.18) with the aid of (7.17) now leads to
n
L 0kXk =0 and
k=1

which contradicts the linear independence of the eigenvectors Xl, ..• , X n .


2. We orthonormalize by setting [J] := Xl and
Pm := X m - PTm_tX m , m = 2, ... ,n,
Pm m= 1, ... ,n,
Pm := IIPml12 '
and, analogously, Vl v := Wl v and

V mv := m=l, ... ,n.

Then
span{PI, ... ,Pm} = Tm , m = 1; ... , n - 1,
and by repeating the above argument,
span{Vlv, ... ,Vmv } = AVSm , m = 1, ... ,n - 1, (7.19)

for sufficiently large v. Writing

with the aid of (7.16) and (7.17) we obtain that

IIv mv -PmI12:::;3Mr v , m=l, ... ,n, vElN.

From this and the representation


+ vmv - Pm
IIPml12
7.4 The QR Algorithm 141

it follows that

Ilv mll - Pmll2 ~ Cr ll , m = 1,.,. ,n, v E lN, (7.20)

for some constant C.


3. From (7.14) and (7.19), by induction, we deduce the existence of phase
factors <{Jmll E ~ with l<{Jmlll = 1 such that

qmll = <(JmIlVmll,
= 1, ... ,no
m

Therefore, defining the diagonal matrices D II = diag(<{JIII,' .. , «Jnll) and the


unitary matrices VII = (VIII' ... , v nll ), we have the relation

VII VII = VIIDII = QII'


This implies that

All+! = Q~AQII = D:VII*AVIIDII


(7.21)
= D:(VII* - P*)AVIID II + D~P* A(VII - P)D II + D:P* APD II ,
where P = (PI,'" ,Pn). Because of (7.20) we have that

Furthermore, V:P* APD II is an upper triangular matrix with diagonal el-


ements diag( AI, ... , An). Hence, the assertion of the theorem follows by
passing to the limit v -t 00 in (7.21). We note that for the elements above
the diagonal we do not, in general, have convergence because of the occur-
rence of the phase factors. 0

For the actual numerical implementation we have to describe the compu-


tation of AV+I according to (7.15). From page 20 we recall that orthonor-
malizing n vectors aI, , an from left to right is equivalent to determining
orthonormal vectors ql, , qn and an upper triangular matrix R = (rjk)
such that
k
ak = Lrikqi, k= 1, ... ,n.
i=l

For the matrices A = (aI, ... , an) and Q = (ql, ... , qn) this corresponds to
a QR decomposition
A=QR
as described in detail in Section 2.4. Now assume that All = Q~_lAQII-I
has been determined according to (7.15). To generate All+! from this, a
QR decomposition of the matrix AQII-I is required, since

A II 8 m = AA II - 18 m = span{Aql,II-I, .. "Aqm,lI-d·
142 7. Matrix Eigenvalue Problems

This is obtained from a QR decomposition

(7.22)

of A v by
AQv-1 = Qv-IAv = QV-IQvR v = QvR v ,
where Qv = QV-I Qv. From this we find that
(7.23)

Hence the two equations (7.22) and (7.23) represent one step of the succes-
sive iterations of subspaces as described in Theorem 7.19.
Now the QR algorithm consists in performing these iterations starting
from the canonical basis el, ... ,en, which means that in the first step a
QR decomposition is required for Al = A = (Ael,"" Ae n ).
Theorem 7.20 (QR algorithm) Let A be a diagonalizable matrix with
eigenvalues

and corresponding eigenvectors Xl, X2, ... ,X n , and assume that

(7.24)

for m = 1, ... ,n - 1. Starting with Al = A, construct a sequence (A v ) by


determining a QR decomposition

and setting
A V+ I := RvQv
for v = 0, 1,2, . . .. Then for A v = (ajk,v) we have convergence:

lim ajk,v = 0, 1 < k < j :::; n,


v---+oo

and
lim ajj,v
v---+oo
= Aj, j = 1, ... ,no
Proof. This is just a special case of Theorem 7.19. o
We proceed with a discussion of the assumption (7.24). Define the ma-
trices X := (Xl,"" x n ) and Y := X-I = (Yjk). Then the identity I = XY
means that
n
ej = LXkYkj, j = 1, ... ,no
k=l
7.4 The QR Algorithm 143

For fixed m = 1, ... ,n - 1 the property (7.24) holds if and only if


m

L ajej E span{ Xm+l,· .. ,xn }


j=l

implies that al = ... = am = O. This in turn is satisfied if and only if the


homogeneous linear system
m
LYkjaj = 0, k = 1, ... ,m,
j=l

admits only the trivial solution, since


k n m
Lajej = LXk LYkjaj.
j=l k=l j=l

Hence (7.24) holds if and only if for m = 1, ... , n - 1 the m x m sub-


matrices (Ykj), k, j = 1, ... , m, are nonsingular. This means that for the
matrix Y, Gaussian elimination works without interchanging columns; i.e.,
the matrix Y has an LR decomposition. Since Gaussian elimination with
column pivoting always works, there exists a permutation matrix P such
that we have an LR decomposition PY = LR (see Problem 2.16). Hence
it is plausible that the assumption (7.24) is not very restrictive. Indeed,
it can be shown that convergence of the QR algorithm also holds when
(7.24) is not satisfied. However, in general, the eigenvalues on the diagonal
will not occur ordered according to their size (see [65)). Furthermore, it
can be shown that in the case of eigenvalues with the same modulus, the
QR algorithm still works in the sense of an appropriately modified version
of Theorem 7.20. For example, for two conjugate complex eigenvalues, the
upper rectangular matrix will be distorted through a two-by-two block on
the diagonal. The blocks do not converge, but still the conjugate complex
eigenvalues can be obtained as eigenvalues of the individual two-by-two
blocks (see [65)).
In principle, the QR decomposition required in each step of the QR
algorithm can be done through the Gram-Schmidt procedure. However, in
practice, because of the ill-conditioning of the Gram-Schmidt procedure,
orthogonalizing by Householder transformations is preferable. For details
we refer back to Section 2.4.
The basic form of the QR algorithm as described above is not yet efficient
enough for applications, since each iteration step requires O(n 3 ) operations.
The speed of convergence is determined by the location of the eigenvalues
with respect to one another. The matrix A - a I has the eigenvalues Aj - a
for j = 1, ... , n. If we choose for a an approximate value of the eigenvalue
An of smallest absolute value, then An - a becomes small. This will speed
144 7. Matrix Eigenvalue Problems

up the convergence in the last row of the matrix, since


IAn -0-1

1An-l
\ _1«1.
0-

Having reduced the elements of the last row to almost zero, the last row
and column of the matrix may be neglected. This means that the smallest
eigenvalue is deflated by canceling the last row and column, and the same
procedure can be applied to the remaining (n - 1) x (n -1) matrix with the
parameter 0- changed to be close to An-I. This so-called shift and deflation
strategy leads to a tremendous speeding up of the convergence. For details
we refer to [27, 65].
The computational costs of one step of the QR algorithm is reduced when
the matrix has a large number of zero entries. For example, for tridiagonal
matrices all matrices generated in the QR algorithm remain tridiagonal. In
the following section we will consider so-called Hessenberg matrices, which
differ from upper triangular matrices only by a non-zero first subdiagonal. It
can be shown (see Problem 7.16) that the Hessenberg form is also invariant
with respect to the QR algorithm. Hence, for practical computations it is
convenient first to transform the matrix into Hessenberg form.
In general, comparing the computational costs, for symmetric matrices
the QR algorithm is superior to the Jacobi method. However, the actual
programming for the Jacobi method is very simple as compared with the
QR algorithm. Hence for small matrix size n the Jacobi method is still
attractive.

7.5 Hessenberg Matrices


Definition 7.21 An n x n matrix B = (b jk ) is called a Hessenberg matrix
if bjk = 0 for 1 ~ k ~ j - 2, j = 3, ... , n; i.e., in the lower triangular
part of a Hessenberg matrix only the elements of the first subdiagonal can
be different from zero.
We proceed by showing that each matrix A can be transformed into
Hessenberg form by unitary transformations using Householder matrices.
We start with generating zeros in the first column by multiplying A from
the left by a Householder matrix HI. We write

A= (~ll A)'
where A is an (n - 1) x (n - 1) matrix and (it an (n - 1) vector. Then
considering a Householder matrix HI of the form
7.5 Hessenberg Matrices 145

where Hl = 1- 2Vl Vi is an (n - 1) x (n - 1) Householder matrix, we have

AHl• = (~ll * )
al AHi
and
HlAH; = (;ll~l Hl~Hi)'
As shown in the proof of Theorem 2.13, choosing

where
Ul = al =fO'(l,o,oo.,of
and
0' = { I:~~ IJaial , a21 i 0,
Jaial , a21 = 0,
eliminates all elements of al with the exception of the first component.
Hence the first column of the transformed matrix is of the required form.
Now assume that A k is an n x n matrix of the form

where B k is a k x k Hessenberg matrix, An - k an (n - k) x (n - k) matrix,


akan (n - k) vector, and 0 the (n - k) x (k - 1) zero matrix. Then for a
Householder transformation of the form

Hk = (~ H~-k)'
where Ik denotes the k x k identity matrix and Hn - k is an (n - k) x (n - k)
Householder matrix, it follows that

AkHic = ( Bk
ak ° *
An-kH~_k )
and
HkAkHic = ( ° Hn-kak
Bk *
Hn-kAn-kH~_k
).
Now, proceeding as above, we can choose Hn - k such that all elements of
Hn-kak vanish with the exception of the first component. This procedure
reduces a further column into Hessenberg form. We can summarize our
analysis in the following theorem.
146 7. Matrix Eigenvalue Problems

Theorem 1.22 To each n x n matrix A there exist n - 2 Householder


matrices HI, . .. ,Hn - 2 such for Q = H n - 2 ... HI the matrix
B = Q*AQ
is a Hessenberg matrix.
For a Hessenberg matrix the value of the characteristic polynomial and
its derivative at a point A E (; can be computed easily without computing
the coefficients of the polynomial. These two quantities are required for
employing Newton's method for approximating the eigenvalues as the zeros
of the characteristic polynomial. We first consider the case of a symmetric
Hessenberg matrix.
Example 1.23 Let

al C2
C2 a2 C3
C3 a3 C4
A=

be a symmetric tridiagonal matrix. Denote by A k the k x k submatrix con-


sisting of the first k rows and columns of A, and let Pk denote the charac-
teristic polynomial of A k . Then we have the recurrence relations
Pk(A) = (ak - A)Pk-I (A) - C%Pk-2(A), k = 2, ... ,n, (7.25)

and
p~(A) = (ak - A)p~_1 (A) - C%P~_2(A) - Pk-I (A), k = 2, ... , n, (7.26)
starting with PO(A) = 1 and pdA) = al - A.
Proof. The recursion (7.25) follows by expanding det(A k - AI) with respect
to the last column, and (7.26) is obtained by differentiating (7.25). 0
Example 1.24 The n x n tridiagonal matrix
2 -1
-1 2 -1
-1 2 -1
A=
-1 2-1
-1 2
has the eigenvalues
\ 4' 2 j7r . 1
Aj = sm 2(n+l) , J= , ... ,n
7.5 Hessenberg Matrices 147

(see Example 4.17). Table 7.1 gives the results of the Newton iteration using
(7.25) and (7.26) for computing the smallest eigenvalue Amin = Al and the
largest eigenvalue Amax = An for n = 10. The starting values are obtained
from the Gerschgorin estimates IA - 21 ~ 2 following from Theorem 7.7. 0

TABLE 7.1. Hessenberg method for Example 7.24

Amax Amin

4.00000000 0.00000000
3.95000000 0.05000000
3.92542110 0.07457890
3.91933549 0.08066451
3.91898705 0.08101295
3.91898595 0.08101405
3.91898595 0.08101405

We conclude this section by describing the computation of the quotient


of the value of the characteristic polynomial p(A) = det(B - AI) and its
derivative for a general Hessenberg matrix B = (bjk)' We assume that
bj,j-l i 0 for j = 2, ... ,n; i.e., B is irreducible (see Problem 7.15). For a
given A we determine

and 0: = O:(A) such that


(b ll - A)6 + b12 6 + +
+

bn,n-l~n-l + (b nn - A)~n = 0,
and ~n = 1. This is an n x n upper triangular linear system for the n
unknowns 0:,6, ... ,~n-l, and it can be solved by backward substitution.
Setting
bll - A b12 .. b1,n-l
C = b21 21
b - A : : b2,~-1
(
bn,n-l
by Cramer's rule we have that
1
1 = ~n = detC = (-1)n- b21 ·· ·bn,n-IO:
det(B - AI) det(B - AI)
148 7. Matrix Eigenvalue Problems

that is,
p(A) = (_1)n-lb21···bn,n_lO(A).
Differentiating the last equation yields
p' (A) = (_1)n-l b21 ... bn,n-l 0' (A),
and therefore
p(A) O(A)
P'(A) = O'(A)'
By differentiating the above linear system with respect to A we obtain
the linear system
(b l l - A)TJI + + b1,n-ITJn-1 = 6 + (3,
+ b2 ,n-ITJn-l = 6,

bn,n-ITJn-1 = ~n
for the derivatives (3 = 0', TJI = ~~, ... , TJn-1 = ~~_I' This linear sys-
tem again can be solved by backward substitution for the n unknowns
(3, TJI, ... , TJn-l· Thus we have proven the following theorem.
Theorem 7.25 Let B = (bjk) be an irreducible Hessenberg matrix and let
A E <C. Starting from ~n = 1, TJn = 0, compute recursively

~n-k = bn-k+l,n-k
1 t bn-k+l,j~j},
{A~n-k+1 - j=n-k+1

TJn-k =b 1
n-k+l,n-k
t bn-k+I,jTJj}
{~n-k+l + ATJn-k+l - j=n-k+1
for k = 1, ... , n - 1 and
n
0= -A6 + 2:>lj~j,
j=l
n
(3 = -6 - ATJI + I>ljTJj.
j=1
Then for the characteristic polynomial of B we have
p(A) 0
p'(A) = -g .
Problems 149

Problems
7.1 For the eigenvalues (repeated according to their algebraic multiplicity) of
an n x n matrix A show that
n
LAj
n

trA = and detA = II Aj.


j=l j=l

7.2 For Example 7.1 show that in the case p = 0 the eigenvalues of the matrix
A converge to the eigenvalues of the differential operator D as n ~ 00.

7.3 Show that the eigenvalues of the adjoint matrix A· are the complex conju-
gate of the eigenvalues of the matrix A.

7.4 Show that for the eigenvalues of Hermitian matrices the geometric and the
algebraic multiplicities coincide.

7.5 Use Gerschgorin's Theorem 7.7 to determine the approximate location of


the eigenvalues of the matrix

To check the estimates, compute the eigenvalues by finding the zeros of the char-
acteristic polynomial.

7.6 Let A be a diagonalizable n x n matrix with eigenvalues AI, ... , An, B an


n x n matrix, and A an eigenvalue of A + B. Show that

where C is a nonsingular matrix such that C- l AC is diagonal and p = 1,2,00.


7.7 Show that the Frobenius norm is indeed a norm on the linear space of
matrices.

7.8 Write a computer program for the Jacobi method and test it for various
examples.

7.9 Assume that A is a real symmetric n x n matrix with eigenvalue A of


t=
multiplicity n -1 and a further eigenvalue /.L A. Show that A = AI + (/.L - A)xx·,
where x· x = 1 and that by at most n - 1 Jacobi transformations A becomes
diagonal.

7.10 Show convergence ofthe cyclic Jacobi method with threshold [N(AW j(2n 2 ).

7.11 Let A be a diagonalizable n x n matrix with eigenvalues AI, ... ,An and
eigenvectors Xl, ... ,X n , and assume that lAd> IA21 ~ IA31 ~ ... ~ IAnl. Starting
from Vo E <en with Vo If. span{X2, ... , X n } show that the sequence
Av v v = 0, 1,2, ... ,
Vv+l := IIAv v l12 '
150 7. Matrix Eigenvalue Problems

is well-defined and that the sequence of Rayleigh quotients

(Avv,vv)
Rv : = -'--:c,.--,:-=--'- v = 0, 1,2, ... ,
IIvvll~
satisfies the estimate
v
IRv-Ad:::;Cr , v=0,1,2, ... ,

for some constant C > 0 and r:= IA2/Ad.


1.12 The matrix
2
1
0.5
has eigenvalue A = 4 with eigenvector x = (1,1, I)T. Construct a Householder
matrix H such that

and determine the remaining eigenvalues.

1.13 Write a computer program for the QR algorithm and test it for various
examples.

1.14 Verify the numerical results of Table 5.2 for the Hilbert matrix.

1.15 Show that Hessenberg matrices B = (b jk ) with bj,j-l 1= 0 for j = 2, ... ,n


are irreducible.

1.16 Show that the Hessenberg form of a matrix is preserved by the QR algo-
rithm.

1.11 Show that the number of multiplications required for the transformation
of a matrix into Hessenberg form via Householder transformations according to
Theorem 7.22 is 5n 3 /3 + O(n 2 ).

1.18 Write a computer program for transforming a matrix into Hessenberg form
via Householder transformations according to Theorem 7.22.

1.19 Discuss Newton's method for the solution of Ax = AX, x T X 1 in the


neighborhood of a simple eigenvalue of a real symmetric matrix A.

1.20 Prove the inequality

L IAjl2 :::; (IIAII~ - ~ IIAA* - A* AII~)


n 1/2

j=1

for the eigenvalues of an n x n matrix A (see Corollary 7.9 and [41]).


8
Interpolation

Polynomials have attracted the attention of mathematicians for centuries


because of their many beautiful properties. For numerical purposes they
have the advantage that their computation reduces to additions and mul-
tiplications only. Therefore, it is quite natural to use polynomials for the
approximation of more complicated functions. A classical approach to spec-
ifying the coefficients of a polynomial of degree n is to prescribe that its
values at n + 1 distinct points coincide with those of the function to be
approximated. The development and investigation of such interpolation
polynomials has a long mathematical history, beginning with the use of
the method of interpolation to tabulate the logarithms, as proposed by
Briggs in the early seventeenth century.
It is the purpose of the first section, Section 8.1, of this chapter to intro-
duce the classical theory of polynomial interpolation, including discussions
on the effective numerical computation of interpolation polynomials and an
analysis of the resulting approximation error. The next section, Section 8.2,
describes the corresponding theory for the interpolation of periodic func-
tions by trigonometric polynomials. For a detailed study of the foundations
of classical interpolation theory we refer to [16].
In the last two sections, Sections 8.3 and 8.4, we proceed with a study
of interpolation by splines, i.e., piecewise polynomial interpolation, which
was developed within the last fifty years and has turned into a successful
tool in approximation theory and other parts of numerical analysis. For a
comprehensive study of spline functions we refer to [18, 53], and for their
use in computer-aided geometric design we refer to [23].
152 8. Interpolation

We would like to point out that interpolation is not only important as


a tool for the approximation of functions that are difficult to compute or
whose values are known only at discrete points. It also serves as an essential
ingredient for developing numerical integration rules and methods for the
approximate solution of differential and integral equations, as we shall see
in the following chapters.

8.1 Polynomial Interpolation


For n E IN U {O}, we denote by Pn the linear space of polynomials
n
p(x) = Lakxk
k=O

for a real (or complex) variable x and with real (or complex) coefficients
ao, ... ,an' A polynomial p E Pn is said to be of degree n if an =I- O. In
this chapter, we consider Pn as a subspace of the linear space C[a, b] of
continuous real- (or complex-) valued functions on the interval [a, b], where
a < b. For m E IN we denote by Cm[a, b] the linear space of m times
continuously differentiable real- (or complex-) valued functions on [a, b].
We recall the following basic uniqueness property of algebraic polynomi-
als as part of the fundamental theorem of algebra. Since we will use this
property frequently, it is appropriate to include a simple proof by induction.

Theorem 8.1 For n E IN U{O}, each polynomial in P n that has more than
n (complex) zeros, where each zero is counted repeatedly according to its
multiplicity, must vanish identically; i. e., all its coefficients must be equal
to zero.

Proof. Obviously, the statement is true for n = O. Assume that it has been
proven for some n 2: O. By using the binomial formula for x k = [(x-z)+z]k
we can rewrite the polynomial p E Pn+l in the form
n+l
p(x) = L bk(x - Z)k + bo
k=l

with the coefficients bo , bl , ... ,bn+t depending on ao, at, . .. ,an+1 and z.
If z is a zero of p, then we must have bo = 0, and this implies that
p(x) = (x - z)q(x) with q E Pn . Obviously, q has more than n zeros,
since p has more than n + 1 zeros. Hence, by the induction assumption, q
must vanish identically, and this implies that p vanishes identically. 0

Theorem 8.2 The monomials Uk(X) := x k , k = 0, ... , n, are linearly in-


dependent.
8.1 Polynomial Interpolation 153

Proof. In order to prove this, assume that

that is,
n
L:akxk = 0, x E [a,b].
k=O
Then the polynomial with coefficients ao, aI, ... ,an has more than n dis-
tinct zeros, and from Theorem 8.1 it follows that all the coefficients must
be zero. 0

The linear independence of the monomials Uo, ... , Un implies that they
form a basis for Pn and that Pn has dimension n + 1.
Theorem 8.3 Given n + 1 distinct points Xo, . .. ,Xn E [a, b] and n + 1
values Yo, ... ,Yn E nt, there exists a unique polynomial Pn E Pn with the
properly
Pn(Xj)=Yj, j=O, ... ,n. (8.1)
In the Lagrange representation, this interpolation polynomial is given by

(8.2)

with the Lagrange factors


n
i k (x)=11 X-Xi, k=O, ... ,n.
i=O Xk - Xi
i#k
Proof. We note that i k E Pn for k = 0, ... , n and that the equations
ik(Xj) = 8jk , j,k = O, ... ,n, (8.3)

°
hold, where 8jk = 1 for k = j, and 8jk = for k ::J j. It follows that Pn
given by (8.2) is in Pn , and it fulfills the required interpolation conditions
Pn(Xj) = Yj, j = 0, ... ,no
To prove uniqueness of the interpolation polynomial we assume that
Pn,l, Pn,2 E Pn are two polynomials satisfying (8.1). Then the difference
Pn := Pn,l - Pn,2 satisfies Pn(Xj) = 0, j = 0, ... , n; i.e., the polynomial
Pn E Pn has n + 1 zeros and therefore by Theorem 8.1 must be identically
zero. This implies that Pn,l = Pn,2' 0

The representation (8.2), which was discovered by Lagrange in 1794, is


very convenient for theoretical investigations because of its simple struc-
ture. However, for practical computations it is suitable only for small n. For
154 8. Interpolation

n large the Lagrange factors become very large and highly oscillatory, which
causes ill-conditioning of the Lagrange interpolation polynomial. Already
in 1676, in his study of quadrature formulae (see Theorem 9.3), Newton
had obtained a representation of the interpolation polynomial that is more
practical for computational purposes. For its description we need to give
the following definition.

Definition 8.4 Given n + 1 distinct points Xo, ... , Xn E [a, b] and n + 1


values Yo, ... , Yn E JR, the divided differences Dj of order k at the point
x j are recursively defined by

D oj '.-
- Yj, j =O, ... ,n,

Dk- l _ Dk- l
k.=
DJ '
j+l j
,
.
J= 0 , ... ,n-,
k k = 1, ... ,n.
Xj+k - Xj

We notice that the points Xo, ... ,Xn need not be in ascending order. It
is convenient to arrange the divided differences according to the tableau

Xo Yo = D8
DA
Xl Yl = D? D 02
D 1
1
D30
X2 Y2 = D~ D 21
D 2l
X3 Y3 = Dg
which we illustrate by the following example. Obviously, for the full tableau
the computational cost is of order O(n 2 ).

Example 8.5 For the points Xo = 0, Xl = 1, X2 = 3, X4 = 4 and the


values Yo = 0, Yl = 2, Y2 = 8, Y4 = 9 the tableau of the divided differences
is given by
0 0
2
1 2 1/3
3 -1/4
3 8 -2/3
1
4 9

Each value DJ in the kth column is obtained by taking the difference of


the two neighboring values D7+f
and 1
D7-
in the preceding column and
dividing it by the difference Xj+k - Xj of the points Xj+k and Xj' 0
8.1 Polynomial Interpolation 155

Lemma 8.6 The divided differences satisfy the relation

j+k j+k 1
Dj = L . Yrn II
.. Xrn - Xi
' j = 0, ... ,n - k, k = 1, ... ,n. (8.4)
rn=J '=J
i#rn

Proof. We proceed by induction with respect to the order k. Trivially, (8.4)


holds for k = 1. We assume that (8.4) has been proven for order k - 1 for
some k ~ 2. Then, using Definition 8.4, the induction assumption, and the
identity

we obtain

j+k j+k 1 j+k-l j+k-l 1 }


Dk =
J
1
Xj+k - Xj { rn~l Yrn i!!l Xrn - Xi L
m=j
Ym II
i=j
Xm -Xi
i#rn i#m

1 j+k-l {II} j+k-l 1


L
m=j+l
Ym X - X +k - X - x·
m J m J
II
i=j+l
x-x'
m ,
i#m

+ Yj+k
j+k-l

D 1
Xj+k - Xi + Yj i!!l
j+k 1
-xJ-·---X-i = l;
j+k
Ym g
j+k

i#m
1
-x-m---X-i

i.e., (8.4) also holds for order k. o


Theorem 8.7 In the Newton representation, for n ~ 1 the uniquely de-
termined interpolation polynomial Pn of Theorem 8.3 is given by
n k-l
Pn(x) = Yo + L D~ II (x - Xi)' (8.5)
k=l i=O

Proof. We denote the right-hand side of (8.5) by Pn and establish Pn = Pn


by induction with respect to the degree n. For n = 1 the representation
(8.5) is correct. We assume that (8.5) has been proven for degree n - 1 for
some n ~ 2 and consider the difference dn := Pn - Pn. Since
n-l

dn(x) = Pn(x) - Pn-I(X) - D3 II (x - Xi),


i=O
156 8. Interpolation

as a consequence of Theorem 8.3 and Lemma 8.6 the coefficient of x n in the


polynomial dn vanishes; i.e., dn E Pn - 1 . Using the induction assumption,
we have that

and therefore
dn(Xj) =0, j=O, ... ,n-l.
Hence, by Theorem 8.1 it follows that dn = 0, and therefore Pn = Pn. 0

Example 8.8 The interpolation polynomial corresponding to Example 8.5


is given by
1 1
P3(X) = 2x + 3"x(x -1) - 4x(x -1)(x - 3). o

Analogously to the Horner scheme (see (6.10)), the value of the Newton
interpolation polynomial at a point x can be obtained by nested multipli-
cations according to

by O(n) multiplications and additions. For an evaluation of the interpola-


tion polynomial at a single point x without explicitly computing the coef-
ficients of the polynomial, the following Neville scheme is very practical.
From the formal coincidence of the recursion (8.6) and Definition 8.4 for
the divided differences, it is obvious that the computations for (8.6) can be
arranged in a tableau analogous to the tableau for the divided differences.

Theorem 8.9 Given n + 1 distinct points Xo, •.. , X n E [a, b] and n + 1


values Yo, ... , Yn E lR, the uniquely determined interpolation polynomials
p~ E Pk , i = 0, ... , n - k, k = 0, ... , n, with the interpolation property

p~(Xj)=Yj, j=i, ... ,i+k,

satisfy the recursive relation

(8.6)

Proof. We again proceed by induction with respect to the degree k. Obvi-


ously, the statement is true for k = 1. Assume that the assertion has been
8.1 Polynomial Interpolation 157

proven for degree k - 1 for some k ~ 2. Then the right-hand side of (8.6)
describes a polynomial P E Pk, and by the induction assumption we find
that the interpolation conditions
(x· - Xi)Y· - (x· - Xi+k)Y·
p(Xj)= J J J J =Yj, j=i+I, ... ,i+k-I,
XHk - Xi
as well as p(Xi) = Yi and p(Xi+k) = Yi+k are fulfilled. o
The main application of polynomial interpolation consists in the approx-
imation of continuous functions I : [a, b) -+ IR. In this case, given n + 1
distinct points Xo, . .. , Xn E [a, b], by

L n : C[a, b) -t Pn

we denote the interpolation operator that maps the function I E C[a, b)


onto its uniquely determined interpolation polynomial Lnl E Pn with the
property
(8.7)
From the Lagrange representation (8.2) it can be seen that the operator
L n is linear and bounded (see Problem 8.4). Moreover, since LnP = P for
all P E Pn , the interpolation operator is a projection; i.e., L~ = Ln.
The interpolation polynomial Lnl is used as an approximation for the
function I, since in general, the polynomial Lnl is better suited for com-
putational purposes than the original function I. In the sequel we shall be
concerned with estimating the approximation error I - Lnl.
Theorem 8.10 Let I : [a, b) -t IR be (n + I)-times continuously differen-
tiable. Then the remainder Rnl := I - Lnl lor polynomial interpolation
with n + 1 distinct points xo, ... ,Xn E [a, b) can be represented in the lorm

(Rnf)(x) =
I(n+l)(~)
(n + I)! !!
n
(x - Xj), x E [a, b), (8.8)

for some ~ E [a, b) depending on x.


Proof. Since (8.8) is trivially satisfied if x coincides with one of the inter-
polation points xo, ... ,xn , we need be concerned only with the case where
x does not coincide with one of the interpolation points. We define
n
qn+l (x) := II (x - Xj)
j=O

and, keeping x fixed, consider g : [a, b) -t IR given by

g(y) := I(y) - (Lnf)(y) - qn+l(Y) I(x) - (~nf)(X), Y E [a, b).


qn+l x
158 8. Interpolation

By the assumption on f, the function g is also (n + l)-times continuously


differentiable. Obviously, g has at least n+2 zeros, namely x and Xo, . .. ,Xn .
Then, by Rolle's theorem the derivative g' has at least n+ 1 zeros. Repeating
the argument, by induction we deduce that the derivative g(n+l) has at least
one zero in [a, b], which we denote by ~. For this zero we have that

0= f(n+l)(O - (n + I)! (Rnf)(x) ,


qn+l(x)
and from this we obtain (8.8). o
The intermediate point ~ in the error representation (8.8) is not, in gen-
eral, known explicitly. Therefore, the interpolation error is estimated by
the following corollary.
Corollary 8.11 Under the assumptions of Theorem 8.10 we have the error
estimate

Example 8.12 The linear interpolation is given by


1
(Ld)(x) = It [f(XO)(XI - x) + f(xd(x - xo))

with the step width h = Xl -xo. For the polynomial q2(X) = (x-xo)(x-xd
we have that
h2
max
xE[xQ,xI]
Iq2(X)1 = -4 .
Therefore, by Corollary 8.11, the error occurring in linear interpolation of
a twice continuously differentiable function f can be estimated by
h2
I(Rd)(x)1 ~ -8 max
yE[xQ,xI]
1!"(y)l, x E [xo,xd· (8.9)

For example, the error in linear interpolation with step size h = 0.01 for
the sine function is less than or equal to h 2 /8 = 0.0000125. 0

By the following examples we want to introduce the question of whether


the interpolation polynomials converge when the number n + 1 of inter-
polation points, and hence the degree n of the interpolation polynomials,
tends to infinity.
Example 8.13 Let f(x):= sinx and let XO"",X n E [0,11") be n + 1 dis-
tinct points. Since

and
8.1 Polynomial Interpolation 159

by Corollary 8.11, we have the estimate


1I"n+l
I(Rnf)(x)1 ~ (n + I)!' x E [0,11"].

Hence the sequence (Lnf) of interpolation polynomials converges to the


interpolated function f uniformly on [0,11"] as n -+ 00. 0

Example 8.14 A first detailed example of the insufficiency of polynomial


interpolation even for analytic functions was investigated by Runge in 1901.
He considered the simple function
1
f(x) = 1 + 25x2
on the interval [-1,1] with equidistant interpolation points. He discovered
that as the degree n tends to infinity, the interpolation polynomials diverge
for 0.726 ~ Ixl ~ 1, whereas the approximation works satisfactorily in the
central portion of the interval (see Problem 8.6). Although f is analytic
in all of JR., its poles in the complex plane at ±i/5 are responsible for this
divergence. 0

Example 8.15 Consider the continuous function


• 11"
xsm- , x E (0,1]'
f(x) := x
{
0, x =0.

With the interpolation points chosen as


1
Xj = -.- ,
J+l
j = 0, ... ,n,
we have that f(xj) = 0, j = 0, ... , n, and therefore Lnf = 0 for all
n. Hence, in this case the sequence (Lnf) converges only at the points
x j, j E IN u {O}, to the interpolated function f. 0

These three examples illustrate that for polynomial interpolation both


convergence and divergence are possible. We complement the examples by
stating the following two theorems without detailed proofs.
Theorem 8.16 (Marcinkiewicz) For each function f E b) there ex-era,
n
ists a sequence of interpolation points (xJ »), j = 0, ... , n, n = 0,1, ... ,
such that the sequence (Lnf) of interpolation polynomials Lnf E Pn with
(Lnf)(xJ n}) = f(xJ n}), j = 0, ... , n, converges to f uniformly on [a, b).
Proof. The proof relies on the Weierstrass approximation theorem and the
Chebyshev alternation theorem. The Weierstrass approximation theorem
160 8. Interpolation

(see [16]) ensures that for each f E CIa, b] there exists a sequence of poly-
nomials Pn E Pn such that IIPn - flloo -+ 0 as n -+ 00. As a consequence of
the Chebyshev alternation theorem from approximation theory (see [16]),
for the uniquely determined best approximation Pn to f in the maximum
norm with respect to Pn, the error Pn - f has at least n + 1 zeros in [a, b].
Then taking the sequence of these zeros as the sequence of interpolation
points implies the statement of the theorem. 0

Theorem 8.17 (Faber) For each sequence of interpolation points (x;n))


there exists a function f E CIa, b] such that the sequence (Lnf) of interpo-
lation polynomials Lnf E Pn does not converge to f uniformly on [a, b].

Proof. This is a consequence of the uniform boundedness principle, Theo-


rem 12.7. It implies that from the convergence of the sequence (Lnf) for
all f E CIa, b] it follows that there must exist a constant C > 0 such that
IIL n ll oo ::; C for all n E IN. Then the statement of the theorem is obtained
by showing that the interpolation operator L n satisfies IILnll oo ?: dn n for
all n E IN and some c > 0 (see [16]). 0

We conclude this section by briefly describing Hermite interpolation,


where in addition to the values of the polynomial, the values of its first
derivative at the interpolation points are also prescribed.

Theorem 8.18 Given n + 1 distinct points XO, •.• , X n E [a, b] and 2n + 2


values Yo, ... ,Yn E lR and y~, . .. ,y~ E lR, there exists a unique polynomial
P2n+ 1 E P2n +l with the property

(8.10)

This Hermite interpolation polynomial is given by


n
P2n+l = L[Yk H 2+ y~H~] (8.11)
k=O

with the Hermite factors

expressed in terms of the Lagrange factors from Theorem 8.3.

Proof. Obviously, the polynomial P2n+l belongs to P2n +l , since the Hermite
factors have degree 2n + 1. From (8.3), by elementary calculations it can
be seen that (see Problem 8.7)

H2(Xj) = Hf(Xj) = 8jk ,


j,k = 0, ... ,no (8.12)
8.2 Trigonometric Interpolation 161

From this it follows that the polynomial (8.11) satisfies the Hermite inter-
polation property (8.10).
To prove uniqueness of the Hermite interpolation polynomial we assume
that P2n+l,1, P2n+l,2 E P2n+l are two polynomials having the interpolation
property (8.10). Then the difference P2n+l := P2n+l,1 - P2n+l,2 satisfies

P2n+l(Xj) =P~n+l(Xj) =0, j =O, ... ,n;

i.e., the polynomial P2n+l E P 2n+l has n + 1 zeros of order two and there-
fore, by Theorem 8.1, must be identically equal to zero. This implies that
P2n+l,1 = P2n+l,2· 0

The main application of Hermite interpolation consists in the approxi-


mation of a given function lEal [a, b] by interpolating its function values
and the values of its derivative at n + 1 distinct points Xo, ... , Xn E [a, b].
By
H n : a 1 [a, b] -+ P 2n +l
we denote the Hermite interpolation operator that maps continuously dif-
ferentiable functions I : [a, b] -+ JR into the uniquely determined Hermite
interpolation polynomial Hnl E P 2n +1 with the property

The following theorem can be proven analogously to Theorem 8.10 (see


Problem 8.8).
Theorem 8.19 Let I : [a, b] -+ JR be (2n + 2)-times continuously differ-
entiable. Then the remainder Rnl := 1- Hnl lor Hermite interpolation
with n + 1 distinct points Xo, ... , Xn E [a, b] can be represented in the lorm

1(2n+2)W n 2
(Rnf)(x) = (2n + 2)! II
(x - Xj), x E [a, b], (8.13)
)=0

lor some ~ E [a, b] depending on x.

8.2 Trigonometric Interpolation


In applications, quite frequently there occur periodic functions, i.e., func-
tions with the property

I(t + T) = I(t), t E JR,

for some T > 0. For example, functions defined on closed planar or spatial
curves always may be viewed as periodic functions. Polynomial interpola-
tion is not appropriate for periodic functions, since algebraic polynomials
162 8. Interpolation

are not periodic. Therefore, we proceed by considering interpolation by


trigonometric polynomials, which was first used independently by Clairaut
(1759) and Lagrange (1762). Without loss of generality we assume that the
period is equal to T = 271".
Definition 8.20 For n E IN we denote by Tn the linear space of trigono-
metric polynomials
n n

q(t) = Lakcoskt + Lbksinkt


k=O k=l

with real (or complex) coefficients ao, ... , an and bl , ... , bn . A trigonomet-
ric polynomial q E Tn is said to be of degree n if lanl + Ibnl > o.
From the addition theorems for the cosine and sine functions it follows
that qlq2 E T n1 + n2 if ql E Tn. and q2 E T n2 . This justifies speaking of
trigonometric polynomials.
Theorem 8.21 A trigonometric polynomial in Tn that has more than 2n
distinct zeros in the periodicity interval [0,271") must vanish identically; i.e.,
all its coefficients must be equal to zero.
Proof. We consider a trigonometric polynomial q E Tn of the form
n
q(t) = a; + L[ak cos kt + bk sin kt]. (8.14)
k=l

Setting bo = 0,

Ik := ~ (ak - ib k ), I-k:= ~ (ak + ib k ), k = 0, ... , n, (8.15)

and using Euler's formula

eit = cos t + i sin t,


we can rewrite (8.14) in the complex form
n

q(t) = L Ik eikt . (8.16)


k=-n
Therefore, substituting z = eit and setting
n

p(z):= L IkZn+k,
k=-n
we have the relation
q(t) = z-np(z).
8.2 Trigonometric Interpolation 163

Now assume that the trigonometric polynomial q E Tn has more than 2n


distinct zeros in the interval [0, 27r). Then the algebraic polynomial p E P2n
has more than 2n distinct zeros lying on the unit circle in the complex plane,
since the function t t---t eit maps [0, 27r) bijectively onto the unit circle. By
Theorem 8.1, the algebraic polynomial p must be identically zero, and now
(8.15) implies that also q must be identically zero. 0

Theorem 8.22 The cosine functions Ck(t) := cos kt, k = 0,1, ... , n, and
the sine functions Sk(t) := sin kt, k = 1, ... , n, are linearly independent in
the function space e[O, 27r].

Proof. To prove this, assume that


n n
Lakck + Lbksk = OJ
k=O k=1

that is,
n n
L ak cos kt + L bk sin kt = 0, t E [0,27r].
k=O k=1

Then the trigonometric polynomial with coefficients ao, ... , an and bl , ... , bn
has more than 2n distinct zeros in [0, 27r), and from Theorem 8.21 it follows
that all the coefficients must be zero. Note that this linear independence
also can be deduced from Theorem 3.17. 0

Theorem 8.22 implies that the cosines Ck, k = 0,1, ... , n, and sines
Sk, k = 1, ... ,n, form a basis for Tn and that Tn has dimension 2n + 1.

Theorem 8.23 Given 2n+ 1 distinct points to, ... , t2n E [0,27r) and 2n+ 1
values Yo, ... ,Y2n E JR., there exists a uniquely determined trigonometric
polynomial qn E Tn with the property

(8.17)

In the Lagrange representation, this trigonometric interpolation polynomial


is given by

(8.18)

with the Lagrange factors

2n . t - ti
sm--
fk(t) = II 2
. tk - ti '
k = 0, .. . ,2n.
i=O sm---
i#k 2
164 8. Interpolation

Proof. The function qn belongs to Tn, since the Lagrange factors are trigono-
metric polynomials of degree n. The latter is a consequence of
. t - to . t - tl tl - to t1 + to)
sm -2- sm -2- = 2"1 cos --2- -
1 (
2" cos t - --2- ;

i.e., each of the functions lk is a product of n trigonometric polynomials of


degree one. As in Theorem 8.3, we have lk(Xj) = bjk for j, k = 0, ... , 2n,
which shows that qn indeed solves the trigonometric interpolation problem.
Uniqueness of the trigonometric interpolation polynomial follows analo-
gously to the proof of Theorem 8.3 with the aid of Theorem 8.21. 0

We now consider the important case of an equidistant subdivision


21rj
tj = 2n + l' j = 0, ... ,2n.

For this we first note the summation formula


2n
iktj
2n { 2n + 1, k = 0,
L e = L e
ijtk
= (8.19)
J=O J=O 0, k=±I, ... ,±2n,
which is a consequence of the fact that for e itk :j:. 1 we have the geometric
sum

°,
2n 1 _ ei(2n+l)tk
ijtk
L...J e
'"' = .
1 - ettk =
j=O

whereas for e itk = 1 each term in the sum is equal to one.


We now attempt to find the uniquely determined interpolation polyno-
mial in the complex form
n
qn(t) = L rk
eikt
.
k=-n

From the interpolation conditions


qn(tj) = Yj, j = 0, ... , 2n,
we observe that solving the interpolation problem is equivalent to solving
the system of linear equations
n
L rk e
iktj
= Yj, j = 0, ... , 2n. (8.20)
k=-n

Assume that the coefficients rk solve (8.20). Then, with the aid of (8.19),
we obtain
2n n 2n
LYj e- imtj = L rk L ei(k-m)tj = (2n + Ibm;
j=O k=-n j=O
8.2 Trigonometric Interpolation 165

Le., any solution of (8.20) must be of the form

2n
'Yk =2 ~1 LYj e-
iktj
, k = -n, ... , n. (8.21)
n j=O

On the other hand, again with the aid of (8.19), for 'Yk given by (8.21) we
have that
n 1 2n 2n
'"" 'Yk e iktj
L.J
= 2n + 1 '""
L.J
Ym '"" eik(tj - t = Yj,
L.J
m
} j = 0, ... , 2nj
k=-n m=O k=O

i.e., the linear system (8.20) has a unique solution, which is given by (8.21).
From this, using the relation (8.15) between the real representation (8.14)
and the complex representation (8.16) of trigonometric polynomials, we
derive the following theorem.

Theorem 8.24 There exists a unique trigonometric polynomial


n
qn(t) = ~O + L[ak cos kt + bk sin kt]
k=l

satisfying the interpolation property

21fj )
qn ( 2n + 1 = Yj, j = 0, ... , 2n.
Its coefficients are given by

2 2n 21fjk
ak = 2n + 1 L Yj cos 21
j=O n+
' k = O, ... ,n,

2 2n 2 'k
bk '"" . 1fJ k = I, ... ,n.
- 2n + 1 L.J Yj sm 21 '
j=O n+

For an equidistant subdivision with an even number 2n of interpolation


points
1fj
tj = -,
n
j = 0, ... , 2n - 1,

we have only 2n conditions to determine an element of the (2n + I)-dimen-


sional space Tn. However, since the function sin nt obviously has its zeros
at the interpolation points, we drop it from the interpolation polynomial.
The proof of the following theorem is completely analogous to the proof of
Theorem 8.24.
166 8. Interpolation

Theorem 8.25 There exists a unique trigonometric polynomial


n-l

qn(t ) aO + LJ
="2 ~[ .
ak coskt + bk smkt1+ 2'
an cosnt
k=l

satisfying the interpolation property

qn ( 7l'nj) = Yj, j = 0, ... ,2n - 1.

Its coefficients are given by


1 'k
L
2n-l
ak =- Yj cos 7l'J , k = 0, ... ,n,
n j=O n

1 'k
L
2n-l
bk =- Yj sin 7l'J , k = 1, ... , n - 1.
n j=O n
Obviously, the trigonometric interpolation polynomials of Theorems 8.24
and 8.25 may be viewed as discretized versions of the Fourier series, where
the integrals giving the coefficients of the Fourier series (see Problem 3.20)
are approximated by the rectangular quadrature rule at an equidistant grid
(see Corollary 9.27). Therefore, trigonometric interpolation on an equidis-
tant grid is also known as the discrete Fourier transform.
An effective numerical evaluation of trigonometric polynomials can be
done analogously to the Horner scheme for algebraic polynomials. For the
polynomial
n
p(z) = LCkZk
k=O
the recursion (6.10) of the Horner scheme has the form

bk- 1 = bkz + Ck-l, k = n, . .. , 1,

starting with bn = Cn, and it delivers p(z) = boo Assuming that the coeffi-
cients Ck are real, we substitute z = e it and separate into real and imaginary
parts, bk = Uk + iVk, to obtain Un = Cn, Vn = 0, and the recursion

Uk-l = Uk cos t - Vk sin t + Ck-l, Vk-l = Uk sin t + Vk cos t,


for k = n - 1, ... ,1. From this we find
n n
Uo = L Ck coskt, Vo = LCksinkt;
k=O k=l

Le., the evaluation of a trigonometric polynomial at a point t can be reduced


to the evaluation ofsint and cost and O(n) additions and multiplications.
8.2 Trigonometric Interpolation 167

To compute all the coefficients ak and bk in Theorem 8.24 or 8.25 by this


approach requires O(n 2 ) additions and multiplications.
By the fast Fourier transform, which is attributed to Cooley and Thkey
(1965) and which was known already to Gauss, the computational costs
can be reduced even further. The main idea is to exploit the symmetries of
e27rij / n if n is a power of two, say n = 2P with p E IN. We briefly explain the
fast Fourier transform algorithm for the evaluation of the discrete Fourier
transform in the complex form

k = 0, ... ,n - 1. (8.22)

Let m := n/2 = 2P - 1
and W := e- 27ri / n . Then wn 1, w m = -1, and
(8.22) reads

Ck = -1 ~ k·
L.., YjW J, k = 0, ... ,n - 1.
n j=O

Now, the basic idea of the Cooley-Thkey algorithm is to break this sum
into two parts for j even and j odd; i.e.,
11
Ck = 2 rk + 2 W k <5k , k = 0, ... , n - 1,

where

rk := -
1" m-l

L.., Y2jW 2 JOk ,


m-l

<5 k := -1 L..,
" Y2j+lW 2k
J , k = 0, ... ,n-1.
m j=O m j=O

Since w 2 = e- 27ri / m , we have rk+m = rk and 8k+ m = 8k , and therefore


11
Ck+m = 2 rk - 2 W k 8k, k = 0, ... ,m - 1.

Obviously, the rk, 8 k , k = 0, ... , m - 1, represent a discrete Fourier trans-


form of length m = n/2. Hence, the discrete Fourier transform of length
n is reduced to two discrete Fourier transforms of length n/2 followed by
n multiplications and n additions. If this is done recursively, we arrive at
the following operation count. Assuming that the w k , k = 1, ... , m -1, are
precomputed, let M p denote the number of additions and multiplications
needed for the Fourier transform of length n = 2P • Then,

with Mo = 0. From this, by induction, it follows that

M p = p2 P+l = 2n log2 n,
168 8. Interpolation

i.e., that the computational cost is reduced significantly from order O(n Z )
to order O( n logz n).
The actual numerical implementation is based on writing the indices k
and j in a binary representation
p-l p-l
q
k = [ko, ... ,kp-d = L kq2 , j = [jo, ... ,jp-d = L jq2 q
q=O q=O

with kq,jq E {O, I} for q = O, ... ,p - 1. Then

since

for q + r 2: p. Inserting this into (8.22), we can split the long sum into p
nested short sums, and the Fourier transform becomes

1 ..
~ _ 2''']p_1 k
X .•. x L...J e 2 0 Y[jo, ... ,jp-d'
jp_1=0

Define the intermediate sums


1

L
jp_q=o

1
k
L
_ 21rijp_l
X··· x e 2 0 Y[jo, ... ,jp- d
jp-1=0

for q = 1, ... ,p and jo, ... ,jp-q-l, kq- 1 , ..• , k o E {O, I}. Then clearly,
1
c[ko, ... ,k p -1 1-- -n Sp[k p -1, ... ,koJ' (8.23)

and setting
SOl.
Jo, ... ,Jp-l . l'
. J = Y[J' o,···,)p-l
we have the recursive relation
S[q]O,··.,Jp-q-l,
. k q-1,'''' k 0 ] = S[q-l.
Jo,···,}p-q-l,
Ok
I
k 1
q-2,···, 0

+sq-l e-2;:'q[k o ,... ,k q-d


[jo, .. . ,jp-q-1 ,1 ,k q-2, . .. ,koJ
8.3 Spline Interpolation 169

for q = 1, ... , p. Each step of these p recursions requires n additions and


n multiplications. Hence, the total computational cost is indeed of order
O( n log2 n). For more details of the actual numerical implementation and,
in particular, on how to effectively perform the so-called bit reversal in
order to arrange the result (8.23) in the natural order, we refer to [45].
The error analysis for trigonometric interpolation is more complicated
than the error analysis of the previous section for polynomial interpolation.
Denote by L n : C[0,21T] -+ Tn the trigonometric interpolation operator
that maps the function f onto its trigonometric interpolation polynomial
Lnf. For equidistant grids, by Problems 8.12 and 8.13 we have convergence
IIL n f - fll2 -+ 0, n -+ 00, for each continuous 21T-periodic function f and
IIL nf - flloo -+ 0, n -+ 00, for each continuously differentiable 21T-periodic
function f. For a detailed error analysis we refer to [49].

8.3 Spline Interpolation


As we have seen in our considerations of the convergence of interpolation
polynomials, increasing the number of interpolation points, i.e., increasing
the degree of the polynomials, does not always lead to an improvement
in the approximation. The spline interpolation that we will study in this
section remedies this deficiency of interpolation by high-degree polynomials
through a piecewise polynomial interpolation of low degree.
A frequently used method of this type is piecewise linear interpolation.
Let a = XQ < Xl < ... < X n = b be a subdivision of the interval [a, b].
Then a given function f E C[a, b] can be approximated by a continuous
piecewise linear function by linear interpolation on each of the subintervals,
i.e., according to Example 8.12, by

°°
From the error estimate (8.9) for linear interpolation, we see that for piece-
wise linear interpolation we have uniform convergence IISn - flloo -+
for n -+ 00 on [a, b], provided that h := maxj=l .... ,n IXj - xj-ll -+ and
f E C 2 [a, b]. The main advantage of this method is its simplicity and its
stability with respect to errors in the interpolation values. However, since
by (8.9) linear interpolation has an error only of order O(h 2 ), for achieving
a prescribed accuracy it usually requires a much finer discretization than
some of the higher-order methods described below.

Definition 8.26 Let a = XQ < Xl < ... < X n = b be a subdivision of


the interval [a, b] and m E IN. A function S : [a, b] -+ JR is called a spline
of degree m with respect to this subdivision if s is (m - 1) -times continu-
ously differentiable on [a, b] and if the restriction of s to each subinterval
170 8. Interpolation

m
X+ .-
.- {
0, x < 0,
for m E IN. The m +n functions
Uk(X):= (x-xo)k, k=O, ... ,m,
(8.24)
Vk(X) := (x - Xk)+, k = 1, ... ,n -1,
are linearly independent. In order to see this, let
m n-l
L akUk + L {3k vk = O.
k=O k=l

Then, in particular,
m

L ak(x - xo)k = 0, x E [xo, xd,


k=O
whence ak = 0 for k = 0, ... ,m. Then we have
{31(X - Xl)m = 0, x E [Xl,X2]'
and therefore {31 = O. Repeating this argument inductively, it follows that
{3k = 0, k = 1, ... ,n - l.
To complete the proof, we need to show that each s E S~ can be ex-
pressed as a linear combination of the functions (8.24). Given a spline
s E S~, by induction we show that there exist constants ao, ... ,am and
{31, ... ,(3n-l such that
m j-l

s(x) = L ak(x - xO)k + L (3k(X - Xk)~, x E [xo, Xj], (8.25)


k=O k=l
8.3 Spline Interpolation 171

for j = 1, ... ,n. This is true for j = 1, since on [xo, Xl] the spline s coincides
with an element of Pm. Now assume that we have the representation (8.25)
for some j 2: 1. Then the difference
m j-l
p(x) := s(x) - L Uk(X - xo)k - L f3k(X - Xk)~
k=O k=l

restricted to the interval [Xj, Xj+I] is in Pm. Since the spline s is in C m - l [a, b]
and p vanishes on [xo, Xj] we have that
p(i)(Xj) = 0, i = 0, ... ,m-1.
Hence p(x) = f3j(x - Xj)+- on [Xj,xj+d for some constant f3j, and because
°
(x - Xj)+- = on [xo,Xj], the representation (8.25) is proven for j + 1. 0

Since the spline space S;:' has dimension m + n, the n + 1 interpolation


conditions at the points Xo, ... ,X n are not sufficient to determine uniquely
a spline of degree greater than one. Therefore, we need to add additional
requirements in the form of conditions at the two endpoints Xo = a and
X n = b of the interval. Since we want to divide the number of these end
conditions equally between both ends, we consider only odd degrees m.
Lemma 8.28 Let m = 2f - 1 with eE IN and e 2: 2, and let f E Cl[a, b].
Assume that the spline s E S;:' interpolates f, i.e.,
s(Xj)=f(Xj), j=O, ... ,n, (8.26)
and that it satisfies the boundary conditions
s(j)(a) = f(j)(a), s(j)(b) = f(j)(b), j = 1, ... , e- 1. (8.27)
Then

ib[j(l)(X) - S(l) (xWdx = ib[j(l)(XWdX -ib[s(l)(XWdX. (8.28)

Proof. We have that

ib[j(l)(X) - s(l)(xWdx = ib[j(l) (xWdx -ib[s(l)(XWdX - 2R,

where
R:= ib[j(l)(X) - s(l)(x)]s(l)(x)dx.

Since f E Cl[a, b] and s E cm-l [a, b] has piecewise continuous derivatives


of order m, by e- 1 repeated partial integrations and using the boundary
conditions (8.27) we obtain that
b
R= (-I)l-li [!'(X) -s'(x)]s(m)(x)dx.
172 8. Interpolation

A further partial integration and the interpolation conditions now yield

R = (_I)l-l i= l
j=l
Xj

Xj-1
[f'(x) - s,(x)]s(m)(x) dx

Xj
n
(_I)l-l 2:)/(x) - s(x)]s(m) (X) =0,
j=l
Xj-l

since s(mH) = O. This completes the proof. o


Lemma 8.29 Under the assumptions 01 Lemma 8.28 let 1 o. Then
s = O.
Proof. For 1 = 0, from (8.28) it follows that

l b
[s(l) (xWdx = O.
This implies that s(l) = 0, and therefore s E Pl - 1 on [a, b]. Now the bound-
ary conditions s(j) (a) = 0, j = 0, ... , e- 1, yield s = o. 0

From the proof it can be seen that Lemmas 8.28 and 8.29 remain valid
if the boundary conditions (8.27) are replaced by
s(l+j) (a) = s(l+j) (b) = 0, j = 0, ... ,e - 2,

or, provided that 1 is periodic with period b-a, by the periodicity condition
s(j)(a) = s(j)(b), j = 1, ... , e- 1.
Consequently, the following conclusions drawn from Lemma 8.29 are also
true for these two end conditions. However, from a practical point of view
only the latter modification is of relevance.
Theorem 8.30 Let m = 2f - 1 with e E IN and e ~ 2. Then, given n + 1
values Yo, ... , Yn and m - 1 boundary data al,· .. , al-l and b1, ... ,bl- 1 ,
there exists a unique spline s E S;:' satisfying the interpolation conditions

s(Xj) = Yj, j = 0, ... , n, (8.29)

and the boundary conditions

s(j)(a) = aj, s(j)(b) = bj , j = 1, ... ,e-1. (8.30)

Proof. Representing the spline in the form (8.25), i.e.,


m n-l
s(x) =L (XkUk + L f3k v k, (8.31)
k=O k=l
8.3 Spline Interpolation 173

it follows that the interpolation conditions (8.29) and boundary conditions


(8.30) are satisfied if and only if the m + n coefficients ao, ... , am and
(31, ... , (3n-l solve the system

m n-l

LakUk(Xj) + L(3kVk(Xj) =Yj, j=O, ... ,n,


k=O k=l

m n-l
j
L aku~) (a) + L (3k vk )(a) = aj, j = 1, ... , £ - 1, (8.32)
k=O k=l

m n-l
Laku~)(b)+L(3kvkj)(b)=b j , j=I, ... ,£-I,
k=O k=l

of m + n linear equations. By Lemma 8.29 the homogeneous form of the


system (8.32) has only the trivial solution. Therefore, the inhomogeneous
system (8.32) is uniquely solvable, and the proof is finished. 0

In principle, for the actual computation of the interpolating spline, it


is possible to use the linear system (8.32). However, as a consequence of
the global nature of the basis functions (8.24), this system turns out to be
ill-conditioned. Therefore, it is preferable to use the corresponding linear
system derived from another set of basis functions known as basic splines,
or simply B-splines. As opposed to the splines (8.24), the B-splines have
local support, Le., they differ from zero only within m + 1 neighboring
subintervals.
For the sake of simplicity we confine our analysis of B-splines to the case
of an equidistant subdivision of step length h. We set

I, Ixi :S 0.5,
Bo(x) :=
{ 0,
Ixi > 0.5,
and define recursively

Bm+i(x):= 1-1x+!

2
Bm(y)dy, XEIR, m=O,I, .... (8.33)

Then, by induction, it can be seen that the B m are (m - I)-times con-


tinuously differentiable and nonnegative, vanish outside the interval
[-m/2 - 1/2, m/2 + 1/2]' and reduce to a polynomial of degree m in each
of the intervals [i, i + 1] for m odd and [i - 1/2, i + 1/2] for m even for i an
integer; Le., the B m are splines of order m on an integer grid if m is odd
and on a half integer grid if m is even.
174 8. Interpolation

Elementary integrations show that

Ixl :S 1,
(8.34)
Ixl ~ 1,

1 12- (Ixl- 0.5)2 - (Ixl + 0.5))2, Ixl :S 0.5,


B 2 (x) = "2 (Ixl - 1.5)2, 0.5 :S Ixl :S 1.5, (8.35)

0, Ixl ~ 1.5,

1 1 (2 - Ixl)3 - 4(1 - Ix1)3, Ixl :S 1,

B 3 (x) = 6 (2 - Ix1)3, 1 :S Ixl :S 2, (8.36)

0, Ixl ~ 2.

Graphs of these B-splines are given in Figure 8.1.

FIGURE 8.1. B-splines Bl, B 2 , and B3

Theorem 8.31 For m E IN U {O} the B-splines

Bm(·-k), k=O, ... ,m, (8.37)

are linearly independent on the interval 1m:= [m;l, mil].


8.3 Spline Interpolation 175

Proof. This is trivial for m = 0, and we assume that it has been proven for
degree m - 1 for some m 2:: 1. Let
m
LCXkBm(X - k) = 0, x Elm . (8.38)
k=O
Then, with the aid of (8.33), differentiating (8.38) yields

~ak [Bm-l (X-k+~) -Bm - (X-k-~)] 1 =0, xElm ·

Observing that the supports of B m - 1 (- + ~) and B m - 1 (. - m - ~) do not


intersect with 1m , we can rewrite this as

flak - ak-dBm-l (x - k +~) = 0, x Elm ,


k=l

whence ak = ak-l for k = 1, ... ,m follows by the induction assumption;


i.e., ak = a for k = 0, ... ,m. Now (8.38) reads
m
aLBm(x-k)=O, xElm ,
k=O
and integrating this equation over the interval 1m leads to

a i:~!! Bm(x) dx = 0.
This finally implies a = 0, since the B m are nonnegative, and the proof is
finished. 0

Corollary 8.32 Let Xk = a + hk, k = 0, ... , n, be an equidistant subdi-


vision of the interval [a, b] of step size h = (b - a)jn with n 2:: 2, and let
m = 2£ - 1 with £ E IN. Then the B-splines

X - a - hk)
Bm,k(X) := B m ( h ' x E [a, b], (8.39)

for k = -£ + 1, ... ,n + £ - 1 form a basis for s;:;,.


Proof. The n + m splines (8.39) belong to S;:;" and by the preceding The-
orem 8.31 they can be shown to be linearly independent on [a, b]. Hence,
the statement follows from Theorem 8.27. 0

The use of the B-splines as a basis opens up another possibility for the
computation of an interpolating spline. We only consider the case m = 3,
i.e., cubic splines. From (8.36) we note that

B~(O) = 0, B~(±l) = 1= ~ •
176 8. Interpolation

Therefore, the cubic spline

8(X) = 2::
n+1
O:k
( X - Xk )
B3 - h - , x E [a, b], (8.40)
k=-l

satisfies the interpolation conditions (8.29) and the boundary conditions


(8.30) if and only if the n + 3 coefficients CL1, ... ,O:n+! satisfy the system

1 2 1
{; O:j-1 + :3 O:j + {; 0:)+1 = Yj, j = 0, ... , n, (8.41 )

of n + 3 linear equations. Since the matrix of this system is irreducible and


weakly row-diagonally dominant, the solution can be obtained by Jacobi
iteration (see Theorem 4.7).
We conclude this section with an analysis of the interpolation error for
cubic splines and note that the results can be extended to arbitrary odd
degree. We begin with a convergence result for arbitrary subdivisions under
a weak regularity assumption on the interpolated function.

Theorem 8.33 Let I : [a, b] --+ lR be twice continuously differentiable and


let 8 E S'3 be the uniquely determined cubic spline satisfying the interpola-
tion and boundary conditions of Lemma 8.28. Then

h3 / 2 2 111"112,
III - 81100 ::; -2- 111'% and III' - 8'1100 ::; h
1
/

where h := maXj=l, ... ,n IXj - xj-d·

Proof. The error function r := f - 8 has n + 1 zeros XO, ••• , X n . Hence, the
distance between two consecutive zeros of r is less than or equal to h. By
Rolle's theorem, the derivative r' has n zeros with distance less than or
equal to 2h. Choose z E [a,b] such that Ir'(z)1 = Ilr'lIoo. Then the closest
zero ( of r' has distance I( - zl ::; h, and by the Cauchy-Schwarz inequality
we can estimate

From this, using Lemma 8.28 we obtain Ilr'lIoo ::; Jh IIf''Ib·


8.3 Spline Interpolation 177

Choose x E [a, b] such that Ir(x)1 = IIrll oo . Then the closest zero ~ of r
has distance I~ - xl ~ h/2, and we can estimate

which concludes the proof. o


If we assume more regularity on f, we can improve on the order of
convergence. For this we need to derive an estimate on the second derivative
of the interpolating spline. From (8.36) it follows that

B~(O) = -2 and B~(±I) = 1.

Hence, the cubic spline (8.40) has second derivatives given by the difference
formula
s" (Xj) = ~2 [OJ-l - 20j + OJ+l], j = 0, ... , n. (8.42)

From this we deduce that

for j = 1, ... , n - 1,

and

From this and the linear system (8.41), for the special case of the interpo-
lation conditions (8.26) and the boundary conditions (8.27), it follows that
the n + 1 values of s" at the grid points satisfy the system

48"(XO) + 2s"(xt} = Fo ,

8" (Xj-l) + 48" (Xj) + 8" (Xj+l) = Fj , j = 1, ... , n - 1, (8.43)


178 8. Interpolation

of n + 1 linear equations with right-hand sides

Fo := ~; [- f(xo) + f(xd - h!'(xo)],


6
F j := h2 [f(xj-d - 2f(xj) + f(xi+d], j = 1, ... ,n - 1,

F n := ~; [f(xn-d - f(x n ) + h!'(xn )].

From the system (8.43) we can conclude that

4Is"(xj)/ ::; IFjl- 2 k=O,


max IS"(Xk)l,
... ,n
j = 0, ... ,n,
and therefore
1
. max Is"(xj)l::; - . max IFjl· (8.44)
J=O, ... ,n 2 J=O, ... ,n

If f is twice continuously differentiable, by Taylor's formula we can estimate

max{lFol, IFnI} ::; 6111"1100'


From Example 8.12, applied to the remainder in the linear interpolation of
f(xj) from f(xj-d and f(xi+d, we obtain

Hence, since s" is piecewise linear, from (8.44) it follows that

11/'1100 ::; 31/1"1100' (8.45)

Theorem 8.34 Let f : [a, b] ~ IR be four-times continuously differen-


tiable and let s E Sf be the uniquely determined cubic spline satisfying the
interpolation and boundary conditions of Lemma 8.28 for an equidistant
subdivision with step width h. Then

Proof. By £1 : C[a, b] ~ S1 we denote the interpolation operator mapping


g E era, b] onto its uniquely determined piecewise linear interpolation.
From Example 8.12 we obtain that

since trivially £1 r = O.
8.4 Bezier Polynomials 179

By integration, we choose a function w such that w ll = Ld ll . Applying


the estimate (8.45) for the cubic spline s - wand using the estimate (8.9)
for the piecewise linear interpolation of III, we obtain

II!" -slliloo ~ II!" -Ldll lloo+IILd ll -s ll lloo ~ 411!" -Ldlliloo ~ ~2 11/(4)1100.


By piecing together the last two inequalities we obtain the assertion of the
theorem. 0

8.4 Bezier Polynomials


In this section we want to introduce some of the basic ideas of computer-
aided geometric design. We will confine our presentation to planar (and
spatial) curves, i.e., to subsets r c IRm , m = 2,3, that can be described
by a continuous mapping I : D --t IR m of an interval D c IR into IR m .
For the purposes of computer-aided geometric design it is essential that
the geometric objects can be visualized and manipulated on the computer
very effectively and rapidly. This, in particular, makes it essential that
the parameters entering the representation of the curves have a geometric
meaning. The latter property, for example, is not fulfilled by polynomial
curves represented through the classical monomial basis.
Definition 8.35 }or n E IN U {O}, we denote by P::: the linear space of
polynomials of the form
n
p(x) = L:akxk, x E IR,
k=O

where ao, ... ,an E IRm . A polynomial pEP::: is said to be of degree n il


an i: 0.
We proceed by introducing a basis for polynomials on an interval [a, b]
in IR with a < b that is better suited for the purposes of computer-aided
design than the monomial basis. For this we make use of the fact that by
the affine linear transformation
x-a
x H t(x) := - - (8.46)
b-a
the interval [a, b] can be mapped on the interval [0,1]. By the binomial
formula we have that

1 = [t + (1- t)]n = ~ (~)tk(l- t)n-k.


The terms in this partition of unity are called Bernstein polynomials for
the interval [0,1]. From these, the Bernstein polynomials for the interval
[a, b] are obtained via the transformation (8.46).
180 8. Interpolation

Definition 8.36 The Bernstein polynomials of degree n for the interval


[0,1] are given by

BI:(t) := (~)tk(1 - t)n-k, k = 0, ... ,no (8.47)

Correspondingly, the polynomials

BI:(x; a, b) := BI: (: =:) = (b _1 a)n (~) (x-al(b-x)n-k, k = 0, ... , n,

are called Bernstein polynomials of degree n for the interval [a, b].
Some basic properties of Bernstein polynomials are described in the fol-
lowing theorem.
Theorem 8.37 The Bernstein polynomials are nonnegative on [0,1] and
provide a partition of unity; i.e.,

(8.48)

and n
L BI: (t) = 1, t E JR.. (8.49)
k=O
They satisfy the relations

BI:(t) = B~_k(1 - t), k = 0, ... , n, (8.50)

and
B[;(t) = (1 - t)B[;-l(t), B~(t) = tB~=i(t) (8.51)
°
for all t E JR. and n E IN. The point t = is a zero of BI: of orner k, and
t = 1 is a zero of order n - k. Each of the polynomials BI: assumes its
maximum value only at t = kin. They satisfy the recursion relation

(8.52)

for n E IN and k = 1, ... , n - 1. The polynomials BEf, ... ,B;: form a basis
of Pn ·
Proof. The first five properties are obvious. The statement on the maximum
of BI: is a consequence of

~ BI:(t) = (~)tk-l(1_ t)n-k-l(k - nt), k = 0, .. . n.


The recursion formula (8.52) follows from the definition (8.47) and the
recursion formula
8.4 Bezier Polynomials 181

for the binomial coefficients. In order to show that the n + 1 polynomials


BO', ... , B~ of degree n provide a basis of Pn , we prove that they are linearly
independent. Let
n
L bkB'k(t) = 0, t E [0,1].
k=O

Then
dj
L bk dt
n
j B'k(t) = 0, t E [0,1]'
k=O

and therefore

since t = 0 is a zero of B'k of order k. From this, by induction we find that


bn = ... = bo = o. 0

Definition 8.38 The coefficients bo, ... , bn E IRffi in the representation of


a polynomial p E P;:- through the Bernstein basis

n
p(x) =L bkB'k(X; a, b), x E [a, b], (8.53)
k=O

are called control points, or Bezier points, of p. The polygon determined by


them is called the Bezier polygon.

We now want to indicate that the graph of the polynomial p is closely


related to the form of the Bezier polygon, and for this reason the graph
of p is often referred to as Bezier curve. We first note that p(a) = bo
and p(b) = bn ; i.e., both endpoints of the Bezier curve and the Bezier
polygon coincide. Furthermore, from (8.49) it follows that the Bezier curve
is contained in the convex hull con{bo, ... ,bn } of the Bezier points. The
convex hull

is the smallest convex set containing the points bo, ... , bn (see Problem
8.19).
For computing the derivatives of a Bezier curve we first note that
182 8. Interpolation

implies that

-n B 0n - 1
, k = 0,

n(Bk-I
n- 1 _ B kn - I)
, k = 1, ... , n - 1, (8.54)

n- I
n B n-I' k=n.
With this identity we are ready to establish the following theorem.
Theorem 8.39 Let
n
p(t) = I: bkB'k(t), t E [0,1],
k=O

be a Bezier polynomial on [0, 1]. Then


n-i
p(j)(t) = (n :'j)! I: 6ibkB~-i(t), j = 1, ... ,n,
k=O

with the forward differences 6 i bk recursively defined by


6°bk := bk , 6 i bk := /:;i-1bk+l - /:;i-Ib k , j = 1, ... , n.

Proof. Obviously, the statement is true for j = O. We assume that it has


been proven for some 0 ~ j < n. Then with the aid of (8.54) we obtain

n' {~ /:;ib Bn-i-I(t) _ n~1 /:;ib Bn-i-I(t)}


( n _ J. _ I)! ~
k=1
k k-I ~ k k
k=O

n-(j+l)
n!
[n - (j + I)]!
I: 6 i +lbk B;-(j+l) (t),
k=O

which establishes the assertion for j + 1. o


Corollary 8.40 The polynomial from Theorem 8.39 has the derivatives
I
(j) ( ) _ n! /\ i b (j) ( ) _ n. /\ i b .
p 0 - (n _ j)! w. 0, p 1 - (n _ j)! w. n-J

at the two endpoints.


8.4 Bezier Polynomials 183

From Corollary 8.40 we note that pU) (0) depends only on bo , ... ,bj and
that pU)(l) depends only on bn - j , ... , bn . In particular, we have that

(8.55)

i.e., at the two endpoints the Bezier curve has the same tangent lines as
the Bezier polygon. Through the affine transformation (8.46) these results
on the derivatives carryover to the general interval [a, b].

bo bo

FIGURE 8.2. Bezier polynomials of degree two

Figure 8.2 illustrates by two Bezier polynomials of degree two in rn? how
the shape of the curve is influenced by the location of the control points bi.
From (8.55) we also observe how to patch two Bezier polynomials of degree
two together smoothly such that the tangent lines at the joints coincide,
i.e., such that the two polynomials match up to a Bezier spline of degree
two. The Bezier polynomials have the same tangent lines at the joints if
the Bezier polygons do. This is illustrated by Figure 8.3.

bo

FIGURE 8.3. Bezier spline of degree two

We will conclude this section by describing the de Casteljau. algorithm


as a very stable and fast method for computing the function values p(t) of
184 8. Interpolation

a Bezier polynomial. Given a Bezier polynomial


n
p(t) = L bkBr(t), t E [0,1],
k=O

we define the subpolynomials b~ E Pk' by


k

b~(t) := L bi+jBj(t) (8.56)


j=O

for i = 0, ... , n - k and k = 0, ... , n. For polynomials on [a, b) we have


an analogous definition for the subpolynomials. The subpolynomial b~ is
a polynomial of degree k and has the k + 1 control points bi ,···, biH .
o
In particular, we have that b = p. Analogous to the Neville scheme of
Theorem 8.9 we have the following recursion formula, which is the basis of
the de Casteljau algorithm.
Theorem 8.41 The subpolynomials b~ of a Bezier polynomial p of degree
n satisfy the recursion formulae

(8.57)
for i = 0, ... ,n- k and k = 1, ... ,n.
Proof. We insert the recursion formulae (8.51) and (8.52) for the Bernstein
polynomials into the definition (8.56) for the subpolynomials and obtain
k-l
b~(t) = biB~(t) + L bi+jBj(t) + biHB~(t)
j=l

k-l k

=L bi+j(1 - t)Bt (t)


l
+ L bi+jtBj~i (t)
j=O j=l

= (1 - t)b~-l (t) + tb~+l (t),


which establishes (8.57). o

Since bo(t) = p(t), starting the recursion with b2(t) = bk, from (8.57) we
can compute p(t) by successive convex combinations of the Bezier points
bo, ... , bn , which clearly is a numerically stable procedure. Since (8.57) is
similar in structure to the divided differences in Definition 8.4, the compu-
tations can be arranged in a tableau analogous to the one for the divided
differences.
From the coefficients of the de Casteljau tableau we can construct two
Bezier polynomials on the subintervals [0, t] and [t,l] that coincide with
the original Bezier polynomial on the full interval [0,1].
8.4 Bezier Polynomials 185

Theorem 8.42 The Bizier polynomials


n n
PI(X):= L>~(t)Bk'(x;O,t) and 112(x):= L bk-k(t)Bk'(x; t, 1)
k=O k=O
with the coefficients M
and bk - for k = 0, ... , n defined by the recursion
k
(8.57) satisfy
p(x) = PI (x) = P2(X), X E m.,
for arbitrary °< t < 1.
Proof. Inserting the equivalent definition (8.56) of the subpolynomials and
reordering the summation, we find that
n k n n
PI(X) = LLbjB;(t)Bk'(x;O,t) = Lbj LB;(t)Bk'(x;O,t).
k=O j=O j=O k=j
Hence the proof will be concluded by showing that
n
LB;(t)Bk'(x;O,t) = Bj(x), x E m.. (8.58)
k=j
To establish this identity we make use of Definition 8.36 and obtain with
the aid of the binomial formula that

tB;(t)Bk'(X;O,t) = t (;)(1- t)k-jtj-n(~)xk(t - x)n-k


k=J k=J

j j
= (;)x (1- x)n- .

Hence (8.58) is valid, and consequently PI = p. The proof of 112 = P is com-


pletely analogous, and it can also be obtained by a symmetry argument
~m~=~ 0

A natural choice in the subdivision of Theorem 8.42 is to break the


interval in half by taking t = 1/2. Successively repeating the subdivision
leads to a sequence of Bezier polygons that converges rapidly enough to
the original Bezier curve to make this subdivision algorithm practical for
an effective visualization of the curve on a computer.
186 8. Interpolation

Problems
8.1 Let UI, ... ,U n E C[a, b) be linearly independent and let XI, ... ,X n E [a,b)
be distinct. For given values YI, ... ,Yn E JR consider the interpolation problem
of finding a function U E Un := span{ UI, ... ,Un} with the property

U(Xj) = Yj, j = 1, ... ,n.

Show that the following three properties are equivalent:


(a) The interpolation problem is uniquely solvable for each given set of values
YI, ... ,Yn E JR.
(b) Each function U E U with zeros u(Xj) = 0 for j = 1, , n vanishes identically.
(c) The n x n matrix with entries udxj) for j, k = 1, n is regular.
8.2 Consider the interpolation of I(x) := x 4 by a polynomial p E P3 with the
four interpolation points -1,0,1,2. Discuss the behavior of the error p- 1 in the
interval [-1,2).

8.3 Write a computer program for the Neville scheme of Theorem 8.9.

8.4 Show that the interpolation operator L n : C[a, b) ~ Pn given by (8.7) is a


linear operator. Show that it is a bounded operator if both the domain and range
space are equipped with the maximum norm.

8.5 Let Xo, ... , Xn E JR be n + 1 distinct points. Show that the Vandennonde
matrix V with entries (xj) for j, k = 0, 1, ... n has determinant

det V = II (Xj - Xk).


°Sj<kSn
8.6 Verify numerically the findings of Runge described in Example 8.14.

8.7 Verify the relations (8.12) for the Hermite factors.

8.8 Prove Theorem 8.19, i.e., the representation of the remainder in Hermite
interpolation.

8.9 Given a twice continuously differentiable function 1 : [a, b] ~ JR and three


points xo, Xl, X2 E [a, b) with Xo I- X2, show that there exists a unique polynomial
p E P3 for which

p(xo) = I(xo), p'(XI) = l(xI}, p"(XI) = J"(xI}, p(X2) = I(X2)'


Find a representation of the polynomial and give a representation of the re-
mainder analogous to Theorem 8.10. (This is an example of Hennite-Birkhoff
interpolation. )
8.10 Inverse interpolation can be used to solve nonlinear equations I(x) = 0
approximately by interchanging the roles of interpolation points and interpolation
values. Find an approximation of the zero X = 1.5 for I(x) = (4x+ 1)3 -343 from
the values of 1 at the four points X = 0,1,2,3 by inverse cubic interpolation, i.e,
by interpolating the inverse of 1 by a cubic polynomial with interpolation points
1(0),/(1),/(2),/(3) and interpolation values 0,1,2,3. For the computation use
the Neville scheme. Are you satisfied with the accuracy of the result?
Problems 187

8.11 For the trigonometric interpolation from Theorem 8.24 with 2n+l equidis-
tant interpolation points show that the Lagrange factors are given by

where
1 sin(n+~+I)t
F(t):= --1 t
n+ sin-
2
for t i- 0, ±271", ±471", .... Prove that

8.12 For the trigonometric interpolation from Theorem 8.24 with 2n+l equidis-
tant interpolation points show that

IIL n l - 1112 --t 0, n --t 00,

for each continuous 271"-periodic function f.


Hint: With the aid of Problem 8.11, show that

for all n E IN and all continuous 271"-periodic functions 1 and use the Weierstrass
approximation theorem for periodic functions.

8.13 For the trigonometric interpolation from Theorem 8.24 with 2n+ 1 equidis-
tant interpolation points show that

IIL n l - 11100 --t 0, n --t 00,

for each continuously differentiable 271" periodic function I.


Hint: For the functions Idt) := e ikt show that

for n = 1,2, ... and k = 0, ±1, ±2, ... , and use the fact that the Fourier series
for continuously differentiable functions is uniformly convergent.

8.14 Write a computer program for the fast Fourier transform.

8.15 Given n distinct points Zl, ... , Zn (/. la, b], n distinct points Xl, ... ,X n in
fa, b], and n values Yl, ... ,Yn E JR, show that there exists a unique function of
the form
n
U(X) = L: -ak-
Xk + Zk
k=l

with real coefficients aI, ... , an such that

U(Xj)=Yj, j=I, ... ,n.


188 8. Interpolation

8.16 Verify the relations (8.34)-(8.36) for B-splines.

8.17 Use the fact that the second derivative of a cubic spline is a piecewise linear
function to derive the linear system (8.43) without using the B-spline (8.36).
Hint: On each subinterval integrate the piecewise linear function for s" twice and
eliminate the integration constants through the interpolation conditions. Then
use the continuity of s' to obtain the linear system.

8.18 For the Bernstein polynomials show that

k
L nB;:(t) = t,
n
t E JR.,
k=O
and
n 2

L
k=O
k B kn (t) -
-
n2
n --1t 2
_ -
n
t
+-,
n
t E JR..

8.19 Show that the convex hull

con{bo, ... ,bn }:= {tOkbk: Ok ~ 0, tOk = I}


k=O k=O
of n + 1 points bo, ... , bn E JR.ffi is convex and that con{bo, ... , bn } C U for each
convex set U with bo, . .. , bn E U.

8.20 Give the Bezier representation of the (cubic) Hermite factors of Theorem
8.18 for the case of two interpolation points. Draw the graphs of the Hermite
factors and their Bezier polygons.
9
Numerical Integration

Numerical integration formulae, or quadrature formulae, are methods for


the approximate evaluation of definite integrals. They are needed for the
computation of those integrals for which either the antiderivative of the in-
tegrand cannot be expressed in terms of elementary functions or for which
the integrand is available only at discrete points, for example from exper-
imental data. In addition and even more important, quadrature formulae
provide a basic and important tool for the numerical solution of differential
and integral equations, as we shall see in Chapters 10, 11, and 12.
The evaluation of planar areas bounded by curves is one of the oldest
problems in science. Attempts to measure the area bounded by circles,
ellipses, and parabolas were undertaken already by the Babylonians, Egyp-
tians, and Greeks. However, a systematic analysis only became possible
after the invention of calculus. Newton interpolated functions at equidis-
tant points and integrated the interpolating polynomial and thus invented
what now is known as the Newton-Cotes quadratures. Describing these in-
terpolatory quadrature formulae will be the subject of Sections 9.1 and 9.2.
Gauss was the first to notice that nonequidistant interpolation points lead,
in general, to better accuracy for the resulting approximations to the inte-
grals. In 1814 he presented a paper entitled "Methodus nova integralium
valores per approximationem inveniendi" introducing quadrature formulae
with the degree of accuracy considerably improved as compared with the
Newton-Cotes formulae. These Gaussian quadrature formulae will be the
subject of Section 9.3. The remaining part of this chapter is based on the
Euler-Maclaurin expansion, which was found and published independently
by Euler (1738) and Maclaurin (1737). We shall first employ the Euler-
190 9. Numerical Integration

Maclaurin expansion in our analysis of numerical integration of periodic


functions. We will then use it to develop Romberg integration as a typical
example for the use of the extrapolation method in order to increase the de-
gree of accuracy. And finally, for integrands with endpoint singularities we
will describe quadrature formulae that are based on a mesh that is graded
towards the endpoints, and we will analyze the error with the help of the
Euler-Maclaurin expansion.
For a comprehensive study of numerical integration methods including
multidimensional integration we refer to [9, 17, 21, 57].

9.1 Interpolatory Quadratures


The most common quadrature formulae approximate the definite integral

QU) := l b
f(x) dx (9.1)

of a continuous function f over the interval [a, b] with a < b by a weighted


sum n
QnU) := L akf(xk) (9.2)
k=O
with n + 1 distinct quadrature points xo, ... ,X n E [a, b] and quadrature
weights ao, ... ,an E nt. As one of the main applications of interpolation
as developed in the previous chapter, an important group of quadrature
formulae is obtained by integrating an interpolating polynomial instead of
the integrand f, i.e., by approximating

where L n : C[a, b] -+ Pn denotes the polynomial interpolation operator


with interpolation points XO, ... , X n introduced in Section 8.1 (see (8.7)).
Note that both the integral Q and the quadrature formula Qn represent
linear operators from C[a, b] into nt.
Theorem 9.1 The polynomial interpolatory quadrature of order n defined

l
by
b
QnU) := (Lnf)(x) dx (9.3)

is of the form (9.2) with the weights given by

ak = I 1
qn+I (Xk )
l
a
b
qn+I(x) dx,
X - Xk
k = 0, ... ,n, (9.4)

where qn+l (x) := (x - xo)'" (x - x n )·


9.1 Interpolatory Quadratures 191

Proof. From (8.2) we obtain

with
b lb II
ak =
la
lk(X)dx =
a.
n

)=0
j#
x
-x
Xk - Xj
j dx,

whence (9.4) follows by rewriting the product. o

The following theorem describes an equivalent definition of polynomial


interpolatory quadratures.
Theorem 9.2 Given n + 1 distinct quadrature points Xo, ... , Xn E [a, b],
the interpolatory quadrature (9.3) of order n is uniquely determined by its
property of integrating all polynomials P E Pn exactly, i. e., by the property

(9.5)

for all P E P n .
Proof. From (9.3) and LnP = P for all P E Pn it follows that

i.e., the quadrature is exact for all P E Pn . On the other hand, from (9.5)
we obtain

f
k=O
akf(xk) = f
k=O
ak(Lnf)(Xk) = l a
b
(Lnf)(x) dx

for all f E era, b]; Le., the quadrature is an interpolatory quadrature. 0

Theorem 9.3 The polynomial interpolatory quadrature of order n with


equidistant quadrature points

Xk = a + kh, k = 0, ... , n,
and step width h = (b-a)/n is called the Newton-Cotes quadrature formula
of order n. Its weights are given by

ak =h
(_l)n-k
k! (n _ k)! 10
r IIn .
(z - J) dz, k = O, ... ,n, (9.6)
)=0
j#

and have the symmetry property ak = an-k, k = 0, ... , n.


192 9. Numerical Integration

Proof. The weights are obtained from (9.4) by substituting x = xo + hz


and observing that
n
qn+l (x) = hn+l II (z - j)
j=O

and

The symmetry ak = an-k follows by substituting z = n - y. o


These quadrature formulae were first discovered by Newton and also
carry the name of Cotes because of his systematic account of Newton's
integration rules in 1711. The Newton-Cotes quadrature formula of order
n = 1 is known as the trapezoidal rule. Its weights can be obtained ei-
ther from evaluating (9.6) or more easily from the exactness conditions of
Theorem 9.2. For the interval [-1, 1), these conditions are given by

aO + al = j1 dx =
-I
2,

-aD + al = jl -I
X dx = 0,
and imply that ao = al = 1. Hence, for a general interval the trapezoidal
rule has the form

1
6 b-a h
a f(x) dx ~ -2- [f(a) + f(b)) = "2 [f(xo) + f(xd)·

Geometrically speaking, the trapezoidal rule approximates the integral of


f by the integral of the straight line connecting the two points (a, f (a))
and (b, f(b)). Hence, the approximate value coincides with the area of the
trapezoid with the four corners (a,O), (b,O), (a, f(a)), and (b, f(b)).
The Newton-Cotes quadrature formula of order n = 2 was already known
to Kepler in 1612 and Cavalieri in 1639 and is called Simpson's rule, since
Simpson rediscovered it in 1743. Its weights are obtained from the exactness
conditions
ao + al + az = jl dx-I
= 2,

-aD + az = jl xdx = 0,
-I
9.1 Interpolatory Quadratures 193

which imply that ao = a2 = 1/3 and al = 4/3. Hence, for a general interval,
Simpson's rule is given by

Jrf(x)
b
b- a[
dx~ -6- f(a) + 4f (a-2-
+ b) + f(b) ] =:3h[f(xo)+4f(xt}+ f(X2)]'
a

Geometrically speaking, Simpson's rule approximates the integral of f by


the integral of the parabola through the three points (a, f (a)), (~, f (~) ),
and (b, f(b)).
Table 9.1 gives the weights of the first four Newton-Cotes formulae (with
the common factor h = (b - a)/n omitted).

TABLE 9.1. Weights of Newton-Cotes formulae

n ak

1 1
1 - - Trapezoidal rule
2 2
1 4 1
2 - - Simpson's rule
3 3 3
3 9 9 3
3 - - - Newton's three-eights rule
8 8 8 8
14 64 24 64 14
4 - - - - - Milne's rule
45 45 45 45 45

For n ~ 8 some of the weights of the Newton-Cotes formulae become


negative (see Problem 9.4). Since this might lead to negative approxima-
tions for integrals with positive integrands, the higher-order Newton-Cotes
rules cannot be recommended for numerical purposes.
We will carry out the error analysis for the Newton-Cotes formulae only
for the two most important cases, n = 1 and n = 2, i.e., the trapezoidal
rule and Simpson's rule.
Theorem 9.4 Let f : era, b] -t IR. be twice continuously differentiable.
Then the error lor the trapezoidal rule can be represented in the form

Jar f(x) dx -
b
b- a h3
-2- [/(a) + I(b)] = -12 I"(~) (9.7)

with some ~ E [a,b] and h = b - a.


Proof. Let £1 f denote the linear interpolation of I at the interpolation
points Xo = a and Xl = b. By construction of the trapezoidal rule we have
194 9. Numerical Integration

that the error

E 1 (J) := iafb f(x) dx - b- a


-2- [f(a) + f(b)]

is given by

Since the first factor of the integrand is nonpositive on [a, b] and since
by I'Hopital's rule the second factor is continuous, from the mean value
theorem for integrals we obtain that
b
E 1 (J) = f(z) - (Ld)(z) {a (x - a)(x - b) dx
(z-a)(z-b) ia
for some z E [a, b]. From this, with the aid of the error representation for
linear interpolation from Theorem 8.10 and the integral

{b h3
i (x - a)(x - b) dx = -6 '
a

the assertion of the theorem follows. o

We explicitly note that (9.7) cannot be obtained by integrating the in-


terpolation error representation (8.8), since we do not know whether the
intermediate point ~ in (8.8) depends continuously on x.
By construction, Simpson's rule integrates polynomials of degree less
than or equal to two exactly. In addition, it also integrates polynomials of
degree three exactly. By linearity, to show this it suffices to prove it for one
polynomial of degree three. For the polynomial

both the integral and the value obtained from Simpson's rule are zero.
Hence, this polynomial of degree three is integrated exactly by Simpson's
rule.

Theorem 9.5 Let f : era, b] ---+ IR be four-times continuously differen-


tiable. Then the error for Simpson's rule can be represented in the form

b a[f(a) + 4f (a-2-
ia{b f(x) dx - --i- + b) + f(b) ] = - 90
h5 f(4)(~) (9.8)

for some ~ E fa, b] and h = (b - a)/2.


9.1 Interpolatory Quadratures 195

Proof. Let L 2 f denote the quadratic interpolation polynomial for f at the


interpolation points Xo = a, Xl = (a + b)f2, and X2 = b. By construction
of Simpson's rule we have that the error

iar f(x) dx - -6-


b- a[f(a) + 4f (a-2-
+ b) + f(b) ]
b
E 2(f):=

l
is given by
b
E 2(f) = [J(X) - (Ld)(x)]dx. (9.9)

Consider the cubic polynomial

p(x) := (Ld)(x) + (b ~ a)2 [(Ld)'(xt} - !,(xt}] Q3(X), (9.10)

where Q3(X) = (x - xo)(x - XI)(X - X2). Obviously, p has the interpolation


properties

p(Xk) = f(Xk), k = 0,1,2, and p'(xt} = !,(xt}.


Since J: Q3(X) dx = 0, from (9.9) and (9.10) we can conclude that

E 2(f) = lb [J(x) - p(x)] dx,

and consequently

l
b 2 f(x) - p(x)
E2(f) = (X - xo)(x - Xl) (X - X2) ( )( )2( ) dx.
a X - Xo X - Xl X - X2
As in the proof of Theorem 9.4, the first factor of the integrand is non-
positive on [a, b], and the second factor is continuous. Hence, by the mean
value theorem for integrals, we obtain that

E 2(f) = (
z- Xo
f(z) - p(z)
)( )2(
z - Xl Z - X2
)
lb
a
2
(X - xo)(x - xt} (X - X2) dx

for some z E [a, b]. Analogous to Theorem 8.10, it can be shown that
f(4)(0
f(z) - p(z) = -4-!- (z - xo)(z - XI)2(Z - X2)

for some ~ E [a, b]. From this, with the aid of the integral
(b - a)5
120
we conclude the statement of the theorem. o
196 9. Numerical Integration

Example 9.6 The approximation of

Jro
l
In2= ~
1 +x
by the trapezoidal rule yields

In 2 ~~ [1 + ~] = 0.75.
For f(x) := 1/(1 + x) we have

~~ 111"1100 = ~ ,
and hence, from Theorem 9.4, we obtain the estimate lin 2 - o. 751 ~ 0.167
as compared to the true error In 2 - 0.75 = -0.056 ....
Simpson's rule yields

In 2 ~ 6"
1[1 + 1 +4t + 21] = 3625 = 0.6944 ... ,
and from Theorem 9.5 and
h5 (4) _ _1_
90 IIf 1100 - 120
we find the estimate lin 2 - 0.69441 ~ 0.0084 as compared to the true error
In 2 - 25/36 = -0.0012. . .. 0

In order to increase the accuracy, instead of using higher order Newton-


Cotes rules it is more practical to use so-called composite rules. These
are obtained by subdividing the interval of integration and then applying a
fixed rule with low interpolation order to each of the subintervals. The most
frequently used quadrature rules of this type are the composite trapezoidal
rule and the composite Simpson's rule.
Let Xk = a + kh, k = 0, ... ,n, be an equidistant subdivision with step
size h = (b - a)/n. Then the composite trapezoidal rule is given by

Th(f) := h [~ f(xo) + f(xd + ... + f(Xn-l) + ~ f(xn)]


for f E C[a, b].
Theorem 9.7 Let f : [a, b] --+ IR be twice continuously differentiable. Then
the error for the composite trapezoidal rule is given by

lb f(x) dx - Th(f) = - b;2a h 2 1"(0

for some ~ E [a, b].


9.1 Interpolatory Quadratures 197

Proof. By Theorem 9.4 we have that

L j"(~k) ~ n max j"(x)


n

n min j"(x) ~
:tEla,b] k=l :tEla,b]

and the continuity of j" we conclude that there exists ~ E [a, bj such that

L j"(~k) = nj"(~),
k=l

and the proof is finished. o


Let n be even. Then the composite Simpson's rule is given by

ShU) := 3h [f(xo) + 4f(Xl) + 2f(xz) + 4f(X3) + 2f(X4)


+ ... + 2f(x n -z) + 4f(x n -d + f(xn)j

for f E era, bj. Its error can be represented and estimated as follows.
Theorem 9.8 Let f : [a, bj -t lR be four-times continuously differentiable.
Then the error for the composite Simpson's rule is given by

for some ~ E [a, bj.

Proof. Using Theorem 9.5, the proof is analogous to the proof of Theorem
9.7. 0

Table 9.2 gives the error between the exact value of the integral from Ex-
ample 9.6 and its numerical approximation by the composite trapezoidal
rule and the composite Simpson's rule. Clearly, if the number n of quadra-
ture points is doubled, i.e., if the step size h is halved, then the error for
the trapezoidal rule is reduced by the factor 1/4 and for Simpson's rule by
the factor 1/16, as predicted in Theorems 9.7 and 9.8.
198 9. Numerical Integration

TABLE 9.2. Trapezoidal and Simpson's rule for Example 9.6

n Trapezoidal rule Simpson's rule

1 -0.05685282
2 -0.01518615 -0.00129726
4 -0.00387663 -0.00010679
8 -0.00097467 -0.00000735
16 -0.00024402 -0.00000047
32 -0.00006103 -0.00000003

9.2 Convergence of Quadrature Formulae


Definition 9.9 A sequence (Qn) of quadrature formulae is called conver-

l
gent if
b
QnU) -t QU) = f(x) dx, n -t 00,

for all f E era, b].

Theorem 9.10 (Szego) Let


n
n
QnU) = Lain) f(xi ))
k=O

be a sequence of quadrature formulae that converges for all polynomials, i.e,

lim Qn(P) = Q(p) (9.11)


n->oo
for all polynomials p, and that is uniformly bounded, i.e., there exists a
constant C > 0 such that
n
L lain) I $. C (9.12)
k=O

for all n E IN. Then the sequence (Q n) is convergent.

Proof. Let f E C[a, b] and € > 0 be arbitrary. By the Weierstrass approxi-


mation theorem (see [16]) there exists a polynomial p such that

IIf-plloo $. 2(C+b-a)·

Then, since by (9.11) we have Qn(P) -t Q(p) as n -t 00, there exists


N(€) E IN such that

IQn(P) - Q(p)1 $. 2
9.2 Convergence of Quadrature Formulae 199

for all n 2: N(c). Now with the aid of the triangle inequality and using
(9.12) we can estimate
n
IQn(f) - Q(f)1 ::; L n
lain)llf(xi » - p(xin»1 + IQn(P) - Q(p)1
k=O

+ lb Ip(x) - f(x)1 dx

Cc c (b - a)c
<
- 2(C+b-a)
+-+
2 2(C+b-a)
=c

for all N 2: N(c)j Le., Qn(f) -+ Q(f) for n -+ 00. o


A quadrature formula
n
Qn(f) =L akf(xk)
k=O
defines a bounded linear operator Qn : C[a, b] -+ R with the norm given
by
n
IIQnlloo = L lakl· (9.13)
k=O
To prove this, we note the estimate
n
IQnfl ::; IIflloo L lakl,
k=O
which implies that Qn is a bounded operator and that the operator norm
is less than or equal to the right-hand side of (9.13). Equality in (9.13)
follows by choosing f to be a continuous piecewise linear function with
IIflloo = 1 and f(xk)ak = lakl for k = 0, ... , n. From (9.13) and the
uniform boundedness principle, Theorem 12.7, it can be seen that the two
conditions of Theorem 9.10 are also necessary for convergence of a sequence
of quadrature formulae.
Corollary 9.11 (Steklov) Assume that the sequence (Qn) of quadrature
formulae converges for all polynomials and that all the weights are nonneg-
ative. Then the sequence (Qn) is convergent.
Proof. This follows from

t
k=O
lain) I = t
k=O
n
ai ) = Qn(l) -+ I a
b
dx =b- a, n -+ 00,

and the preceding Theorem 9.10. 0


200 9. Numerical Integration

From Theorems 9.7 and 9.8 and Corollary 9.11 we observe that the com-
posite trapezoidal rule and the composite Simpson's rule are convergent.
On the other hand, using the fact that the conditions of Theorem 9.10 are
necessary for convergence, it can be shown that the Newton-Cotes quadra-
tures do not converge for all continuous functions (see Problem 9.5).

9.3 Gaussian Quadrature Formulae


Given the arbitrary quadrature points xo, ... , X n in [a, b], the quadrature
weights ao, . .. ,an of a polynomial interpolatory quadrature are determined
such that all polynomials of degree less than or equal to n are integrated
exactly. In this section we will examine the problem of whether the quadra-
ture points can be chosen in such a way that polynomials of degree less than
or equal to 2n + 1 are also integrated exactly. Obviously, to achieve this
degree of exactness the quadrature points and the quadrature weights have
to satisfy the conditions

Lakx~
n
k=O
= l a
b
xidx, i = 0, ... ,2n + 1.

We shall see that this system of 2n + 2 nonlinear equations for the 2n + 2


unknowns Xo, ... ,X n E [a, b] and ao, ,an E IR. has a unique solution and
that for this solution the points Xo, , Xn are distinct.
We shall proceed slightly more generally by considering quadrature for-
mulae for the integral

Q(f) := l b
w(x)f(x) dx, (9.14)

J:
where w denotes some weight function. We assume that w (a, b) -+ IR.
is continuous and positive and that the integral w(x) dx exists. Typical
examples are given by

w(x) = 1, w(x) =~, w(x) =~,


where for the two latter cases the interval is assumed to be [a,b] = [-1,1].
Analogously to the case w(x) = 1, interpolatory quadrature rules for (9.14)
are obtained by replacing f through its interpolation polynomial Lnf and
then integrating exactly, i.e., by approximating Qf through

Qn(f) := l b
w(x)(Lnf)(x) dx.

Note that the separation of a weight function w for interpolatory quadra-


ture formulae has the advantage that in general, wLnf is a better ap-
proximation to w f than L n (w f) due to possible singularities of wand its
derivatives at the endpoints of the interval.
9.3 Gaussian Quadrature Formulae 201

Definition 9.12 A quadrature formula

1 a
b
w(x)f(x) dx
n
~ :~:::>kf(Xk)
k=O
with n+ 1 distinct quadrature points is called a Gaussian quadrature formula
if it integrates all polynomials P E P2n+1 exactly, i.e., if

L akP(xk) = 1w(x)p(x) dx
n b
(9.15)
k=O a

for all P E P2n+l'


Lemma 9.13 Let Xo, . .. ,Xn be the n +1 distinct quadrature points of a
Gaussian quadrature formula. Then

l b
w(x)qn+l (x)q(x) dx =0 (9.16)

for qn+l(X) := (x - xo)··· (x - x n ) and all q E Pn .

Proof. Since qn+lq E P2n +l and qn+l (Xk) = 0, we have that

1 a
b
w(x)qn+l (x)q(x) dx =
n
L akqn+l (Xk)q(Xk) = 0
k=O
for all q E Pn . o
Lemma 9.14 Let Xo, ... ,Xn be n+ 1 distinct points satisfying the condition
(9.16). Then the corresponding polynomial interpolatory quadrature is a
Gaussian quadrature formula.

Proof. Let L n denote the polynomial interpolation operator for the interpo-
lation points Xo, ... ,Xn . By construction, for the interpolatory quadrature

l
we have
t
k=O
akf(xk) =
a
b
w(x)(Lnf)(x) dx (9.17)

for all f E era, b]. Each P E P2n+ 1 can be represented in the form

P = LnP + qn+lq

for some q E Pn , since the polynomial P - LnP vanishes at the points


Xo, ... ,X n · Then from (9.16) and (9.17) we obtain that

for all p E P2n + 1 . o


202 9. Numerical Integration

Lemma 9.15 There exists a unique sequence (qn) of polynomials of the


form qo = 1 and

qn(x) = x n + rn-l (x), n = 1,2, ... ,

with rn-l E P n - 1 satisfying the orthogonality relation

(9.18)

and
Pn=span{qo, ... ,qn}, n=O,I, .... (9.19)
Proof. This follows by the Gram-Schmidt orthogonalization procedure from
Theorem 3.18 applied to the linearly independent functions un(x) := x n
for n = 0,1, ... and the scalar product

(f,g):= l b
w(x)f(x)g(x) dx

for f, g E C[a, b]. The positive definiteness of the scalar product is a conse-
quence of w being positive in (a, b). 0

Lemma 9.16 Each of the orthogonal polynomials qn from Lemma 9.15


has n simple zeros in (a, b).

Proof. For m = 0, from (9.18) we have that

for n > O. Hence, since w is positive on (a, b), the polynomial qn must
have at least one zero in (a, b) where the sign of qn changes. Denote by
Xl, ... , X m the zeros of qn in (a, b) where qn changes its sign. We assume
that m < n and set rm(x) := (x - xd··· (x - x m ). Then r m E Pn - 1 and

l
therefore
b
w(x)rm(x)qn(x) dx = O.
However, this integral must be different from zero, since rmqn does not
change its sign on (a, b) and does not vanish identically. Hence, we have
arrived at a contradiction, and consequently m = n. 0

Theorem 9.17 For each n = 0,1, ... there exists a unique Gaussian quad-
rature formula of order n. Its quadrature points are given by the zeros of
the orthogonal polynomial qn+l of degree n + 1.

Proof. This is a consequence of Lemmas 9.13-9.16. o


9.3 Gaussian Quadrature Formulae 203

Theorem 9.18 The weights of the Gaussian quadrature formulae are all
positive.
Proof. Define
2
fk(X):= [qn+l(X)] , k = O, ... ,n.
x - Xk
Then
ak[q~+l (Xk)f = Ln ajfk(xj) = jb w(x)fk(x) dx > 0,
j=O a
since!k E P2n, and the theorem is proven. 0

Corollary 9.19 The sequence of Gaussian quadrature formulae is conver-


gent.
Proof. For each polynomial p we have

Qn(P) = fb w(x)p(x) dx,


provided that 2n + 1 is greater than or equal to the degree of p. From their
proofs it is obvious that Theorem 9.10 and its Corollary 9.11 remain valid
for the integral with the weight function w. Hence, the statement of the
theorem follows from Theorem 9.18. 0
2n 2
Theorem 9.20 Let f E C + [a, b]. Then the error for the Gaussian
quadrature formula of order n is given by

j bw(x)f(x) dx - {;n akf(xk)


a =
f(2n+2)(~)
(2n + 2)!
jb w(X)[qn+l (x)fdx
a

for some ~ E [a, b].


Proof. Recall the Hermite interpolation polynomial Hnf E P2n+l for f
from Theorem 8.18. Since (Hnf)(Xk) = f(Xk)' k = 0, ... ,n, for the error

En(f) := jb w(x)f(x) dx - t
a k=O
akf(xk)

we can write
En(f) = fb w(x)[J(x) - (Hnf) (x)] dx.
Then as in the proofs of Theorems 9.7 and 9.8, using the mean value the-

l
orem we obtain
b
f(z) - (Hnf)(z) 2
En(f) = [qn+l ()]2
Z a
w(x)[qn+l(x)] dx

for some z E [a, b]. Now the proof is finished with the aid of the error
representation for Hermite interpolation from Theorem 8.19. 0
204 9. Numerical Integration

Example 9.21 We consider the Gaussian quadrature formulae for the


weight function
1
w(x) = ~' x E [-1,1].
1- x 2
The Chebyshev polynomial Tn of degree n is defined by

Tn(x) := cos(n arccos x), -1:S; x :s; 1.

Obviously To(x) = 1 and T 1 (x) = x. From the addition theorem for the
cosine function, cos(n + 1) t + cos( n - 1) t = 2 cos t cos nt, we can deduce the
recursion formula

Tn+l(x) + T n- 1 (x) = 2xTn (x), n = 1,2, ....


Hence we have that Tn E Pn with leading term

T n (x)=2 n- 1 x n + ... , n=I,2, ....

Substituting x = cos t we find that


1r, n=m=O,

t T~X) dx = Jr cosntcosmtdt =
L1 1- x 2
1r

2'
n = m > 0,
o

0, n f:. m.
Hence, the orthogonal polynomials qn of Lemma 9.15 are given by
qn = 21 - nT n . The zeros of Tn and hence the quadrature points are given
by
Xk = cos (2~: l 1r) , k = O, ... ,n - 1.

The weights can be most easily derived from the exactness conditions

m = O, ... ,n-l,
for the interpolation quadrature, Le., from

+ l)m m=O,
L ak cos (2k 2n
n-1
1r =
{ 1r,

k=O 0, m=I, ... ,n-1.

From our analysis of trigonometric interpolation, i.e., from (8.19), we see


that the unique solution of this linear system is given by
1r
ak = - , k = 0, ... , n - 1.
n
9.3 Gaussian Quadrature Formulae 205

Hence, for n = 1,2, ... the Gauss-Chebyshev quadrature of order n - 1 is


given by
t vII(x)- x 2 dx ~ ~n I:I (cos 2k+
i-I 2n
k=O
l 1r ).

From Theorem 9.20 we have the error representation

rt
i-I
I(x)
vI - x 2
dx _ ~
n
I:
k=O
I (cos 2k + 1
2n
1r) = 1r1(2n)(~)
22n - I (2n)!

for some ~ E [-1, I). o


Example 9.22 We now consider the weight function

w(x) = 1, x E [-1, I).

The Legendre polynomial L n of degree n is defined by


n
Ln(x) := _1_, dd (x2 _ l)n.
2n n. x n
Obviously, L n E Pn . If m < n, by repeated partial integration we see that

since (x 2 - l)n has zeros of order n at the endpoints -1 and 1. Therefore,

The zeros of the Legendre polynomials, and therefore the quadrature


points and weights of the corresponding Gauss-Legendre quadratures, can-
not be given explicitly by a simple expression. We consider only the cases
n = 1 and n = 2 and note that
2 1
qo(x)=I, qI(X)=X, q2(X) = X - 3'
where the coefficient of q2 can be determined from J~I q2(X) dx = O.
The quadrature point for the first Gauss-Legendre formula is Xl = 0,
and the weight at can be obtained from the exactness condition

al = r dx = 2.
t

i-I
Hence the first Gauss-Legendre formula is given by

(9.20)
206 9. Numerical Integration

with the error representation

[II f(x) dx - 2f(0) = ~ J"(~)


for some ~ E [-1,1]. The coefficient of the derivative on the right-hand
side follows most easily by inserting f(x) = x Z • For obvious reasons, this
Gauss-Legendre formula is also known as the midpoint rule.
The quadrature points for the second Gauss-Legendre formula are
Xl = -1/,;3 and Xz = 1/,;3. The weights can be obtained from the exact-

/1
ness conditions
a1 + az = dx = 2,
-1

alxl + azxz = /1 xdx = 0,


-1

and they have the values al = az = 1. Hence the second Gauss-Legendre


formula is given by

with the error representation

1f(x)dx _ f (-1) _ f (_1 ) = _1 f(4)(~)


/ -1 ,;3,;3 135
for some ~ E [-1,1]. The coefficient on the right-hand side follows by
inserting f(x) = x 4 . 0

From the Gaussian quadrature formula

[11 g(z) dz ~ ~ akg(xk)


of order n for the interval [-1,1], by substituting
a+b b-a
x=-- +--z
2 2
and f(x) = g(z) we obtain the Gaussian quadrature formula

inr (a + b
b

a
f(x) dx
b- a
~ -2- 2:
n

k=O
akf -2- + -2-
b- a
Xk
)

for an arbitrary interval [a, b]. The error representation


9.4 Quadrature of Periodic Functions 207

with ( E [-1,1] can be transformed accordingly. Subdividing the interval


[a, b] into m equidistant subintervals with step width h = (b - a)/m and
then applying to each subinterval the Gaussian quadrature formula of order
n, we obtain the composite Gaussian quadrature
fb
Ja
h m-l
f(x) dx ~"2 ~ f;n
ak f
(
a
h h
+ jh +"2 +"2 Xk
)

with an error of order O(h 2n ). These composite Gaussian rules are used
quite frequently in practice. We illustrate their convergence behavior by
Table 9.3, which gives the error between the exact value of the integral
from Example 9.6 and its numerical approximation by composite Gaussian
quadrature of orders one and two. As predicted by our error analysis, if the
number n of quadrature points is doubled, i.e., if the step size h is halved,
then the error for the Gaussian quadrature of orders one and two is reduced
roughly by the factor 1/4 and 1/16, respectively.

TABLE 9.3. Gaussian quadrature for Example 9.6

m n=1 n=2

1 0.02648051 0.00083949
2 0.00743289 0.00007054
4 0.00192729 0.00000489
8 0.00048663 0.00000031
16 0.00012197 0.00000002
32 0.00003051 0.00000000

9.4 Quadrature of Periodic Functions


We proceed by deriving the Euler-Maclaurin expansion.
Definition 9.23 The Bernoulli polynomials B n of degree n are defined
recursively by Bo(x) := 1 and
B~ := Bn- 1, n E IN, (9.21)
with the normalization condition

1 1
Bn(x) dx = 0, n E IN. (9.22)

The rational numbers


bn := n! Bn(O), n = 0,1, ... ,
are called Bernoulli numbers.
208 9. Numerical Integration

The first Bernoulli polynomials are given by


1 1 1
x+- .
2
Bo(x) = 1, B 2 (x) = - X - -
2 2 12
We note that the normalization (9.22) is equivalent to
Bn(O) = B n (I), n = 2,3, .... (9.23)
Lemma 9.24 The Bernoulli polynomials have the symmetry property
(9.24)
Proof. Obviously (9.24) holds for n = 0. Assume that (9.24) has been
proven for some n 2: 0. Then, integrating (9.24), we obtain
Bn+l (x) = (-1) n+l Bn+l (1 - x) + (3n+l
for some constant (3n+l. The condition (9.22) implies that (3n+l = 0, and
therefore (9.24) is also valid for n + 1. 0

Lemma 9.25 The Bernoulli polynomials B 2m +l, m = 1,2, ... , of odd


degree have exactly three zeros in [0,1], and these zeros are at the points
0, 1/2, and 1. The Bernoulli polynomials B 2m , m = 0,1, ... , of even degree
satisfy B 2m (O) =I 0.
Proof. From (9.23) and (9.24) we conclude that B 2m +1 vanishes at the
points 0, 1/2, and 1. We prove by induction that these are the only zeros
of B 2m +l in [0,1]. This is true for m = 1, since B 3 is a polynomial of degree
three. Assume that we have proven that B 2m +l has only the three zeros
0, 1/2, and 1 in [0,1]' and assume that B 2m + 3 has an additional zero a in
[0,1]. Because of the symmetry (9.24) we may assume that a E (0,1/2).
Then, by Rolle's theorem, we conclude that B 2m +2 has at least one zero in
(0, a) and also at least one zero in (a, 1/2). Again by Rolle's theorem this
implies that B 2m+l has a zero in (0,1/2), which contradicts the induction
assumption.
From the zeros of B 2m +l, by Rolle's theorem it follows that B 2m has a
zero in (0,1/2). Assume that B 2m (0) = 0. Then, by Rolle's theorem, B 2m - 1
has a zero in (0,1/2), which contradicts the first part of the lemma. 0

By En : IR ---+ IR we denote the periodic extension of the Bernoulli


polynomial B n ; i.e., En has period 1 and En(x) = Bn(x) for :s: x :s: 1.
The Fourier series of the periodic functions En are given by
°
00
- m-l '"' cos 21fkx
B 2m (x) = 2(-1) L...J (21fk)2m (9.25)
k=l

and
- m ~ sin21fkx
B 2m - 1(X) = 2(-1) L...J (21fk)2m-l (9.26)
k=l
9.4 Quadrature of Periodic Functions 209

for m = 1,2, .... This follows from (9.21) and (9.22) and the elementary
Fourier expansion for the piecewise linear function ih (see Problem 9.13).
Let Xk = a + kh, k = 0, ... , n, be an equidistant subdivision of the
interval [a, b] with step size h = (b - a)/n and recall the definition of the
trapezoidal sum

Th(f) := h [~ f(xo) + f(xt} + ... + f(X n -1) + ~ f(X n )]

for f E C[a, b].

Theorem 9.26 Let f : [a, b] -+ IR be m times continuously differentiable


for m 2: 2. Then we have the Euler-Maclaurin expansion

(9.27)

where [T] denotes the largest integer smaller than or equal to T'
Proof. Let 9 E cm[o, 1]. Then, by m - 1 partial integrations and using
(9.23) we find that

Combining this with the partial integration

t
io B 1 (z)g'(z)dz= 21 [g(I)+g(O)]- i t g(z)dz
o
and observing that the odd Bernoulli numbers vanish leads to

['q-] b .
1
1
L.
1
g(z) dz =- [g(O) + g(I)] - ~ [g(2 i -l)(I) - g(2 i -l)(0)]
o 2 J=1
(2J)!
210 9. Numerical Integration

Now we substitute x = Xk + hz and g(z) = f(Xk + hz) to obtain


r Xk

IXk
+1
f(x) dx =
h
"2 (f(Xk) + f(Xk+dl

+( _l)mh m i:k+ 1

B m (X ~ a) f(m)(x) dx.

Finally, we sum the last equation for k = 0, ... , n - 1 to arrive at the


Euler-Maclaurin expansion (9.27). 0

For 21l"-periodic continuous functions f : lR --+ lR the trapezoidal rule

2: t/ (2~k).
coincides with the rectangular rule

12~ f(x)dx ~
For its error

En(f):= r2~ f(x) dx - ~


Jo k=l
f (2~k) t
we have the following corollary of the Euler-Maclaurin expansion.
Corollary 9.27 Let f : lR --+ lR be (2m + I)-times continuously differen-
tiable and 21l"-periodic for mE IN and let n E IN. Then for the error of the
rectangular rule we have

2~+l r
h
IEn(f)I::; If(2m+l)(x)1 dx,
n Jo
where
1
2L
00

C:= k 2m + 1 .
k=l
Proof. From Theorem 9.26 we have that

En(f) = - (~rm+llh B2m+l C~x) f(2 m+l)(x)dx,

and the estimate follows from the inequality


1
L (21l"k)2m+l'
00

IB2m + 1 (X)1 ::; 2 x E lR,


k=l
which is a consequence of (9.26). o
9.4 Quadrature of Periodic Functions 211

Corollary 9.27 illustrates why for periodic functions the simple rectangu-
lar rule is superior to any other quadrature rule (see Problem 9.12). Note
that the rectangular rule can also be obtained by integrating the trigono-
metric interpolation polynomials of Theorems 8.24 and 8.25.
In the following theorem we give an example of derivative-free error esti-
mates for numerical quadrature rules in the spirit of Davis [15]. They have
the advantage that they do not need the computation of higher derivatives
for the evaluation of the estimates. However, they require the integrand to
be analytic, and their proofs need complex analysis.
Theorem 9.28 Let f : lR -+ lR be analytic and 21r-periodic. Then there
°
exists a strip D = lR x (-a, a) C cr:; with a > such that f can be extended
to a holomorphic and 21r -periodic bounded function f : D -+ cr:;. The error
for the rectangular rule can be estimated by

IEn(f)1 ~ e~:~ 1 '


where M denotes a bound for the hoiomorphic function f on D.
Proof. Since f : lR -+ lR is analytic, at each point x E lR the Taylor

°
expansion provides a holomorphic extension of f into some open disk in the
complex plane with radius r(x) > and center x. The extended function
again has period 21r, since the coefficients of the Taylor series at x and
at x + 21r coincide for the 21r-periodic function f : lR --+ lR. The disks
corresponding to all points of the interval [0,21r] provide an open covering
of [0, 21r]. Since [0,21r] is compact, a finite number of these disks suffices to
cover [0, 21r]. Then we have an extension into a strip D with finite width
2a contained in the union of the finite number of disks. Without loss of
generality we may assume that f is bounded on D.
From the residue theorem we have that

1
+ 2 71"
io
nz /-io<+271" nz 41ri n (21rk)
. cot- f(z)dz-. cot-f(z)dz=-- L f -
to< 2 -to< 2 n k=l n
for each °< a < a. This implies that

1
i o<+271" nz 21r
Re. i cot -2 f(z) dz = -
to< n
since by the Schwarz reflection principle, f enjoys the symmetry property
f(z) = f(z). By Cauchy's integral theorem we have

1
i o<+271" 1271"
Re. f(z) dz = f(x) dx,
to< 0
and combining the last two equations yields

1
i
o<+271" ( nz)
En(f)=Re l-icot f(z)dz
io<
T
212 9. Numerical Integration

for all 0 < a < a. Now the estimate follows from


1 - i cot nz I < 2
1 2 - e n G'-1
for 1m z = a and then passing to the limit a -+ a. o
The estimate shows that for periodic analytic functions the rectangu-
lar rule is of exponential order; i.e., doubling the number of quadrature
points doubles the number of correct digits in the approximate value for
the integral.

9.5 Romberg Integration


We now proceed with describing the extrapolation method due to Richard-
son (1927). Its basic idea is to derive high-order approximation methods
from simple low-order methods. It can be applied to a variety of formulae in
numerical analysis, and its application to the Euler-Maclaurin expansion
was suggested by Romberg in 1955.
Recall the composite trapezoidal rule

T~(J) := h [21 f(a)


n-l 1]
+ {; f(a + kh) + 2 f(b)

with step size h = (b - a)/n. If f is four-times continuously differentiable,


by the Euler-Maclaurin expansion from Theorem 9.26 we have an error
representation of the form

l b
f(x) dx = T~(J) + '"f1h 2 + O(h4 )
for some constant 'Y! depending on f but not on h. Hence, for half the step
size, we have that

From these two equations we can eliminate the terms containing h 2 j Le.,
we multiply the first equation by -1/3 and the second equation by 4/3 and
add both equations to obtain

Hence, the linear combination


9.5 Romberg Integration 213

of the composite trapezoidal rule with step sizes hand h/2 leads to a
quadrature formula with the improved error order O(h4 ). The quadrature
T;(f) coincides with the composite Simpson's rule for the step size h/2.
If f is six-times continuously differentiable, by linearly combining the
Euler-Maclaurin formulae for the step sizes hand h/2 we obtain an error
representation of the form

l b
f(x) dx = T;(f) + 12h4 + O(h6 )
for some constant 12 depending only on f. From this and the corresponding
formula
ira
b 4
h
f(x) dx = T; (f) + 12 16 + O(h 6 )
for step size h/2, by eliminating the terms containing h 4 we obtain the
quadrature formula
3 [2
1 16T%(f) - Th(f)
Th(f) := 15 2]
with an error of order O(h 6 ). Note that the actual numerical evaluation of
T'K(f) requires the values for the composite trapezoidal rule for the step
sizes h, h/2, and h/4.
Obviously, this procedure can be repeated, and this leads to the sequence
of Romberg quadrature formulae. Let
Tf(f) := T~k(f), k = 0,1,2, ... ,
be the trapezoidal sums for the step sizes h k := h/2 k . Then for m = 1,2, ...
the Romberg quadratures are recursively defined by

T;:+t(f) := 4m 1_ 1 [4mTk+l(f) - Tk(f)], k = 0,1, .... (9.28)

For the error we have the following theorem.


Theorem 9.29 Let f : [a, b) --+ rn. be 2m-times continuously differentiable.

It
Then for the Romberg quadratures we have the error estimate

1(') <Ix ~ TrU)1 $ em 1I/(,ml li oo (;. ) 'm, k =0,1, ... ,


for some constant em depending on m.
Proof. By induction, we show that there exist constants Ij,i such that

l b
f(x) dx - T~(J) _ ~l Ij,i[J(2 j -l) (b) _ f(2 j -l)(a)) (2:) 2j

(9.29)

~ Im,illf(2m) 1100 (2:) 2m


214 9. Numerical Integration

for i = 1, ... ,m and k = 0, 1, .... Here the sum on the left-hand side is set
equal to zero for i = m. By the Euler-Maclaurin expansion this is true for
i = 1 with 'Yj,l = b2j j(2j)! for j = 1, ... , m - 1 and

'Ym,l = (b - a) sup IB2m (x)l.


xE[O,I]

As an abbreviation we set

Fj := f(2 j -I)(b) - f(2 j -I)(a), j = 1, ... ,m - 1.

Assume that (9.29) has been shown for some 1 ::; i < m. Then, using (9.28),
we obtain

4
i
4i _ 1
[ib f(x) dx - Ti.+l. (f) - f;
a
m-l (
2k+1
h )2 'Yj,iFj
j
]

- 4i
1
_ 1
[i a
b
f(x) dx - Ti.(f) -
.
f;
m-l ( h ) 2j
2k 'Yj,iFj
]

where
4i - j - 1
'Yj,i+l = 4i _ 1 'Yj,i, j = i + 1, ... ,m - 1.
Now with the aid of the induction assumption we can estimate

::; 'Ym,i+l11f(2m)1100 (~ ) 2m ,

where
4i - m +1
'Ym,i+l = 4i - 1
and the proof is complete. o

From Theorem 9.29 we conclude that the Romberg quadrature Ti: in-
tegrates polynomials of degree less than or equal to 2m - 1 exactly. For
h = b- a the Romberg quadrature TO' uses 2m - 1 + 1 equidistant integration
points. Therefore, TJ coincides with the trapezoidal rule, T;f with Simp-
son's rule, and TJ with Milne's rule. Similarly, TI, Tf, and T2 correspond
9.5 Romberg Integration 215

to the composite trapezoidal rule, the composite Simpson's rule, and the
composite Milne rule, respectively. For m 2:: 4 the number of the quadrature
points in TO' is greater than the degree of exactness. The Romberg formula
TJ uses nine quadrature points, and this is the number of quadrature points
where the Newton-Cotes formulae start having negative weights.
Theorem 9.30 The quadrature weights of the Romberg formulae are pos-
itive.

Proof. We define recursively Ql := 4Tl+l - 2Tl and

k
1_ [22m+1Tm
Qm+l := __
4m _ 1 k+ 1
+ 2Tm
k
+ 4 m+lQm
k+l
] (9.30)

for k = 1,2, ... and m = 1,2 ... and show by induction that
Tm+1 = _1_ [T m + Qm]. (9.31)
k 4m _ 1 k k

By the definition of Ql this is true for m = 1. We assume that (9.31) has


been proven for some m 2:: 1. Then, using the recursive definitions of Tr
and Q'k and the induction assumption, we derive

4m + 1 1
T;:+l + Q~+l = 4m _ 1 [TM-l + QM-l] - 4m _ 1 [4mTM-l - Tr]

- 4 m +lTm+1 _ T m+1 - (4 m + 1 _ I)T m+2.


- k+l k - k '

i.e., (9.31) also holds for m + 1. Now, from (9.30) and (9.31), by induction
with respect to m, it can be deduced that the weights of are positiveTr
and that the weights of Q'k are nonnegative. 0

Corollary 9.31 For the Romberg quadratures we have convergence:

lim Tr(f) =
m-+oo
Ia
b
f(x) dx and lim Tr(f)
k-4OO
= I a
b
f(x) dx

for all continuous functions f.

Proof. This follows from Theorems 9.29 and 9.30 and Corollary 9.11. 0

For continuous functions, the trapezoidal sums converge as the step size
tends to zero. This motivates us to consider a polynomial in h 2 interpolating
the values Tl (f), ... , Tl+m (f) at the interpolation points h%, ... , h%+m and
evaluate it at h = o.
Theorem 9.32 Denote by L'k the uniquely determined polynomial in h 2
of degree less than or equal to m with the interpolation property

L'k(h;) = T}(f), j = k, ... , k + m.


216 9. Numerical Integration

Then the Romberg quadratures satisfy

(9.32)

Proof. Obviously, (9.32) is true for m = o. Assume that it has been proven
for m - 1. Then, using the Neville scheme from Theorem 8.9, we obtain

establishing (9.32) for m. o

This interpretation of the Romberg quadrature as an extrapolation method


in the sense of Richardson opens up the possibility of modifications using
other than equidistant step sizes.
Table 9.4 gives the error between the exact value of the integral from
Example 9.6 and its numerical approximation by the Romberg quadrature,
exhibiting its fast convergence according to the error estimates of Theorem
9.29. Clearly, the first two columns of Table 9.4 have to coincide with Table
9.2.

TABLE 9.4. Romberg quadratures for Example 9.6

k TI T; Tt r:
1 -0.05685282
2 -0.01518615 -0.00129726
4 -0.00387663 -0.00010679 -0.00002742
8 -0.00097467 -0.00000735 -0.00000072 -0.00000030
16 -0.00024402 -0.00000047 -0.00000001 -0.00000000
32 -0.00006103 -0.00000003 -0.00000000 -0.00000000

We finish this section with the corresponding Table 9.5 for the integral

t
Jo
2
JX dx = 3 (9.33)

of a function that is not differentiable in all of the integration interval. Not


surprisingly, the convergence is notably slower.
9.6 Improper Integrals 217

TABLE 9.5. Romberg quadratures for the integral (9.33)

k Tl n Tt Tt T~

1 0.166667
2 0.063113 0.028595
4 0.023384 0.010140 0.008910
8 0.008536 0.003587 0.003151 0.003059
16 0.003085 0.001268 0.001114 0.001082 0.001074
32 0.001108 0.000448 0.000394 0.000382 0.000380

9.6 Improper Integrals


We conclude this chapter with an example for the numerical integration of
improper integrals and describe a class of quadrature rules for the integral

1 1

f(x) dx

where the integrand f is sufficiently smooth in (0,1) but is allowed to have


singularities at the endpoints x = 0 and x = 1 such that f is nonetheless
integrable.
Let the function w : [0,211"] -+ [0,1] be bijective, strictly monotonically
increasing, and infinitely differentiable. Then we can substitute x = w(t)
and consequently obtain

Jt Jr
21f
f(x) dx = g(t) dt,
o o
where
g(t) := w'(t) f(w(t)), 0 < t < 211".
Now assume that the function w has derivatives
w U) (0) = w U)(211") = 0, j = 1, ... ,p - 1, (9.34)
and
(9.35)
for some p E IN. Then we may expect that the function 9 and some of
its derivatives up to a certain order vanish at t = 0 and t = 211"; i.e., 9
can be considered as a sufficiently smooth 211"-periodic function, and the
rectangular rule may be applied to the transformed integral. This yields
the quadrature formula

(9.36)
218 9. Numerical Integration

with the quadrature points and weights given by

Xk =W(2:k), ak= 2: W,(2:k), k=I, ... ,n-l.

In addition, it is natural to require the symmetry property

wet) = 1 - w(211" - t), t E [0,211"]. (9.37)

Then the quadrature points and weights have the symmetry

xn-k=l-xk, an-k=ak, k=I, ... ,n-l,


and from the assumptions (9.34) and (9.35), by Taylor's formula, it follows
that they satisfy the inequalities

Co(~r ~Xk, l-Xn-k~CI (~)P, k=I, ... ,[iJ, (9.38)

and

Co ( nk)P-I ~ ak, an-k ~ CI n '


(k)P-I
k=I,···,[iJ, (9.39)

for some constants 0 < Co < Cl depending on the function w. From (9.38) it
is obvious that the quadrature points are graded towards the two endpoints
x = 0 and x = 1 of the integration interval.
For substitutions with the properties (9.34), (9.35), and (9.37), from the
Euler-Maclaurin expansion applied to the integral over 9 we now will derive
an estimate for the remainder term

En(J) := 1°1

f(x) dx -
n-l

L ad(xk)'
k=l

For q E IN and 0 < a ~ 1 by sq,a we denote the linear space of q-times


continuously differentiable functions f : (0,1) ---t IR for which

sup [x(1 - x)]HI-a If(j)(x)1 < 00


O<x<l

for j = 0, ... , q. On sq,a we define the norm

Ilfllq,a := . max sup [xCI - x)]i+I-a If(j)(x)l.


J=O, ... ,q O<x<l

Then, clearly

If(j)(x)1 ~ IIfll q ,a[x(1 - x)]a-i-l, 0 < x < 1, (9.40)

for j = 0, ... , q.
9.6 Improper Integrals 219

Theorem 9.33 Let p E IN and assume that w satisfies (9.34), (9.35), and
(9.37). Further, let q E IN and j E s2q+l,a with 0< 0: ~ 1 such that

2q + 1 < o:p and 2q + 2 ~ p.

Then the error in the quadrature formula (9.36) can be estimated by

with some constant C depending on w, 0:, and q.


Proof. For the derivatives of 9 we can write
r
g(r)(t) =L uj(t) jU)(w(t», r = 0, ... , 2q + 1.
j=O

Then from

g(r+l)(t) = t;
r [
uj(t) w'(t) jU+1)(W(t» -it
+ d U 'fU)(w(t»
:(t)]

we derive the recursion formulae


du(j(t)
dt
j = 0,
, duj(t) (9.41)
Uj_1 (t) w (t) + -----;jt , j = 1, ... ,r,

u~(t) w'(t), j = r + 1,
for the coefficients uj. In particular, we have

u(j(t) = w(r+l)(t) and u~(t) = [w'(tW+l. (9.42)

The functions uj satisfy

uj(t) = O([t(21l' - t»)yi, t(21l' - t) ~ 0, (9.43)

for r = 0, ... ,2q + 1 and j = 0, ... ,r, where


zj = p - 1 + jp - r.

For j = °and j = r this is obvious from the assumption on wand (9.42),


and for j = 1, ... , r - 1 it follows by induction from the recursion formulae
°
(9.41). Note that zj ~ because of the assumption p ~ 2q + 2.
Using (9.40) and the assumptions on w, we can estimate
220 9. Numerical Integration

for some constant C 1 , and with the aid of (9.43), we further obtain that

luj(t) f(j)(w(t))1 ::; C 21IfIl2q+1,a[t(211" - t)]a p-r-1, 0 < t < 211", (9.44)

for some constant C 2 and r = 0, ... ,2q + 1 and j = 0, ... , r. From this,
since QP > 2q + 1, we observe that for r = 0, ... , 2q the derivatives g(r) can
be continuously extended from (0,211") onto [0,211"] with values

g(r)(o) = g(r) (211") = 0, r = 0, ... ,2q.

Furthermore, from (9.44) and the assumption QP > 2q + 1 we see that the
integral of g(2 q+1) over [0,211"] exists as an improper integral and

with some constant C3 depending on w, Q, and q. Now the statement fol-


lows from Corollary 9.27 of the Euler-Maclaurin expansion. Note that for
the Euler-Maclaurin expansion (9.27) to be valid it obviously suffices that
the integral of the error term exists as an improper integral. 0

We proceed by describing a few examples for substitutions w (see Prob-


lem 9.19). In 1963 Korobov suggested the polynomial transformation

Wp(t) := [1 21<
[s(211" - S)]P-1 dS]
-1
1 t
[s(211" - s)]P-1ds . (9.45)

The trigonometric transformation

wp(t) := [1 2
1< sin P - 1
~ dS] -1 it sin P - 1
~ ds (9.46)

with the special cases

was proposed by Sidi [54]. Substitutions of the form


tP
w (t) '- -----,---,- (9.47)
p .- t P + (211" - t)p

were considered in [40]. As a rule of thumb, these substitutions should not


be used for P too large, say P > 10, because this may lead to overgrading
and numerical difficulties with underflow. The substitutions

w(t) =
[1o
2
1< exp (11"
- - - -11")]
S
- ds -1
211" - S
it 0
exp (11"
- - - -11")
S
- ds
211" - S
Problems 221

and

w(t) = exp (2)


_~
t
+ (2)t exp __1f_
21f -
with zeros of infinite order at the endpoints, which were suggested by Iri,
Moriguti, and Takesawa [58] and by Sag and Szekeres [52], respectively,
also suffer from this drawback.
As a numerical example we consider the improper integral

[1 1
Jo VX dx = 2. (9.48)

Table 9.6 gives the error between the exact value and the numerical ap-
proximation obtained by using the substitution (9.47).

TABLE 9.6. Numerical quadrature for the integral (9.48)

n p=3 p=4 p=5 p=6

8 0.07012542 -0.06064201 -0.22007377 -0.42795942


16 0.02849925 0.00455233 -0.00438402 -0.01896018
32 0.00992273 0.00129852 0.00011279 -0.00003394
64 0.00347755 0.00032530 0.00002117 -0.00000019
128 0.00122386 0.00008137 0.00000382 -0.00000001

Problems
9.1 Show that the error for the composite trapezoidal rule can be expressed in

l -l
the form
b b
f(x)dx-Th(f) = KT(X)f"(x)dx,

where the so-called Peano kernel KT is given by

KT(X) = "21 (x - xk-I)(Xk - x), Xk-l ~ X ~ Xk,

for k = 1, ... ,n. Use this error representation for an alternative proof of Theorem
9.7.

9.2 Show that the error for the composite Simpson's rule can be expressed in
the form
222 9. Numerical Integration

where the Peano kernel K s is given by

h ( 3 1 4
$ x $
18 x - Xk-2) - 24 (x - Xk-2) , Xk-2 Xk-l,

Ks(x) :=
{ h 3 1 4
- (Xk -x) - - (Xk -x)
18 24 '
for k = 2,4, ... , n. Use this error representation for an alternative proof of The-
orem 9.8.

9.3 For Newton's three-eights rule prove the error representation

[b f(x)dx _ b ~ a [I(a) + 3f(a + h) + 3f(b _ h) + feb») = _ 38~5 f(4)(f,)

with some f, in [a, b) and h = (b - a)/3.

9.4 Show that the weight a4 for the Newton-Cotes formula of order eight is
negative.

9.5 For the remainder En of the Newton-Cotes formula of order n on the in-
terval [-1,1), applied to the Chebyshev polynomial T n + l , show that

n 1
E (T ) _ (n+ 1)!4 + n E IN.
n n+1 - nn+2

From this conclude that if n odd, then

where
(n - 1)' 4 n + 1
Tn = 3 nn+2 --t 00, n --t 00.

Hint: Use Theorem 8.10 and show that

Io
n ( ) z
n+l
dz = -2 II ( )
0
z
n+2
dz

for n odd.

9.6 Compute the weights for the polynomial interpolatory quadratures with
equidistant quadrature points

b-a
Xk = a + (k + 1) -n+
-2' k = 0, 1, ... , n,
for n = 0, 1, 2 and obtain representations of the quadrature errors. These formulae
are called open Newton-Cotes quadmtures, since the two endpoints a and b are
omitted.
Problems 223

9.7 For n E IN, a quadrature formula of the form

I
b b n
/(x) dx::::: ~ a ~ /(Xk)
a k=1

with distinct quadrature points XI, ... ,X n E [a, b] and equal weights is called a
Chebyshev quadrature if it integrates polynomials in Pn exactly. Find the Cheby-
shev quadratures for n = 1,2,3,4. (Chebyshev quadratures exist only for n < 8.)
9.8 Show that there exists no polynomial interpolatory quadrature of order n
that integrates polynomials of degree 2n + 2 exactly.
9.9 The Chebyshev polynomial of the second kind Un of degree n is defined by

Un(x) ;= sin«~ + 1) arccos x) , -1 ~ X ~ 1.


sm (arccos x)
Show that Uo(x) = 1, UI(X) = 2x, and
Un+l(x) + Un-I(X) = 2xUn (x), n = 1,2, ....
Prove the orthogonality relation

[II ~Un(X)Um(x)dx= i t5 nm .

9.10 Show that the quadrature points and quadrature weights for the Gauss-
Chebyshev quadrature of order n - 1 for the integral

[II ~/(x)dx
are given by
k+l
Xk = cos--
n+l
7r

and
• 2 k +1
ak =-
n+l
-
11"
sm - -
n+l
7r

for k = 0, ... ,n - 1.

9.11 Find the quadrature weights aO,al,a2,a3, and the (remaining) quadrature
points x I, X2 of a quadrature formula of the form

[II /(x) dx ::::: ao/( -1) + al/(xI) + a2!(x2) + a3/(1)

that is exact for all polynomials in P s . (This is an example of a Gauss-Lobatto


quadrature, i.e., a Gauss quadrature with two preassigned quadrature points.)
Find the quadrature weights ao, aI, a2, and the (remaining) quadrature points
x I, X2 of a quadrature formula of the form

[II /(x) dx::::: ao/( -1) + aI/(xI) + a2!(x2)


that is exact for all polynomials in P4 . (This is an example of a Gauss-Radau
quadrature, Le., a Gauss quadrature with one preassigned quadrature point.)
224 9. Numerical Integration

9.12 By approximating the integral

1
21r
1 211"
----dx=-
o 5-4cosx 3

by the rectangular rule and Simpson's rule convince yourself of the superiority of
the rectangular rule for periodic functions.

9.13 Verify the Fourier series (9.25) and (9.26) for the periodic Bernoulli poly-
nomials.

9.14 For the Bernoulli polynomials show that the series

is absolutely and locally uniformly convergent for all x E [0,1] and all t E (-1,1).

9.15 Derive a quadrature formula by integrating the interpolating cubic spline


from Theorem 8.30 and discuss its relation to the Euler-Maclaurin expansion.

9.16 Write a computer program for Romberg integration and test it for various
examples.

9.17 Calculate the weights of the Romberg quadratures Tf and Tt.


9.18 Show that the Richardson extrapolation for the midpoint rule (9.20) leads
to nonnegative quadrature weights.

9.19 Show that the functions (9.45), (9.46), and (9.47) are strictly monoton-
ically increasing, infinitely differentiable, and map [0,211"] onto [0,1] such that
(9.34), (9.35), and (9.37) are satisfied.

9.20 Write a computer program for the numerical quadrature (9.36) using the
substitution (9.47) and test it for various examples.
10
Initial Value Problems

Historically, the study of differential equations originated in the beginningE


of calculus with Newton and Leibniz in the seventeenth century and iE
closely interwoven with the general development of mathematics. To a sub-
stantial degree, the central role of differential equations within mathematics
is due to the fact that many important problems in science and engineering
are modeled by differential equations.
This chapter will be devoted to an introduction to the basic numerical
approximation methods for initial value problems for ordinary differential
equations. For a more comprehensive study we refer to [13,33,42,46,55].
Analogous to the need for numerical quadrature formulas, numerical meth-
ods for the approximate solution of ordinary differential equations are nec-
essary, since in general, no explicit solutions of the differential equation
will be known, despite the fact that there exists a broad range of analyt-
ical solution methods for special classes of ordinary differential equations.
In addition, the functions and data involved in the differential equation
problem quite often will be available only at discrete points. However, we
would like to emphasize that despite the availability of numerical methods
the study of elementary analytical methods for the solution of ordinary
differential equations remains worthwhile, since it provides a first step into
gaining insight into the general structure of differential equations.
A solid foundation for numerical approximation methods for differen-
tial equations, including their convergence and error analysis, requires as
a prerequisite results on the existence and uniqueness of the solution to
the problem to be approximately solved. Therefore, in Section 10.1 we will
begin with proving the fundamental Picard-Lindel6f existence and unique-
226 10. Initial Value Problems

ness theorem for initial value problems. In Section 10.2 we will describe
some variants of the simplest method for the numerical solution of initial
value problems, which was first used by Euler. These methods are special
cases of so-called single-step methods, for which we will give a convergence
and error analysis in Section 10.3. This section also includes a short dis-
cussion of the Runge-Kutta method as the most widely used single-step
method. The final section, Section lOA, is concerned with the description
and analysis of multistep methods.
We wish to note explicitly that this chapter is also meant to serve as
an application of some of the material provided in Chapters 8 and 9 on
interpolation and numerical integration.

10.1 The Picard-Lindelof Theorem


Definition 10.1 Let G C IR? be a domain and f : G -+ lR. A continuously
differentiable function u : [a, b] -+ lR is called a solution of the ordinary
differential equation of the first order

u' = f(x, u) (10.1)

if (x, u(x)) E G and u'(x) = f(x, u(x)) for all x E [a, b].
Geometrically speaking, the differential equation (10.1) defines a field of
directions on G. Solving the differential equation means looking for func-
tions whose graphs match this field of directions.
Systems of ordinary differential equations can be included in the dis-
cussion as follows. If G c lRn + 1 is a domain and f : G -+ lRn , then a
continuously differentiable function u : [a, b] -+ lRn is called a solution of
the system of ordinary differential equations of the first order

u' = f(x, u)
if (x,u(x)) E G and u'(x) = f(x,u(x)) for all x E [a,b]. More explicitly,
this system reads

for u = (Ul,"" un) T and f = (II, ... , f n) T. Each ordinary differential


equation
u(n) = f(x,u,u', ... ,u(n-l»)
10.1 The Picard-Lindelof Theorem 227

of order n is equivalent to the system

... ,

via Ul = U, Uz = u', ... ,Un = u(n-l). Therefore, in principle, considering


only differential equations of the first order is no loss of generality.
From the wide field of applications we sketch only the following two
simple examples.

Example 10.2 By Newton's law, the differential equation of the second


order
mu" = f(t,u)

describes the motion of an object of mass m subject to the external force


f(t, u) depending on the location u of the object and the time t. Given an
initial location Uo and an initial velocity u~ at the initial time t = 0, one
wants to find the position u(t) of the object for all times t ~ O. 0

Example 10.3 Let p = p(t) describe the population of a species of animals


or plants at time t. If r(t,p) denotes the growth rate given by the difference
between the birth and death rate depending on the time t and the size
p of the population, then an isolated population satisfies the differential
equation
dp
dt = r(t,p).

The simplest model r(t,p) = ap, where a is a positive constant, leads to

dp
- =ap
dt

with the explicit solution p(t) = poea(t-t o ). Such an exponential growth is


realistic only if the population is not too large. The modified model

dp 2
- = ap-bp
dt
with positive constants a and b contains a correction term that slows down
the growth rate for large populations and is known as the Verhulst equa-
tion. It was introduced by Verhulst in 1938 as a model for the .growth of
the human population. In general, for a given growth rate r one wants to
determine the development of the population p(t) in time for a given initial
population PO at time t = to. 0

Both examples are typical initial value problems: Find a solution of a


differential equation that attains a given initial value at a given initial
time. This notion is made more precise by the following definition.
228 10. Initial Value Problems

Definition 10.4 The initial value problem for the ordinary differential
equation
u' = f(x,u) (10.2)
consists in finding a continuously differentiable solution u satisfying the
initial condition
u(xo) = Uo (10.3)
for a given initial point Xo and a given initial value uo.

The existence and uniqueness of a solution to such an initial value prob-


lem are settled through the following fundamental theorem.
Theorem 10.5 (Picard-Lindelof) Let G E lRn + 1 be a domain and let
f : G -+ lRn be a continuous function satisfying a Lipschitz condition

IIf(x,u) -f(x,v)lI::; Lllu - vII (10.4)


for all (x, u), (x, v) E G and some constant L > 0, which is called the

°
Lipschitz constant. Then for each initial data pair (xo, uo) E G there exists
an interval [xo - a, Xo + a] with a > such that the initial value problem
(10.2)-{10.3) has a unique solution in this interval.
Proof. Firstly, we transform the initial value problem equivalently into the
Volterra integral equation

u(x) = Uo + l f(~,
Xo
x
u(f,)) d~. (10.5)

Clearly, if u solves the initial value problem, then it follows by integrating


the differential equation and using the initial condition that u also solves
the integral equation. Conversely, if u is a continuous solution of the integral
equation, then by differentiating the integral equation it follows that u is
continuously differentiable and satisfies the differential equation. Inserting
x = Xo in (10.5) shows that the initial condition is fulfilled.
For solving the Volterra integral equation we now can employ Banach's
fixed point Theorem 3.45. Since G is open, we can choose a bounded domain
D such that (xo, uo) E D and D c G. Denote by M a bound on the
continuous function f : D -+ lR n ; Le.,
IIf(x,u)11 ::; M, (x,u) E D.

Since D is open, we can choose a > ° such that the closed rectangle
B := {(x, u) E lRn + 1 : Ix - xol ::; a, lIu - uoll ::; Ma}

is contained in D. Consider the Banach space C[xo -a, xo+a] of continuous


functions u : [xo - a, Xo + a] -+ lRn furnished with the maximum norm

lIull oo := max Ilu(x)1I


Ix-xol::;a
10.1 The Picard-LindelOf Theorem 229

in terms of the chosen norm II . II on IRn . Each solution u of the integral


equation satisfies (see (6.1))

lIu(x) - uoll = Ill: f(~, u(~)) d~11 :S M a, Ix - xol :S a,

that is,
Ilu - uoll oo :S M a,
which implies that the solution remains within the rectangle B. We consider
the closed subset

U:= {u E C[xo - a,xo + aj : Ilu - uoll oo :S Ma}


of the Banach space C[xo - a, Xo + aj and note that by Remark 3.40 the
set U is complete. On U we define an operator A : U ---+ U by setting

(Au)(x) := Uo + r f(~, u(O) d~,


Jxo
Ix - xol :S a.

The operator A indeed maps U into itself, since the function Au is con-
tinuous and satisfies IIAu - uoll oo :S M a. With the aid of the Lipschitz
condition (10.4) and using (6.1) we can estimate

II(Au)(x) - (Av)(x)1I = 1I1:[f(~,U(O) - f(~,V(~))jd~1I

:S L r Ilu(~) - v(~)11 d~ :S Lallu - vll


Jxo oo

for all Ix - Xo I :S a. Hence

IIAu - Avll oo :S Lallu - vll oo

for all u, v E U. Now we choose a such that a < 1/ L. Then A : U ---+ U is a


contraction operator, and the Banach fixed point theorem ensures a unique
fixed point of A, Le., a unique solution of the integral equation (10.5) in
the interval [xo - a, Xo + aj. 0

Exploiting the fact that in Theorem 10.5 the width a of the interval
is determined by the Lipschitz constant L, which is independent of the
initial point (xo,uo), one can assure global existence of the solution; i.e.,
the solution to the initial value problem exists and is unique until it leaves
the domain G of definition for the differential equation.
Note that on a convex domain each function that is continuously dif-
ferentiable with respect to u satisfies a Lipschitz condition (see the mean
value Theorem 6.7).
230 10. Initial Value Problems

Corollary 10.6 Under the assumptions of Theorem 10.5, the sequence


(u v ) defined by uo(x) = Uo and

Uv+l (x) := Uo + (X f(~, uv(~)) d~, Ix - xol ~ a, v = 0,1, ... , (10.6)


lxo
converges as v --+ 00 uniformly on [xo - a, Xo + a] to the unique solution u
of the initial value problem. We have the a posteriori error estimate

La
Ilu - uvlloo ~ 1 _ La Iluv- uv-Illoo , v = 1,2, ....
Proof. This follows from Theorem 3.46. 0

Example 10.7 Consider the initial value problem

u' = x 2 + u2 , u(O) = 0,
on G = (-0.5,0.5) x (-0.5,0.5). For f(x, u) := x 2 + u 2 we have

If(x, u)1 ~ 0.5

on G. Hence for any a < 0.5 and M = 0.5 the rectangle B from the proof
of Theorem 10.5 satisfies BeG. Furthermore, we can estimate

If(x, u) - f(x, v)1 = lu 2 - v2 1= I(u + v)(u - v)1 ~ lu - vi

for all (x, u), (x, v) E G; Le., f satisfies a Lipschitz condition with Lipschitz
constant L = 1. Thus in this case the contraction number in the Picard-
Lindelof theorem is given by La < 0.5.
Here, the iteration (10.6) reads

Starting with uo(x) = 0 we first compute

and from Corollary 10.6 we have the error estimate


1
lIu - uilioo ~ IluI - uolloo = 24 = 0.041 ....

The second iteration yields


10.2 Euler's Method 231

with the error estimate


1
Ilu - uzlloo ~ lI u z - utlloo = 63.2 7 = 0.00012 ... ,
and the third iteration yields

U3(X) = i
r [e + 9~6 + 2~10 e
189 + 3969
4
]
de = 3
x
3

+
x
7

63 +
2x
ll

2079
X
+ 59535
15

o
with the error estimate
1 1
lIu - u31100 ~ II u3- uzlloo = 2079.2 10 + 59535.2 15 = 0.00000047 ....

In this example three steps of the Picard-Lindel6f iteration give eight dec-
imal places of accuracy. However, the example is not typical, since in gen-
eral, the integrations required in each iteration step will not be available
explicitly as in the present case. 0

10.2 Euler's Method


In the sequel we confine our presentation to the initial value problem for a
differential equation of the first order. The generalization to systems and
henceforth to equations of higher order is straightforward. We shall always
tacitly assume that the assumptions of the Picard-Lindel6f Theorem 10.5
are satisfied.
The following simple method for the numerical solution of the initial
value problem
u' = f(x, u), u(xo) = Uo, (10.7)
was first used by Euler. Given a step size h > 0, it consists in replacing the
derivative u' = f(x, u) throughout the interval [xo, Xo + h] by the derivative
Uo = f(xo, uo) at the initial point, i.e., geometrically speaking, by replacing
the solution by its tangent line at the initial point xo. This leads to the
approximation
Ul = Uo + hf(xo, uo) (10.8)
for the value U(Xl) of the exact solution at the point Xl = Xo + h. Repeating
this procedure leads to the Euler method as described in the following
definition. For obvious reason, this method is also known as the polygon
method, since it approximates the exact solution curve by a polygon.
Definition 10.8 The Euler method for the numerical solution of the ini-
tial value problem (10.7) constructs approximations Uj to the exact solution
u (x j) at the equidistant grid points

Xj := Xo + jh, j = 1,2, ... ,


232 10. Initial Value Problems

with step size h by

Uj+l := Uj + hf(xj, Uj), j = 0, 1, ....

Example 10.9 Consider the initial value problem

u' = x2 + u2 , u(O) = 0,
from Example 10.7. Table 10.1 gives the difference between the exact so-
lution as computed by the Picard-Lindelof iterations in Example 10.7 and
the approximate solution obtained by Euler's method for various step sizes
h. We observe a linear convergence as h -+ O. 0

TABLE 10.1. Numerical example for the Euler method

x h = 0.1 h = 0.01 h = 0.001 h = 0.0001

0.1 0.000333 0.000048 0.000005 0.000000


0.2 0.001667 0.000197 0.000020 0.000002
0.3 0.004003 0.000446 0.000045 0.000005
0.4 0.007357 0.000798 0.000080 0.000008
0.5 0.011769 0.001258 0.000127 0.000013

There are three different interpretations of the approximation formula of


Euler's method:
1. Replace the derivative by the difference quotient

u(xd - u(xo) '()


h ~u Xo = f( Xo,Uo )
and solve for u(xd.
2. Integrate in the equivalent integral equation (10.5), i.e., in

approximately by the rectangular rule

(Xl f(~, u(~)) d~ ~ hf(xo, uo).


lxo
3. Use Taylor's formula

u(xd = u(xo) + hu'(xo) + ~2 u"(xo + Oh)


with 0 < 0 < 1 and neglect the remainder term; i.e., approximate
u(xd ~ u(xo) + hu'(xo).
10.2 Euler's Method 233

Each of these three interpretations opens up possibilities for improve-


ments of Euler's method. For example, instead of the rectangular rule we
can use the more accurate trapezoidal rule

l
X1
h
f({,u({»d{ ~"2 [f(xo,u(xo)) + f(xl,u(xd)]'
Xo

which yields
h
Ul = Uo +"2 [f(xo,uo) + f(Xl,ud]· (10.9)

Repeating this procedure leads to the following method.


Definition 10.10 The implicit Euler method for the numerical solution of
the initial value problem (10.7) constructs approximations Uj to the exact
solution u(Xj) at the equidistant grid points

Xj := Xo + jh, j = 1,2, ... ,

with step size h by

Uj+l = Uj +"2h [f(Xj,Uj) + f(xj+l,uj+t}], j = 0,1, ....


This method is called an implicit method, since determining Uj+l requires
the solution of an equation that in general is nonlinear. In contrast, the
Euler method of Definition 10.8 is an explicit method, since it provides an
explicit expression for the computation of Uj+l'
Remark 10.11 The nonlinear equations of the implicit Euler method can
be solved by successive approximations, provided that the Lipschitz constant
L for f and the step size h satisfy Lh < 2.

Proof. We have to solve equation (10.9) for Ul. Setting


h
g(u) := Uo +"2 [f(xo,uo) + f(Xl'U)]
we can rewrite (10.9) as the fixed point equation Ul = g(ud. The function
9 is a contraction, since

and therefore the assertion follows from Theorem 3.46. o

Since the solution of the nonlinear equation (10.9) will deliver only an
approximation to the solution of the initial value problem, there is no need
to solve (10.9) with high accuracy. Using the approximate value from the
explicit Euler method as a starting point and carrying out only one itera-
tion, we arrive at the following method.
234 10. Initial Value Problems

Definition 10.12 The predictor corrector method for the Euler method
for the numerical solution of the initial value problem (10.7), also known
as the improved Euler method or Heun method, constructs approximations
Uj to the exact solution u(x j) at the equidistant grid points
Xj := Xo + jh, j = 1,2, ... ,
by
h
Uj+I := Uj + 2" [f(xj, Uj) + f(xj+l' Uj + hf(xj, Uj))], j = 0,1, ....

Example 10.13 Consider again the initial value problem from Example
10.7. Table 10.2 gives the difference between the exact solution as computed
by the Picard-Lindelofiterations and the approximate solution obtained by
the improved Euler method for various step sizes h. We observe quadratic
convergence as h -t O. 0

TABLE 10.2. Numerical example for the improved Euler method

x h = 0.1 h = 0.01 h = 0.001

0.1 -0.00016667 -0.00000167 -0.00000002


0.2 -0.00033326 -0.00000333 -0.00000003
0.3 -0.00049955 -0.00000500 -0.00000005
0.4 -0.00066530 -0.00000668 -0.00000007
0.5 -0.00083027 -0.00000837 -0.00000009

In the following section we will show that the Euler method and the
improved Euler method are convergent with convergence order one and
two, respectively, as observed in the special cases of Examples 10.9 and
10.13.

10.3 Single-Step Methods


We generalize the Euler methods into more general single-step methods by
the following definition.
Definition 10.14 Single-step methods for the approximate solution of the
initial value problem
U' = f(x, u), u(xo) = Uo,
construct approximations Uj to the exact solution u(x j) at the equidistant
grid points
Xj:=xo+jh, j=1,2,oo.,
10.3 Single-Step Methods 235

with step size h by


Uj+l := Uj + hep(xj, Uj; h), j = 0,1, ... ,
where the function ep : G x (0,00) -t IR is given in terms of the right-hand
side f ; G -t IR of the differential equation.
Example 10.15 The Euler method and the improved Euler method are
single-step methods with
ep(X, u; h) = f(x, u) (10.10)
and
1
ep(x, u; h) = 2 [f(x, u) + f(x + h, U + hf(x, u))], (10.11)
respectively. o
The function ep describes how the differential equation
u' = f(x,u)
is approximated by the difference equation
1
Ii [u(x + h) - u(x)] = ep(x, u; h).
From a reasonable approximation we expect that the exact solution to the
initial value problem approximately satisfies the difference equation. Hence,
1
Ii [u(x + h) - u(x)]- ep(x, u; h) -t 0, h -t 0,

must be fulfilled for the exact solution u. We also expect that the order of
this convergence will influence the accuracy of the approximate solution.
These considerations are made more precise by the following definition.
Definition 10.16 For each (x, u) E G denote by TJ = TJ(O the unique
solution to the initial value problem
TJ' = f(~, TJ), TJ(x) = u,
with initial data (x, u). Then
1
d(x, u; h) ;= Ii [TJ(x + h) - TJ(x)]- ep(x, u; h)

is called the local discretization error. The single-step method is called con-
sistent (with the initial value problem) if
lim d(x, Uj h)
h-+O
= °
uniformly for all (x, u) E G, and it is said to have consistency order p if
Id(x, Uj h)1 :s Kh P
for all (x,u) E G, all h > 0, and some constant K.
236 10. Initial Value Problems

Without loss of generality, in the sequel we always will assume that f


(and later also derivatives of I) are uniformly continuous and bounded on
G. This can always be achieved by reducing G to a smaller domain.
Theorem 10.11 A single-step method is consistent if and only if

lim ep(x, u; h)
h-+O
= f(x, u)
uniformly for all (x, u) E G.
Proof. Since we assume f to be bounded, we have

1](x+t)-1](x) = l t
1]'(X+S)ds= l tf
(x+s,1](X+S))dS-70, t-70,

uniformly for all (x, u) E G. Therefore, since we also assume that f is


uniformly continuous, it follows that

<
h
I
Jrh[1]'(X+t)-1]'(X)]dtl
o - O~t9
max 11]'(x+t)-1]'(x)1

= max O~t:Sh
If(x + t,1](x + t)) - f(x,1](x))1 -70, h -7 0,
uniformly for all (x, u) E G. From this we obtain that

Ll(x, u; h) + ep(x, u; h) - f(x, u) = ~ [1](x + h) -1](x)] -1]'(x)

= ~ l h
[1]'(X + t) -1]'(x)]dt -7 0, h -7 0,

uniformly for all (x, u) E G. This now implies that the two conditions
Ll -7 0, h -7 0, and ep -7 f, h -7 0, are equivalent. 0

Theorem 10.18 The Euler method is consistent. If f is continuously dif-


ferentiable in G, then the Euler method has consistency order one.
Proof. Consistency is a consequence of Theorem 10.17 and the fact that
ep(x, u; h) = f(x, u) for Euler's method. If f is continuously differentiable,
then from the differential equation 1]' = f(~, 1]) it follows that 1] is twice
continuously differentiable with
(10.12)
Therefore, Taylor's formula yields

ILl(x,u;h)1 = I~ [1](x+h) -1](x)] -1]'(X)1 = ~ 11]"(x+Oh)l::; Kh


for some °< 0 < 1 and some bound K for the function 2(Jx + ful). 0
10.3 Single-Step Methods 237

Theorem 10.19 The improved Euler method is consistent. If f is twice


continuously differentiable in G, then the improved Euler method has con-
sistency order two.

Proof. Consistency follows from Theorem 10.17 and


1
~(x, u; h) = 2 [J(x, u) + f(x + h, u + hf(x, u))] -+ f(x, u), h -+ o.
If f is twice continuously differentiable, then (10.12) implies that TJ is three
times continuously differentiable with

TJIII = fxx(~, TJ) + 2fxu(~, TJ)f(~, TJ) + fuu(~, TJ)F(~, TJ)


+ fu (~, TJ)fx (~, TJ) + f~(~, TJ)f(~, TJ)·
Hence Taylor's formula yields

TJ(x ~ TJ"(X) I = 6~ ITJII/(x+Oh)1 :::; K 1 h3


+ h) - TJ(x) - hTJ'(x) - 2
I (10.13)

for some 0 < 0 < 1 and a bound K 1 for 6(fxx+ 2fxuf+ fuuF+ fufx+ f~f)·
From Taylor's formula for functions of two variables we have the estimate
1
If(x + h, u + k) - f(x, u) - hfx(x, u) - kfu(x, u)1 :::; 2 Kz(lhl + Ikl)z
with a bound K z for the second derivatives fxx, fxu, and fuu. From this,
setting k = hf(x, u), in view of (10.12) we obtain
1
If(x + h,u + hf(x,u)) - f(x,u) - hTJ"(X)1 :::; 2 K z (1 + Ko)zh z
with some bound K o for f, whence

I~(X, Uj h) - f(x, u) - %TJ"(X) I:: ; ~ K z (1 + Ko)ZhZ (10.14)

follows. Now combining (10.13) and (10.14), with the aid of the triangle
inequality and using the differential equation, we can establish consistency
order two. 0

We proceed by investigating the convergence of single-step methods as


the step size h tends to zero. This is done for the solution to the initial
value problem in a fixed interval [a, b] with initial data at Xo = a and the
step size h and the number n of steps chosen such that X n = b.
Definition 10.20 Assume that in the interval [a, b] at the equidistant grid
points
Xj := Xo + jh, j = 0, 1, ... ,n,
238 10. Initial Value Problems

with Xo = a and Xn = b, approximate values Uj for the solution u(Xj) to


the initial value problem

U' = f(x, u), u(xo) = Uo,


are obtained by a single-step method. Then

is called the global error, and

E = E(h) := . max lej(h)1


J=O, ... ,n

is called the maximal global error. The single-step method is called conver-
gent if
lim E(h) = 0,
h---+O

and it is said to have convergence order p if

E(h) ~ HhP

for all h > °and some constant H.

The following lemma is needed for our convergence analysis.

Lemma 10.21 Let (~j) be a sequence in lR with the property

I~j+ll ~ (1 + A)I~jl + B, j = 0,1, ... ,


for some constants A > ° and B ~ 0. Then the estimate

j = 0, 1, ... ,

holds.

Proof. We prove this by induction. The estimate is true for j = O. Assume


that it has been proven for some j ~ O. Then, with the aid of the inequality
1+ A < e A , which follows from the power series for the exponential function,
we obtain

I~j+ll ~ (1 + A)I~olejA + (1 + A) ~ (e jA - 1) +B

Le., the estimate also holds for j + 1. o


10.3 Single-Step Methods 239

Theorem 10.22 Assume that the function VJ describing the single-step


method is continuous (also with respect to h) and satisfies a Lipschitz con-
dition; i. e.,
IVJ(x, u; h) - VJ(x, Vj h)1 ~ Mlu - vi
for all (x, u), (x, v) E G, all (sufficiently small) h, and a Lipschitz constant
M. Then the single-step method is convergent if and only if it is consistent.

Proof. We first show that consistency implies convergence and assume that
the single-step method is consistent. For the difference of two consecutive
errors we compute

Hence
(10.15)
where
c(h) := max I~(x, u(x); h)1
a~x9

satisfies
c(h) -t 0, h -t 0,
since we assume consistency. The inequality (10.15) implies that

lej+ll ~ (1 + hM)lejl + hc(h), j = 0,1, ... , n.

From this, applying Lemma 10.21 for A = hM and B = hc(h) and using
eo = 0, we obtain the estimate

(10.16)

This establishes the convergence

E(h) ~ c~) (eM(b-a) - 1) -t 0, h -t 0.

We now show that convergence implies consistency and assume that the
°
single-step method is convergent; i.e., for h -t the approximations

(10.17)

converge to the solution of

u'(x) = f(x, u), u(xo) = Uo,


240 10. Initial Value Problems

for all initial data (xo, uo) E G. We set

g(x,u) :=~(x,u;O)

and observe that by Theorem 10.17 the single-step method is also consistent
with the initial value problem

u'(x) = g(x, u), u(xo) = Uo. (10.18)

Since we have already shown that consistency implies convergence, the


approximations (10.17) also converge to the solution of (10.18); i.e., the
solutions of the two initial value problems coincide. Therefore, we have
f(xo,uo) = g(xo,uo), and since this holds for all (xo,uo) E G, from the
continuity of ~ we conclude uniform convergence:

~(x, u; h) ---t f(x, u), h ---t O.

Now consistency follows from Theorem 10.17. o


Theorem 10.23 Assume that the single-step method satisfies the assump-
tions of the previous Theorem 10.22 and that it has consistency order p;
i.e., 1~(x,u;h)1 ~ Kh P • Then

lejl ~~ (eM(Xj-Xo) - 1) h P , j = 0,1, ... , n;


i.e., the convergence also has order p.

Proof. This follows from (10.16) with the aid of c(h) ~ Kh P . o


Corollary 10.24 The Euler method and the improved Euler method are
convergent. For continuously differentiable f the Euler method has conver-
gence order one. For twice continuously differentiable f the improved Euler
method has convergence order two.

Proof. By Theorems 10.18, 10.19, 10.22, and 10.23 it remains only to verify
the Lipschitz condition of the function ~ for the improved Euler method
given by (10.11). From the Lipschitz condition for f we obtain

Icp(x, u; h) - ~(x, v; h)1

1 1
~ 2 If (x, u) - f(x, v)1 + 2 lf (x + h, u + hf(x, u)) - f(x + h, v + hf(x, v)) I

L L ( 1+ hL) lu-vl;
~2Iu-vl+2I[u+hf(x,u)1-[v+hf(x,v)11~L
2
i.e., ~ also satisfies a Lipschitz condition. o
10.3 Single-Step Methods 241

Single-step methods of higher order can be constructed as follows. For a


set of real numbers St, f = 2, ... ,m, Cli, i = 1, ... ,f - 1, f = 2, ... , m, and
at, f = 1, ... ,m, the quantities
k 1 = f(xj,uj),

are computed recursively, and then the approximation is obtained by


m

Uj+1 = Uj +hL aiki·


i=1

The Euler method is described by m = 1 and a1 = 1 and the improved


Euler method by m = 2, S2 = 1, C21 = 1, and a1 = a2 = 1/2. The basic
goal in the design of higher-order methods is, for a given m, to determine
the coefficients such that the order of consistency and hence the order of
convergence becomes as large as possible. As an example, we shall consider
the Runge-Kutta method, which is the most widely used and most suc-
cessful single-step method. It was introduced by Runge in 1895 for a single
differential equation and extended to systems of differential equations by
Kutta in 1901.
Definition 10.25 The Runge-Kutta method for the numerical solution of
the initial value problem (10.7) constructs approximations Uj to the exact
solution u(Xj) at the equidistant grid points

Xj :=xo+jh, j = 1,2, ... ,


with step size h by using the above higher-order method with
k 1 = f(xj,uj),

k2 =f ( Xj + ~ ,Uj + ~ k1),

k 3 = f (Xj + ~ ,Uj + ~ k2 ),
242 10. Initial Value Problems

and
h
Uj+l = Uj + 6 (k i + 2k2 + 2k3 + k 4 ).
For the differential equation u' = f(x) the Runge-Kutta method coin-
cides with Simpson's rule for numerical integration.
Theorem 10.26 The Runge-Kutta method is consistent. If f is four-times
continuously differentiable, then it has consistency order four and hence
convergence order four.
Proof. The function <P describing the Runge-Kutta method is given recur-
sively by

where
<PI (x, u; h) = f(x, u),

<P2(x,u;h) = f (x+ ~ ,u+ ~ <PI(X,U;h)) ,

<P3(x,u;h) =f (x+ ~ ,u+ ~ <P2(x,U;h)),


<P4(X, u; h) = f(x + h, U + h<p3(X, u; h)).
From this, consistency follows immediately by Theorem 10.17.
Analogously to the proof of Theorem 10.18 for the improved Euler method,
the consistency order four can be established by a Taylor expansion of
<p(x, u; h) with respect to powers of h up to order h4 and expressing the
derivatives of T1 on the right-hand side of
1 h h2 h3
Ii [17(X + h) -17(X)] = 17'(X) + 2 17"(X) + 6" 17/1/(X) + 24 17"I (X) + O(h 4 )
through f and its derivatives by using the differential equation. We leave
the details as an exercise for the reader (see Problem 10.9). 0

The error estimate in Theorem 10.23 is not practical in general, since the
constants M and K have to be determined from higher-order derivatives of
f. Therefore, in practice, the error is estimated by the following heuristic
consideration. For convergence order p, the error between the approximate
solution ii(x; h) at the point x, obtained with step size h, and the exact
solution u(x) satisfies
ii(x; h) - u(x) ';:;j ch P
for some constant c. Correspondingly, for step size h/2 we have that
10.4 Multistep Methods 243

Subtracting these relations yields

u(x;h) - u (Xi~) ~ c (~)P (2 P-1).


Now the constant c can be eliminated from the last two relations, with the
result that

u(x;~) -u(x)~ 2P~1 [U(X;h)-U(X;~)]. (10.19)

Hence we may consider (10.19) as an estimate for the error occurring with
the smaller step size h/2. However, we need to keep in mind that (10.19)
does not provide an exact bound and might fail in particular situations.
Nevertheless, it can be used for controlling the step size during the course of
the numerical calculations in order to adjust the actual step size according
to the required accuracy.
Solving for u(x) in (10.19) yields

2P u (Xi~) -u(x;h)
u(x) ~ 2P _ 1 (10.20)

We leave it as an exercise for the reader to interpret (10.20) as a Richard-


son extrapolation, which we explained in detail for the case of numerical
integration in Section 9.5.

10.4 Multistep Methods


In the single-step methods each computed function value of f is used only
in one step. It is natural to try to design methods where each computed
function value of f is used in several steps. This leads to multistep methods,
as described in the following definition.
Definition 10.27 Multistep methods for the approximate solution of the
initial value problem
u' = f(x, u), u(xo) = uo,
construct approximations Uj to the exact solution u(Xj) at the equidistant
grid points
Xj := Xo + jh, j = 1,2, ... ,
with step size h by

for j = 0,1, Here cp is a function of r +2 variables given in terms of


f, and ao, ,ar-l are constants.
244 10. Initial Value Problems

To start such a multistep method involving r steps, r starting values


Uo, Ul, ... ,Ur-l are required. For example, these can be approximately
computed from the initial value Uo by a single-step method such as the
Runge-Kutta method.
A particular class of multistep methods is obtained by approximating
the integral in

U(Xj+r) - U(Xj+r-k) = l xHr

Xj+r-k
f(~, u(~)) d~
with 1 :::; k :::; r by an interpolatory quadrature, i.e., by

where PEPs with 0 :::; s :::; r is the uniquely determined polynomial with
the interpolation property

Le., by setting

l
xi + r
uj+r - Uj+r-k = p(~) d~. (10.21)
Xj+r-k

Integrating the Lagrange representation (8.2) of the interpolation polyno-


mial shows that these multistep methods are of the form
s
uj+r - Uj+r-k =h L bmf(xj+m, uj+m)
m=O

with coefficients bo, ... ,bs depending on r, k, and s.


From (10.21) we can generate a variety of methods by choosing the num-
ber of steps r, the number s + 1 of interpolation points, and the number
k of integration intervals appropriately. We briefly report on some of these
methods.
The Adams-Bashforth method, introduced by Adams and Bashforth in
1883, is obtained by taking k = 1 and s = r - 1. For r = 1 the interpolation
polynomial is a constant, and therefore

(10.22)

For r = 2 the interpolation is linear and leads to

Uj+2 = Uj+l + 2"h [3f(xj+i,uj+d - f(xj,uj)) (10.23)

(see Problem 10.12). The Adams-Bashforth method is explicit. Clearly.


(10.22) coincides with the Euler method from Definition 10.8.
lOA Multistep Methods 245

The Adams-Moulton method, devised by Moulton during World War I,


is given by k = 1 and s = r. For r = 1 the interpolation is linear, whence

Uj+l = Uj + 2"h [f(Xj+l, uj+t} + f(xj, Uj )]. (10.24)

For r = 2 the interpolation is quadratic, leading to

Uj+2 = Uj+l + 12h [5f(xj+2,uj+2) + 8f(xj+l,uj+l) - f(xj,uj)] (10.25)

(see Problem 10.12). The Adams-Moulton method is implicit. One iteration


step for the solution of the nonlinear equation for uj+r starting with the
approximation given by the corresponding Adams-Bashforth method leads
to a predictor corrector method. Clearly, (10.24) coincides with the implicit
Euler method from Definition 10.10.
The explicit method for k = 2 and s = r - 1 is known as the Nystrom
method, and the implicit method for k = 2 and s = r is called the Milne-
Thomson method (see Problem 10.14).

Definition 10.28 For each (x, u) E G denote by TJ = TJ(~) the unique


solution to the initial value problem

TJ' = f(~, TJ), TJ(x) = U,


for the initial data (x, u). Then

~(X, Uj h) := ~ [TJ(X + rh) + fo amTJ(x + mh)]

-cp(x, TJ(x), ... , TJ(x + (r - l)h); h)

is called the local discretization error. The multistep method is called con-
sistent (with the initial value problem) if

lim ~(x, Uj h) = 0
h--+O

uniformly for all (x, u) E G, and it is said to have consistency order p if

I~(x, Uj h)1 :::; Kh P


for all (x, u) E G, all h > 0, and some constant K.

Theorem 10.29 If f is (s + I)-times continuously differentiable, then the


multistep methods (10.21) are consistent of order s + 1.

Proof. By construction we have that

~(x, u; h) = l1
h
x rh
+
x+(r-k)h
[f(~, u(~)) - p(O] d~,
246 10. Initial Value Problems

where p denotes the polynomial satisfying the interpolation condition

p(x + mh) = f(x + mh, 1J(x + mh)), m = 0, ... , s.


By Theorem 8.10 on the remainder in polynomial interpolation, we can
estimate

for all ~ in the interval x + (r - k)h ::; ~ ::; x + rh and some constant K
depending on f and its derivatives up to order s + 1. 0

Analyzing the convergence for multistep methods is more involved than


for single-step methods for the following two reasons. Firstly, the approx-
imation obtained by a multistep method is, of course, also influenced by
the errors
ej:=Uj-U(Xj), j=O, ... ,r-l,
in the starting values. Hence we give the following definition.

Definition 10.30 The starting values uj, j = 0, ... ,r - I, are called con-
sistent if
lim [uj(h) - u(Xj)] = 0, j = 0, ... , r - 1.
h-tO
They are said to have consistency order p if

for all h > 0 and some constant K* .

To make sure that the consistency order of the starting values coincides
with the consistency order of the multistep method, the single-step method
for computing the starting values has to be chosen accordingly.
Secondly, multistep methods can be unstable, as illustrated by the fol-
lowing example.

Example 10.31 Let p be the quadratic interpolation polynomial satisfy-


ing

and approximate
U'(XO) ~ p'(xo).
Using the fact that the approximation for the derivative is exact for poly-
nomials of degree less than or equal to two, simple calculations show that
(see Problem 10.15)

p'(xo) = 2~ [-U(X2) + 4u(xd - 3u(xo)]. (10.26)


10.4 Multistep Methods 247

If U is three times continuously differentiable, by Theorem 8.10 we have

u(x) - p(x) I~ 1
lIulll llool(x - xt}(x - x2)1,
I
-6
x - Xo
and from this, passing to the limit x -+ XO, it follows that the error for the
derivative can be estimated by

Iu'(xo) - p'(xo)1 ~ ~2 Ilulllll oo . (10.27)

By approximating
p'(xo) ~ u'(xo) = f(xo, uo)
we derive a multistep method of the form

Uj+2 - 4Uj+l + 3uj = -2hf(xj, Uj), j = 0,1, .... (10.28)

From (10.27) it follows that (10.28) is consistent with order two if f is twice
continuously differentiable.
Now we consider the initial value problem

U' = -U, u(O) = 1,

with the solution u(x) = e- x . Here the multistep method (10.28) reads

Uj+2 - 4Uj+l + (3 - 2h)uj = 0, j = 0,1, .... (10.29)

Table 10.3 gives the error ej = Uj -e- Xj between the approximate and exact
solutions for the step sizes h = 0.1 and h = 0.01. For the starting values,
Uo = 1 and Ul = e- h have been used with ten-decimal-digits accuracy. The
last column gives the quotient qj := ej/ej-l of the error in two consecutive
steps.

TABLE 10.3. Numerical results for Example 10.31

h = 0.1 h = 0.01
j Xj ej qj j Xj ej qj

4 0.4 0.0109 3.60 5 0.05 0.0000 3.23


6 0.6 0.1123 3.15 10 0.10 0.0099 3.01
8 0.8 1.0858 3.10 15 0.15 2.4456 3.01
10 1.0 10.4143 3.10 20 0.20 604.1985 3.01

In order to explain the numerical failure indicated by the results in Table


10.3, we solve the difference equation (10.29) by looking for solutions of the
form
U J· -- a)..j , (10.30)
248 10. Initial Value Problems

where a and A are complex numbers. Substituting into (10.29) shows that
(10.30) solves (10.29) if and only if A is a solution of the so-called charac-
teristic equation
A2 - 4A + (3 - 2h) = O.
This quadratic equation has two solutions, namely
Al,2 = 2 =f VI + 2h.
Therefore, the general solution of (10.29) is given by
Uj = aA{ + bA~.
The two constants a and b are determined by the conditions Uo = 1 and
Ul = e- h and have the values
A2 - e- h
a = = 1 + O(h 2 )
A2 - Al
and
e- h - Al
b = ----::.
A2 - Al
The term aA{ in the solution to the difference equation approximates the
solution e- Xj = e- jh to the initial value problem, since
aA{ = [1 + O(h 2 )] [1 - h + O(h 2 )F ~ e- jh .
However, the additional term bA~ grows exponentially, and the relation

Uj - u(Xj) ~ A2 = 3 + h + O(h 2 )
Uj-l - u(xj-d
explains the last column of Table 10.3. o
Roughly speaking, for multistep methods with r ~ 2, the (homogeneous)
difference equation of order r occurring in the multistep method has r lin-
early independent solutions, whereas the approximated differential equa-
tion has only one solution. Hence only one of the solutions to the difference
equation corresponds to the differential equation. Therefore, convergence
of the multistep method can be expected only when the additional solu-
tions to the difference equation remain bounded. Note that these additional
solutions will always be activated by errors in the starting values and by
round-off errors. For this reason we proceed by investigating the stability
of the difference equation.
Definition 10.32 The linear difference equation
r-l
Uj+r +L amuj+m = 0, j = 0,1, ... , (10.31)
m=O

with constant coefficients ao, . .. ,ar - l is called stable if all its solutions are
bounded.
10.4 Multistep Methods 249

Theorem 10.33 The linear difference equation (10.31) is stable if and


only if it satisfies the root condition, i.e., if all the zeros A of the charac-
teristic polynomial
r-l
p(A) := Ar + amA m L (10.32)
m=O
have absolute value IAI :::; 1, and zeros satisfying IAI = 1 are simple zeros.
Proof. We begin by noting that each solution to the difference equation
(10.31) is uniquely determined by its r initial values Uo, Ul,· .. ,Ur - l ' Ob-
viously, from these initial values the remaining terms Ur ,Ur+I, ... are re-
cursively determined by (10.31).
For convenience we set a r = 1 and denote by A the differential operator
given by (AJ)(A) = Af'(A). Then for the sequence
Uj = jn Aj , j = 0,1, ... , (10.33)
we have that
r-l r

uj+r + L amuj+m = L am(j + m)n Aj+m


m=O m=O

k=O
= A
j
t (~)l(An-kp)(A).
From this it can be deduced that if A is a zero of the characteristic poly-
nomial p of multiplicity 8, then for n = 0,1, ... , 8 - 1 the sequence (10.33)
solves the difference equation.
Now assume that AI, ... , Ak are the zeros of the characteristic polynomial
(10.32) and have multiplicities 81, ... , 8k j i.e.,
k
p(A) =
Al)SI. II (A -
1=1
Then the general solution of the homogeneous difference equation (10.31)
is given by
k sl-l
Uj = L L alsjS A{ (10.34)
1=1 s=o
with r arbitrary constants also To establish this we need to show that the
coefficients als can be chosen such that arbitrarily given initial conditions
k 81-1

LLalsjSA{=Uj, j=0, ... ,r-1, (10.35)


1=1 s=o
250 10. Initial Value Problems

are fulfilled. The homogeneous adjoint system to the system (10.35) reads
r-l
L (3jP A{ = 0, S = 0, ... , SI - 1, 1 = 1, . .. , k.
j=O

Assume that 13j, j = 0, ... , r - 1 is a solution. Then the polynomial


r-l
q(A) := L (3jA
j

j=O

of degree r - 1 has the zeros Al with multiplicity SI for 1 = 1, ... , k; i.e.,


the polynomial has r zeros and therefore, by Theorem 8.1, must vanish
identically. This implies 130 = ... = 13r-1 = O. Hence, for each given right-
hand side the system (10.35) has a unique solution.
Now from the form (10.34) of the general solution to the difference equa-
tion, the equivalence of stability and the root condition is obvious. 0

Besides the solution (10.34) of the homogeneous difference equation, we


also will need an explicit expression for the solution to the inhomogeneous
difference equation.

Lemma 10.34 For k = 0,1, ... , r -1, let Uj,k denote the unique solutions
to the homogeneous difference equation (10.31) with initial values

Uj,k = bj,k, j = 0, 1, ... , r - 1.

Then for a given right-hand side cr , Cr+ 1, ... , the unique solution to the
inhomogeneous difference equation
r-l
Zj+r +L amzj+m = Cj+r, j = 0,1, . .. , (10.36)
m=O

with initial values Zo, Zl, ... , Zr-l is given by


r-I j

Zj+r =L Zk Uj+r,k +L Ck+r Uj+r-k-l,r-l, j = 0,1, .... (10.37)


k=O k=O

Proof. Setting Um,r-l = 0 for m = -1, -2, ... , we can rewrite (10.37) in
the form
r-l
Zj =L Zk Uj,k + Wj, j = 0,1, ... ,
k=O
where
L
00

Wj := Ck+r Uj-k-l,r-l, j = 0,1, ....


k=O
lOA Multistep Methods 251

Obviously, Wj = 0 for j = 0, ... , r - 1, and therefore it remains to show


that Wj satisfies the inhomogeneous difference equation (10.36).
As in the proof of Theorem 10.33 we set a r = 1. Then, using Um,r-I = 0
for m < r - 1, Ur-I,r-I = 1, and the homogeneous difference equation for
Um,r-l, we compute

r r 00

L amwj+m = L am L Ck+rUj+m-k-l,r-1
m=O m=O k=O

r j

= L am L Ck+rUj+m-k-l,r-1
m=O k=O

j r

= L Ck+r L amUj+m-k-l,r-1 = Cj+r'


k=O m=O

Now the proof is completed by noting that each solution to the inhomo-
geneous difference equation (10.36) is uniquely determined by its r initial
values Zo, ZI,' .. , Zr-l' 0

Definition 10.35 The multistep method of Definition 10.27 is called sta-


ble if the associated difference equation
r-I

Uj+r + L amuj+m = 0
m=O

is stable.
Single-step methods are always stable, since the associated difference
equation Uj+l - Uj = 0 clearly satisfies the root condition.
Remark 10.36 The multistep methods (10.21) are stable.
Proof. The corresponding characteristic polynomial p(A) := Ar - Ar - k ful-
fills the root condition. 0

For establishing convergence of multistep methods, we will need the fol-


lowing extension of Lemma 10.21.
Lemma 10.37 Let (~j) be a sequence in ffi with the property
j-I

I~jl ~ A L I~ml +B, j = 1,2,00',


m=O

for some constants A > 0 and B 2: O. Then the estimate


I~jl ~ (AI~ol + B)e(j-I)A, j = 1,2, ... ,
holds.
252 10. Initial Value Problems

Proof. We prove by induction that

I~jl ~ (AI~ol + B)(1 + A)j-l, j = 1,2, .... (10.38)

Then the assertion follows by using the estimate 1 + A ~ eA. The inequality
(10.38) is true for j = 1. Assume that it has been proven up to some j ~ 1.
Then we have
j j

I~j+ll ~ A L I~ml + B ~ (AI~ol + B) + A L (AI~ol + B)(1 + A)m-l


m=O m=l

= (AI~ol + B)(1 + A)jj


i.e., the estimate is also true for j + 1. o
Theorem 10.38 Assume that the function r.p describing the multistep method
is continuous and satisfies a Lipschitz condition; i. e.,
r-l
Ir.p(x, Uo, UI,"" Ur-l; h) - r.p(x, Vo, VI,"" Vr-l; h)1 ~ M L IUm - vml
m=O
for all (x, Uo), ... ,(x, ur-d(x, Vo), . .. ,(x, vr-d E G, all (sufficiently small)
h, and a Lipschitz constant M. Furthermore, assume that the multistep
method is consistent and stable and that the starting values are consistent.
Then the multistep method is convergent. If both the multistep method and
the starting values have consistency order p, then the convergence also is
of order p.

Proof. (Compare to the proof of Theorem 10.22.) For the errors

we obtain
r-l r-l r-l
ej+r + L amej+m = uj+r + L amuj+m - U(Xj+r) - L amu(xj+m)
m=O m=O m=O

We rewrite this into the form


r-l
ej+r + L amej+m = hCj+r, j = 0,1, ... , (10.39)
m=O
10.4 Multistep Methods 253

where

-cp(Xj, U(Xj), ... , u(xj+r-d; h).


We can estimate the right-hand side by
r-l
ICj+rl ~ M L lej+ml + c(h), j = 0,1, ... , (10.40)
m=O

where
c(h) = max I~(x, u(x); h)1
a~x~b

satisfies c(h) ~ 0, h ~ 0, since we assume consistency. By Lemma 10.34


we can express the solution of (10.39) in the form
r-l j
ej+r = L ek Uj+r,k +hL Ck+r Uj+r-k-l,r-l, j = 0,1, ....
k=O k=O

From this, since we assume stability, we can estimate

for some constant Nand


r-l
d(h) := L lekl·
k=O

We note that d(h) ~ 0, h ~ 0, since the starting values are assumed to be


consistent. Inserting (10.40) into the last inequality now yields

lej+rl ~ N { d(h) + hM ~ fo lek+ml + (j + l)hc(h) }, j = 0,1, ....


Because of
j r-l r-l m+j r+j-l r+j-l
L L leHml = L L lekl ~ r L lekl = r L lekl + rd(h)
k=O m=O m=O k=m k=O k=r

and (j + l)h ~ Xj+! - Xo ~ 2(b - a) we obtain that lerl ~ C')'(h) and


254 10. Initial Value Problems

for some constant C and ,(h) := d(h) + c(h). Now Lemma 10.37 implies
that

whence
E(h) ::; C(1 + Ch)r(h)e(b-a)C -t 0, h -t 0,

follows, since ,(h) -t 0, h -t 0. For consistency order p we have that


~((h) = O(h P ); i.e., the convergence is also of order p. 0

The basic advantage of multistep methods results from the fact that for
arbitrary convergence order, in each step only one new evaluation of the
function f is required. In contrast, for single-step methods the number of
function evaluations required in each step is equal, in general, to the conver-
gence order. Therefore, multistep methods are much faster than single-step
methods. However, it should be noted that readjusting the step size dur-
ing the computation is more involved due to the need to recompute the
corresponding starting values for the new step size.

Problems
10.1 Find the exact solution of the initial value problem

u' = _u 2 , u(O) = 1,

and compare it to the approximate solutions obtained by successive approxima-


tions according to Corollary 10.6. Compute the third iterate U3 and compare the
exact error u - U3 to the a posteriori error estimate from Corollary 10.6.

10.2 Consider the initial value problem u' = u, u(O) = 1, and show that the
approximate solution from the Euler method is given by Uj = (1 + h)j.

10.3 Find the exact solution of the initial value problem


, u
u = 2 -, u(l) = 1.
x
Determine an analytic expression for the approximate solution by Euler's method
and verify the convergence order one predicted by Theorem 10.18.

10.4 Show that Euler's method fails to approximate the solution u(x) = (~x) 3/2
of the initial value problem u' = U 1 / 3 , u(O) = O. Explain this failure.

10.5 Show that the differential equation u' = ax with a E lR is solved exactly
by the improved Euler method.
Problems 255

10.6 Show that the single-step method

Uj+l = Uj + hI (Xj + ~,Uj + ~ I(xj,uj))

has consistency order two if I is twice continuously differentiable. This method


is known as the modified Euler method.

10.7 Show that the single-step method given by

k1 = I(xj, Uj),

2h 2h)
k3 = I ( Xj + 3" ,Uj + 3" k2 ,

and
Uj+l = Uj + 4"h (k + 3k3)
1

is consistent and has consistency order three if ! is three-times continuously


differentiable. This method is known as Heun's third-order method.

10.8 Show that the single-step method given by

kl = !(Xj,Uj),

k2 =! (Xj +~ ,Uj +~ k1)'

and
h
Uj+l = Uj + "6 (k 1 + 4k2 + k3)
is consistent and has consistency order three if ! is three-times continuously
differentiable. This method is known as Kutta's third-order method.

10.9 Show that the Runge-Kutta method (see Definition 10.25) has consistency
order four if ! is four-times continuously differentiable.

10.10 Write a computer program for the Runge-Kutta method and test it for
various examples.

10.11 The population p = p(t) and q = q(t) of two interacting animal species
that have a predator prey relationship is modeled by the system of the Lotka-
Volterra equations
~~ = OIp + {3pq, ~; = 'Yq + ~pq
with constant coefficients 01 < 0, {3 > 0, 'Y > 0, and ~ < 0, complemented by initial
conditions p(O) = po and q(O) = qo. (Explain the significance of the signs of the
constants for the model.) For the coefficients 01 = -1, {3 = 0.01, 'Y = 0.25, and
~ = -0.01, test the stability of the solutions by solving the initial value problem
256 10. Initial Value Problems

numerically by the Runge-Kutta method for the four different initial conditions
po = 30 ± 1 and qo = 80 ± 1. Visualize the numerical results by a phase diagram,
i.e., by the curve {(p(t), q(t) : t E [0, T)} for sufficiently large T > O.

10.12 Verify the coefficients in the Adams-Bashforth and Adams-Moulton meth-


ods (10.22)-(10.25).

10.13 Determine the coefficients of the Adams-Bashforth and Adams-Moulton


methods for r = 3.

10.14 The multistep methods (10.21) for k = 2 and s = r - 1 and for k = 2


and s = r are known as the Nystrom method and the Milne-Thomson method,
respectively. Determine the coefficients of the Nystrom method and the Milne-
Thomson method for r = 1 and r = 2.

10.15 Verify the coefficients in the difference formula (10.26).

10.16 Construct a two-step method of the form

that has consistency order two and discuss its stability.

10.17 Find the general solution of the difference equation

for 0 < a < 1. Show that limj-too Uj = 1/(1 - a).


10.18 Find an explicit expression for the Fibonacci numbers aj, which are de-
fined by ao = ai = 1 and aj+i = aj + aj-i for j ~ 1. Is the root condition of
Theorem 10.33 satisfied?

10.19 Attempt to approximate the unique solution u(x) = 2 of the initial value
problem
u' = xu(u - 2), u(O) = 2,
numerically by any of the methods described in this chapter. Discuss the results
by relating them to the solution of the initial value problem with perturbed initial
condition u(O) = 2 + a for small a E JR.

10.20 Consider the approximate solution of the initial value problem

U' + 100u = 100, u(O) = 2,


by the Euler method. Explain why for an accurate approximation the step size
h has to be chosen smaller than h < 0.02 despite the fact that the solution is
almost constant for x not too small, say, for x > 0.1. (This differential equation
is an example of a so-called stiff equation, for which the numerical solution is
rather delicate.)
11
Boundary Value Problems

Whereas in initial value problems the solution is determined by conditions


imposed at one point only, boundary value problems for ordinary differ-
ential equations are problems in which the solution is required to satisfy
conditions at more than one point, usually at the two endpoints of the
interval in which the solution is to be found. Since an ordinary differential
equation of order n has, in principle, a general solution depending on n pa-
rameters, the total number of boundary conditions required to determine
a unique solution is n. For an introduction to some of the basic methods
for the numerical solution of such boundary value problems we shall con-
fine ourselves to the simplest boundary value problem, which is one for
an equation of the second order in which the solution is specified at two
distinct points. For more detailed studies we refer to [13, 36, 46].
As opposed to the fundamental Picard-Lindel6f existence and uniqueness
theorem for initial value problems, a detailed analysis of the existence and
uniqueness theory for nonlinear boundary value problems is more involved
and beyond the scope of this introduction. However, for linear boundary
value problems the theory is more elementary, and we shall include part of
it in our analysis.
For the numerical solution of boundary value problems for ordinary dif-
ferential equations three different groups of methods are available: shooting
methods, finite difference methods, and finite element methods. Whereas
shooting methods, which we briefly describe in Section 11.1 and which rely
on numerical methods for initial value problems, are restricted to ordinary
differential equations, the finite difference and finite element methods can
also be applied to boundary value problems for partial differential equa-
258 11. Boundary Value Problems

tions. Therefore, our presentation of finite difference and finite element


methods for linear ordinary differential equations is also meant as a model
discussion for the more complicated and more important case of partial
differential equations.
Of course, in one chapter only a small part of the theory and the ap-
plications of finite difference and finite element methods can be covered.
Hence, we set ourselves the task to outline the basic ideas of these meth-
ods by considering only the simplest cases. For a solid foundation of the
finite element method, we felt it was necessary to include as its theoretical
basis a discussion of the Galerkin method for strictly coercive operators,
which appears in Section 11.3. This, in turn, made it necessary to present
the Lax-Milgram theorem on the existence of solutions for equations with
strictly coercive operators.

11.1 Shooting Methods


Consider the boundary value problem for the differential equation of the
second order
u" = I(x, u, u'), a::; x ::; b, (11.1)
with boundary conditions

u(a) = 0, u(b) = (3. (11.2)

For the sake of simplicity we assume that the function I is defined on


[a,b] x IR?
Shooting methods attempt to employ the numerical methods described
in the previous chapter for initial value problems where, roughly speaking,
the initial conditions at x = a are adjusted so that the solution satisfies the
required boundary conditions (11.2). For this, in addition to the boundary
value problem, we also consider the initial value problem

u" = I(x, u, u'), u(a) = 0, u' (a) = s, (11.3)

with a real parameter s. Geometrically speaking, the parameter s prescribes


the initial slope of the solution curve.
If we assume that I is continuous and satisfies a Lipschitz condition
with respect to u and u', then by the Picard-Lindel6f Theorem 10.5, for
each s E IR. there exists a unique solution u(·, s) of the initial value problem
(11.3). To arrive at a solution to the boundary value problem (11.1)-(11.2),
the parameter s has to be chosen such that u(b, s) = (3; Le., we have to
solve the equation
F(s) = 0,
where the function F : IR. -* IR. is defined by
F(s) := u(b, s) - (3.
11.1 Shooting Methods 259

For each s the value F(s) can be computed approximately by one of the
numerical methods of Chapter 10 for the solution of initial value problems,
extended appropriately to the case of a second-order equation. Note that
for a nonlinear differential equation the equation F(s) = is nonlinear.
For finding a zero of F the Newton method of Section 6.2 can be em-
°
ployed. For the computation of the derivative F'(s), which is required for
Newton's method, we assume that the solution u to the initial value prob-
lem (11.3) depends in a continuously differentiable manner on the parame-
ter s. This can be assured by appropriate assumptions on f (see [12]). We
set
8u
v:= 8s
and differentiate the differential equation and the initial condition (11.3)
with respect to s to obtain

v"(x,s) = fu(x,u(x,s),u'(x,s))v(x,s)
(11.4)
+ fU I (x, u(x, s), u' (x, s))v' (x, s)

and
v(a,s) = 0, v'(a,s) = 1. (11.5)
Since
F'(s) = v(b,s),
computing the derivative of F requires solving the additional linear initial
value problem (11.4)-(11.5) for v, where u is known from solving (11.3).
Note that from a numerical approximation, u is known only at grid points.
Summarizing, we obtain the following method.
Algorithm 11.1 The shooting method with Newton iterations consists of
the following steps:
1. Choose an initial slope s E JR.
2. Solve numerically the initial value problem for

u" = f(x, u, u')

with initial conditions u(a) = 0::, u'(a) = s and the initial value problem for

v" = fu(x, u, u')v + fU I (x, u, u')v'

with initial conditions v(a) = 0, v' (a) = 1.


3. Ifu(b) = /3 is satisfied within the required accuracy, then stop; otherwise,
replace s by
u(b)-/3
s- v(b)
and go back to step 2.
260 11. Boundary Value Problems

Example 11.2 Consider the boundary value problem


1
u" = 3
U , u(l) = V2, u(2) = 2 V2,
with the exact solution u(x) = v'2/x. We solve numerically the associated
initial value problem
u"=u 3 , u(l)=V2, u'(l)=s,
by the improved Euler method of Section 10.2 with step sizes h = 0.1,
h = 0.01, and h = 0.001. For this we transform the initial value problem
for the equation of second order into the initial value problem for the system
u' = w, w' = u3 , u(l) = V2, w(l) = s.
As starting value for the Newton iteration we choose s = O. The exact initial
condition is s = -v'2 = -1.414214. The numerical results represented in
Table 11.1 illustrate the feasibility of the shooting method with Newton
iterations. 0

TABLE 11.1. Numerical results for Example 11.2.

h = 0.1 h = 0.01 h = 0.001


s F(s) s F(s) s F(s)
0.00000 3.61648 0.00000 3.84079 0.00000 3.84400
-0.81116 1.10056 -0.74681 1.26284 -0.74584 1.26538
-1.31684 0.15879 -1.28234 0.21124 -1.28180 0.21210
-1.41553 0.00373 -1.40987 0.00678 -1.40980 0.00684
-1.41796 0.00000 -1.41424 0.00000 -1.41420 0.00000
-1.41796 0.00000 -1.41424 0.00000 -1.41421 0.00000

Numerical problems with ill-conditioning will arise in cases where small


changes in the initial data s will cause large changes in the solution u(', s).
This is illustrated by the following example.
Example 11.3 The linear boundary value problem
u" - u' - 110u = 0, u(O) = u(lO) = 1,
has the unique solution

u(x) = 1
ellO _ e- IOO
{(ellO _ l)e- lOx + (1 _ e-IOO)e llx }.

The unique solution to the associated initial value problem with initial
conditions u(O) = 1 and u'(O) = s is given by
11 - S -lOx 10 + s llx
ux
() = - - e +--e.
21 21
11.1 Shooting Methods 261

Hence, in this case we have

11 - S -lOO 10 + s 110 1
s =--e
F() +--e -.
21 21
From F(s) = 0 we deduce that the exact initial slope s satisfies
e- 110
_ e- 210
-10<s=-10+21 1-e- 2lO

In a numerical computation with ten-decimal-digit accuracy the best ap-


proximation s to the exact zero s we can expect is such that

Within this interval of initial conditions we now have

u(1O, -10) = e- IOO ~ 0


and

= 21 -2110- 10- 9
9
u(1O -10 + 10- 9 ) e- lOO + -21- e 110 ~ 2.8.10 37 .
, '
Le., small changes in s will cause very large changes in the values of the
solution at the other endpoint. Hence, we cannot expect that this bound-
ary value problem can be numerically solved by the simple version of the
shooting method. 0

This difficulty can be remedied by a multiple shooting method as follows.


The interval [a, b] is subdivided into n subintervals according to

a = Xo < Xl < ... < Xn-l < Xn = b.


Then for given vectors u = (uo, ... , un_d T and s = (so, ... , sn_d T in IRn
such that Uo = Q, for j = 0, ... , n - 1 consider the n initial value problems
for
u" = I(x, u, u')
on the subintervals [Xj, xj+d with initial conditions

In order to obtain from this a solution to the differential equation on all


of the interval [a,b], the solutions u(',Uj,Sj) on the subintervals [xj,xj+d
have to coincide at the grid points Xl, ... , Xn-l together with their first
derivatives. Then the differential equation ensures that the function is twice
continuously differentiable on [a, b]. In addition, the boundary condition
262 11. Boundary Value Problems

U( b) = (3 must be satisfied. Altogether we have the following 2n-l nonlinear


equations for the 2n - 1 unknowns Ul, ... , Un-l and So, ... , Sn-l:

U'(Xj+I,Uj,Sj)-Sj+l =0, j=0, ... ,n-2, (11.6)

U(Xn,un-l,sn-d - (3 = O.
For the solution of this system Newton's method can again be used. For
details we refer to [36, 50].

11.2 Finite Difference Methods


As already indicated in Example 2.1, the basic idea of finite difference
methods for the approximate solution of boundary value problems consists
in replacing the derivatives in the differential equations by difference quo-
tients. For the sake of simplicity, we confine our presentation to a linear
boundary value problem. Without loss of generality we need consider only
the homogeneous boundary condition, since inhomogeneous boundary con-
ditions can be dealt with by incorporating them into the right-hand side of
the differential equation (see Problem 11.3).
Theorem 11.4 Assume that q, r E era, b] and q 2: O. Then the boundary
value problem for the linear differential equation

-U" + qu = r on [a, b] (11. 7)

with homogeneous boundary conditions

u(a) = u(b) = 0 (11.8)

has a unique solution U E C 2 [a, b].

Proof. Assume that Ul and U2 are two solutions to the boundary value
problem. Then the difference U = UI - U2 solves the homogeneous boundary
value problem
-U" + qu = 0, u(a) = u(b) = O.

By partial integration we obtain

This implies u' = 0 on [a, b], since q 2: o. Hence U is constant on [a, b],
and the boundary conditions finally yield U = 0 on [a, b]. Therefore, the
boundary value problem (11.7)-(11.8) has at most one solution.
11.2 Finite Difference Methods 263

The general solution of the linear differential equation (11.7) is given by


(11.9)
where U1 , U2 denotes a fundamental system of two linearly independent so-
lutions to the homogeneous differential equation, u* is a solution to the in-
homogeneous differential equation, and C 1 and C2 are arbitrary constants.
This can be seen with the help of the Picard-Lindel6f Theorem 10.1 (see
Problem 11.4). The boundary condition (11.8) is satisfied, provided that
the constants C 1 and C2 solve the linear system

This system is uniquely solvable. Assume that C 1 and C2 solve the homoge-
neous system. Then U = C 1 U1 + C2U2 yields a solution to the homogeneous
boundary value problem. Hence U = 0, since we have already established
uniqueness for the boundary value problem. From this we conclude that
C 1 = C2 = 0 because U1 and U2 are linearly independent, and the exis-
tence proof is complete. 0

For the numerical solution, proceeding as in Example 2.1, we choose an


equidistant grid
Xj = a + jh, j = 0, ... , n + 1,
with the step size given by h = (b - a)j(n + 1) and n E IN. At the internal
grid points x j, j = 1, ... , n, we replace the differential quotient in the
differential equation by the difference quotient

u"(Xj) ~ ~2 [u(xj+t} - 2u(xj) + U(Xj-1)]


to obtain the system of equations
1
- h 2 [Uj-1 - (2 + h 2 qj)Uj + uj+I] = rj, j = 1, ... ,n, (11.10)

for approximate values Uj to the exact solution u(Xj). Here we have set
qj := q(Xj)and rj := r(xj). The system has to be complemented by the
two boundary conditions
Uo = U n +1 = O. (11.11)
For an abbreviated notation we introduce the n x n tridiagonal matrix

-1
264 11. Boundary Value Problems

and the vectors U = (Ul,"" un)T and R = (rl, .. " rn)T. Then our system
of equations, including the boundary conditions, reads

AU=R. (11.12)

The following two questions have to be answered:


1. Is the system (11.12) uniquely solvable?
2. How large is the error between the approximate solution Uj and the
exact solution u(Xj)? Do we have convergence of the approximate
solution to the exact solution as h -+ O?

Theorem 11.5 For each h > 0 the difference equations (11.10)-(11.11)


have a unique solution.

Proof. The tridiagonal matrix A is irreducible and weakly row-diagonally


dominant. Hence, by Theorem 4.7, the matrix A is invertible, and the Ja-
cobi iterations converge. 0

Recall that for speeding up the convergence of the Jacobi iterations we


can use relaxation methods or multigrid methods as discussed in Sections
4.2 and 4.3.
The error and convergence analysis is initiated by first establishing the
following two lemmas.

Lemma 11.6 Denote by A the matrix of the finite difference method for
q 2: 0 and by A o the corresponding matrix for q = O. Then

i. e., all components of A -1 are nonnegative and smaller than or equal to


the corresponding components of Ail l .

Proof. The columns of the inverse A-I = (aI, ,an) satisfy Aaj = ej for
j = 1, ... ,n with the canonical unit vectors el, , en in ffi n. The Jacobi
iterations for the solution of Az = ej starting with Zo = 0 are given by

with the usual splitting A = D + A L + AR of A into its diagonal, lower, and


upper triangular parts. Since the entries of D- l and of -D-l(A L + A R )
are all nonnegative, it follows that A-I 2: O. Analogously, the iterations

yield the columns of Ail l . Therefore, from Dill 2: D- l we conclude that


Ail l 2: A-I. 0
11.2 Finite Difference Methods 265

Lemma 11.7 Assume that u E C 4[a, b). Then

lu"(X) - ~2 [u(x + h) - 2u(x) + u(x - h))1 :::; ~~ lIu(4)lloo

for all x E [a + h, b - h).


Proof. By Taylor's formula we have that
2 3 4
u(x ± h) = u(x) ± hu'(x) + 2h u"(x) ±
h
"6 h
ull/(x) + 24 U(4)(X ± O±h)

for some O± E (0,1). Adding these two equations gives

h4 h4
u(x+h)-2u(x)+u(x-h) = h 2 u"(X) + 24 u(4)(x+O+h)+ 24 u(4)(x-O_h),

whence the statement of the lemma follows. o


Theorem 11.8 Assume that the solution to the boundary value problem
(11.7)-(11.8) is four-times continuously differentiable. Then the error of
the finite difference approximation can be estimated by

lu(xj) - ujl :::; ~: lIu(4)lIoo(b - a)2, j = 1, ... ,no


Proof. By Lemma 11.7, for

Zj := u"(Xj) - ~2 [u(xj+d - 2u(xj) + u(xj-d)


we have the estimate

(11.13)

Since

- ~2 [u(xj+d-(2+h 2 qj)u(Xj)+u(xj_dl = -u"(Xj)+qju(Xj)+Zj = rj+zj,


the vector U = (u(xd, ... , u(Xn))T given by the exact solution solves the
linear system
AU = R+Z,
where Z = (Zl,"" zn)T. Therefore,

A(U - U) = Z,

and from this, using Lemma 11.6 and the estimate (11.13), we obtain

h2
lu(xj) -ujl:::; IIA- 1 Zlloo:::; 121Iu(4)lIooIiAolelloo, j = 1, ... ,n, (11.14)
266 11. Boundary Value Problems

where e = (1, ... , 1)T. The boundary value problem


-u~ = 1, uo(a) = uo(b) = 0,

has the solution


1
uo(x) = 2' (x - a)(b - x).

Since U~4) = 0, in this case, as a consequence of (11.14) the finite difference


approximation coincides with the exact solution; Le., e = AoU = AoO.
Hence,
IIAolell oo ::; Iluolloo = ~ (b - a)2, j = 1, ... , n.
Inserting this into (11.14) completes the proof. o

Theorem 11.8 confirms that as in the case of the initial value problems
in Chapter 10, the order of the local discretization error is inherited by
the global error. Note that the assumption in Theorem 11.8 on the dif-
ferentiability of the solution is satisfied if q and r are twice continuously
differentiable.
The error estimate in Theorem 11.8 is not practical in general, since it
requires a bound on the fourth derivative of the unknown exact solution.
Therefore, in practice, analogously to (10.19) the error is estimated from
the numerical results for step sizes hand h/2. Similarly, as in (10.20), a
Richardson extrapolation can be employed to obtain a fourth-order approx-
imation.
Of course, the finite difference approximation can be extended to the
general linear ordinary differential equation of second order

_ul/ + pu' + qu = r
by using the approximation

(11.15)

for the first derivative. This approximation again has an error of order
O(h 2 ) (see Problem 11.9). Besides Richardson extrapolation, higher-order
approximations can be obtained by using higher-order difference approxi-
mations for the derivatives such as
1
ul/(x) ~ 12h2 [-u(x + 2h) + 16u(x + h)
(11.16)
-30u(x) + 16u(x - h) - u(x - 2h)],

which is of order O(h 4 ), provided that u is six-times continuously differen-


tiable (see Problem 11.9).
11.2 Finite Difference Methods 267

We wish also to indicate briefly how the finite difference approximations


are applied to boundary value problems for partial differential equations.
For this we consider the boundary value problem for

-t6.u+qu=r inD (11.17)

in the unit square D = (0,1) x (0,1) with boundary condition


U = 0 on aD. (11.18)

Here 6 denotes the Laplacian

Proceeding as in the proof of Theorem 11.4, by partial integration it can


be seen that under the assumption q ~ 0 this boundary value problem
has at most one solution. It is more involved and beyond the scope of this
book to establish that a solution exists under proper assumptions on the
functions q and r. We refer to [24, 60] and also the remarks at the end of
Section 11.4.
As in Example 2.2, we choose an equidistant grid

Xij = (ih,jh), i,j = 0, ... ,n + 1,

with step size h = 1/(n+ 1) and n E IN. Then we approximate the Laplacian
at the internal grid points by
1
t6.U(Xij) ~ h2 {U(Xi+l,j) + U(Xi-l,j) + u(xi,j+d + u(xi,j-d - 4U(Xij)}

and obtain the system of equations


1 2
h2 [(4 + h qij)Uij - Ui+l,j - Ui-l,j - Ui,j+l - Ui,j-d = rij,
(11.19)
i,j=I, ... ,n,

for approximate values Uij to the exact solution U(Xij). Here we have set
qij := q(Xij) and rij := r(xij). This system has to be complemented by the
boundary conditions

UO,j = Un+l,j = 0, j = 0, ... ,n + 1,


(11.20)
Ui,O = Ui,n+l = 0, i = 1, ... , n.

We refrain from rewriting the system (11.19)-(11.20) in matrix notation


and refer back to Example 2.2. Analogously to Theorem 11.5, it can be seen
that the Jacobi iterations converge (and relaxation methods and multigrid
methods are applicable). Hence we have the following theorem.
268 11. Boundary Value Problems

Theorem 11.9 For each h


have a unique solution.
> ° the difference equations (11.19)-{11.20)

From the proof of Lemma 11.6 it can be seen that its statement also holds
for the corresponding matrices of the system (11.19)-(11.20). Lemma 11.7
implies that

provided that u E C 4 ([O, 1] x [0,1]). Then we can proceed as in the proof


of Theorem 11.8 to derive an error estimate. For this we need to have an
estimate on the solution of

-~uo =1 in D, Uo = ° on aD. (11.21)

Either from an explicit form of the solution obtained by separation of vari-


ables or by writing

where Vo is a harmonic function, i.e., a solution of ~vo = 0, and employing


the maximum minimum principle for harmonic functions (see [39]), it can
be seen that lIuollex:> S 1/8 (see Problem 11.10). Hence we can state the
following theorem.

Theorem 11.10 Assume that the solution to the boundary value problem
(11.17)-(11.18) is four-times continuously differentiable. Then the error of
the finite difference approximation can be estimated by

11.3 The Riesz and Lax-Milgram Theorems


To establish the foundation of finite element methods for boundary value
problems we need to extend our tools from functional analysis.

Theorem 11.11 (Riesz) Let X be a Hilbert space. Then for each bounded
linear function F : X -+ ce there exists a unique element f E X such that

F(u) = (u, f) (11.22)


11.3 The Riesz and Lax-Milgram Theorems 269

for all u EX. The norms of the element / and the linear junction F
coincide; i. e.,
Ilfll = IIFII· (11.23)

Proof. Uniqueness follows from the observation that because of the positive
definiteness of the scalar product, f = 0 is the only element representing
the zero function F = 0 in the sense of (11.22). For F :f 0 choose w E X
with F( w) :f O. Since F is continuous, the nullspace

N(F) = {u EX: F(u) = O}


can be seen to be a closed, and consequently, by Remark 3.40, a complete,
subspace of the Hilbert space X. By the approximation Theorem 3.52 there
exists the best approximation v to w with respect to N(F). By Theorem
3.51 it satisfies w - v ..1 N(F). Then for 9 := w - v we have that

(F(g)u - F(u)g,g) = 0, u EX,

since F(g)u - F(u)g E N(F) for all u EX. Hence,

F(9)9)
F(u) = ( u'lf9TI2
for all u EX, which completes the proof of (11.22).
From (11.22) and the Cauchy-Schwarz inequality we have that

IF(u) I ~ II/liliull, u EX,


whence IIFII < IIfll follows. On the other hand, inserting / into (11.22)
yields
IIfl1 2 = F(f) ~ IIFIIII/II.
and therefore Ilfll ~ IIFII· This concludes the proof of the norm equality
(11.23). 0

Definition 11.12 A linear operator A: X -t X in a pre-Hilbert space X


is called strictly coercive if there exists a constant c > 0 such that

Re(Au, u) 2: cllull 2 (11.24)

for all u E X.

Theorem 11.13 (Lax-Milgram) In a Hilbert space X a bounded and


strictly coercive linear operator A : X -t X has a bounded inverse
A-1:X-tX.

Proof. Using the Cauchy-Schwarz inequality, we can estimate

IIAuliliuli 2: Re(Au, u) 2: cllul1 2 .


270 11. Boundary Value Problems

Hence
IIAull ~ ellull (11.25)
for all u E X. From (11.25) we observe that Au = 0 implies u = 0; i.e., A
is injective.
Next we show that the range A(X) is closed. Let v be an element of the
closure A(X) and let (v n ) be a sequence from A(X) with V n -+ v, n -+ 00.
Then we can write V n = AU n with some Un E X, and from (11.25) we find
that

for all n, m E IN. Therefore, (un) is a Cauchy sequence in X and converges:


Un -+ u, n -+ 00, with some u EX. Then v = Au, since A is continuous,
and A(X) = A(X) is proven.
From Remark 3.40 we now have that A(X) is complete. Let w E X be
arbitrary and denote by v its best approximation with respect to A(X),
which uniquely exists by Theorem 3.52. Then, by Theorem 3.51, we have
(w - v, u) = 0 for all u E A(X). In particular, (w -v, A(w -v)) = O. Hence,
from (11.24) we see that w = v E A(X). Therefore, A is surjective. Finally,
the boundedness of the inverse

(11.26)

is a consequence of (11.25). o
Definition 11.14 Let X be a complex (or real) linear space. Then a func-
tion S : X X X -+ ce (or IR) is called sesquilinear if it is linear with respect
to the first variable and antilinear with respect to the second variable, i. e.,
if
S(au + {3v, w) = as(u, w) + {3S(v, w)
and
S(u,av + {3w) = as(u,v) + j3S(u,w)
for all u, v, w E X and a, {3 E ce (or IR). A sesquilinear function on a
normed space X is called bounded if

IS(u,v)1 ~ Cllullllvil
for all u, v E X and some positive constant C. It is called strictly coercive
if
ReS(u,u) ~ cllull 2
for all u E X and some positive constant e.

Note that for a real linear space, sesquilinear functions are bilinear func-
tions, i.e., linear with respect to both variables. Each bounded and strictly
11.3 The Riesz and Lax-Milgram Theorems 271

coercive linear operator A : X -t X in a pre-Hilbert space defines a


bounded and strictly coercive sesquilinear function by

S(u,v) := (u,Av), u,v EX.

The converse of this statement is described by the following theorem.


Theorem 11.15 Let S be a bounded and strictly coercive sesquilinear func-
tion on a Hilbert space X. Then there exists a uniquely determined bounded
and strictly coercive linear operator A : X -t X such that

S(u, v) = (u, Av)

for all u,v E X.


Proof. For each v E X the mapping u 1-+ S(u, v) clearly defines a bounded
linear function on X, since IS(u,v)1 ~ Cilullllvil. By the Riesz Theorem
11.11 we can write S(u,v) = (u,J) for all u E X and some f E X.
Therefore, setting Av := f we define an operator A : X -t X such that
S(u, v) = (u, Av) for all u, vEX.
To show that A is linear we observe that
(u,aAv + ,BAw) = a(u,Av) + J3(u,Aw) = as(u,v) + J3S(u,w)
= S(u, av + ,Bw) = (u, A[av + ,Bwl)
for all u, v, w E X and all a, (3 E (C. The boundedness of A follows from

IIAul1 2 = (Au, Au) = S(Au, u) :S CIIAullliull,


and the strict coercivity of A is a consequence of the strict coercivity of S.
To show uniqueness of the operator A we suppose that there exist two
operators At and A 2 with the property

S(u,v) = (u,A1v) = (u,A 2 v)

for all u,v EX. Then we have (u,A1v - A 2 v) = 0 for all u,v E X, which
implies Atv = A 2 v for all v E X by setting u = A1v - A 2 v. 0

Corollary 11.16 Let S be a bounded and strictly coercive sesquilinear


function and F a bounded linear function on a Hilbert space X. Then there
exists a unique u E X such that

S(v, u) = F(v) (11.27)


for all vEX.
Proof. By Theorem 11.15 there exists a uniquely determined bounded and
strictly coercive linear operator A such that

S(v, u) = (v, Au)


272 11. Boundary Value Problems

for all u, v EX, and by Theorem 11.11 there exists a uniquely determined
element f such that
F(v) = (v, f)
for all vEX. Hence, the equation (11.27) is equivalent to the equation

Au = f.
However, the latter equation is uniquely solvable as a consequence of the
Lax-Milgram Theorem 11.13. 0

Since the coercivity constants for A and S coincide, from (11.23) and
(11.26) we conclude that
Ilull -< !c IIFII (11.28)

for the unique solution u of (11.27).


Let A : X -+ X be a bounded linear operator. Then, given f EX,
solving the equation Au = f obviously is equivalent to finding u E X such
that
(v, Au) = (v, f) (11.29)
for all v EX. The Galerkin method, named after the Russian engineer
Galerkin, is based on this observation, and given a finite-dimensional sub-
space X n C X, it approximately solves (11.29) by an element Un E X n
such that
(v, Au n ) = (v, f) (11.30)
for all v E X n . By Theorems 3.51 and 3.52, the condition (11.30) is equiva-
lent to the fact that the best approximations to AU n and to f with respect
to X n coincide; i.e.,
(11.31)
where Pn denotes the orthogonal projection operator from X onto X n .
The equivalence of (11.30) and (11.31) is the reason why the Galerkin
method belongs to the so-called projection methods; Le., the equation to be
approximated is projected onto a finite-dimensional subspace.
To analyze the Galerkin method we introduce a finite-dimensional oper-
ator An : X n -+ X n by An := PnA. Then, by Theorem 3.51, we have

(Anu, u) = (PnAu, u) = (Au, u) + (PnAu - Au, u) = (Au, u)

for all u E X n . Hence from the strict coercivity of A we deduce that

for all u E X n ; i.e., An : X n -+ X n is strictly coercive with the same


coercitivity constant c as A : X -+ X. This now can be employed to prove
the following theorem.
11.3 The Riesz and Lax-Milgram Theorems 273

Theorem 11.17 For a bounded and strictly coercive linear operator A the
Galerkin equations (11.30) have a unique solution. It satisfies the error
estimate
Ilun - ull ::; M inf Ilv - ull,
vEX n
(11.32)

where M is some constant depending on A (and not on X n ).


Proof. Since An : X n -+ X n is strictly coercive with coercitivity constant
c, by the Lax-Milgram Theorem 11.13 we conclude that An is bijective;
i.e., the Galerkin equations (11.30) have a unique solution Un E X n . The
estimate (11.26) applied to the operator An implies that

(11.33)

For the error Un - U between the Galerkin approximation Un and the


exact solution U we can write
Un - U = (A~I PnA - I)u = (A~I PnA - I)(u - v)
for all v E X n , since, trivially, we have A;;-l PnAv = v for v E X n . By
Theorem 3.52 we have IlPnll 1, and therefore, using Remark 3.25 and
(11.33) we can estimate

whence (11.32) follows. o

The error estimate of Theorem 11.17 is usually referred to as Cia's


lemma, since it was first obtained by Cea in 1964. It indicates that the
error in the Galerkin method is determined by how well the exact solution
can be approximated by elements of the subspace X n .
By Corollary 11.16 the Galerkin method immediately carries over to the
solution of the sesquilinear equation (11.27) and consists in finding Un E X n
such that
S(v,U n ) = F(v) (11.34)
for all v E X n .
The practical solution of the Galerkin equations (11.30) reduces to the
solution of a system of linear equations. If WI, ... , W n is a basis for X n
(without loss of generality we assume the dimension of X n to be n), then
for n
Un = LQkWk
k=1

the Galerkin equations (11.30) are equivalent to the system of linear equa-
tions
n
LO:k(Wj,Awk) = (Wj,!), j = 1, ... ,n. (11.35)
k=l
274 11. Boundary Value Problems

From this formulation it becomes obvious that the Galerkin method is


only a semidiscrete method, since setting up the linear system requires
the evaluation of scalar products and of the operator A applied to the
basis elements. For a fully discrete method these computations, in general,
need further approximations of integrals for the scalar products and of
differential or integral operators. This also requires that the error analysis
be amended accordingly, since the error estimate of Theorem 11.17 covers
only the semidiscrete case.
Having outlined the basic ideas of the Galerkin method and its error
analysis within a few paragraphs, we want to point out clearly that the
power and the art of the application of the Galerkin method for the ap-
proximate solution of differential and integral equations begins with the
proper choice of the approximating subspace X n and the appropriate basis
WI, ... ,W n therein, corresponding to the operator A under consideration.
However, it is beyond our goal to enter into this important topic in any
detail aside from the short discussion in Section 11.5.

11.4 Weak Solutions


We return to the boundary value problem, and instead of (11.7)-(11.8) we
consider the slightly more general so-called Sturm-Liouville problem

-(pu')' + qu = r in [a, b] (11.36)

with homogeneous boundary conditions

u(a) = u(b) = O. (11.37)

Here we assume that p E C I [a, b] and q, r E C[a, b] such that p(x) > 0
and q(x) ~ 0 for all x E [a,b]. Multiplying the differential equation by
v and performing a partial integration, it follows that each solution u to
(11.36)-(11.37) satisfies
S(v, u) = F(v) (11.38)
for all v E C I [a, b] with v( a) = v( b) = 0, where we have set

S(u, v) := l b
(pu'V' + quv) dx (11.39)

l
and
b
F(v) := rvdx. (11.40)

Conversely, if u E C 2 [a, b] satisfies (11.38), by partial integration we obtain

l
that
b
[(pul)I - qu + r]vdx = 0 (11.41)
11.4 Weak Solutions 275

for all v E CI[a, bl with v(a) = v(b) = O. Now we set f := (pu')' - qu + r


and assume that f(xo) :j:. 0 for some Xo in (a, b), say f(xo) > O. Since f is
continuous, there exists an interval U C (a, b) such that f is positive on U.
Now we choose a nonnegative function v :j:. 0 from C I [a, bl which vanishes
outside U. For this function v the integral in (11.41) must be positive. This
is a contradiction, and therefore f must vanish identically; i.e., u satisfies
the differential equation (11.36). Therefore, (11.38) provides an equivalent
reformulation of the boundary value problem.
From Example 3.38 we recall that the space of continuous functions is
not complete with respect to the L 2 scalar product. However, if we wish
to apply the analysis of the previous section and, in particular, Corollary
11.16, then we need a Hilbert space. For this, we introduce the Sobolev space
HI [a, b] based on the concept of weak derivatives. By L 2 [a, b] we denote the
space of measurable real-valued functions defined on the interval [a, b] that
are square-integrable in the sense of Lebesgue. We shall make use of the
fact that L 2 [a, b] is a Hilbert space with respect to the L 2 scalar product.
(More precisely, L 2 [a, bl is the linear space of equivalence classes of functions
coinciding almost everywhere.) Note that the space C[a, b] of continuous
functions is dense in L 2 [a,b] (see [5, 51, 59]).
Definition 11.18 A function u E L 2 [a, b] is said to have a weak derivative
u' E L 2 [a, b] if
lb uv' dx = -lb u'vdx (11.42)

for all v E C 1 [a,b] with v(a) = v(b) = O.


By partial integration it follows that (11.42) is satisfied for functions
u E C l [a, b]. Hence, weak differentiability generalizes classical differentia-
bility.
From the denseness of {v E C I [a, bl : v(a) = v(b) = O} in L 2 [a, b], or from
the Fourier series for the odd extension of u, it can be seen that the weak
derivative, if it exists, is unique (see Problem 11.17). From the denseness of
C[a, b] in L 2 [a, b], or from the Fourier series for the even extension of u, it
follows that each function with vanishing weak derivative must be constant
almost everywhere (see Problem 11.17). The latter, in particular, implies

u(x) = lXu'(Od~+c (11.43)

for almost all x E [a, b] and some constant c, since by Fubini's theorem
276 11. Boundary Value Problems

for all v E CI [a, b) with v(a) = v(b) = O. Hence both sides of (11.43) have
the same weak derivative.

Theorem 11.19 The linear space

endowed with the scalar product

(11.44)

is a Hilbert space.

Proof. It is readily checked that HI [a, b) is a linear space and that (11.44)
defines a scalar product. Let (un) denote an HI Cauchy sequence. Then
(un) and (u~) are both L 2 Cauchy sequences. From the completeness of
L 2 [a, b) we obtain the existence of u E L 2 [a, b) and w E L 2 [a, b) such that
lIu n - ul12 -t 0 and lIu~ - wl12 -t 0 as n -t 00. Then for all v E C I [a, b)
with v(a) = v(b) = 0 we can estimate

Therefore, u E HI [a, b) with u' = w, and lIu - unllHI -t 0, n -t 00, which


completes the proof. 0

Theorem 11.20 CI[a,b) is dense in HI[a,b).

Proof. Since C[a, b) is dense in L 2 [a, b], for each u E HI[a, b) and € > 0 there
exists w E C[a, b) such that Ilu' - wl12 < €. Then we define v E CI [a, b) by

v(x) := u(a) + l x
w({) d{,

and using (11.43), we have

u(x) - v(x) = l x
{u'W - w({)} d~.

By the Cauchy-Schwarz inequality this implies lIu - vl12 < (b - a)€, and
the proof is complete. 0

Theorem 11.21 HI[a, b) is contained in C[a, b).


11.4 Weak Solutions 277

Proof. From (11.43) we have

u(x) - u(y) = ~x Ul(~) d~, (11.45)

whence by the Cauchy-Schwarz inequality,

follows for all x, y E [a, b]. Therefore, every function u E HI [a, b] belongs to
C[a, b], or more precisely, it coincides almost everywhere with a continuous
function. 0

By Theorem 11.21 we may consider HI[a,b] as a subspace of C[a,b].


Choose y E [a, b] such that lu(y)1 = mina<x<b lu(x)l. Then from

and (11.45), by the Cauchy-Schwarz inequality we find that

for some constant C. The latter inequality means that the HI norm is
stronger than the maximum norm (in one space dimension!).

Theorem 11.22 The space

HJ[a,b]:= {u E HI[a,b]: u(a) = u(b) = O}


is a complete subspace of HI [a, b].

Proof. Since the HI norm is stronger than the maximum norm, each HI
convergent sequence of elements of HJ [a, b] has its limit in HJ [a, b]. There-
fore HJ [a, b] is a closed subspace of HI [a, b], and the statement follows from
Remark 3.40. 0

Definition 11.23 A function u E HJ [a, b] is called a weak solution to


the boundary value problem (11.36)-{11.37) if (11.38) is satisfied for all
v E HJ[a, b].

Theorem 11.24 Assume that p > 0 and q ~ O. Then there exists a unique
weak solution to the boundary value problem (11.36)-(11.37).

Proof. The sesquilinear function S : HJ [a, b] x HJ [a, b] is bounded, since


278 11. Boundary Value Problems

by the Cauchy~Schwarz inequality. For u EHJ [a, b], from (11.45) and the
Cauchy-Schwarz inequality we obtain that

Hence we can estimate

S(u, u) min p(x) fb lu'I 2 dx ~ cllull~l


~ a:Sx:Sb a

for all u E HJ
[a, b] and some positive constant Cj i.e., S is strictly coercive.
Finally, by the Cauchy-Schwarz inequality we have

i.e., the linear function F : HJ


[a, b] -+ ffi is bounded. Now the statement
of the theorem follows from Corollary 11.16. 0

We note that from (11.28) and the previous inequality it follows that

(11.46)

for the weak solution u to the boundary value problem (11.36)-( 11.37).

Theorem 11.25 Each weak solution to the boundary value problem (11.36)-
(11.37) is also a classical solution; i. e., it is twice continuously differen-
tiable.

Proof. Define

f(x) := l [q(~)u(O
x
- r(~)] d~, x E [a, b].

Then f E GI[a, b]. From (11.38), by partial integration we obtain

l b
[pu' - f]v' dx = 0

for all v E HJ [a, b]. Now we set


C .- _1_
.- b-a
l b
[pu' - f] d~

l (P(~)u'(~)
and x
vo(x) := - f(~) - c]d~, x E [a,b].
11.5 The Finite Element Method 279

Then Vo E HJ [a, b) and


lb [pu' - I - c]2 dx = lb [pu' - I - c)vh dx

Hence
pu' = 1+ c,
and since I and p are in C [a, b] with p(x) > 0 for all x E [a, b], we can
1

conclude that u' E C 1 [a, b] and


(pu')' = l' = qu - r.
This completes the proof. o

Using the differential equation (11.36), from (11.46) we conclude that


there exists a constant C > 0, independent of r, such that
(11.47)
which we note for later use.
As compared to Theorem 11.4 we have not obtained any major extension
of the existence result. However, as pointed out already in the introduction,
we view this section as a model case for the more complicated situation of
partial differential equations. By partial integration it can be seen that
the boundary value problem (11.17)~(11.18) for the Laplace operator is
equivalent to finding a function u E C 2 (D) satisfying u = 0 on aD and

l ([grad u]T grad v + quv)dx = l Iv dx (11.48)

for all v E C1(D) with v = 0 on aD. The analysis of weak solutions


of (11.48) follows the same pattern as for the ordinary differential equa-
tion (11.36). However, the details are more heavily involved. In particular,
since for the multidimensional case the Sobolev space H1(D) no longer is a
subspace of the continuous functions, the formulation of the boundary con-
dition, Le., the definition of the subspace HJ (D), has to be modified, and
establishing that weak solutions are also classical solutions is more com-
plicated. For a comprehensive study of weak solutions to boundary value
problems for elliptic partial differential equations we refer to [24, 60).

11.5 The Finite Element Method


The finite element method for the boundary value problem (11.36)-(11.37)
consists in the application of the Galerkin method (11.34) to the weak
280 11. Boundary Value Problems

formulation (11.38) by using spline spaces as approximating subspaces.


Then, for appropriate basis functions, the matrix S(Wj, Wk) will be sparse;
i.e., most of the matrix entries will be zero. Polynomials as approximating
subspaces are not suitable, since analogously to Example 5.1, they lead to
ill-conditioned linear systems with full matrices.
We consider the case of linear splines. For the equidistant grid
Xj := a + jh, j = 0, ... , n + 1,
with step size h = (b - a)j(n + 1) and n E IN we choose for X n the space
of continuous piecewise linear functions; i.e., X n consists of the functions
U E era, b] that satisfy u(a) = u(b) = 0 and coincide on each subinterval
[Xj-I,Xj] with a polynomial in PI for j = 1, ... ,n. The functions in this
spline space belong to HJ [a, b] with piecewise constant weak derivatives.
As basis elements in X n we take the so-called hat functions

1
h (Xk+1 - x),

0,
Each u E X n can be represented in the form

where Ok = U(Xk), k = 1, ... , n. Obviously, we have

S(Wj, Wk) = l {pwjw~


b
+ qWjwd dx =0
if (Xj-I, Xj+!) n (Xk-I, xk+d = 0, Le., if Jj - kl > 2. Therefore, the matrix
S(Wj,Wk) is tridiagonal. We compute the matrix elements
1 (XHI
S(Wj,Wj) = h 2 I"
Xj-l
p(x)dx

and
S(wj,wj+d = S(Wj+I,Wj)
1 {XHI 1 (Xj+1
= ~2 J p(x) dx + h2 J q(x)(Xj+! - x)(x - Xj) dx,
xJ xJ
11.5 The Finite Element Method 281

and the right-hand sides

F(wj) = h1 {lX'
J r(x)(x-Xj-ddx+) lX'+1 r(x)(xj+l-x)dx } .
X)-l x)

These equations illustrate two general features of the finite element meth-
ods. Firstly, it is characteristic for the finite element method that the co-
efficients are computed by the same formula for each subinterval, i.e., for
each of the finite elements into which the total interval is subdivided.
Secondly, as already mentioned earlier, the Galerkin method is only
semidiscrete. In order to make it fully discrete, a numerical quadrature
has to be applied. If we remain within our framework of approximations
and approximate P, q, and r by linear splines, we obtain

and
-1 h
S(wj,wj+d ~ 2h (Pj + pj+d + 12 (qj + qj+d
for the matrix elements, and

for the right-hand sides. Here, as above, we have set Pj = p(Xj), qj = q(Xj),
and rj = r(xj). Similar to the linear system (11.10)-(11.11) for the finite
difference method, the tridiagonal linear system is irreducible and weakly
row-diagonally dominant. It also is accessible to convergence acceleration
of the Jacobi iterations by relaxation and multigrid methods.
In order to derive an error estimate for the semidiscrete version of the
finite element method with linear splines from Theorem 11.17, we need an
estimate for the interpolation error for linear splines with respect to the
Hi norm (see also Theorem 8.33).

Lemma 11.26 Let f[a, b] E C 2 [a, b]. Then the remainder Rtf := f - Ltf
for the linear interpolation at the two endpoints a and b can be estimated
by
IIRtfllL2 ::; (b - a)2 III"11£2,
(11.49)
II(Rtf)'II£2 ::; (b - a) 111"11£2.
Proof. For each function g E CI [a, b] satisfying g(a) = 0, from
282 11. Boundary Value Problems

by using the Cauchy-Schwarz inequality we obtain

Ig(x)1 2 ::; (b - a)IIg'1I12, x E [a, b].


From this, by integration we derive the Friedrich inequality

IlglI£2 ::; (b - a)IIg'II£2 (11.50)

for functions 9 E C1[a,b] with g(a) = 0 (or g(b) = 0). Using the interpola-
tion property (RI!)(a) = (RI!)(b) = 0, by partial integration we obtain

From this, again applying the Cauchy-Schwarz inequality, we have

whence (11.49) follows with the aid of Friedrich's inequality (11.50) for
9 = RI!. 0

Theorem 11.27 The error in the finite element approximation by linear


splines for the boundary value problem (11.36)-{11.37) can be estimated by

(11.51)

for some positive constant C.

Proof. By summing up the inequalities (11.49), applied to each of the


subintervals of length h, for the interpolating linear spline W n E X n with
wn(Xj) = u(Xj) for j = O, ... ,n we find that

IIw~ - u'IIL2 ::; Ilu"ll£2 h


and

whence

follows. Now (11.51) is a consequence of the error estimate for the Galerkin
method of Theorem 11.17. 0

By the following trick, which was independently developed by Aubin


(1967) and Nitsche (1968), we can improve the error estimate in the £2
norm to the order O(h 2 ) that we expect for approximations using linear
splines.
Problems 283

Theorem 11.28 The error in the finite element approximation by linear


splines for the boundary value problem (11.36)-(11.37) can be estimated by

Ilun - ullL2 :S Cllu"IIph2


with some positive constant C.
Proof. Denote by Zn the weak solution to the boundary value problem with
the right-hand side U - Un; i.e.,

S(v,Zn) = (v,u - un)p


for all v E HJ [a, b]. In particular, inserting v = U - Un, it follows that
(11.52)

Since S(v, u) = F(v) and S(v, un) = F(v) for all v E X n , using the sym-
metry of S we have
S(U-Un,v)=O
for all v E X n . Inserting the Galerkin approximation to Zn, which we denote
by zn, into the last equation and subtracting from (11.52), we obtain
(11.53)
Since S is bounded, from (11.53) and (11.51), applied to U-U n and Zn -zn,
we can conclude that

for some constant C 1 . However, from (11.47) we also have that

for some constant C2 . Now the assertion of the theorem follows from the
last two inequalities. 0

We refrain from describing both the extension of this analysis to higher-


order splines such as cubic splines (see Problem 11.19) and the extension
to partial differential equations. For the latter we refer to [4, 11].

Problems
11.1 Consider multiple shooting for the boundary value problem

u" + u = 0, u(a) = u(b) = 0,


with n equidistant subintervals. Show that the corresponding linear system (11.6)
is uniquely solvable, provided that (b - a)/7r ¢ IN.
284 11. Boundary Value Problems

11.2 Write a computer program for multiple shooting using the Newton method
and the Runge-Kutta method and test it for various examples.

11.3 Show that the boundary value problem for the differential equation

u" = f(x,u,u'), a:::; x:::; b,

with inhomogeneous boundary conditions u(a) = a and u(b) = {3 can be equiva-


lently transformed into a boundary value problem with homogeneous boundary
condition.

11.4 Show that the general solution of the linear differential equation (11. 7) is
given by (11.9).

11.5 For p E C 1 [a, b] and q E CIa, b] show that the boundary value problem

u"+ pu
, + qu =r .
In [a, b] , u(a) = u(b) = 0,

is solvable for each right-hand side r E CIa, b] if and only if the boundary value
problem
u" + pu' + qu = 0 in [a, b], u(a) = u(b) = 0,
admits only the trivial solution u = O.
11.6 Find the solution of the boundary value problem

u"(x)+u(x)=e"', u(O)=u(l) =0.

11.7 Write a computer program for the finite difference method (11.10)-(11.11)
and test it for various examples.

11.8 Find the explicit solution for the finite difference approximation (11.10)-
(11.11) for the boundary value problem

u" - u = -2 in [0,1], u(O) = u(l) = 0,


and verify the convergence result of Theorem 11.8.

11.9 Show that the error in the finite difference approximation (11.15) is of
order O(h 2 ) and that the error in the approximation (11.16) is of order O(h 4 ).

11.10 Prove the estimate lIuolioo :::; 1/8 for the solution to the boundary value
problem (11.21).
11.11 In the space CIa, b] with scalar product

(u,v):= l b
u(x)v(x)dx,

define a functional F : CIa, b] -t <C by

F(u):= l b
u(x)dx.

Show that F is linear and bounded. Is there an f E CIa, b] such that F(u) = (u, f)
for all u E CIa, b]? Does your answer agree with the Riesz Theorem 11.11?
Problems 285

11.12 In the pre-Hilbert space of Problem 11.11, for a fixed x E [a, b) consider
the point evaluation functional F : C[a, b] -t <C defined by

F(u) := u(x).

Is F linear and bounded?

11.13 Let X and Y be Hilbert spaces and let A : X -t Y be a bounded linear


operator. Show that there exists a uniquely determined bounded linear operator
A· : Y -t X such that
(Au,v) = (u,A·v)
for all 1.1 E X and v E Y. The operator A· is called the adjoint operator of A.
Show that IIAII = IIA·II·

11.14 Let A : X -+ X be a bounded, self-adjoint, and positive operator in


a Hilbert space Xj i.e., (Au,v) = (u,Av) for all u,v E X and (Au,u) > 0
for all 1.1 t= o. Choose Wo E X and define Wi = AWi -1 for j = 1, ... , n - 1.
Show that the Galerkin equations for Au = f with respect to the subspaces
X n = span{wo, ... , wn-d are uniquely solvable for each n E IN. Moreover, if f
is in the closure of span {Ai Wo : j = 0, 1, ... }, then the Galerkin approximation
Un converges to the solution of Au = f.
Show that in the special case Wo = f the approximations Un can be computed
iteratively by the formulae 1.10 = 0, po = f, and

Un+l = Un - Onpn,
pn = r n + {3n-1Pn-l,
rn = rn-l - on- 1Apn-l,
On-l (r n-l,Pn-J)/(Apn-l,Pn-J),
{3n-l -ern, Apn- d/(Apn-l,Pn- d·

Here Tn is the residual Tn = AU n - f. This is the conjugate gradient method of


Hestenes and Stiefel.

11.15 Let A : X -t X be a bounded, self-adjoint, and positive operator in a


Hilbert space X; i.e., (Au,v) = (u,Av) for all u,v E X and (Au, 1.1) > 0 for all
1.1 t= o. Show that solving the equation Au = f is equivalent to minimizing the
so-called energy functional

E(v):= (v,Av) - 2Re(v,f)

on X. Show that the Galerkin approximation with respect to a subspace X n is


equivalent to minimizing E on X n . This method is known as the Rayleigh-Ritz
method.

11.16 Show that under the assumptions of Problem 11.15 for the Galerkin
equations the SOR method of Section 4.2 converges for 0 < w < 2.

11.17 Show that the weak derivative, if it exists, is unique and that each func-
tion with vanishing weak derivative must be constant almost everywhere.
286 11. Boundary Value Problems

11.18 Write a computer program for the finite element method with linear
splines and test it for various examples. Compare the numerical results with
those for the finite difference method.

11.19 Let B-1,Bo,B1, ... ,Bn,Bn+1,Bn+2 denote the cubic B-splines for the
equidistant grid Xj := a+ jh, j = 0, ... ,n+ 1, with step size h = (b-a)/n. Show
that
Uo := Bo - 4B-1, U1 := B 1 - B_ 1,

Un .- B n - B n +2, Un+1 := B n - 4Bn+2


is a basis for 5:; n HJ[a,b], i.e., for the space of cubic splines vanishing at the
endpoints.
Using this basis, set up the Galerkin equations for the Sturm-Liouville problem
analogous to the case of linear splines treated in Section 11.5.

11.20 Formulate and prove analogues of Theorems 11.27 and 11.28 for the finite
element approximation using cubic splines as in Problem 11.19.
12
Integral Equations

The topic of the last chapter of this book is linear integral equations, of

lb
which
K(x, Y)'{J(Y) dy = j(x), x E [a, b],

-lb
and
'{J(x) K(x, Y)'{J(Y) dy = j(x), x E [a, b],

are typical examples. In these equations the function '{J is the unknown,
and the so-called kernel K and the right-hand side j are given functions.
The above equations are called Fredholm integral equations of the first and
second kind, respectively. Since both the theory and the numerical approx-
imations for integral equations of the first kind are far more complicated
than for integral equations of the second kind, we will confine our presen-
tation to the latter case.
Integral equations provide an important tool for solving boundary value
problems for both ordinary and partial differential equations (see Problem
12.1 and [39]). Their historical development is closely related to the solution
of boundary value problems in potential theory in the last decades of the
nineteenth century. Progress in the theory of integral equations also had a
great impact on the development of functional analysis.
Omitting the proofs, we will present the main results of the Riesz theory
for compact operators as the foundation of the existence theory for integral
equations of the second kind. Then we will develop the fundamental ideas
of the Nystr6m method and the collocation method as the two most im-
288 12. Integral Equations

portant approaches for the numerical solution of these integral equations.


This is done in a general framework of operator equations and their ap-
proximate solution, which makes the analysis more widely applicable. For
a comprehensive study of both the theory and the numerical solution of
linear integral equations we refer to [391.

12.1 The Riesz Theory


This section is devoted to a summary of some of the basic facts of the theory
of Fredholm integral equations of the second kind. The integral equations
formulated above carry the name of Fredholm, since in 1902 Fredholm
established an existence theory for integral equations of the second kind
with continuous kernels, which is now known as the Fredholm alternative.
For the purpose of this introduction to the numerical solution of integral
equations it suffices to consider only the first and most important part of
this alternative, which states that the inhomogeneous equation

<p(x) -l b
K(x, y)<p(y) dy = f(x), x E [a, b], (12.1)

with continuous kernel K has a unique solution <p E C[a, b] for each right-
hand side f E C[a, b] if and only if the homogeneous integral equation

<p(x) -l b
K(x, y)<p(y) dy = 0, x E [a, b], (12.2)

has only the trivial solution. The importance of this result originates from
the fact that it reduces the difficult problem of establishing existence of
a solution to the inhomogeneous integral equation to the simpler problem
of showing that the homogeneous integral equation allows only the trivial
solution <p = 0, and it extends the corresponding statement for systems
of linear equations to the case of integral equations. Actually, Fredholm
derived his results by interpreting integral equations as a limiting case of
linear systems by considering the integral as a limit of Riemann sums and
passing to the limit in Cramer's rule for the solution of linear systems. For
the solution of integral equations with continuous kernels, Fredholm's ap-
proach is still the most elegant and shortest. However, since it is restricted
to the case of continuous kernels, it is more convenient to consider the
above equations as a special case of operator equations of the second kind
with a compact operator, as presented by Riesz in 1918.
Definition 12.1 A linear operator A : X -+ Y from a normed space X
into a normed space Y is called compact if for each bounded sequence (<Pn)
in X the sequence (A<pn) contains a convergent subsequence in Y, i.e., if
each sequence from the set {A<p : <p EX, 1I<p1I ~ I} contains a convergent
subsequence.
12.1 The Riesz Theory 289

Without developing the concept of compactness in normed spaces in any


detail, we note that this definition is equivalent to requiring that the set
{A<p : <p E X, 1I<p1I ::; I} be relatively sequentially compact.
Compact operators are bounded, linear combinations of compact opera-
tors are compact, and products of two bounded operators are compact if
one of them is compact (see Problem 12.2). From the Bolzano-Weierstrass
theorem it can be seen that bounded operators A : X -+ X with finite-
dimensional range A(X) := {A<p : <p E X} are compact. Furthermore, the
identity operator I : X -+ X, defined by I : <p t-t <p for all <p EX, is com-
pact if and only if the space X is finite-dimensional. This actually justifies
the distinction between the equations A<p = f and <p - A<p = f as equations
of the first and second kind, since A and I - A have different properties in
infinite-dimensional spaces if A is compact. A proof of these facts and of
the following important theorem can be found in most introductory books
on functional analysis, for example in [39].
The fundamental result of the Riesz theory is described by the following
theorem, which extends Fredholm's result on the equivalence of injectivity
and surjectivity to the case of operator equations of the second kind with
a compact operator.

Theorem 12.2 Let A: X -+ X be a compact operator in a normed space


X. Then I - A is surjective if and only if it is injective. If the inverse
operator (I - A)-I: X -+ X exists, it is bounded.
In order to verify that Fredholm's existence analysis for integral equations
with continuous kernels K : [a, b] x [a, b] -+ IR can be viewed as a special
case of Theorem 12.2, we have to establish that the linear integral operator
A : C[a, b] -+ C[a, b], defined by

(A<p)(x):= l b
K(x,y)<p(y)dy, x E [a,b], (12.3)

is compact. For this we need the following theorem due to Arzela-Ascoli,


which again is proven in most introductions to functional analysis.

Theorem 12.3 (Arzela-Ascoli) Each sequence from a subset U C C[a, b]


contains a uniformly convergent subsequence; i. e., U is relatively sequen-
tially compact, if and only if it is bounded and equicontinuous, i.e., if there
exists a constant C such that

1<p(x)1 ::; C
for all x E [a, b] and all <p E U, and for every c > 0 there exists 8 > 0 such
that
l<p(x) - <p(y)1 < c
for all x,y E [a,b] with Ix - yl < 8 and all <p E U.
290 12. Integral Equations

Theorem 12.4 The integral operator (12.3) with continuous kernel is a


compact operator on C[a, b].

Proof. For all rp E C[a, b] with Ilrpll<XJ ~ 1 and all x E [a, b], we have that

I(Arp)(x) I ~ (b - a) max IK(x, y)l;


x,yE[a,b]

Le., the set U := {Arp : rp E C[a, b], Ilrpll<XJ ~ I} C C[a, b] is bounded. Since
K is uniformly continuous on the square [a, b] x [a, b], for every E: > 0 there
exists fl > 0 such that
E:
IK(x, z) - K(y, z)1 < b _ a

for all x, y, z E [a, b] with Ix - yl < fl. Then

I(A<p)(x) - (A<p)(y)1 ~ IlIK(X,Z) - K(y,Z)]<p(Z)dZI <,

for all x, y E [a, b] with Ix - yl < fl and all rp E C[a, b] with IIrpll<XJ ~ 1; Le.,
U is equicontinuous. Hence A is compact by the Arzela-Ascoli Theorem
12.3. 0

In our analysis we also will need an explicit expression for the norm of
the integral operator A.

Theorem 12.5 The norm of the integral operator A : C[a, b] -+ C[a, b]


with continuous kernel K is given by
b
IIAII<XJ = max r IK(x,y)1 dy. (12.4)
as-x9 J a

Proof. For each rp E C[a, b] with IIrpll<XJ ~ 1 we have

I(Arp)(x)1 ~ lb IK(x, y)1 dy, x E [a, b],

and thus

Since K is continuous, there exists Xo E [a, b] such that

rbIK(xo,y)ldy= max rbIK(x,y)ldy.


Ja as-xS-bJ a
12.2 Operator Approximations 291

For € > ° choose 'l/J E era, b] by setting

.I.(y) ._ K(xo,Y) [ b]
'f' .- IK(xo,y)1 +€' yEa, .

Then 11'l/Jlloo ~ 1 and

IIA'l/Jlloo ~ I(A'l/J)(xo) I = r [K(xo, y)F dy ~ Jar [K(xo,


b

Ja IK(xo,y)1 +€
y)F -
IK(xo,y)1 +€
b
€2 dy

= l b
IK(xo,y)1 dy - €(b - a).

Hence

~ IIA'l/Jlloo ~ ra
b
IIAlloo = sup IIA'Plioo IK(xo, y)1 dy - €(b - a),
1I'PliooSl J
and since this holds for all € > 0, we have

This concludes the proof. o

It also can be shown that the integral operator remains compact if the
kernel K is merely weakly singular (see [39]). A kernel K is said to be
weakly singular if it is defined and continuous for all x, y E [a, b], x f:. y,
and there exist positive constants M and 0: E (0,1] such that

IK(x, y)1 ~ Mix _ yla-l

for all x, y E [a, b], x f:. y.

12.2 Operator Approximations


The fundamental concept for approximately solving an operator equation

'P - A'P =I
of the second kind is to replace it by an equation

'Pn - An'Pn = In
with approximating sequences An ---+ A and In ---+ I as n ---+ 00. For com-
putational purposes, the approximating equations will be chosen such that
292 12. Integral Equations

they can be reduced to solving a system of linear equations. In this section


we will provide a convergence and error analysis for such approximation
schemes. In particular, we will derive convergence results and error esti-
mates for the cases where we have either norm or pointwise convergence of
the sequence An -t A, n -t 00.

Theorem 12.6 Let A : X -t X be a compact linear operator on a Banach


space X such that 1 - A is injective. Assume that the sequence An : X -t X
of bounded linear operators is norm convergent, i.e., IIA n -All -t 0, n -t 00.
Then for sufficiently large n the inverse operators (1 - An)-l : X -t X
exist and are uniformly bounded. For the solutions of the equations

cP - Acp =f and CPn - AnCPn = fn


we have an error estimate

IICPn - cpll ::; C{II(An - A)cpll + Ilfn - fll} (12.5)

for some constant C.

Proof. By the Riesz Theorem 12.2, the inverse (1 - A)-I: X -t X exists


and is bounded. Since IIAn - All -t 0, n -t 00, by Remark 3.25 we have
11(1 - A)-l (An - A)II ::; q < 1 for sufficiently large n. For these n, by the
Neumann series Theorem 3.48, the inverse operators of

exist and are uniformly bounded by

But then [1 - (1 - A)-l(A - A n ))-I(1 - A)-l are the inverse operators of


1 - An and they are uniformly bounded.
The error estimate follows from

by the uniform boundedness of the inverse operators (1 - An)-l. 0

In order to develop a similar analysis for the case where the sequence
(An) is merely pointwise convergent, i.e., Ancp -t cP, n -t 00, for all cP, we
will have to bridge the gap between norm and pointwise convergence. This
goal will be achieved through the concept of collectively compact operator
sequences and the following uniform boundedness principle.
Theorem 12.7 Let the sequence An : X -t Y of bounded linear operators
mapping a Banach space X into a normed space Y be pointwise bounded;
12.2 Operator Approximations 293

i.e., for each cP E X there exists a positive number C<p depending on cP


such that IIA ncp11 'S C<p for all n E IN. Then the sequence (An) is uniformly
bounded; i.e., there exists some constant C such that IIAnll 'S C for all
n E IN.
Proof. In the first step, by an indirect proof we establish that positive
constants M and P and an element 1/J E X can be chosen such that

(12.6)

for all cP E X with IIcp - 1/J11 'S P and all n E IN. Assume that this is not
possible. Then, by induction, we construct sequences (nk) in IN, (Pk) in lR,
and (CPk) in X such that
IIA nk cpll ~ k
for k = 0,1,2, ... and cP with Ilcp - CPkll 'S Pk and

1
o < Pk 'S 2 Pk-l,
for k = 1,2, ....
We initiate the induction by setting no = 1, Po = 1, and CPo = O. Assume
that nk E IN, Pk > 0, and CPk E X are given. Then there exist nHl E IN
and CPHI E X satisfying IlcpHl - CPk II 'S pk/ 2 and IIAnk+l CPk+l1l ~ k + 2.
Otherwise, we would have IIAncpli 'S k+2 for all cP E X with IIcp-cpk II 'S Pk /2
and all n E IN, and this contradicts our assumption. Set

PHI := min (p; , II A n:+ll1) 'S P2k .

Then for all cP E X with IIcp - CPHll1 'S PHI, by the triangle inequality we
have

IIAnk+l cpll ~ IIAnk+l CPHIII-IIAnk+l (cp - CPHl)11 ~ k + 1,


since IIAnk +! (cp - cpk+dll 'S IIAnk+lIIPHl 'S 1.
For j > k, using the geometric series we have

1 1
-< -2 Pk + ... + -2 p'J -1 <
-
Pk·

Therefore, (CPk) is a Cauchy sequence and converges to some element cP in


the Banach space X. From Ilcpk - cpjll 'S Pk for all j ~ k by passing to the
limit j --+ 00 we see that Ilcpk - cpll 'S Pk for all k E IN. Therefore, we have
IIA nk cpll ~ k for all k E IN, which is a contradiction to the boundedness of
the sequence (An cp).
294 12. Integral Equations

Now, in the second step, from the validity of (12.6) we deduce for each
'P E X with II'PII :S 1 and for all n E IN the estimate

1 2M
IIAn'Pll = - IIAn(p'P + 1/J) - A n1/J11 :S - .
P P
This completes the proof. o

Following Anselone [2], we introduce the concept of collectively compact


operator sequences.

Definition 12.8 A sequence An : X -+ Y of linear operators from a


normed space X into a normed space Y is called collectively compact if
each sequence from the set {An'P : 'P E X, II'PII :s 1, n E IN} contains a
convergent subsequence.

Each operator An from a collectively compact sequence is compact.

Lemma 12.9 Let X be a Banach space, let An : X -+ X be a collectively


compact sequence, and let B n : X -+ X be a pointwise convergent sequence
with limit operator B : X -+ X. Then

II(Bn - B)Anll -+ 0, n -+ 00. (12.7)

Proof. Assume that (12.7) is not valid. Then there exist co > 0, a sequence
(nk) in IN with nk -+ 00, k -+ 00, and a sequence ('Pk) in X with lI'Pkll :s 1
such that
(12.8)

Since the sequence (An) is collectively compact, there exists a subsequence


such that
(12.9)
Then we can estimate with the aid of the triangle inequality and Remark
3.25 to obtain

(12.10)

The first term on the right-hand side of (12.10) tends to zero as j -+ 00,
since the operator sequence (B n ) is pointwise convergent. The second term
tends to zero as j -+ 00, since the operator sequence (B n ) is uniformly
bounded by Theorem 12.7 and since we have the convergence (12.9). There-
fore, passing to the limit j -+ 00 in (12.10) yields a contradiction to (12.8),
and the proof is complete. 0
12.2 Operator Approximations 295

Theorem 12.10 Let A : X -t X be a compact linear operator on a Ba-


nach space X such that I - A is injective, and assume that the sequence
An : X -t X of linear operators is collectively compact and pointwise con-
vergent; i.e., An<p -t Acp, n -t 00, for all <P EX. Then for sufficiently
large n the inverse operators (I - An)-l : X -t X exist and are uniformly
bounded. For the solutions of the equations

<p - A<p =f and CPn - An<Pn = fn


we have an error estimate

II<Pn - <p11 ::; C{II(An - A)cpll + Ilfn - fll} (12.11)

for some constant C.

Proof. By the Riesz Theorem 12.2, the inverse (I - A)-I: X -t X exists


and is bounded. The identity

(I - A)-l = I + (I - A)-l A

suggests

as an approximate inverse for I - An. Elementary calculations yield

(12.12)

where

From Lemma 12.9 we conclude that IISnl1 -t 0, n -t 00. Hence for suffi-
ciently large n we have IISnll ::; q < 1. For these n, by the Neumann series
Theorem 3.48, the inverse operators (I - Sn)-l exist and are uniformly
bounded by

Now (12.12) implies first that I - An is injective, and therefore, since An is


compact, by Theorem 12.1 the inverse (I - An)-l exists. Then (12.12) also
yields (I - An)-l = (I - Sn)-l M n , whence uniform boundedness follows,
since the operators M n are uniformly bounded by Theorem 12.7. The error
estimate (12.11) is proven as in Theorem 12.6. 0

Note that both error estimates (12.5) and (12.11) show that the accuracy
of the approximate solution essentially depends on how well An<p approxi-
mates A<p for the exact solution <po
296 12. Integral Equations

12.3 Nystrom's Method


Recalling Chapter 9, we choose a convergent sequence
n
Qn(g) = L a~n) g(x~n))
k=O
of quadrature formulae for the integral

Q(g) = lab g(X) dx


n
with quadrature points Xb ), ... ,x~n) E [a, b) and real quadrature weights
(n) (n) D . . . d f (n) (n)
a o , ... , an . ror convemence we wnte Xo, ... ,X n mstea 0 X o , ... , Xn ,
. t ead 0 f a (n) , ... , an.
an d ao, ... , an ms (n) We approximate
. the 'mtegraI oper-
o
ator
(A<p)(x) = lab K(x, y)<p(y) dy, x E [a, b],

with continuous kernel K by a sequence of numerical integration operators


n
(An<p)(x) := LakK(X,Xk)<P(Xk), x E [a, b);
k=O
i.e., we apply the quadrature formulae for g = K(x, ')<p. Then the solution
to the integral equation of the second kind

<p - A<p =f
is approximated by the solution of

<Pn - An<Pn = f,
which reduces to solving a finite-dimensional linear system.

Theorem 12.11 Let <Pn be a solution of


n
<Pn(x) - L akK(x, Xk)<Pn(Xk) = f(x), x E [a, b). (12.13)
k=O

Then the values <p)n) .- <Pn(Xj), j 0, ... ,n, at the quadrature points
satisfy the linear system
n
<p)n) _ L akK(Xj, Xk)<p~n) = f(xj), j = 0, ... , n. (12.14)
k=O
12.3 Nystrom's Method 297

Conversely, let <p)n l , j = 0, ... , n, be a solution of the system (12.14). Then


the function <Pn defined by
n
<Pn(x):= f(x) + LakK(X,xk)<Pknl, x E [a,b], (12.15)
k=O

solves equation (12.13).

Proof. The first statement is trivial. For a solution <p)n l , j = 0, ... ,n, of
the system (12.14) the function <Pn defined by (12.15) has values
n
<Pn(Xj) = f(xj) +L akK(Xj, xk)<p~nl = <p)n l , j = 0, ... , n.
k=O

Inserting this into (12.15), we see that <Pn satisfies (12.13). o

The formula (12.15) may be viewed as a natural interpolation of the val-


ues <p)n l , j = 0, ... ,n, at the quadrature points to obtain the approximating
function <Pn. It was introduced by Nystrom in 1930.
For convenience we note the following analogue of Theorem 12.5.
Theorem 12.12 The norm of the quadrature operators An is given by

(12.16)

Proof. For each <P E C[a, b] with 11<plloo :s 1 we have


n

IIAn<plloo :s a<x<b
max L lakK(x,xk)l,
- - k=O

and therefore IIAnll oo is smaller than or equal to the right-hand side of


(12.16). Let z E [a, b] be such that

and choose 'l/J E C[a, b] with 11'l/Jlloo = 1 and

akK(z, Xk)'l/J(Xk) = lakK(z, xk)l, k = 0, ... , n.


Then
n

IIAnil oo 2 IIAn'l/Jlloo 2 I(An'l/J)(z) I = L lakK(z,xk)l,


k=O
and (12.16) is proven. o

The error analysis will be based on the following theorem.


298 12. Integral Equations

Theorem 12.13 Assume the quadrature formulae (Qn) to be convergent.


Then the sequence (An) is collectively compact and pointwise convergent
(i.e., An<p -t A<p, n -t 00, for all <P E C[a, b]) but not norm convergent.

Proof. Since the quadrature formulae (Qn) are assumed to be convergent,


by (9.13) and the uniform boundedness principle Theorem 12.7 there exists
a constant C such that the weights satisfy
n
L lakn ) ~ C
1

k=O

for all n E IN (see Theorem 9.10). Then we can estimate

IIAn<plloo ~ C max IK(x,y)III<plloo


a"'Ox,y"'Ob
(12.17)

and

for all Xl, X2 E [a, b]. From (12.17) and (12.18) we see that

{An<p : <P E C[a, b], 11<p1100 ~ 1, n E IN}


is bounded and equicontinuous because the kernel K is uniformly contin-
uous on [a, b] x [a, b]. Therefore, by the Arzela-Ascoli Theorem 12.3 the
sequence (An) is collectively compact.
Since the quadrature is convergent, for fixed <p E C[a, b] the sequence
(An<p) is pointwise convergent; i.e., (An<p)(x) -t (A<p)(x), n -t 00, for all
X E [a, b]. As a consequence of (12.18), the sequence (An<p) is equicontinu-
ous. Hence it is uniformly convergent: IIA n<p - A<plloo -t 0, n -t 00. That
is, we have pointwise convergence: An<p -t A<p, n -t 00, for all <p E C[a, b]
(see Problem 12.7).
For c: > 0 choose a function 'l/J£ E C[a, b] with 0 ~ 'l/J£(x) ~ 1 for all
X E [a, b] such that 'l/J£(x) = 1 if minj=o, ... ,n Ix - xjl 2 c: and 'l/J£(Xj) = 0,
j = 0, ... , n. Then

max jK(x,y)1 r {I -
b
IIA(<p'l/J£) - A<plloo ~ x,yE[a,b] } a
'l/J£(y)} dy -t 0, c: -t 0,

for all <p E C[a, b] with 11<p1100 = 1. Using this result, we derive

IIA - Anll oo = sup II(A - An)<plloo 2 sup sup II(A - An)(<p'l/J£) 1100
11'1'11==1 11'1'11==1 £>0

sup sup IIA(<p'l/J£)lloo 2 sup IIA<plloo = IIAlloo,


11'1'11==1 £>0 11'1'11==1
whence we see that the sequence (An) cannot be norm convergent. 0
12.3 Nystrom's Method 299

Theorem 12.13 enables us to apply the approximation theory of Theorem


12.10. For the discussion of the error based on the estimate (12.11) we need
the norm IIAcp - Ancpll=. It can be expressed in terms of the error for the
corresponding numerical quadrature by

b n
IIAcp - Ancpll= = aT;~b fa K(x, y)cp(y) dy - ( ; akK(x, Xk)cp(Xk)

and requires a uniform estimate for the error of the quadrature applied
to the integration of K(x, .)cp. Therefore, from the error estimate (12.11),
it follows that under suitable regularity assumptions on the kernel K and
the exact solution cp, the convergence order of the underlying quadrature
formulae carries over to the convergence order of the approximate solutions
to the integral equation. We illustrate this by the case of the trapezoidal
rule. Under the assumption cp E C 2 [a, b) and K E C 2 ([a, b) x [a, b)), by
Theorem 9.7, we can estimate

1 h2 (b - a) max
IIAcp - Ancpll= ~ -2
1 a<::;x,y<::;b
I~f)2 [K(x, y)cp(y)] I.
uy

Example 12.14 Consider the integral equation

cp(x) - 2
1 1 0
1
(x + l)e- xY cp(y)dy = e- x - 2 1 e-(X+l), o ~
1+2 x ~ 1,
(12.19)
with exact solution cp(x) = e- X • For its kernel we have

1
1 1 x+1
max - (x + l)e- xY dy = sup - - (1 - e- X) < 1.
O<::;x::;1 0 2 O<x<::;l 2x

Therefore, by the Neumann series Theorem 3.48 and the operator norm
(12.4), equation (12.19) is uniquely solvable.
We use the (composite) trapezoidal rule for approximately solving the
integral equation (12.19) by the Nystrom method. Table 12.1 gives the
difference between the exact and approximate solutions and clearly shows
the expected convergence rate O(h 2 ).

TABLE 12.1. Numerical solution of (12.19) by the trapezoidal rule


n x=o x = 0.25 x = 0.5 x = 0.75 x=1

4 0.007146 0.008878 0.010816 0.013007 0.015479


8 0.001788 0.002224 0.002711 0.003261 0.003882
16 0.000447 0.000556 0.000678 0.000816 0.000971
32 0.000112 0.000139 0.000170 0.000204 0.000243
300 12. Integral Equations

We now use the (composite) Simpson's rule for the integral equation
(12.19). The numerical results in Table 12.2 show the convergence order
O(h 4 ), which we expect from the error estimate (12.11) and the convergence
order for Simpson's rule from Theorem 9.8. 0

TABLE 12.2. Numerical solution of (12.19) by Simpson's rule

n x=O x = 0.25 x = 0.5 x = 0.75 x=1

4 0.00006652 0.00008311 0.00010905 0.00015046 0.00021416


8 0.00000422 0.00000527 0.00000692 0.00000956 0.00001366
16 0.00000026 0.00000033 0.00000043 0.00000060 0.00000086

After comparing Tables 12.1 and 12.2, we wish to emphasize the major
advantage of Nystrom's method over other methods like the collocation
method, which we will discuss in the next section. The matrix and the
right-hand side of the linear system (12.14) are obtained by just evaluating
the kernel K and the given function f at the quadrature points. Therefore,
without any further computational effort we can improve considerably on
the approximations by choosing a more accurate numerical quadrature for-
mula.
In the next example we consider an integral equation with a periodic
kernel and a periodic solution.

Example 12.15 Consider the integral equation

ab r 2n
~(T)dT
~(t) + -; io a2 + b2 _ (a2 _ b2) cos(t + T) = f(t),
(12.20)
where a 2: b > O. This integral equation arises from the solution of the
Dirichlet problem for the Laplace equation in an ellipse with semiaxis a
and b (see [39]). Any solution ~ to the homogeneous form of equation
(12.20) clearly must be a 21l'-periodic analytic function, since the kernel is
a 21l'-periodic analytic function with respect to the variable t. Hence, we
can expand ~ into a uniformly convergent Fourier series
00 00

~(t) = 2: an cos nt + 2: I3n sin nt.


n=O n=l

Inserting this into the homogeneous integral equation and using the inte-
grals (see Problem 12.10)

(12.21)
12.3 Nystrom's Method 301

for n = 0,1,2, . .. , it follows that

for n = 0,1,2, ....


Hence, an = f3n = 0 for n = 0,1,2, ... , and therefore
({) = O. Now the Riesz Theorem 12.2 implies that the integral equation
(12.20) is uniquely solvable for each right-hand side f.
We numerically want to solve (12.20) in the case where the unique solu-
tion is given by

(()(t) = ecostcos(sint), 0::; t::; 211'.

Using the integrals (12.21), it can be seen that the right-hand side becomes

f(t) = (()(t) + eccost cos(csin t), 0::; t ::; 27l',

where c = (a - b)/(a + b).


Since we are dealing with periodic analytic functions, we use the rectan-
gular rule. From Theorem 9.28 we expect an exponentially decreasing error
behavior, which is exhibited by the numerical results in Table 12.3 giving
the difference between the exact and approximate solutions. Doubling the
number of quadrature points doubles the number of correct digits in the
approximate solution.

TABLE 12.3. Nystrom method for equation (12.20)

n t=O t = 11'/2 t=lI'

4 -0.15350443 0.01354412 -0.00636277


a=1 8 -0.00281745 0.00009601 -0.00004247
b = 0.5 16 -0.00000044 0.00000001 -0.00000001

4 -0.69224130 -0.06117951 -0.06216587


a=1 8 -0.15017166 -0.00971695 -0.01174302
b = 0.2 16 -0.00602633 -0.00036043 -0.00045498
32 -0.00000919 -0.00000055 -0.00000069

The actual size of the error, i.e., the constant factor in the exponential
decay, depends on the parameters a and b, which describe the location of
the singularities of the integrands in the complex plane; i.e., they determine
the width of the strip of the complex plane into which the kernel can be
extended as a holomorphic function.
Note that for periodic analytic functions the rectangular rule generally
yields better approximations than Simpson's rule (see Problem 9.12). 0
302 12. Integral Equations

We confine ourselves to these few examples for the application of the


Nystrom method. For a greater variety the reader is referred to [1, 3, 6, 19,
25,30,39,49].
With the aid of appropriately chosen quadrature formulae, which take
care of the singularity by a weighted product rule, the Nystrom method
can also be successfully applied to weakly singular integral equations of the
second kind (see [39]).

12.4 The Collocation Method


The collocation method for approximately solving an equation of the second
kind
<p - A<p = ! (12.22)
consists in seeking an approximate solution from a finite-dimensional sub-
space by requiring that the equation (12.22) be satisfied at only a finite
number of so-called collocation points. Assume that A : C[a, b] -t C[a, bl
is a bounded linear operator and let X n = span{ u~n), ... ,u~n)} C C[a, b]
denote a sequence of subspaces with dim X n = n + 1. Choose n + 1 points
a :S x~n) < ... < x~n) :S b such that the interpolation at these grid
points with respect to the subspace X n is uniquely solvable. Typical ex-
amples for the choice of X n are polynomials, trigonometric polynomials,
and splines (see also Problem 8.1). For convenience we will again write
.
Xo,···, X n mstead 0 f X (n) , ... , X (n) ,an d Uo, ... , Un .mstead 0 f U(n) , ... , Un(n) .
o n o
By L n : C[a, b] -t X n we denote the operator that maps the function
! E C[a, b] into its uniquely determined interpolating function L n ! E X n
with the property

Representing L n in terms of the Lagrange basis, i.e., in terms of the uniquely


determined functions eo, ... ,en E X n with the interpolation property

in the form
n
Ln! = L !(Xk)ek (12.23)
k=O

it can be seen that the operator L n : C[a, bl -t X n is linear and bounded


(with respect to the maximum norm). Moreover, since L n ! = ! for all
! E X n , the interpolation operator is a projection operator; i.e., L; = L n
(see p. 157 and Problem 8.4)
12.4 The Collocation Method 303

The collocation method approximates the solution of (12.22) by an ele-


ment <Pn E X n satisfying

(12.24)

We express <Pn as a linear combination

and immediately see that equation (12.24) is equivalent to the linear system
n
L 'YdUk(Xj) - (AUk)(Xj)} = f(xj), j = 0, ... ,n, (12.25)
k=O

for the coefficients 'YO, ... ,'Yn' If we use the Lagrange basis for X n and write

then of course 'Yj = <Pn(Xj), j = 0, ... , n, and the system (12.25) becomes
n
'Yj - L 'Yk(A£k)(Xj) = f(xj), j = 0, ... , n. (12.26)
k=O

From the systems (12.25) and (12.26) it is obvious that the collocation
method is only semidiscrete, since in general, additional approximations
are needed in order to compute the matrix entries (AUk)(Xj) or (A£k)(Xj).
The collocation method can be interpreted as a projection method; i.e.,
since the interpolating function is uniquely determined by its values at the
interpolation points, equation (12.24) is equivalent to

(12.27)
This equation can be considered as an equation in the whole space C[a, b]
because any solution <Pn = LnA<pn + Lnf automatically belongs to X n .
Hence, our general error and convergence results for operator equations of
the second kind can be applied to the collocation method.
Theorem 12.16 Let A : C[a, b) -+ C[a, b) be a compact linear operator
such that I - A is injective, and assume that the interpolation operators
L n : C[a, b) -+ X n satisfy IIL n A - Alloo -+ 0, n -+ 00. Then, for suffi-
ciently large n, the approximate equation (12.27) is uniquely solvable for all
f E C[a, b), and we have the error estimate
(12.28)
for some positive constant C depending on A.
304 12. Integral Equations

Proof. From Theorem 12.6 applied to An = LnA, we conclude that for


all sufficiently large n the inverse operators (I - LnA)-l exist and are
uniformly bounded. To verify the error bound, we apply the interpolation
operator L n to (12.22) and get

ip - LnAip = Lnf + ip - Lnip·


Subtracting this from (12.27) we find

whence the estimate (12.28) follows. o


Corollary 12.17 Let A : C[a, b] ---+ C[a, b] be a compact linear operator
such that I - A is injective, and assume that the interpolation operators
L n : C[a, b] ---+ X n are pointwise convergent; i.e., Lnip ---+ ip, n ---+ 00, for all
ip E C[a, b]. Then, for sufficiently large n, the approximate equation (12.27)
is uniquely solvable for all f E C[a, b], and the estimate (12.28) holds.
Proof. By Lemma 12.9 the pointwise convergence of the interpolation oper-
ators L n and the compactness of A imply that IILnA - Alloo ---+ 0, n ---+ 00.
Now the statement follows from the preceding theorem. 0

We note that the collocation method may of course also be applied in


function spaces other than the space C[a, b].
We proceed by considering the collocation method for integral equations
of the second kind

ip(x) -l b
K(x, y)ip(y) dy = f(x), x E [a,b], (12.29)

with continuous kernel K. Using the interpolation operator, in this case we


can rewrite the collocation equation (12.26) in the form

ipn(X) -l b
[LnK(·, Y)](X)ipn(Y) dy = (Lnf)(x), x E [a, b], (12.30)

and the systems (12.25) and (12.26) become

~ ik {Uk(Xj) -l b
K(xj,y)udy) dY } = f(xj), j = 0, ... ,n, (12.31)

and

(12.32)

respectively. There exists a broad variety of collocation methods corre-


sponding to various choices for the subspaces X n , for the basis functions
12.4 The Collocation Method 305

uo, ... ,un, and for the collocation points xo, ... ,X n . We briefly discuss two
possibilities, based on linear splines and on trigonometric polynomials.
First we consider piecewise linear interpolation. Let Xj = a + jh,
j = 0, ... , n, denote an equidistant subdivision with step size h = (b- a)jn
and let X n be the space of continuous functions on [a, b] whose restrictions
on each of the subintervals [Xj-l, Xj], j = 1, ... , n, coincide with a linear
function. As in Section 11.5, the Lagrange basis is given by

1
h (Xk+l - x),

0,

for k = 0, ... ,n. Since for piecewise linear interpolation we have that

IILnfiloo ~ . max
J=O, ... ,n
If(xj)1 ~ Ilflloo,

with equality holding if f is constant, we observe that IILnll oo = 1 for the


corresponding interpolation operator Ln. Here, we have pointwise conver-
gence Ln<p -t <p, n -t 00. This can be seen from the error estimate (8.9)
and the Weierstrass approximation theorem, analogous to the proof of the
Szego Theorem 9.10. Therefore, in this case Corollary 12.17 applies, and
we can state the following result.

Theorem 12.18 The collocation method with linear splines converges for
integral equations of the second kind with continuous kernels.

Provided that the exact solution of the integral equation is twice contin-
uously differentiable, then from the error estimate (8.9) for linear interpo-
lation and Corollary 12.17 we derive an error estimate of the form

for the linear spline collocation approximate solution <Pn' Here, C denotes
some constant depending on the kernel K.
In general, in most practical problems the evaluation of the matrix entries

J:
in (12.32) will require a numerical quadrature for integrals of the form
K(xj'Y)£k(y)dy. To be consistent with our approximations, we replace
K(xj,·) by its piecewise linear interpolation; i.e., we approximate

! a
b
K(Xj'Y)£k(y)dy
n
~ LK(Xj,Xi)
;=0
! a
b
£;(Y)£k(y)dy
306 12. Integral Equations

for j, k = 0, ... ,n. Straightforward calculations yield the tridiagonal matrix


2 1
1 4 1
1 4 1
W=~
6
1 4 1
1 2

for the weights Wik = f:


fi(y)fk(y) dy.
We now investigate the influence of these approximations on the error
analysis. We interpret the solution of the system (12.32) with the approxi-
mate values for the coefficients as the solution r{Jn of an additional approx-
imate equation
r{Jn - Anr{Jn = Lnl, (12.33)
namely of the collocation equation

with
n
Kn(x,y):= LK(x,Xi)fi(y);
i=O

i.e., Kn(x, y) = [LnKn(x, .))(y) interpolates K with respect to the second


variable. We assume that the kernel K is twice continuously differentiable
on [a, b) x [a, b). Then, using the error estimate (8.9), we have

IK(x,y) - Kn(x,y)1 ~ ~211~2y~1100


for all a ~ x, y ~ b. Writing

and using the fact that for the piecewise linear spline interpolation we have
IILnll oo = 1, from (8.9) we obtain

for all a ~ x, y ~ b. Hence, in view of (12.4), for the integral operator An


with kernel K n we have IIAn- Alloo = O(h 2 ). When 1 is twice continuously
differentiable, we also have IILnl - 11100 = O(h 2 ). Therefore, from Theorem
12.6 we can now conclude that the approximate equation (12.33) is uniquely
solvable for sufficiently large n and that for the approximate solution we
have an error estimate IIr{Jn - cplloo = O(h 2 ). Therefore, the fully discrete
approximation still is of order O(h 2 ).
12.4 The Collocation Method 307

Example 12.19 Consider the integral equation (12.19) of Example 12.14.


Table 12.4 gives the error between the exact solution and the fully discrete
collocation approximation with linear splines. It clearly exhibits the error
behavior O(h 2 ). 0

TABLE 12.4. Numerical results for spline collocation

n x=O x = 0.25 x = 0.5 x = 0.75 x=1

4 0.004808 0.005430 0.006178 0.007128 0.008331


8 0.001199 0.001354 0.001541 0.001778 0.002078
16 0.000300 0.000338 0.000385 0.000444 0.000519
32 0.000075 0.000085 0.000096 0.000111 0.000130

We note that in principle, a collocation method with error O(h 4 ) can


be obtained from cubic spline interpolation (see Theorem 8.34). However,
the numerical implementation is much more involved. This again illustrates
that the Nystrom method is more practical, since there it is quite easy to
change the order from O(h 2 ) to O(h 4 ) by simply replacing the weights of
the trapezoidal rule by those of Simpson's rule.
We proceed by discussing the collocation method based on trigonometric
interpolation with equidistant knots tj = j7r In, j = 0, ... , 2n - 1. First,
we establish a convergence result for the trigonometric interpolation of
differentiable functions (see Problem 8.12).

Lemma 12.20 Let I E C 1 [0, 27r]. Then for the remainder in trigonometric
interpolation we have

(12.34)

where Cn -+ 0, n -+ 00.

Proof. Consider the trigonometric monomials Im(t) = e imt and write m =


(2k + l)n + q with k E 7J., and 0 ::; q < 2n. Since Im(tj) = Iq-n(tj) for
j = 0, ... , 2n - 1, the trigonometric interpolation polynomials for 1m and
Iq-n coincide. Therefore, we have

IILnlm- Imlloo ::; 2


for all Iml ~ n. Since I is continuously differentiable, we can expand it into
a uniformly convergent Fourier series (see Problem 12.14)
00

m=-oo
308 12. Integral Equations

From the relation

1
21r
!,(t)e-imtdt = im 1
21r
f(t)e-imtdt = 21rima m

for the Fourier coefficients it follows that

Using this identity and the Cauchy-Schwarz inequality, we derive

This implies (12.34). o

Now, consider an integral equation of the second kind with 21r-periodic


continuously differentiable kernel K and right-hand side f. The corre-
sponding integral operator A maps C[0,21r] into C 1 [0, 21r] and satisfies
:s
II(A<p)'1I2 MII<plloo, where M = ,j2; IlaKjatll oo . Therefore, making use
of (12.34), we find

:s
for all <p E C[O, 21r]. Hence, IILnA-Alloo cnM -t 0, n -t 00, and Theorem
12.16 can be applied to obtain the following result.
Theorem 12.21 The collocation method with trigonometric polynomials
converges for integral equations of the second kind with continuously differ-
entiable periodic kernels and right-hand sides.
One possibility for the implementation of the collocation method is to
use the trigonometric monomials as basis functions. Then the integrals
21r
Jo K(tj,r)eikrdr have to be integrated numerically. Replacing the kernel
by its trigonometric interpolation leads to the quadrature formula

for j = 0, ... ,2n - 1. Using fast Fourier transform techniques (see Section
8.2 ) these quadratures can be carried out very rapidly. A second, even
more efficient, possibility is to use the Lagrange basis

(12.35)
12.4 The Collocation Method 309

for k = 0, ... , 2n -1 which can be derived from Theorem 8.25 (see Problem

J:
12.13).
1r
For the evaluation of the matrix coefficients K(tj, r)fk(r) dr we pro-
ceed analogously to the preceding case of linear splines. We approximate
these integrals by replacing K(tj,') by its trigonometric interpolation poly-
nomial, Le., we approximate

for j, k = 0, ... ,2n - 1. Using (12.35), elementary integrations yield (see


Problem 12.13)

(12.36)

for m, k = 0, ... ,2n - 1. Note that despite the global nature of the trigono-
metric interpolation and its Lagrange basis, due to the simple structure of
the weights (12.36) in the quadrature rule, the computation of the matrix
elements is not too costly. The only additional computational effort besides
the kernel evaluation is the computation of the row sums
2n-l
L (-l)mK(tj,t m )
m=O

for j = 0, ... ,2n - 1. We omit the analysis of the additional error in the
fully discrete method caused by the numerical quadrature.

Example 12.22 For the integral equation (12.20) from Example 12.15,
Table 12.5 gives the error between the exact solution and the collocation
approximation.
TABLE 12.5. Collocation method for equation (12.20)

n t=O t = rr/2 t = rr

4 -0.10752855 -0.03243176 0.03961310


a=l 8 -0.00231537 0.00059809 0.00045961
b = 0.5 16 -0.00000044 0.00000002 -0.00000000

4 -0.56984945 -0.18357135 0.06022598


a=l 8 -0.14414257 -0.00368787 -0.00571394
b = 0.2 16 -0.00602543 -0.00035953 -0.00045408
32 -0.00000919 -0.00000055 -0.00000069
310 12. Integral Equations

Again we have exponential convergence, as is to be expected from the


estimate (12.28) and the error analysis for the trigonometric interpolation
for analytic functions [38]. 0

In general, the fully discrete implementation of the collocation method


as described by our two examples can be used in all situations where the re-
quired numerical quadratures for the matrix elements can be carried out in
closed form for the chosen approximating subspace and collocation points.
In all these cases, of course, the quadrature formulae that are required for
the related Nystrom method will also be available. Because the approxima-
tion order for both methods usually will be the same, Nystrom's method
is preferable, since it requires the least computational effort for evaluat-
ing the matrix elements. However, the situation changes in cases where no
straightforward quadrature rules for the application of Nystrom's method
are available.
Again, for a greater variety of collocation methods the reader is referred
to [1, 3, 6, 19, 25, 30, 39, 49].

12.5 Stability
For finite-dimensional approximations of a given operator equation we have
to distinguish three condition numbers, namely, the condition numbers of
the original operator and of the approximating operator as mappings in
the underlying normed spaces, and the condition number of the linear sys-
tem for the actual numerical solution. This latter system we can influence,
for example in the collocation method by the choice of the basis for the
approximating subspaces.
Consider an equation of the second kind C{} - AC{} = f in a Banach space
X and approximating equations C{}n - AnC{}n = fn under the assumptions of
Theorem 12.6, Le., norm convergence, or of Theorem 12.10, i.e., collective
compactness and pointwise convergence. Then, recalling Definition 5.2 of
the condition number, from Theorems 12.6 and 12.10 it follows that the
condition numbers cond(I - An) are uniformly bounded. Hence, for the
condition of the approximating scheme, we mainly have to be concerned
with the condition of the linear system for the actual computation of the
solution of C{}n - AnC{}n = fn.
For the discussion of the condition number for the Nystrom method we
recall the linear system (12.14) and denote by An the matrix with the
entries akK(Xj,Xk)' We introduce operators R n : C[a,b] -t IR nH by

Rn : f 1-+ (f(xo), ... ,f(xn))T, f E C[a,b],

and M n : IRnH -t C[a, b], where Mnif! is the piecewise linear interpolation
with (Mnif!)(xj) = if!j, j = 0, ... , n, for if! = (if!o, ... , if!n)T. (If a < xo, we
12.5 Stability 311

set (Mn4l)(x) = 41 0 for a ::; x::; xo; and if Xn < b, we set (Mn 4l)(x) = «P n
for X n ::; x ::; b.) Then clearly, IIRnll oo = IIMnli oo = 1.
From Theorem 12.11 we conclude that

and

From these relations we immediately obtain the following theorem.

Theorem 12.23 For the Nystrom method the condition numbers for the
linear system are uniformly bounded.

This theorem states that the Nystrom method essentially preserves the
stability of the original integral equation.
For the collocation method, we introduce the matrices En with entries
Uk(Xj) and An with entries (AUk)(Xj). Since X n = span{uo, ... ,un} is
assumed to be such that the interpolation problem with respect to the
collocation points Xo, ... ,Xn is uniquely solvable, the matrix En is invertible
(see Problem 8.1). In addition, let the operator W n : lRn+l -t C[a,b] be
defined by
n
Wn : 'Y ~ L 'YkUk
k=O

for 'Y = ('Yo,.·., 'Yn) T and recall the operators R n and Mn from above.
Then we have

From (12.25) we can conclude that

and
(En - An)-l = E;;l Rn(I - LnA)-l LnMn .
From these three relations, and the fact that by Theorems 12.7 and 12.16
the sequence of operators (I - LnA)-l L n is uniformly bounded, we obtain
the following theorem.

Theorem 12.24 Under the assumptions of Theorem 12.16, for the collo-
cation method the condition number of the linear system satisfies

for all sufficiently large n and some constant C.


312 12. Integral Equations

This theorem suggests that the basis functions must be chosen with cau-
tion. For a poor choice, like monomials, the condition number of En can
grow quite rapidly. However, for the Lagrange basis, i.e., for the linear sys-
tem (12.26), En becomes the identity matrix with condition number one.
In addition, IIL n ll enters in the estimate on the condition number of the
linear system, and for example, for polynomial or trigonometric polynomial
interpolation we have IILnll -t 00, n -t 00 (see Theorem 8.16).
In the context of stability we will conclude this chapter with a few re-
marks on integral equations of the first kind.

Theorem 12.25 Let X and Y be normed spaces and let A : X -t Y be a


compact linear operator. Then A has a bounded inverse if and only if X is
finite-dimensional.

Proof. Assume that A has a bounded inverse A-I: Y -t X. Then we have


A -1 A = I, and therefore the identity operator must be compact, since the
product of a bounded and a compact operator is compact (see Problem
12.2). However, the identity operator on X is compact if and only if X has
finite dimension. 0

Theorem 12.25 implies that integral equations of the first kind with con-
tinuous (or weakly singular) kernels are improperly posed problems in the
sense of Hadamard, as described in Chapter 5.
Of course, the ill-posed nature of an equation has consequences for its
numerical treatment. The fact that an operator does not have a bounded
inverse means that the condition numbers of its finite-dimensional approx-
imations grow with the quality of the approximation. Hence, a careless dis-
cretization of ill-posed problems leads to a numerical behavior that at first
glance seems to be paradoxical. Namely, increasing the degree of discretiza-
tion, i.e., increasing the accuracy of the approximation for the operator, will
cause the approximate solution to the equation to become less and less re-
liable. Therefore, straightforward application of the methods described in
this chapter to integral equations of the first kind with continuous kernels
will generate numerical nonsense.
To make this remark more vivid, we consider the approximate solution
of an integral equation of the first kind

lab K(x, y)ep(y) dy = f(x), x E [a, b),

by the analogue of the linear system (12.14) for the Nystrom method, i.e.,
by
n
LakK(xj,xdep~n) =f(Xj), j =O, ... ,n.
k=O
Problems 313

The equation of the first kind

1 1
(x + l)e- xY <p(y)dy = 1- e-(x+l), 0:S x :S 1, (12.37)

has the unique solution <p(x) = e- X (see Problem 12.20). Table 12.6 gives
the difference between the exact solution and the solution obtained by the
quadrature method using the (composite) trapezoidal rule.

TABLE 12.6. Numerical solution of (12.37)

n x=O x = 0.5 x=l

4 0.4057 0.3705 0.1704


8 -4.5989 14.6094 -4.4770
16 -8.5957 2.2626 -153.4805
32 3.8965 -32.2907 22.5570
64 -88.6474 -6.4484 -182.6745

We observe that the approximation is completely useless and that in


agreement with the above remarks, the quality of the approximation de-
creases when the accuracy of the quadrature is increased. (Of course, the
actual numerical values for the solution of the ill-conditioned linear system
of this example will depend on the actual computer and the code for solving
the linear system that is used.)
Hence, the numerical solution of integral equations of the first kind with
continuous kernels requires regularization methods such as Tikhonov reg-
ularization or singular value cutoff, which we discussed in Chapter 5 for
the finite-dimensional case. These regularization techniques now, of course,
need to be analyzed in an appropriate function space setting. We recall the
corresponding references to [14, 22, 28, 37, 39, 43] from Chapter 5 for the
foundation of regularization methods in Hilbert spaces.

Problems
12.1 Show that the boundary value problem for the differential equation

_u" + qu = r in [0,1]

with boundary conditions u(O) = u(I) = 0 is equivalent to finding a continuous


solution of the integral equation of the second kind

u(x) + 1 1

G(x,y)q(y)u(y)dy = 1 1

G(x,y)r(y)dy, x E [0,1],
314 12. Integral Equations

where
(1 - x)y, 0::; y ::; x::; 1,
G(x,y) :=
{
(1 - y)x, 0::; x ::; y ::; 1,
is the so-called Green's junction of the boundary value problem.

12.2 Show that linear combinations of compact linear operators are compact
and that the product of two bounded linear operators is compact if one of the
factors is compact.

12.3 Show that the integral operator with continuous kernel is a compact op-
erator from L 2 [a, b) into L 2 [a, b).

12.4 Show that the Volterra integral equation of the second kind

<p(x) - [X K(x,y)<p(y)dy = I(x), x E [a,b],

with continuous kernel K has a unique continuous solution <p for each continuous
right-hand side I.
Hint: Show that the homogeneous equation allows only the trivial solution and
use Theorem 12.2.

12.5 Solve the Volterra integral equation

by successive approximations.

12.6 Show that a sequence An : X ~ Y of compact linear operators mapping


a normed space X into a normed space Y is collectively compact if and only if
for each bounded sequence (<Pn) in X the sequence (An<Pn) contains a convergent
subsequence.

12.7 Show that a sequence (<Pn) of functions <pn : [a, b) ~ 1R. that is equicontin-
uous and converges pointwise on [a, b) to some function <P : [a, b] ~ 1R. converges
uniformly on [a, b].

12.8 Prove the Banach-Steinhaus theorem: Let A : X ~ Y be a bounded linear


operator and let An : X ~ Y be a sequence of bounded linear operators from a
Banach space X into a normed space Y. For pointwise convergence An<p ~ A<p,
n ~ 00, for all <p E X it is necessary and sufficient that IIA n II ::; C for all n E IN
with some constant C and that An<p ~ A<p, n ~ 00, for all <p E U, where U is
some dense subset of X (compare Theorem 9.10).

12.9 For the integral operator A and the numerical integration operators using
the (composite) trapezoidal rule, derive bounds on II(A n - A)Alloo and
II(A n - A)Anll oo . Relate the results to Lemma 12.9.

12.10 Verify the integrals (12.21).


Problems 315

12.11 Write a computer program for the Nystom method allowing the use of
different quadrature formulae and test it for various examples.

12.12 Use the quadrature formula (9.36) with the substitution (9.47) in a
Nystrom method for the integral equation (12.19). Compare the numerical re-
sults with those obtained from the trapezoidal and Simpson's rule.

12.13 Verify the Lagrange basis (12.35) and the integrals (12.36).

12.14 Show that the Fourier series of a continuously differentiable periodic func-
tion is uniformly convergent.

12.15 In the degenerate kernel approximation the integral equation of the sec-
ond kind with continuous kernel K is approximated by the solutions of

'Pn(X) - [b Kn(x,Y)'Pn(y)dy = I(x), x E [a,b),

with an approximate kernel K n of the form


n

Kn(x,y) = Laj(x)bj(y).
j=O

Show how the solution of the approximate equation can be reduced to solving
a system of linear equations. Give an error and convergence analysis based on
Theorem 12.6.

12.16 Use the results of Problem 12.15 to prove Theorem 12.2 for the case of
an integral equation of the second kind with continuous kernel.

12.17 Construct degenerate kernels via interpolation of the kernel K with re-
spect to the first variable and relate this particular degenerate kernel method to
the collocation method (see Problem 12.15).

12.18 The idea of two-grid and multigrid iterations can also be applied to
integral equations of the second kind. For its theoretical foundation assume
the sequence of operators An : X -+ X to be either norm convergent (i.e.,
IIA n - All -+ 0, n -+ (0) or collectively compact and pointwise convergent (i.e.,
An'P -+ A'P, n -+ 00, for all 'P E X). Show that the defect correction iteration

'Pn.v+l := (I - An_d-1{(A n - An-d'Pn.v + In}, V = 0, 1,2, ... ,

using the preceding coarser level converges, provided that n is sufficiently large.
Show that the defect correction iteration

'Pn.v+l := (I - Ao)-l{(A n - AO)'Pn.v + In}, V = 0, 1,2, ... ,

using the coarsest level converges, provided that the approximation A o is suffi-
ciently close to A.
316 12. Integral Equations

12.19 Consider the two-grid iteration

with m = n - 1 or m = 0 for the Nystrom method, i.e., for the numerical


quadrature operators
Zn

(An'P)(x) = Lain) K(x, xin»)'P(xin»), x E [0,1],


k=1

with Zn quadrature points. Show that each iteration step requires the following
computations. First
gn,v := In + (An - Am)'Pn,v
has to be evaluated at the Zm quadrature points x;m), j = 1, ... ,Zm, on the level
m and at the Zn quadrature points x;n), j = 1, ... , Zn, on the level n by setting
x = x(m) and x = x(n) respectively in
J J ' )

Zn Zm

gn,v(X) = In(x) + Lain) K(x, xin»)'Pn,v(xin ») - L aim) K(x, xim»)'Pn,v(xim»).


k=1 k=1

Then one has to solve the linear system

'Pn,v+1 ( Xj(m») - '"'


L..J a k(m)K( Xj(m) ' X(m»)
k 'Pn,v+1 ( Xk(m») -_ gn,v «m»)
Xj , j = 1, ... , Zm,
k=1

for the values 'Pn,v+1 (x;m») at the Zm quadrature points x;n). Finally, the values
at the Zn quadrature points x;n), j = 1, ... ,Zn, are obtained from the Nystrom
interpolation
Zm

'Pn,v+1 ( x j(n») = '"'


L..J (kk(m)K( x j(n) ' x k(m») 'Pn,v+1 (x k(m») + gn,v (x j(n») , j=I, ... ,Zn.
k=l

Make an operation count for one step of the defect correction iteration. Set up
the corresponding equations for the collocation method.

12.20 Show that the integral equation (12.37) has a unique solution.
References

[1] Anderssen, R.S., de Hoog, F.R., and Lukas, M.A. The Application
and Numerical Solution of Integral Equations. Sijthoff and Noordhoff,
Alphen aan den Rijn 1980.

[2] Anselone, P.M. Collectively Compact Operator Approximation The-


ory and Applications to Integral Equations. Prentice-Hall, Englewood
Cliffs 1971.

[3] Atkinson, K.E. A Survey of Numerical Methods for the Solution of


Fredholm Integral Equations of the Second Kind. SIAM, Philadelphia
1976.

[4] Aubin, J.P. Approximation of Elliptic Boundary Value Problems.


John Wiley & Sons, New York 1972.

[5] Aubin, J.P. Applied Functional Analysis. John Wiley & Sons, New
York 1979.

[6] Baker, C.T.H. The Numerical Treatment of Integral Equations.


Clarendon Press, Oxford 1977.

[7] Ben-Israel, A. and Greville, T.N.E. Generalized Inverses: Theory and


Applications. John Wiley & Sons, New York 1974.

[8] Brandt, A. Multigrid adaptive solutions to boundary value problems,


Math. Compo 31, 333-390 (1977).
318 References

[9] Brass, H. Quadraturverfahren. Vandenhoek und Ruprecht, Gottingen


1979.

[10] Brosowski, B. and Kress, R. Einfiihrung in die Numerische Mathe-


matik. Bibliographisches Institut, Mannheim 1975.

[11] Ciarlet, P.S. The Finite Element Method for Elliptic Problems. North
Holland, Amsterdam 1978.

[12] Coddington, E.A. and Levinson, N. Theory of Ordinary Differential


Equations. McGraw-Hill, New York 1955.

[13] Collatz, L. The Numerical Treatment of Differential Equations. 3rd


edition. Springer-Verlag, Berlin 1966.

[14] Colton, D. and Kress, R. Inverse Acoustic and Electromagnetic Scat-


tering Theory. 2nd edition. Springer-Verlag, Berlin 1998.

[15] Davis, P.J. On the numerical integration of periodic analytic func-


tions. In: Symposium on Numerical Approximation (R. Langer, ed.).
The University of Wisconsin Press, Madison, 45-59 (1959).

[16] Davis, P.J. Interpolation and Approximation. Blaisdell Publishing


Company, Waltham 1963.

[17] Davis, P.J. and Rabinowitz, P. Methods of Numerical Integration. 2nd


edition. Academic Press, San Diego 1984.

[18] De Boor, C. A Practical Guide to Splines. Springer-Verlag, New York


1978.

[19] Delves, L.M. and Mohamed, J.L. Computational Methods for Integral
Equations. Cambridge University Press, Cambridge 1985.

[20] Dennis, J.E. and Schnabel, R.B. Numerical Methods for Uncon-
strained Optimization and Nonlinear Equations. Prentice-Hall, En-
glewood Cliffs 1983.
[21] Engels, H. Numerical Quadrature and Cubature. Academic Press,
New York 1980.
[22] Engl, H.W., Hanke, M., and Neubauer, A. Regularization of Inverse
Problems. Kluwer Academic Publishers, Dordrecht 1996.

[23] Farin, G. Curves and Surfaces for Computer Aided Geometric Design.
A Practical Guide. 2nd edition. Academic Press, Boston 1990.

[24] Gilbarg, D. and Trudinger, N.S. Elliptic Partial Differential Equa-


tions of Second Order. Springer-Verlag, Berlin 1977.
References 319

[25] Golberg, M.A. and Chen, C.S. Discrete Projection Methods for Inte-
gral Equations. Computational Mechanics Publications, Southamp-
ton 1997.
[26] Golub, G. and Ortega, J.M. Scientific Computing. Academic Press,
Boston 1993.
[27] Golub, G. and van Loan, C. Matrix Computations. John Hopkins
University Press, Baltimore 1989.

[28] Groetsch, C.W. The Theory of Tikhonov Regularization for Fredholm


Equations of the First Kind. Pitman, Boston 1984.

[29] Hackbusch, W. Multi-Grid Methods and Applications. Springer-


Verlag, Berlin 1985.

[30] Hackbusch, W. Integral Equations: Theory and Numerical Treatment.


Birkhauser-Verlag, Basel 1995.
[31] Hadamard, J. Lectures on Cauchy's Problem in Linear Partial Dif-
ferential Equations. Yale University Press, New Haven 1923.

[32] Hairer, E., N0rsett, S.P., and Wanner, G. Solving Ordinary Differen-
tial Equations. Nonstiff Problems. Springer-Verlag, Berlin 1987.

[33] Henrici, P. Discrete Variable Methods in Ordinary Differential Equa-


tions. John Wiley & Sons, New York 1962.

[34] Heuser, H. Funktionalanalysis. 2. Auflage. Teubner, Stuttgart 1986.

[35] Kantorovic, L.V. and Akilov, G.P. Functional Analysis in Normed


Spaces. Pergamon Press, Oxford 1964.

[36] Keller, H.B. Numerical Methods for Two-Point Boundary Value


Problems. Blaisdell Publishing Company, Waltham 1968.

[37] Kirsch, A. An Introduction to the Mathematical Theory of Inverse


Problems. Springer-Verlag, New York 1996.

[38] Kress, REin ableitungsfreies Restglied fur die trigonometrische In-


terpolation periodischer analytischer Funktionen. Numer. Math. 16,
389-396 (1971).

[39] Kress, R Linear Integral Equations. Springer-Verlag, Berlin 1989.

[40] Kress, R A Nystrom method for boundary integral equations in do-


mains with corners. Numer. Math. 58, 145-161 (1990).
[41] Kress, R, de Vries, H.L., and Wegmann, R On nonnormal matrices.
Linear Algebra and its Appl. 8, 109-120 (1974).
320 References

[42] Lambert, J.D. Numerical Methods for Ordinary Differential Equa-


tions. The Initial Value Problem. John Wiley & Sons, Chichester
1993.

[43] Louis, A.K. Inverse und schlecht gestellte Probleme. Teubner, Stutt-
gart 1989.

[44] More, J.J. The Levenberg-Marquardt algorithm, implementation and


theory. In: Numerical Analysis (Watson, ed.). Springer-Verlag Lec-
ture Notes in Mathematics 630, Berlin, 105-116 (1977).

[45] Nussbaumer, H.J. Fast Fourier Transform and Convolution Algo-


rithms. Springer-Verlag, Berlin 1982.

[46] Ortega, J.M. and Poole, W.G. An Introduction to Numerical Methods


for Differential Equations. Pitman, Boston 1981.

[47] Ortega, J.M. and Rheinboldt, W.C. Iterative Solution of Nonlinear


Equations in Several Variables. Academic Press, New York 1970.

[48] Parlett, B.N. The Symmetric Eigenvalue Problem. Prentice-Hall, En-


glewood Cliffs 1980.

[49] Prossdorf, S. and Silbermann, B. Numerical Analysis for Integral


and Related Operator Equations. Akademie-Verlag, Berlin 1991, and
Birkhauser-Verlag, Basel 1991.

[50] Roberts, S.M. and Shipman, J.S. Two-Point Boundary Value Prob-
lems: Shooting Methods. Elsevier, New York 1972.

[51] Rudin, W. Functional Analysis. McGraw-Hill, New York 1973.

[52] Sag, T.W. and Szegeres, G. Numerical evaluation of high-dimensional


integrals. Math. Compo 18, 245-253 (1964).

[53] Schumaker, L.L. Spline Functions: Basic Theory. John Wiley & Sons,
Chichester 1981.

[54] Sidi, A. A new variable transformation for numerical integration. In:


Numerical Integration IV. (Brass, Hammerlin, eds.) International
Series of Numerical Mathematics. Birkhauser-Verlag Basel 112, 359-
373 (1993).

[55] Stetter, H.J. Analysis of Discretization Methods for Ordinary Differ-


ential Equations. Springer-Verlag, Berlin 1973.

[56] Stetter, H. J. The defect correction principle and discretization meth-


ods, Numer. Math. 29, 425-443 (1978).
References 321

[57] Stroud, A.H. Approximate Calculation of Multiple Integrals. Prentice-


Hall, Englewood Cliffs 1971.

[58] Takahasi, H. and Mori, M. Quadrature formulas obtained by variable


transformation. Numer. Math. 21, 206-219 (1973).
[59] Taylor, A.E. Introduction to Functional Analysis. John Wiley & Sons,
New York 1967.
[60] Treves, F. Basic Linear Partial Differential Equations. Academic
Press, New York 1975.
[61] Varga, R. Matrix Iterative Analysis. Prentice-Hall, Englewood Cliffs
1962.

[62] Watkins, D.S. Understanding the QR-Algorithm, SIAM Review 24,


(1982).

[63] Weissinger, J. Spiirlich besetzte Gleichungssysteme. Bibliographisches


Institut, Mannheim 1990.
[64] Wesseling, P. An Introduction to Multigrid Methods. John Wiley &
Sons, Chichester 1992.
[65] Wilkinson, J .H. The Algebraic Eigenvalue Problem. Clarendon Press,
Oxford 1965.
[66] Young, D. Iterative Solution of Large Linear Systems. Academic
Press, New York 1971.
Index

Adams-Bashforth method, 244 Cauchy-Schwarz inequality, 30


Adams-Moulton method, 245 Cea's lemma, 273
adjoint matrix, 6 characteristic polynomial, 36, 249
Aitken's (P method, 117 Chebyshev polynomial, 204, 223
algebraic multiplicity, 36 Chebyshev quadrature, 223
a posteriori estimate, 45 Cholesky elimination, 19
a priori estimate, 44 classical Jacobi method, 131
Aubin-Nitsche lemma, 282 closed ball, 29
closed set, 28
backward substitution, 12, 15
closure, 28
Bairstow method, 113
collectively compact operators, 294
Banach space, 40
collocation method, 302
Banach's fixed point theorem, 43
collocation points, 302
Banach-Steinhaus theorem, 314
compact operator, 288
Bernoulli polynomial, 207
complete pivoting, 15
Bernstein polynomial, 180
complete set, 40
best approximation, 47
Bezier curve, 181 computer-aided geometric design,
Bezier points, 181 179
Bezier polygon, 181 condition number, 80
Bezier spline, 183 conjugate gradient method, 285
bijective operator, 46 consistency, 235, 245, 246
boundary value problem, 258 order, 235, 245, 246
weak solution, 277 consistently ordered matrix, 64
bounded operator, 33 continuous operator, 32
bounded set, 29 contraction number, 43
B-spline, 173 contraction operator, 43
convergence order, 108, 238
Cauchy sequence, 40 convergent quadrature, 198
Index 323

convergent sequence, 27 discrete, 167


convex hull, 181 fast, 167
convex set, 98 Fredholm integral equation, 287
cyclic Jacobi method, 132 first kind, 287, 312
second kind, 287
de Casteljau algorithm, 183 Friedrich inequality, 282
defect correction iteration, 69 Frobenius norm, 127
defect correction principle, 69 frozen Newton method, 109
dense set, 28 fully discrete method, 274
diagonal matrix, 16 function space
diagonalizable matrix, 133 era, b], 40
diagonally dominant, 56 Hi [a, b], 275
strictly, 56 L 2 [a, b], 42
weakly, 59
difference equation, 248 Galerkin method, 272
stable, 248 Gauss-Chebyshev quadrature, 205
direct methods, 5, 119 Gauss-Jordan elimination, 18
discrepancy principle, 85 Gauss-Legendre quadrature, 205
distance, 27 Gauss-Lobatto quadrature, 223
divergent sequence, 27 Gauss-Radau quadrature, 223
divided differences, 154 Gauss-Seidel method, 57
with relaxation, 62
eigenvalue, 36 Gaussian elimination, 11, 14
eigenvector, 36 Gaussian quadrature, 201
elimination methods, 5 composite, 207
equicontinuous, 289 geometric multiplicity, 36
equivalent linear system, 12 global convergence, 95
equivalent norm, 27 global error, 238
Euclidean norm, 26 maximal, 238
Euler method, 231 Gram-Schmidt orthogonalization,
implicit, 233 31
improved,234
Euler-Maclaurin expansion, 209 Hermite interpolation operator, 161
explicit method, 233 Hermite interpolation polynomial,
extrapolation method, 212, 216 160
Hermite-Birkhoff interpolation poly-
fast Fourier transform, 167 nomial, 186
Fibonacci numbers, 256 Hermitian matrix, 37
finite difference method, 262 Hessenberg matrix, 144
finite element method, 279 Hessian matrix, 114
fixed point, 43 Heun method, 234
forward differences, 182 Hilbert matrix, 79
forward elimination, 13, 14 Hilbert space, 40
Fourier series, 52 Horner scheme, 110
Fourier transform Householder matrix, 20
324 Index

ill-conditioned linear system, 81 Lipschitz constant, 228


ill-posed problem, 77 Lipschitz continuous, 43
implicit method, 233 local convergence, 95
initial value problem, 228 local discretization error, 235, 245
injective operator, 46 Lotka-Volterra equations, 255
inner product, 29 lower triangular matrix, 18
interpolation operator, 157, 302 LR decomposition, 18
trigonometric, 169 L 1 norm, 41
interpolation polynomial L 2 norm, 42
Hermite, 160
Hermite-Birkhoff, 186 Mandelbrot set, 118
Lagrange, 153 matrix
Newton, 155 adjoint, 6
trigonometric, 163 consistently ordered, 64
interpolatory quadrature, 190 diagonal, 16
inverse interpolation, 186 diagonalizable, 133
irreducible matrix, 59 Hermitian, 37
iterative methods, 5, 119 Hessenberg, 144
Hessian, 114
Jacobi method, 55 Hilbert, 79
classical, 131 Householder, 20
cyclic, 132 irreducible, 59
damped,71 Jacobian, 99
with relaxation, 61 left triangular, 18
Jacobian matrix, 99 lower triangular, 18
normal, 127
kernel, 287 permutation, 19
degenerate, 315 positive definite, 19, 37
weakly singular, 291 positive semidefinite, 37
reducible, 59
Lagrange factor, 153 right triangular, 18
Lagrange interpolation polynomial, symmetric, 19
153 transposed, 6
least squares method, 10 tridiagonal, 7
left triangular matrix, 18 unitary, 20
Legendre polynomial, 205 upper triangular, 18
Levenberg-Marquardt method, 114 Vandermonde, 186
limit, 27 matrix norm, 34
linear convergence, 108 maximum norm, 26, 41
linear interpolation, 158 mean value theorem, 99
linear operator, 32 midpoint rule, 206
linear system Milne-Thomson method, 245
equivalent, 12 modified Newton method, 109
triangular, 12 Moore-Penrose inverse, 84
Lipschitz condition, 228 multigrid methods, 74
Index 325

multiplicity orthonormal system, 31


algebraic, 36
geometric, 36 Parseval equality, 52
multistep method, 243 partial pivoting, 15
stable, 251 Peano kernel, 221
permutation matrix, 19
Neumann series, 46, 51 pivot element, 14
Neville scheme, 156 pivoting
Newton interpolation polynomial, complete, 15
155 partial, 15
Newton method, 102 polygon method, 231
frozen, 109 polynomial
modified, 109 Bernoulli, 207
Newton-Cotes quadrature, 191,222 Bernstein, 180
norm, 26 Chebyshev, 204, 223
equivalent, 27 Legendre, 205
Euclidean, 26 positive definite matrix, 19, 37
Frobenius, 127 positive semidefinite matrix, 37
£1,41 power method, 133
£2,42 predictor corrector method, 234
maximum, 26, 41 pre-Hilbert space, 29
stronger, 50 projection method, 272, 303
vector, 26 pseudo-inverse, 84
normal equations, 49
normal matrix, 127 QR algorithm, 133
normed space, 26 deflation, 144
Nystrom method, 245, 296 shift, 144
QR decomposition, 19
open ball, 29 quadratic convergence, 108
open set, 28 quadrature
operator, 32 Chebyshev, 223
bijective, 46 convergent, 198
bounded,33 Gauss-Chebyshev, 205
compact, 288 Gauss-Legendre, 205
continuous, 32 Gauss-Lobatto, 223
contraction, 43 Gauss-Radau, 223
injective, 46 Gaussian, 201
linear, 32 interpolatory, 190
strictly coercive, 269 Newton-Cotes, 191, 222
surjective, 46 Romberg, 213
operator norm, 33 quadrature points, 190
ordinary differential equation, 226 quadrature weights, 190
orthogonal, 31
orthogonal projection, 48 range, 32
orthogonal system, 31 rank one methods, 110
326 Index

Rayleigh-Ritz method, 285 successive overrelaxation method,


rectangular rule, 210 62
reducible matrix, 59 superlinear convergence, 110
regularization parameter, 86 surjective operator, 46
relaxation methods, 60 symmetric matrix, 19
relaxation parameter, 61
Riesz theory, 289 theorem
right triangular matrix, 18 Arzela-Ascoli, 289
Romberg quadrature, 213 Courant, 123
root condition, 249 Faber, 160
Runge-Kutta method, 241 Gerschgorin, 126
Kahan, 62
Sassenfeld criterion, 57 Lax-Milgram, 269
scalar product, 29 Marcinkiewicz, 159
scaling, 16 Ostrowski, 63
Schur's inequality, 127 Picard-Lindel6f, 228
secant method, 110 Rayleigh, 122
semidiscrete method, 274 Riesz, 268
series, 50 Steklow, 199
sesquilinear function, 270 Szego, 198
bounded, 270 Young, 64
strictly coercive, 270 Tikhonov regularization, 86
shooting method, 258 transposed matrix, 6
multiple, 261 trapezoidal rule, 192
Simpson's rule, 192 composite, 196
composite, 196 triangle inequality, 26
simultaneous displacements, 55 second,26
single-step method, 234 triangular linear system, 12
singular system, 82 tridiagonal matrix, 7
singular value decomposition, 82 trigonometric interpolation poly-
singular values, 81 nomial, 163
Sobolev space, 275 trigonometric polynomial, 162
span, 31 two-grid methods, 68
spectral cutoff, 85
uniform boundedness principle, 292
spectral radius, 38
unitary matrix, 20
spline, 169
upper triangular matrix, 18
cubic, 170, 175
spline interpolation, 169 Vandermonde matrix, 186
steepest descent, 115 vector norm, 26
Steffensen's method, 117 Verhulst equation, 227
strictly coercive operator, 269 Volterra integral equation, 228, 314
stronger norm, 50
Sturm-Liouville problem, 274 weak derivative, 275
successive approximations, 44 well-conditioned linear system, 81
successive displacements, 57 well-posed problem, 77
Graduate Texts in Mathematics
l'OIlti"ll~d from pag~ ii

61 WHITEHEAD. Elements of Homotopy 92 DIESTEL. Sequences and Series in Banach


Theory. Spaces.
62 KARGAPOLOV/MERLZlAKOV. Fundamentals 93 DUBROVIN/FoMENKO/Novo\Ov. Modern
of the Theory of Groups. Geometry-Methods and Applications.
63 BOLLOBAS. Graph Theory. Part I. 2nd ed.
64 EDWARDS. Fourier Series. Vol. I 2nd ed. 94 WARNER. Foundations of Differentiable
65 WELLS. Differential Analysis on Manifolds and Lie Groups.
Complex Manifolds. 2nd ed. 95 SHIRYAEV. Probability. 2nd ed.
66 WATERHOUSE. Introduction to Affine 96 CONWAY. A Course in Functional
Group Schemes. Analysis. 2nd ed.
67 SERRE. Local Fields. 97 KOBLITZ. Introduction to Elliptic Curves
68 WEIDMANN. Linear Operators in Hilbert and Modular Forms. 2nd ed.
Spaces. 98 BROCKERITOM DIECK. Representations of
69 LANG. Cyclotomic Fields II. Compact Lie Groups.
70 MASSEY. Singular Homology Theory. 99 GROVE/BENSON. Finile Refkction
71 FARKAS/KRA. Riemann Surfaces. 2nd ed. Groups. 2nd cd.
72 STILLWELL. Classical Topology and 100 BERG/CHRISTENSEN/RESSEL. Harmonic
Combinatorial Group Theory. 2nd ed. Analysis on Semigroups: Theory of
73 HUNGERFORD. Algebra. Posilive Detinite and Related Functions.
74 DAVENPORT. Multiplicatiw Number 101 EDWARDS. Galois Theory.
Theory. 2nd cd. 102 VARADARAJAN. Lie Groups, Lie Algebras
75 HOCIiSCIIiLD. Basic Theory of Algebraic and Their Representations.
Groups and Lie Algebras. 103 LANG. Complex Analysis. 3rd ed.
76 IITAKA. Algebraic Geometry. 104 DUBROVIN/FoMENKO/NoVIKOV. Modern
77 HECKE. Lectures on the Theory of Geometry-Methods and Applications.
Algebraic Numbers. Part II.
78 BURRIS/SANKAPPANAVAR. A Course in 105 LANG. SL,(R).
Universal Algebra. 106 SILVERMAN. The Arithmetic of Elliptic
79 WALTERS. An Introduction to Ergodic Curves.
Theory. 107 OLVER. Applications of Lie Groups to
80 ROBINSON. A Course in the Theory of Differential Equations. 2nd ed.
Groups. 2nd ed. 108 RANGE. Holomorphic Functions and
81 FORSTER. Lectures on Riemann Surfaces. Integral Representations in Several
82 BOTTITU. Differential Forms in Complex Variables.
Algebraic Topology. 109 LElrro. Univalent Functions and
83 W ASIlINGTON. Introduction to Cyci<llomic Teichmuller Spaces.
Fields. 2nd ed. 110 LANG. Algebraic Number Theory.
84 IRELAND/RoSEN. A Classical Introduction III HUSEMOLLER. Elliptic Curves.
to Modern Number Theory. 2nd ed. 112 LANG. Elliptic Functions.
85 EDWARDS. Fourier Series. Vol. II. 2nd 113 KARATZAS/SIIREVE. Brownian Motion
ed. and Stochastic Calculus. 2nd ed.
86 VAN LINT. Introduction to Coding 114 KOBLITZ. A Course in Number Theory
Theory. 2nd ed. and Cryptography. 2nd ed.
87 BROWN. Cohomology of Groups. 115 BERl;ER/GOSTIAUX. Differential
88 PIERCE. Associative Algebras. Geometry: Manifolds, Curves, and
89 LANG. Introduction to Algebraic and Surfaces.
Abelian Functions. 2nd ed. 116 KELLEy/SRINIVASAN. Measure and
90 BR0NDSTED. An Inlroduction to Convex Integral. Vol. I.
Polytopes. II? SERRE. Algebraic Groups and Class
91 BEARDON. On the Geometry of Discrete Fields.
Groups. 118 PEDERSEN. Analysis Now.
119 ROTMAN. An Introduclion to Algebraic 148 ROTMAN. An Introduclion to the
Topology. Theory of Groups. 4th ed.
120 ZIEMER. Weakly Differentiable 149 RATCLIFFE. Foundations of
Functions: Sobolev Spaces and Functions Hyperbolic Manifolds.
of Bounded Variation. ISO EISENBUD. Commutative Algebra
121 LANG. Cyclotomic Fields I and H. with a View Toward Algebraic
Combined 2nd ed. Geometry.
122 REMMERT. Theory of Complex 151 SILVERMAN. Advanced Topics in
Functions. the Arithmetic of Elliptic Curves.
Readings in Mmhematics 152 ZIEGLER. Lectures on Polytopes.
123 EBBINGHAUS/HERMES el al. Numbers. 153 FULTON. Algebraic Topology: A
Readings in Mathematics First Course.
124 DUBROVIN/FoMENKO/NoVD<OV. Modern 154 BROWN/PEARCY. An Introduction
Geometry-Methods and Applications. to Analysis.
Part HI. ISS KASSEL. Quantum Groups.
125 BEREN~TEIN/GAY. Complex Variables: 156 KECHRIS. Classical Descriptive Set
An Introduction. Theory.
126 BOREL. Linear Algebraic Groups. 2nd 157 MALLIAVIN. Integralion and
ed. Probability.
127 MASSEY. A Basic Course in Algehraic 158 ROMAN. Field Theory.
Topology. 159 CONWAY. Functions of One
128 RAUCII. Partial Differential Equations. Complex Variable II.
129 FULTON/HARRIS. Representation Theory: 160 LANG. Differential and Riemannian
A First Course. Manifolds.
Readings in Mathematics 161 BORWEIN/ERDELYI. Polynomials
130 DODSON/PO~TON. Tensor Geometry. and Polynomial Inequalities.
131 LAM. A First Course in Noncommutative 162 ALPERIN/BELL. Groups and
Rings. Representations.
132 BEARDON. Iteration of Rational 163 DIXON/MORTIMER. Permutation
Functions. Groups.
133 HARRIS. Algebraic Geomelry: A First 164 NATHANSON. Additive Number Theory:
Course. The Classical Bases.
134 ROMAN. Coding and Information Theory. 165 NATHANSON. Additive Number Theory:
135 ROMAN. Advanced Linear Algebra. Inverse Problems and the Geometry of
136 ADKINS/WEINTRAUB. Algebra: An Sumsets.
Approach via Module Theory. 166 SHARPE. Differential Geometry: Cartan's
137 AxLER/BoURDON/RAMEY. Harmonie Genaalization of Klein's Erlangen
Function Theory. Program.
138 COllEN. A Course in Computational 167 MORANDI. Field and Galois Theory.
Algebraic Number Theory. 168 EWALD. Combinatorial Convexity and
139 BREDON. Topology and Geometry. Algebraic Geometry.
140 AUBIN. Optima and Equilibria. An 169 BHATIA. Malrix Analysis.
Introduction to Nonlinear Analysis. 170 BREDON. Sheaf Theory. 2nd cd.
141 BECKER/WEISPFENNING/KREDEL. Grabner 171 PETERSEN. Riemannian Geometry.
Bases. A Computational Approach to 172 REMMERT. Classical Topics in Complex
Commutative Algebra. Function Theory.
142 LANG. Real and Functional Analysis. 173 DIESTEL. Graph Theory.
3rd ed. 174 BRIDGES. Foundations of Real and
143 DOOB. Measure Theory. Abstract Analysis.
144 DENNIS/FARB. Noncommutative 175 LICKORISH. An Introduction to Knot
Algebra. Theory.
145 VICK. Homology Theory. An 176 LEE. Riemannian Manifolds.
Introduclion 10 Algebraic Topology. 177 NEWMAN. Analytic Number Theory.
2nd ed. 178 CLARKE/LEDYAEV/STERN/WOLENSKI.
146 BRIDGES. Computability: A Nonsmooth Analysis and Conlrol
Mathematical Sketchbook. Theory.
147 ROSENBERG. Algebraic K-Theory 180 SRIVASTAVA. A Course on Borel SeIS.
and lis Applications. 181 KRESS. Numerical Analysis.

You might also like