STEP 3
ERRORS 3
Error propagation and generation
It has been noted that a number is to be represented by a finite
number of digits, and hence often by an approximation. It is to be
expected that the result of any arithmetic procedure (any algorithm)
involving a set of numbers will have an implicit error relating to the
error of the original numbers. One says that the initial errors
propagate through the computation. In addition, errors may he
generated at each step in an algorithm, and one speaks of the total
cumulative error at any step as the accumulated error.
Since one ains to produce results within some chosen limit of error, it is
useful to consider error propagation. Roughly speaking, based on
experience, the propagated error depends on the mathematical
algorithm chosen, whereas the generated error is more sensitive to
the actual ordering of the computational steps. It is possible to be more
precise, as described below.
Absolute error
The absolute error is the absolute difference between the exact
number x and the approximate number X, i.e.,
A number correct to n decimal places has
We expect that the absolute error involved in any approximate number
is no more than five units at the first neglected digit.
Relative error
This error is the ratio of the absolute error to the absolute exact
number, i.e.,
(Note that the upper bound follows from the triangle inequality; thus
A decimal number correct to n significant digits has
Error propagation
Consider two numbers
Under the operations of addition or subtraction,
The magnitude of the propagated error is therefore not more than the
sum of the initial absolute enors; of course, it may be zero.
Under the operation of multiplication:
The maximum relative error propagated is approximately the sum of
the initial relative errors. The same result is obtained when the
operation is division.
Error generation
Often (for example, in a computer) an operation נis approximated by
an operation sup>*, say. Consequently, x is represented by x* *.
Indeed, one has
so that the accumulated enor does not exceed the sum of the
propagated and generated errors. Examples may be found in Step 4.
Example
Evaluate (as accurately as possible) the expressions:
1. 3.45+4.87-5.16
2. 3.55 x 2.73
There are two methods which the student may consider: The first is to
invoke the concepts of absolute and relative error as defined above.
Thus, the result for 1. is 3.16 +/- 0.015, since the maximum absolute
error is 0.005 + 0.005 + 0.005 = 0.015. We conclude that the answer is
3 (to 1S ), for the number certainly lies between 3.145 and 3.175. In 2. ,
the product 9.6915 is subject to the maximum relative error:
whence the maximum (absolute) error ~ (2.73 + 3.55) x 0.005 ~ 0.03, so
that the answer is 9.7.
A second approach is interval arithmetic. Thus, the approximate
number 3.45 represents a number in the interval (3.445, 3.455), etc.
Consequently, the result for 1. lies in the interval bounded below by
and above by
Similarly, in 2. , the result lies in the interval bounded below by
and above by
whence once again the approximate numbers 3 and 9.7 correctly
represent the respective results to 1. and 2. .
Checkpoint
1. What distinguishes propagated and generated errors?
2. How to determine the propagated error for the operations addition
(subtraction) and multiplication (division)?
EXERCISES
Evaluate the following operations as accurately as possible, assuming
all values to the number of digits given:
1. 8.24 + 5.33.
2. 124.53 - 124.52.
3. 4.27 x 3.13.
4. 9.48 x 0.513 - 6.72.
5. 0.25 x 2.84/0.64.
6. 1.73 - 2.16 + 0.08 + 1.00 - 2.23 - 0.97 + 3.02.
STEP 4
ERRORS 4
Interval arithmetic
In STEP 2, floating point representation was introduced as a convenient
way of dealing with large or small numbers. Since most scientific
computations involve such numbers, many students will be familiar
with floating point arithmetic and will appreciate the way in which it
facilitates calculations involving multiplication or division.
In order to investigate the implications of finite number
representation, one must examine the way in which arithmetic is
carried out with floating point numbers. The following specifications
apply to most computers which round, and are easily adapted to those
which chop. For the sake of simplicity in the examples, we will use a
three-digit decimal mantissa normalized to lie in the range
(most computers use binary representation and the mantissa is
commonly normalized to lie in the rangc [?,1]). Note that up to six
digits are used for intermediate results, but the final result of each
operation is a normalized three-digit decimal floating point number.
Addition and subtraction
Mantissae are added or subtracted (after shifting the mantissa and
increasing the exponent of the smaller number, if necessary, to make
the exponents agree); the final normalized result is obtained by
rounding (after shifting the mantissa and adjusting the exponent, if
necessary). Thus:
3.12 x 101 + 4.26 x l01 = 7.38 x 101
2.77 x 102 + 7.55 x 102 = 10.32 x 102 1.03 x 103
6.18 x l01 + 1.84 x l0-1 = 6.18 x 101 + 0.0184 x 101 = 6.1984 x 101 6.20 x 101 ,
3.65 x 10-1 - 2.78 x 10-1 = 0.87 x 10-1 8.70 x 10-2.
The exponents are added and the mantissae are multiplied; the final
result is obtained by rounding (after shifting the mantissa right and
increasing the exponent by 1, if necessary). Thus:
(4.27 x 101) x (3.68 x 101) = 15.7136 x 102 1.57x103
(2.73x102)x(-3.64x10-2)=-9.9372x100 -9.94x100.
Division
The exponents are subtracted and the mantissae are divided; the
final result is obtained by rounding (after shifting the mantissa
left and reducing the exponent by 1, if necessary). Thus:
(5.43xl01) / (4.55x102) = 1.19340...xl0-1 1.19x10-1
(-2.75x102) / (9.87x10-2) = -0.278622. . .x104 -2.79x103.
Expressions
The order of evaluation is determined in a standard way and the
result of each operation is a normalized floating point number.
Thus:
(6.18x101+1.84xl0-1)/((4.27x101)x(3.68x101))
(6.20x101)/(1.57x103)=3.94904...x10-2 3.95x10-2
Generated error
Note that all the above examples (except the subtraction and the
first addition) involve generated errors which are relatively large
due to the small number of digits in the mantissae. Thus the
generated error in
2.77x102+7.55x102=10.32x102 1.03x103
is 0.002 x 103. Since the propagated error in this example may be
as large as 0.01 x 102 (assuming the operands are correct to 3S ),
one can use the result given in Error propagation to deduce that
the accumulated error cannot exceed 0.002x103 + 0.01x102 =
0.003x103..
Consequences
The peculiarities of floating point arithmetic lead to some
unexpected and unfortunate consequences, including the
following:
1. Addition or subtraction of a small (but nonzero) number
may have no effect, for example
5.18x102 + 4.37x10-1 = 5.18x102 + 0.00437x102 = 5.18437x102
5.18x102,
whence, the additive identity is not unique.
2. Frequently, the result of ax(1/a) is not 1; for example, if a =
3.00x100, then
1/a is 3.33x10-1
and
a (1/a) is 9.99 10-1,
whence the multiplicative inverse may not exist.
3. The result of (a + b) + c is not always the same as the
result of a + (b + c); for example, if
a = 6.31x101, b = 4.24x100, c = 2.47x10-1,
then
(a+b)+c = (6.31x101 + 0.424x101) + 2.47x10-1 6.73x101 +
.0247x101 6.75x101,
whereas
(a+b)+c = 6.31x101 + (4.24x100 + 2.47x100) 6.31x101 +
4.49x100 6.31x101+4.49x100 6.31x101+ 0.449x101
6.76x101,
whence the associative law for addition does not
always apply.
Examples involving adding many numbers of varying size
indicate that adding in order of increasing magnitude is
preferable to adding in the reverse order.
4. Subtracting a number from another nearly equal number
may result in loss of significance or cancellation
errors. In order to illustrate this loss of accuracy, suppose
that we evaluate f(x) = 1 - cos x for x=0.05, using three-
digit, decimal, normalized, floating point arithmetic with
rounding. Then
1 - cos(0.05) = 1-0.99875 1.00x100 - 0.999x100 1.00x10-3.
Although the value of 1 is exact and cos(0.05) is correct to
3S, when expressed as a three-digit floating point number,
their computed difference is correct to only 1S ! (The two
zeros after the decimal point in 1.00x10-3 pad the number.)
The approximation 0.999 ~ cos(0.05) has a relative error of
about 2.5x10-4. By comparison, the relative error of 1.00x10-3
- cos(0.05) is about 0.2, i.e., it is much larger. Thus,
subtraction of two nearly equal numbers should be avoided
whenever possible.
In the case of f(x)= 1 - cos x, one can avoid loss of
significant digits by writing
This last formula is more suitable for calculations when x is
close to 0. It can be verified that the more accurate
approximation of 1.25 x 10-3 is obtained for 1- cos(0.05)
when three-digit floating point arithmetic is used.
5. Checkpoint
Why is it sometimes necessary to shift the mantissa and
adjust the exponent of a floating point number?
6. Does floating point arithmetic obey the usual laws of
arithmetic?
7. Why should the subtraction of two nearly equal
numbers be avoided?
8. Exercises
Evaluate the following expressions, using three-digit
decimal normalized floating point arithmetic with rounding:
1. 6.19x102+5.82x102,
2. 6.19x102+3.61x101,
3. 6.19x102-5.82x102,
4. 6.19x 102-3.61x101,
5. (3.60 x 103)x(1.01x10-1,
6. (-7.50x10-1x(-4.44x101,
7. (6.45x102/(5.16xl0-1,
8. (-2.86 x 10-2)/(3.29 x 103).
9. Estimate the accumulated errors in the results of
Exercise 1, assuming that all values are correct to 3S.
10. Evaluate the following expressions, using four-digit
decimal normalized floating point arithmetic with
rounding, then recalculate them, carrying all decimal
places, and estimate the propagated error.
1. Given a = 6.842x10-1, b = 5.685x101, c = 5.641x101,
find a(b-c) and ab-ac.
2. Given a=9.812xl01, b=4.631x+l0-1, c=8.340xl0-1,
find (a+b)+c and a+(b+c).
3. Use four-digit decimal normalized floating point
arithmetic with rounding to calculate f (x)=
tanx - sinx for x = 0.1.
Since
tanx-sinx=tanx(1-cosx)=tanx(2sin2(x/2))
f(x) may be written as f (x)=2tanxsin2(x/2).
Repeat the calculation using this alternative
expression. Which of the two values is more
accurate?
STEP 5
ERRORS
Approximation to functions
An important procedure in Analysis is to represent a given function as
an infinite series of terms involving simpler or otherwise more
appropriate functions. Thus, if f is the given function, it may be
represented as the series expansion
involving the set of functions ( i). Mathematicians have spent a lot of
effort in discussing the convergence of series, i.e., on defining
conditions for which the partial sum
approximates the function value f (x) ever more closely as n increases.
In Numerical Analysis, we are primarily concerned with such
convergent series; computation of the sequence of partial sums is an
approximation process in which the truncation error may be made as
small as we please by taking a sufficient number of terms into account.
1. Taylor series
The most important expansion to represent a function is the
Taylor series. If f is suitably smooth in the neighbourhood of
some chosen point x0,
where
here h = x - x0 denotes the displacement from x0 to the point x in
the neighbourhood, and the remainder term is
for some point between x0 and x. (This is known as the
Lagrange form of the remainder; its derivation may be found
in Section 8.7 of Thomas and Finney ( 1992), cited in the
Bibliography.) Note that the in this expression for Rn may be
written as
The Taylor expansion converges for x within some range
including the point x0, a range which lies within the
neighbourhood of x0 mentioned above. Within this range of
convergence, the truncatiorr error due to discarding terms
after the xn term (equal to the value of Rn at the point x) can be
made smaller in magnitude than any positive constant by
choosing n sufficiently large. In other words, by using Rn to decide
how many terms are needed, one may evaluate a function at any
point in the range of convergence as accurately as the
accumulation of round- off error permits.
From the point of view of the numerical analyst, it is most
important that the convergence be fast enough. For example, if
we consider f (x) = sin x, we have
and the expansion (about x0 = 0) for n = 2k - 1 is given by
with
.
Note that this expansion has only odd-powered terms, although
the polynomial approximation is of degree (2k - 1 ) - it has only k
terms. Moreover, the absence of even-powered terms means that
the same polynomial approximation is obtained with n = 2k, and
hence R2k-1 = R2k; the remainder term R2k - 1 given above is actually
the expression for R2k. Since then
if 5D accuracy is required, it follows that we need only take k = 2
at x = 0.1, and k = 4 at x=1 (since 9! = 362 880). On the other
hand, the expansion for the natural (base e) logarithm,
where
is less suitable. Although only n = 4 terms are needed to give 5D
accuracy at x = 0.1, n = 13 is required for 5D accuracy at x = 0.5,
and n = 19 gives just 1D accuracy at x = 1!
Moreover, we remark that the Taylor series is not only used
extensively to represent functions numerically, but also to
analyse the errors involved in various algorithms (for example,
STEP 8, Step9, Step10, Step30, and Step31)
Polynomial approximation
The Taylor series provides a simple method of polynomial
approximation (of chosen degree n)
which is basic to the later discussion of various elementary
numerical procedures. Because f is often complicated, one may
prefer to execute operations such as differentiation and
integration of a polynomial approximation. Interpolation
formulae in STEP22 and STEP23 may also be used to construct
polynomial approximations.
2. Other series expansions
There are many other series expansions, such as the Fourier
series (in terms of sines and cosines), or those involving various
orthogonal functions (Legendre polynomials, Chebyshev
polynomials, Bessel functions, etc.). From the numerical
standpoint, truncated Fourier series and Chebyshev polynomial
series have proven to be most useful. Fourier series are
appropriate in treating functions with natural periodicities,
while Chebyshev series provide the most rapid convergence
of all known polynomial based approximations.
Occasionally, it is possible to represent a function adequately
(from the numerical standpoint) by truncating a series which
does not converge in the mathematical sense. For example,
solutions are sometimes obtained in the form of asymptotic
series with leading terms which provide sufficiently accurate
numerical results.
While we confine our attention in this book to truncated Taylor
series, the interested reader should be aware that such
alternative expansions exist (see, for example, Burden and Faires
(1993)).
3. Recursive procedures
While a truncated series with few terms may be a practical way
to compute values of a function, there is a number of arithmetic
operations involved, and it may be preferable to employ possibly
available recursive procedures which reduce the arithmetic
required. For example, the values of the polynomial
and its derivative
for may be generated recursively by Horner's scheme:
Thus, for successive values of k, one has
The technique just described is also known as nested
multiplication. (Perhaps the student may want to suggest a
recursive procedure for higher derivatives of P.)
Finally, it should be noted that it is common to generate
members of a set of orthogonal functions recursively.
Checkpoint
1. How do numerical analysts use the remainder term Rn in Taylor
series?
2. Why is the rate of convergence so important from the
numerical standpoint?
3. From the numerical standpoint, is it essential for a series
representation to converge in the mathematical sense?
EXERCISES
1. Find the Taylor series expansions about x = 0 for each of the
functions:
1. cosx.
2. 1/(1 - x).
3.
Determine also for each series a general remainder term.
2. For each of the functions in 1. evaluate f(0.5), using a calculator
and the first four terms of the Taylor expansions.
3. Use the remainder term found in 1. 3 to find the value of n
required in the Taylor series for f(x)=ex about x = 0 to give 5D
accuracy for all x between 0 and 1.
4. Truncate the Taylor series found in 1. 3 to give linear, quadratic,
and cubic polynomial approximations for f(x) = ex in the
neighbourhood of x = 0. Use the remainder term to estimate (to
the nearest 0.1) the range over which each polynomial
approximation yields results correct to 2D.
5. Evaluate P(3.1) and P'(3.1), where P(x) = x3 - 2x2 + 2x + 3, using the
technique of nested multiplication.
6. Evaluate P(2.6) and P'(2.6), where P(x) = 2x4 - x3 + 3x2 + 5, using
nested multiplication. Check your values using a calculator.
STEP 6
Approximation to functions
Non- linear equations 1
Non- linear algebraic and transcendental equations
The first non-linear equation encountered in algebra courses is usually
the quadratic equation
and all students will be familiar with the formula for its roots:
The formula for the roots of a general cubic is somewhat more
complicated and that for a general quartic usually takes several
pages to describe! We are spared further effort by a theorem which
states that there is no such formula for general polynomials of
degree higher than four. Accordingly, except in special cases (for
example, when factorization is easy), we prefer in practice to use a
numerical method to solve polynomial equations of degree higher than
two.
Another class of nonlinear equations consists of those which involve
transcendental functions such as
. Useful analytic solutions of such equations are rare so that we are
usually forced to use numerical methods.
1. A transcendental equation
We shall use a simple mathematical problem to show that
transcendental equations do arise quite naturally. Consider the
height of a liquid in a cylindrical tank of radius r and horizontal
axis, when the tank is a quarter full (see Figure 2). Denote the
height of the liquid by h (DB in the diagram). The condition to be
satisfied is that the area of the segment ABC should be ? of the
area of the circle. This task reduces to
where is the area of the sector OAB, the triangle OAD. Hence
.
When we have solved the transcendental equation
we obtain h from
FIGURE 2.
Cylindrical tank (cross-section).
2. Locating roots
Let it be required to find some or all of the roots of the nonlinear
f(x) = 0. Before we use a numerical method (STEP 7, STEP 8, STEP
9 and STEP 10), we should have some idea about the number,
nature and approximate location of the roots. The usual approach
involves the construction of graphs and perhaps a table of
values of the function f, in order to confirm the information
obtained from the graph.
We will now illustrate this approach by a few
examples.
a)
If we do not have a calculator or computer
available to immediately plot the graph of
f(x )= sin x - x + 0.5,
we can separate f into two parts, sketch two curves on a single
set of axes, and find out whether they intersect. Thus we sketch
. Since , we are only interested in the
interval -0.5 x 1.5 (outside which |x - 0.5| > 1). Thus we deduce
from Fig. 3 that the equation has only one real root, near x =1.5 as
follows:
We now know that the root lies between 1.49 and 1.50, and we
can use a numerical method to obtain a more accurate answer as
is discussed in tlater Steps.
b)
Again, we sketch two curves:
In order to sketch the second curve, we use the three obvious
zeros at x = 0, 2, and 3, as well as the knowledge that x(x - 2) (x -
3) is negative for x < 0 and 2 < x < 3, but positive and increasing
steadily for x > 3. We deduce from the graph (Fig. 4) that there
are three real roots, near x = 0.2, 1.8, and 3. 1, and tabulate as
follows (with :
We conclude that the roots lie between 0.15 and 0.2, 1.6 and 1.8,
and 3.1 and 3.2, respectively. Note that the values in the table
were calculated to an accuracy of at least 5SD. For example,
working to 5S accuracy, we
have f (0.15) = 0.97045-
0.79088= 0.17957, which is
then rounded to 0.1796. Thus
the entry in the table for
f(0.15) is 0.1796 and not 0.1795
as one might expect from
calculating 0.9704 - 0.7909.
Checkpoint
1. Why are numerical methods used in solving nonlinear
equations?
2. How does a transcendental equation differ from an
algebraic equation?
3. What kind of information is used when sketching curves
for the location of roots?
EXERCISES
4. Locate the roots of the equation x+cos x=0.
5. Use curve sketching to roughly locate all the roots of the
equations:
a) x + 2 cos x = 0.
b) x + ex= 0.
c) x(x - 1) - ex= 0.
d) x(x - 1 - sin x = 0.
STEP 7
NONLINEAR EQUATIONS 2
The bisection method
The bisection method, suitable for implementation on a computer (cf.
the Pseudo-code) allows to find the roots of the equation f (x) = 0,
based on the following theorem:
Theorem: If f is continuous for x between a and b and if f (a) and f(b)
have opposite signs, then there exists at least one real root of f (x) = 0
between a and b.
1. Procedure: Suppose that a continuous function f is negative at x
= a and positive at
x = b, so that there is at least one real root between a and b. (As
a rule, a and b may be found from a graph of f.) If we calculate f
((a +b)/2), which is the function value at the point of bisection of
the interval a< x < b, there are three possibilities:
1. f ((a + b)/2) = 0, in which case (a + b)/2 is the root;
2. f ((a + b)/2) <0, in which case the root lies between (a +
b)/2 and b;
3. f ((a + b)/2) > 0, in which case the root lies between a and
(a + b)/2.
Presuming there is just one root, in Case 1 the process is
terminated. In either Case 2 or Case 3, the process of bisection
of the interval containing the root can be repeated until the root
is obtained to the desired accuracy. In Figure 5, the successive
points of bisection are denoted by x1 , x2, and x3.
2. Effectiveness: The bisection method is almost certain to give a
root. Provided the conditions of the above theorem hold; it can
only fail if the accumulated error in the calculation of f at a
bisection point gives it a small negative value when actually it
should have a small positive value (or vice versa); the interval
subsequently chosen would then be wrong. This can be
overcome by working to sufficient accuracy, and this almost-
assured convergence is not true of many other methods of
finding roots.
One drawback of the bisection method is that it applies only to
roots of f about which f (x) changes sign. In particular, double
roots can be overlooked; one should be careful to examine f(x) in
any range where it is small, so that repeated roots about which f
(x) does not change sign are otherwise evaluated (for example,
see Steps 9 and 10). Of course, such a close examination also
avoids another nearby root being overlooked.
Finally, note that bisection is rather slow; after n iterations the
interval containing the root is of length (b - a)/2n. However,
provided values of f can be generated readily, as when a
computer is used, the rather large number of iterations which
can be involved in the application of bisection, is of relatively
little consequence.
3. Example
Solve 3xex = 1 to three decimal places by the bisection method.
Cconsider f(x) = 3x - ex, which changes sign in the interval 0.25 <
x < 0.27: one tabulates (working to 4D ) as follows:
(Ascertain graphically that there is just one root!)
Denote the lower and upper endpoints of the interval bracketing
the root at the n-th iteration by an and bn, respectively (with a1 =
0.25 and b1 = 0.27). Then the approximation to the root at the n-
th iteration is given by xn = an + bn)/2. Since the root is either in
[an, bn] or [xn, bn] and both intervals are of length bn - an)/2, we
see that xn will be accurate to three decimal places when (bn -
an)/2 < 5 10-4. Proceeding to bisection:
(Note that the values in the table are displayed to only 4D.)
Hence the root accurate to three decimal places is 0.258.
Checkpoint
4. When may the bisection method be used to find a root of the
equation f(x) = 0?
5. What are the three possible choices after a bisection value is
calculated?
6. What is the maximum error after n iterations of the bisection
method`?
EXERCISES
1. Use the bisection method to find the root of the equation
x+cosx = 0.
correct to two decimal places (2D ).
2. Use the bisection method to find to 3D the positive root of the
equation
x - 0.2sinx - 0.5=0.
3. Each equation in Exercises 2(a)-2(c) of Step 6 has only one root.
For each equation use the bisection method to find the root
correct to 2 D.
STEP 8
NONLINEAR EQUATIONS 3
Method of false position
As mentioned in the Prologue, the method of false position dates back
to the ancient Egyptians. It remains an effective alternative to the
bisection method for solving the equation f(x) = 0 for a real root
between a and b, given that f (x) is continuous and f (a) and f(b) have
opposite signs. The algorithm is suitable for automatic computation
(pseudo.code)
1. PROCEDURE
The curve y = f(x) is not generally a straight line. However, one may
join the points (a,f(a)) and (b,f(b)) by the straight line
Thus straight line cuts the x-axis at (X, 0) where
so that
Suppose that f(a) is negative and f(b) is positive. As in the bisection
method, there are the three possibilities :
1. f(X) = 0, when case X is the root ;
2. f(X) < 0, when the root lies between X and b ;
3. f(X)>0, when the root lies between X and a.
Again, in Case 1, the process is terminated, in either Case 2 or Case 3,
the process can be repeated until the root is obtained to the desired
accuracy. In Fig. 6, the successive points where the straight lines cut
the axis are denoted by x1, x2, x3.
2. EFFECTIVENESS AND THE SECANT METHOD
Like the bisection method, the method of false position has
almost assured convergence, and it may converge to a root faster.
However, it may happen that most or all the calculated values of X are
on the same side of the root, in which case convergence may be slow
(Fig. 7). This is avoided in the secant method, which resembles the
method of false position except that no attempt is made to ensure that
the root is enclosed; starting with two approximations to the root (x0,
x1), further approximations x2, x3,… are computed from
There is no longer convergence, but the process is simpler (the sign of
f(xn+1) is not tested) and often converges faster.
With respect to speed of convergence of the secant method, one has at
the (n+1)th step:
Hence, expanding in terms of the Taylor series,
where we have used the fact that f( )=0. Thus we see that en+1 is
proportional to enen-1, which may be expressed in mathematical
notation as
We seek k such that
Hence the speed of convergence is faster than linear (k =1 ), but slower
than quadratic (k=2). This rate of convergence is sometimes referred
to as superlinear convergence.
3. EXAMPLE
Use the method of false position to solve
Then
f (x1) =f (0.257637) = 3 x 0.257637 0 0.772875 = 0.772912 - 0.772875 =
0.000036.
The student may verify that doing one more iteration of the method of
false position yields an estimate x2 = 0.257628 for which the function
value is less than 5*10-6. Since x1 and x2 agree to 4D, we conclude that
the root is 0.2576, correct to 4D.
Checkpoint
1. When may the method of false position be used to find a root
of the equation f(x) = 0?
2. On what geometric construction is the method of false
position based?
EXERCISES
1. Use the method of false position to find the smallest root of the
equation f (x) = 2 sin x + x - 2 = 0, stopping when
2. Compare the results obtained when you use
1. the bisection method,
2. the method of false position, and
3. the secant method
with starting values 0.7 and 0.9 to solve the equation
3sin x = x + 1/x.
4. Use the method of false position to find the root of the
equation
5. Each equation in Exercises 2(a)-2(c) of Step 6 has only one
root. Use the method of false position to find each root
stopping when
STEP 9
NONLINEAR EQUATIONS
The method of simple iteration
The method of simple iteration involves writing the equation f(x) = 0 in
a form x = f(x), suitable for the construction of a sequence of
approximations to some root in a repetitive fashion.
1. Procedure
The iteration procedure follows: In some way, we obtain a rough
approximation x0 of the desired root, which may then be substituted
into the right-hand side to give a new approximation, x1= (x0). The new
approximation is again substituted into the right-hand side to give a
further approximation x2= (x1), etc., until
(hopefully) a sufficiently accurate
approximation to the root is obtained. This
repetitive process, based on xn+1 = (xn) is called
simple iteration; provided that |xn+1 - xn|
decreases as n increases, the process tends to
= ( ), where denotes the root.
2. Example
Use simple iteration to find the root of the
equation
3xex = 1
to an accuracy of 4D.
One first writes
x = e-x/3 = (x).
Assuming x0 = 1, successive iterations yield
x1 = 0.12263, x2 = 0.29486,
x3 = 0.24821, x4 = 0.26007,
x5 = 0.25700, x6 = 0.25779,
x7 = 0.25759, x8 = 0.25764.
Thus, we see that after eight iterations the root is 0.2576 to 4D. A
graphical interpretation of the first three iterations is shown in Fig. 8.
3. Convergence
Whether or not an iteration procedure converges or indeed at all,
depends on the choice of the function (x) as well as the starting value
x0. For example, the equation x? = 3 has two real roots
It can be given the form
x = 3/x = (x)
which suggests the iteration
xn+1 = 3/xn.
However, if the starting value x0 = 1 is used, the successive iterations
yield
We can examine the convergence of the iteration process
x= (xn) to ( )
with the aid of the Taylor series
where k is a point between the root and the approximation xk. We have
Multiplying the n + 1 rows together and cancelling the common factors
x1, x2, ??? , xn leaves
whence
so that the absolute error | xn+1| can be made as small as we please by
sufficient iteration if | '|< 1 in the neighbourhood of the root.
Note that (x) = 3/x has derivative | '(x)| = |-3/x?| > 1 for |x| < 3?.
Checkpoint
1. Assuming x0 = 1, show by simple iteration that one root of the
equation 2x - 1 -2sinx = 0 is 1.4973.
2. Use simple iteration to find (to 4D) the root of the equation x + cos
x = 0.
STEP 10
NONLINEAR EQUATIONS 5
The Newton- Raphson iterative method
The Newton- Raphson method is suitable for implementation on a
computer (pseudo-code). It is a process for the determination of a real
root of an equation f (x) = 0, given just one point close to the desired
root. It can be viewed as a limiting case of the secant method (Step
8) or as a special case of simple iteration ( Step 9).
1. Procedure
Let x0 denote the known approximate value of the root of f(x) = 0 and h
the difference between the true value and the approximate value, i.e.,
The second degree, terminated Taylor expansion ( STEP 5) about x0
is
where lies between
Ignoring the remainder term and writing
whence
and, consequently,
should be a better
estimate of the root
than x0. Even better
approximations may be
obtained by repetition
(iteration) of the process, which then becomes
Note that if f is a polynomial, you can use the recursive procedure of
STEP 5 to compute
The geometrical interpretation is that each iteration provides the point
at which the tangent at the original point cuts the x-axis (Figure 9).
Thus the equation of the tangent at (xn, f (xn)) is
y - f(x0) = f '(x0)(x - x0)
so that (x1, 0) corresponds to
-f(x0) = f '(x0)(x1 - x0),
whence x1 = x0 - f(x0)/f '(x0).
2. Example
We will use the Newton- Raphson method to find the positive root of
the equation sin x = x2, correct to 3D.
It will be convenient to use the method of false position to obtain
an initial approximation. Tabulation yields
With numbers displayed to 4D, we see that there is a root in the
interval 0.75 < x < 1 at approximately
Next, we will use the Newton- Raphson method:
and
yielding
Consequently, a better approximation is
Repeating this step, we obtain
and
so that
Since f(x2) = 0.0000, we conclude that the root is 0.877 to 3D.
3. Convergence
If we write
the Newton-Raphson iteration expression
may be rewritten
We have observed (STEP 9) that, in general, the iteration method
converges when near the root. In the case of Newton-Raphson,
we have
so that the criterion for convergence is
,
i.e., convergence is not as assured as, say, for the bisection method.
4. Rate of convergence
The second degree terminated Taylor expansion about xn is
where is the error at the n-th iteration and .
Since , we find
But, by the Newton-Raphson formula,
whence the error at the (n + 1)-th iteration is
provided en is sufficiently small.
This result states that the error at the (n + 1)-th iteration is proportional
to the square of the error at the nth iteration; hence, if , an
answer correct to one decimal place at one iteration should be
accurate to two places at the next iteration, four at the next, eight at
the next, etc. This quadratic - second- order convergence - outstrips
the rate of convergence of the methods of bisection and false
position!
In relatively little used computer programs, it may be wise to prefer the
methods of bisection or false position, since convergence is virtually
assured. However, for hand calculations or for computer routines in
constant use, the Newton-Raphson method is usually preferred.
5. The square root
One application of the Newton- Raphson method is in the
computation of square roots. Since a? is equivalent to finding the
positive root of x2 = a. i.e.,
f(x) = x2 - a = 0.
Since f '(x) = 2x, we have the Newton-Raphson iteration formula:
xn+1 = xn - (x?n - a)/2xn = ?(xn + a/xn),
a formula known to the ancient Greeks. Thus, if a = 16 and x0 = 5, we
find to 3D
x1 = (5 + 3.2)/2 = 4.1, x2 = (4.1 + 3.9022)/2 = 4.0012, and x3 = (4.0012 +
3.9988)/2 = 4.0000.
Checkpoint
1. What is the geometrical interpretation of the Newton-
Raphson iterative procedure?
2. What is the convergence criterion for the Newton-Raphson
method?
3. What major advantage has the Newton-Raphson method over
some other methods?
EXERCISES
1. Use the Newton-Raphson method to find to 4S the (positive) root of
3xex=1?
Note that Tables of natural logarithms ( Naperian) are more more
readily available than tables of the exponential, so that one might
prefer to solve the equivalent equation f(x) = loge 3x + x = log 3 + loge x
+ x = 0.
2.Derive the Newton-Raphson iteration formula
xn + 1 = xn - (xkn - a)/k xk-1n
for finding the k-th root of a.
3. Compute the square root of 10 to 5 significant digits from an initial
guess.
4. Use the Newton-Raphson method to find to 4D the root of the
equation
x cos x = 0.
Use the Newton-Raphson method to find to 4D the root of each
equation in the exercises 1 - 3.
STEP 11
SYSTEMS OF LINEAR EQUATIONS 1
Solution by elimination
Many physical systems can be modelled by a set of linear equations
which describe relationships between system variables. In simple
cases, there are two or three variables; in complex systems (for
example, in a linear model of the economy of a country) there may be
several hundred variables. Linear systems also arise in connection with
many problems of numerical analysis. Examples of these are the
solution of partial differential equations by finite difference methods,
statistical regression analysis, and the solution of eigenvalue problems
(see, for example, Gerald and Wheatley ( 1994)). A brief introduction to
this last topic is given in STEP 17.
Hence there arises a need for rapid and accurate methods for solving
systems of linear equations. The student will already be familiar
with solving by elimination systems of equations with two or three
variables. This Step presents a formal description of the Gauss
elimination method for n-variable systems and discusses certain
errors which might arise in their solutions. Partial pivoting, a
technique which enhances the accuracy of this method, is discussed in.
in the next Step.
1. Notation and definitions
We first consider an example in three variables:
a set of three linear equations in the three variables or unknowns x, y,
z. During solution of such a system, we determine a set of values for x,
y and z which satisfies each of the equations. In other words, if values
(X, Y, Z) satisfy al1 equations simultaneously, then they constitute a
solution of the system.
Consider now the general system of n equations in n variables:
Obviously, the dots indicate similar terms in the variables and the
remaining (n - 3) equations.
In this notation, the variables are denoted by x1, x2, . . , xn; they are
sometimes referred to as xi , i = 1, 2, ? ?. , n. The coefficients of the
variables may be detached and written in a coefficient matrix:
The notation aij will be used to denote the coefficient of xj in the i-th
equation. Note that aij occurs in the i-th row and j-th column of the
matrix.
The numbers on the right-hand side of the equations are called
constants; they may be written in a column vector:
The coefficient matrix may be combined with the constant vector to
form the augmented matrix:
As a rule, one works in the elimination method directly with the
augmented matrix.
2. The existence of solutions
For any particular solution of n linear equations, there may be a single
solution (X1, X2, . . . Xn), or no solution, or an infinity of solutions. In the
theory of linear algebra, theorems are given and conditions stated
which allow to make a decision regarding the category into which a
given system falls. We shall not treat the question of existence of
solutions in this book, but for the benefit of students, familiar with
matrices and determinants, we state the theorem:
Theorem: A linear system of n equations in n variables with coefficient
matrix A and non-zero constants vector b has a unique solution, if and
only if the determinant of A is not zero.
If b = 0 , the system has the trivial solution x = 0 . It has no other
solution unless the determinant of A is zero, when it has an infinite
number of solutions.
If the determinant of A is non-zero, there exists an n n matrix, called
the inverse of A (denoted by A-1) such that the matrix product of A-1
and A is equal to the n n-identity or unit matrix I. The elements of
the identity matrix are 1 on the main diagonal and 0 elsewhere. Its
algebraic properties include Ix = x for any n1 vector x, and IM = MI =
M for any nn matrix M. For example, the 33 identity matrix is
Multiplication of the equation Ax = b from the left by the inverse matrix
A-1 yields A-1Ax = A-1b, whence the unique solution is x = A-1b (since A-
1
A = I and Ix = x). Thus, in principle, a linear system with a unique
solution may be solved by first evaluating A-1 and then A-1b. This
approach is discussed in more detail in the optional Step 14. The Gauss
elimination method is a more general and efficient direct procedure for
solving systems of linear equations.
3. Gauss elimination method
In Gauss elimination, the given system of equations is transformed into
an equivalent system which has upper triangular form; this form is
readily solved by a process called back- substitution. We shall
demonstrate the process by solving the example of Section 1.
Transformation into upper triangular form:
First stage: Eliminate x from equations R2 and R3 using equation R1.
Second stage: Eliminate y from R3' using R2':
The system, now in upper triangular form, has the coefficient matrix:
Solution by back- substitution
The system in upper triangular form is easily solved by obtaining z
from R3", then y from R2", and finally x from R1". This procedure is
called back- substitution. Thus:
4. The transformation operations
During transformation of a system to upper triangular form, one or
more of the following elementary operations are used at every step:
1. Multiplication of an equation by a constant;
2. Subtraction from one equation some multiple of another
equation;
3. Interchange of two equations.
Mathematically speaking, it should be clear to the student that
performing elementary operations on a system of linear equations
leads to equivalent systems with the same solutions. This statement
requires proof which may be found as a theorem in books on linear
algebra (cf., for example, Anton (1993)). It forms the basis of all
elimination methods for solving systems of linear equations.
5. General treatment of the elimination process
We will now describe the application of the elimination process to a
general n n linear system, written in general notation, which is
suitable for implementation on a computer (Pseudo-code). Before
considering the general system, the process will be described by
means of a system of three equations. We begin with the augmented
matrix, and display in the column (headed by m) the multipliers
required for the transformations.
First stage: Eliminate the coefficients a21 and a31, using row R1:
Second stage: Eliminate the coefficient a32 using row R2:
The matrix is now in the form which permits back- substitution. The
full system of equations at this stage, equivalent to the original system,
is
Hence follows the solution by back- substitution:
We now display the process for the general n n system, omitting the
primes (') for convenience. Recall that the original augmented matrix
was:
First stage: Eliminate the coefficients a21, a31,. . . , an1 by calculating the
multipliers
and then
This leads to the modified augmented system
Second stage: Eliminate the coefficients a32, a42, . . . , an2 by calculating
the multipliers
and then
whence
We continue to eliminate unknowns, going on to columns 3, 4, . . so that
by the beginning of the k-th stage we have the augmented matrix
k- th stage: Eliminate ak+l,k, ak+2,k, . . , ak+n,kl by calculating the multipliers
and then
At the end of the k-th stage, we obtain the augmented system
Continuing in this way, we obtain after n -1 stages the augmented
matrix
Note that the original coefficient matrix has been transformed into the
upper triangular form.
We now back- substitute: We find xn = bn/ann, and then
Notes
1. The diagonal elements akk, used at the k-th stage of the successive
elimination, are referred to as pivot elements
.
2. In order to proceed from one stage to the next, the pivot elements
must be non- zero, since they are used as divisors in the multipliers
and in the final solution. If at any stage a pivot element vanishes,
rearrange the remaining rows of the matrix, in order to obtain a non-
zero pivot; if this is not possible, then the system of linear equations
has no solution.
3. If a pivot element is small compared with the elements in its
column which have to be eliminated, the corresponding multipliers
used at that stage will be larger than one in magnitude. The use of
large multipliers in elimination and back-substitutions leads to
magnification of round- off errors; this can be avoided by using the
partial pivoting described in Step12.
4. During the elimination method, an extra check column,
containing the sums of the rows, should be computed at each stage (cf.
the following example). Its elements are treated in exactly the same
way as the equation coefficients. After each stage is completed, the
new row sums should equal the new check column elements ( within
roundoff error).
6. A numerical example with check column
Consider the system of equations:
0.34x1 - 0.58x2 + 0.94x3 = 2.0
0.27x1 + 0.42x2 + 0.134x3 = 1.5
0.20x1 - 0.51x2 + 0.54x3 = 0.8
The manipulations leading to the solution are set out in tabular form
below. For the purpose of illustration, the calculations have been
executed in 3D floating point arithmetic. For example, at the first stage,
the multiplier 0.794 arises as follows:
while the value -0.0900 arises from the sequence of operations
Work with so few significant digits leads to errors in the solution, as is
shown below by an examination of the so called residuals.
Next, we back-substitute:
As a check, we sum the original three equations to obtain 0.81x1 - 0.67x2
= 0.43. Inserting the solution yields 0.81 0.89- 0.67 2.1 + 1.61
3.03 = 4.3049.
In order to assess the accuracy of the solution, we insert the solution
into the left-hand sides of each of the original equations and compare
the results with the right-hand side constants. For the present example,
we find:
which yields the residuals
It seems reasonable to conclude that, if the residuals are small, the
solution is a good one. As a rule, this is true. However, at times, small
residuals do not indicate that a solution is acceptable. This aspect is
discussed under the heading ill-conditioning in Step 12.
Checkpoint
1. What types of operations are permissible during the
transformation of the augmented matrix?
2. What is the final form of the coefficient matrix, before
backsubstitution begins?
3. What are pivot elements? Why must small pivot elements be
avoided if possible?
EXERCISES
Solve the following systems by Gauss elimination:
a)
x1 + x2 - x3 = 0,
2x1 - x2 + x3 = 6,
3x1 + 2x2 - 4x3 = -4.
b)
5.6 x + 3.8 y + 1.2z = 1.4,
3.lx + 7.ly - 4.7z = 5.1,
1.4x - 3.4y + 8.3z = 2.4.
c)
2x + 6&127 + 4z = 5,
6x + 19y + 12z = 6,
2x + 8y + 14z = 7.
d)
1.3x + 4.6y + 3.lz= -1,
5.6x + 5.8y + 7.9z = 2,
4.2x + 3.2y + 4.5z = -3.
STEP 12
SYSTEMS OF LINEAR EQUATIONS 2
Errors and ill- conditioning
For any system of linear equations, the question concerning the errors
in a solution obtained by a numerical method is not readily answered. A
general discussion of the problems it raises is beyond the scope of this
book. However, some sources of errors will be indicated below.
1. Errors in the coefficients and constants
In many practical cases, the coefficients of the variables, and also the
constants on the right-hand sides of the equations are obtained from
observations of experiments or from other numerical calculations. They
will have errors; and therefore, once the solution of a system has been
found, it too will contain errors. In order to show how this kind of error
is carried through calculations, we shall solve a simple example in two
variables, assuming that the constants have errors at most as large as
+/- 0.01. Consider the system:
Solving this system by Gauss elimination and back- substitution
yields:
,
whence 3/2 y lies between 2.985 and 3.015, i.e., y lies between 1.990
and 2.010. From the first equation, we now obtain
so that x lies between 0.99 and 1.01.
If the coefficients and constants of the system were exact, its exact
solution would be x = 1, y = 2. However, since the constants are not
known exactly, it does not make sense to talk of an exact solution; all
one can say is that 0.99 x 1.01 and 1.99 y 2.01.
In this example, the error in the solution is of the same order as that in
the constants. Yet, in general, the errors in the solutions are greater
than those in the constants.
2. Round- off errors and numbers of operations
Numerical methods for solving systems of linear equations involve
large numbers of arithmetic operations. For example, the Gauss
elimination of Step 11, according to Atkinson (1993), involves (n3 + 3n2
- n)/3 multiplications/divisions and (2n3 + 3n2 - 5n)/6
additions/subtractions in the case of a system with n unknowns.
Since round-off errors are propagated at each step of an algorithm, the
growth of round-off errors can be such that, when n is large, a solution
differs greatly from the true one.
3. Partial pivoting
In Gauss elimination, the buildup of round-off errors may be reduced by
rearranging the equations so that the use of large multipliers in the
elimination operations is avoided. The corresponding procedure is
known as partial pivoting (or pivotal condensation). The general
rule to follow involves: At each elimination stage, rearrange the rows of
the augmented matrix so that the new pivot element is larger in
absolute value than (or equal to) any element beneath it in its column.
A use of this rule ensures that the magnitudes of the multipliers,
employed at each stage, are less or equal to unity. A simple example in
3 floating point arithmetic follows:
In the following tabular solution, the pivot elements are in bold print.
(Note that the magnitude of all the multipliers is less than unity.)
If no pivoting is done, it may be verified that using three-digit floating
point arithmetic yields the solution z = 2.93, y = 2.00 and x = 1.30. Since
the true solution to 3S is z = 2.93, y = 1.96, and x=1.36, the solution
obtained by partial pivoting is better than the one obtained without
pivoting.
4. Ill-conditioning
Certain systems of linear equations are such that their solutions are
very sensitive to small changes (and therefore to errors) in their
coefficients and constants. We give an example below in which 1 %
changes in two coefficients change the solution by a factor of 10 or
more. Such systems are said to be ill- conditioned. If a system is ill-
conditioned, a solution obtained by a numerical method may differ
greatly from the exact solution, even though great care is taken to
keep round-off and other errors very small.
As an example, consider the system of equations:
which has the exact solution x =1, y = 2. Changing coefficients of the
second equation by 1% and the constant of the first equation by 5%
yields the system:
It is easily verified that the exact solution of this system is x = 11, y =
-18.2. This solution differs greatly from the solution of the first system.
Both these systems are said to be ill- conditioned.
If a system is ill-conditioned, then the usual procedure of checking a
numerical solution by calculation of the residuals may not be valid. In
order to see why this is so, suppose we have an approximation X to the
true solution x. The vector of residuals r is then given by r = b - AX
= A( x - X) . Thus e = x - X satisfies the linear system Ae = r. In
general, r will be a vector with small components. However, in an ill-
conditioned system, even if the components of r are small so that it
is `close' to 0, the solution of the linear system Ae = r could differ
greatly from the solution of the system Ae = 0 , namely 0. It then
follows that X may be a poor approximation to x despite the residuals
in r being small.
Obtaining accurate solutions to ill-conditioned linear systems can be
difficult, and many tests have been proposed for determining whether
or not a system is ill-conditioned. A simple introduction to this topic is
given in the optional STEP 16.
Checkpoint
1. Describe the types of error which may affect the solution of a
system of linear equations.
2. How can partial pivoting contribute to a reduction of errors?
3. Is it true to say that an ill- conditioned system does not have
an exact solution?
EXERCISES
1. Find the range of solutions for the following system, assuming that
the maximum errors in the constants are as shown:
2. Solve the following systems by Gauss elimination:
a)
b)
c)
3. Use 4D normalized floating point arithmetic to solve the
fo1lowing system with and without partial pivoting. Compare your
answers with the exact answer: x =1.000 x 1O0, y = 5.000 x 10-1:
4. Show that for a linear system of three unknowns, Gauss
elimination requires three divisions, eight multiplications, and eight
sub- tractions for the triangularization, and a further three divisions,
three multiplications and three additions/subtractions for the
backsubstitution.
5. Derive the general formulae in Section 2 for the numbers of required
arithmetic operations.
6. Study the ill- conditioning example of Section 4 in the following
ways:
a) Plot the lines of the first system on graph paper; then describe ill-
conditioning in geometrical terms when only two unknowns are
involved.
b) Insert the solution of the first system into the left-hand side of the
second system. Does x=1, y=2 `look like a good solution to the second
system? Comment!
c) Insert the solution of the second system into the left-hand side of
the first system. Comment!
d) The system
is an example of ill-conditioning, due to T. S. Wilson. Insert the
`solution' (6.0, -7.2, 2.9, -0.1 ) into the left-hand side. Would you claim
this solution to be a good one? Next insert the solution (1.0,1.0,1.0,1.0).
Comment on the dangers of laying claims!
STEP 13
SYSTEMS OF LINEAR EQUATIONS 3
The Gauss- Seidel iterative method
The methods used in lhe previous Steps for solving systems of linear
equations are termed direct methods. If a direct method is used, and
round-off and other errors do not arise, an exact solution is reached
after a finite number of arithmetic operations. In general, of course,
round-off errors do arise; and when large systems are being solved by
direct methods, the growing errors can become so large as to render
the results obtained quite unacceptable.
1. Iterative methods
Iterative methods provide an alternative approach. Recall that an
iterative method starts with an approximate solution and uses it by
means of a recurrence formula to provide another approximate
solution; by repeated application of the formula, a sequence of
solutions is obtained which (under suitable conditions) converges to
the exact solution. Iterative methods have the advantages of simplicity
of operation and ease of implementation on computers, and they are
relatively insensitive to propagation of errors; they would be used in
preference to direct methods for solving linear systems involving
several hundred variables, particularly, if many of the coefficients were
zero. Systems of over 100 000 variables have been successfully solved
on computers by iterative methods, whereas systems of 10 000 or more
variables are difficult or impossible to solve by direct methods.
2. The Gauss- Seidel method
This text will only present one iterative method for linear equations,
due to Gauss and improved by Seidel. We shall use this method to
solve the system
It is suitable for implementation on computers(PSEUDO CODE)
The first step is to solve the first equation for x1, the second for x2, and
the third for x3 when the system becomes:
An initial solution is now assumed; we shall use xl = 0, x2 = 0 and x3 = 0.
Inserting these values into the right-hand side of Equation (1) yields xl
= 1.3. This value for xl is used immediately together with the remainder
of the initial solution (i.e., x2 = 0 and x3 = 0) in the right-hand side of
Equation (2) and yields x2 =1.3 - 0.2 x 1.3 - 0 = 1.04. Finally, the values xl
= 1.3 and x2 = 1.04 are inserted into Equation (3) to yield x3 = 0.936. This
second approximate solution (1.3, 1.04, 0.936) completes the first
iteration.
Beginning with this second approximation, we repeat the process to
obtain a third approximation, etc. Under certain conditions relating to
the coefficients of the system, this sequence will converge to the exact
solution.
We can set up recurrence relations which show clearly how the
iterative process proceeds. Denoting the k-th and k+1-th
approximations by (x(k)1, x(k)2, x(k)3) and (x(k+1)1, x(k+1)2, x(k+1)3), respectively,
we find
We begin with the starting vector x(0) = (x(0)1, (x(0)2, (x(0)3) all components
of which are 0, and then apply these relations repeatedly in the order
(1)', (2)' and (3)'. Note that, when we insert values for xl, x2 and x3 into
the right-hand sides, we always use the most recent estimates found
for each unknown.
3. Convergence
The sequence of solutions produced by the iterative process for the
above numerical example are shown in the table:
.
The student should check that the exact solution for this system is
(1,1,1). It is seen that the Gauss- Seidel solutions are rapidly
approaching these values; in other words, the method is converging.
Naturlly, in practice, the exact solution is unknown. It is customary to
end the iterative procedure as soon as the differences between the
x(k+1) and x(k) values are suitably small. 0ne stopping rule is to end the
iteration when
becomes less than a prescribed small number (usually chosen
according to the accuracy of the machine on which the calculations are
carried out).
The question of convergence with a given system of equations is
crucial. As in the above example, the Gauss- Seidel method may
quickly lead to a solution very close to the exact one; on the other
hand, it may converge too slowly to be of practical use, or it may
produce a sequence which diverges from the exact solution. The reader
is referred to more advanced texts (such as Conte and de Boor ( 1980))
for treatments of this question.
In order to improve the chance (and rate) of convergence, the system
of equations should be rearranged before applying the iterative
method, so that, as far as possible, each leading-diagonal coefficient is
larger (in absolute value) than any other in its row.
Checkpoint
1. What is the essential difference between a direct method and
an iterative method?
2. Give some advantages of the use of iterative methods
compared with that of direct methods.
3. How can you improve the chance of success with the Gauss-
Seidel method?
EXERCISES
1. For the example treated above, compute the value of S3, the
quantity used in the suggested stopping rule after the third
iteration.
2. Use the Gauss- Seidel method to solve the following systems to
5D accuracy (remembering to rearrange the equations if
appropriate). Compute the value of Sk (to 6D) after each
iteration.
a)
x - y + z = -7,
20x + 3y - 2z = 51,
2x + 8y + 4z = 25.
Remember to rearrange! Compute the value of Sk to 5 D after
each iteration.
b)
10x - y = 1
10
-x + - z = 1
y
- y + 10z - w = 1
10
- z + = 1
w
STEP 14
SYSTEMS OF LINEAR EQUATIONS 4*
Matrix inversion*
The general system of linear equations in n variables (STEP11, Section
1) can be written in matrix form Ax = b, when a vector x is sought
which satisfies this equation. We will now use the inverse matrix A-1 to
find this vector.
1. The inverse matrix
In Step 11, we have observed that, if the determinant of A is non-zero, it
has an inverse matrix A-1 , and the solution of the linear system can be
written as x = A-1 b, i.e., the solution of the system of linear equations
can be obtained by first finding the inverse of the coefficient matrix A,
and then forming the product A-1 b.
However, as a rule, this approach is not adopted in practice. The
problem of finding the inverse matrix is itself a numerical task, which
generally requires for its solution many more operations (and therefore
involves more round- off and other errors) than any of the methods
described in the preceding Steps. However, if the inverse matrix is
required for some additional reason, it is sensible to compute the
inverse first. For example, the inverse may contain theoretical or
statistical information or be of use in some other formula or calculation.
2. Method for inverting a matrix
There are many numerical methods for finding the inverse of a matrix.
We shall describe one which uses Gauss elimination and back-
substitution procedures. It is simple to apply and is computationally
efficient. We shall illustrate the method by application to a 2 x 2 matrix
and a 3 x 3 matrix; it should then be clear to the reader, how the
method may be extended for use with n x n matrices.
As the 2 x 2 example, consider
and seek the inverse matrix
such that
This is equivalent to solving the two systems
Thus:
a. Form the augmented matrix
b. Apply elementary row operations to the augmented matrix such
that A is transformed into an upper triangular matrix ( Step 11,
Section 5):
c. Solve the two systems:
using back- substitution. Note how the systems have been
constructed, using and columns of . From the first system, 3v1 =
-2, 3v1u&127 = -2/3, and 2u1 + v = 1, whence 2u1 = 1 + 2/3, u1 = 5/6.
From the second system, 3v2 = 1, v2 = 1/3 and 2u2 + v2 = 0, whence
2u2 = -1/3, u2 = -1/6. Thus the required inverse matrix is:
d. Check: AA-1 should be equal to I. Multiplication yields:
so that A-1 is correct.
In this simple example, one can work with fractions, so that no round-
off errors occurred and the resulting inverse matrix is exact. In
general, during hand calculations, the final result should be checked by
computing AA-1 , which should be approximately equal to the identity
matrix I. As the 33 example, consider
In order to demonstrate the effects of errors, compute A-1 to 3S. The
results of the calculations in tabular form are:
As an example of back- substitution, taken with the second column
of yields the second column of A-1 . Thus:
yields w2 = -20.0, v2 = 38.3 and u2 = -34.0, determined in this order.
Check by multiplication that AA-1 is
,
which is approximately equal to I. The noticeable inaccuracy is due to
carrying out the calculation of the elements of A-1 to 3S only.
3. Solution of linear systems using the inverse matrix
As previously noted, the unique solution of a linear system Ax = b is x
= A-1 b, provided the coefficient matrix A has an inverse A-1 . We shall
illustrate this by using the inverse A-1 obtained in Section 2 to compute
the solution of the linear system:
The coefficient matrix is
We can use A-1, calculated in the preceding section, as follows:
Thus, we arrive at the solution x = -13, y = 16 and z = -2.5 to 2S. We
check the solution by adding the three equations:
Inserting this solution in the left-hand side yields to 2S:
Checkpoint
1. In the method for finding the inverse of A, what is the final form
of A after the elementary row operations have been carried out?
2. Is the solution of the system Mx = d, x = dM-1 or x = M-1 d (or
neither)?
3. Give a condition for a matrix not to have an inverse.
EXERCISES
1. Find the inverses of the following matrices, using elimination and
back- substitution:
1. 2. 3.
2. Solve the systems of equations (each for two right-hand side
vectors):
a.
b.
c.
STEP 15
SYSTEMS OF LINEAR EQUATIONS 5*
Use of LU decomposition*
We have shown in STEP 11 how to solve a linear system Ax = b by
Gauss elimination, applied to the augmented matrix . In the
preceding Step, we have extended the elimination process to calculate
the inverse A of the coefficient matrix A and assumed that it exists.
Another general approach to solving Ax = b is known as the method of
LU decomposition, which provides new insights into matrix algebra
and has many theoretical and practical uses. It yields efficient
computer algorithms for handling practical problems.
The symbols L and U denote lower triangular matrix and upper
triangular matrices, respectively. Examples of lower triangular
matrices are
Note that in such a matrix all elements above the leading diagonal are
zero. Examples of upper triangular matrices are:
where all elements below the leading diagonal are zero. The product of
L1 and U1 is
1. Procedure
Suppose we have to solve a linear system Ax = b and that we can
express the coefficient matrix A in the form of the socalled LU
decomposition A = LU. Then we may solve the linear system as
follows:
Stage l:
Write Ax = LUx = b.
Stage 2:
Set y = Ux, so that Ax = Ly = b. Use forward substitution with Ly =
b to find y1, y2, . . , yn in that order, i.e., assume that the augmented
matrix for the system Ly = b is:
Then forward substitution yields y1 = b1/l11, and, subsequently,
Note that the value of yi depends on the values y1, y2, . . , yi-1, which
have already been calculated.
Stage 3:
Finally, use back- substitution with Ux = y to find xn, . . . , x1 in that
order.
Later on we shall outline a general method for finding LU
decompositions of square matrices. The following example
demonstrates this method, involving the matrix A = L1 U1 above. If we
wish to solve Ax = b for a number of different vectors b, then this
method is more efficient than Gauss elimination. Once we have
found an LU decomposition of A, we need only forward and
backward substitute to solve the system for any b.
Example
We shall solve the system
Stage l:
An LU decomposition of the system is
Stage 2:
Set y = U1 x and then solve the system L1 y = b, i.e.,
Using forward substitution, we obtain:
Stage 3:
Solve
Back-substitution yields:
Thus, the solution of Ax = b is:
which you may check, using the original equations. We turn now to the
problem of finding an LU decomposition of a given square matrix
A.
Realizing an LU decomposition
For an LU decomposition of a given n x n matrix A, we seek a lower
triangular matrix L and an upper triangular matrix U (both of order n x
n) such that A = LU. The matrix U may be taken to be the upper
triangular matrix resulting from Gauss elimination without partial
pivoting (Sections 3 and 5 of STEP 11), and the matrix L may be taken
to be the lower triangular matrix which has diagonal elements 1 and
which has as the (i, k) element the multiplier mik. This multiplier is
calculated at the k-th stage of Gauss elimination and is required to
transform the current value of aik into 0. In the notation of Step 11 ,
these multipliers were given by mik = aik/akk, I = k+l, k+2,.. ,n.
An example will help to clarify this procedure. From Step 11 , we recall
that Gauss elimination, applied to the system
yielded the upper triangular matrix:
Also, we saw that in the first stage we calculated the multipliers rn21 =
a2l/a11 = 1 / 1= 1 and m3l = a3l/a11 = 2/1 = 2, while, in the second stage, we
calculated the multiplier m32 = a32/a22 = -3/1 = -3. Thus
It is readily verified that LU equals the coefficient matrix of the original
system:
Another technique which may be used to find an LU decomposition
of an n x n matrix is by direct decomposition. In order to illustrate
this process, let it be rquired to find an LU decomposition for the 3 x 3
coefficient matrix of the system above. Then the required L and U are
of the form
Note that the total number of unknowns in L and U is 12, whereas there
are only 9 elements in the 3 x 3 coefficient matrix A. To ensure that L
and U are unique, we need to impose 12 - 9 = 3 extra conditions on the
elements of these two triangular matrices. (In the general nn case, n
extra conditions are required.) One common choice is to require all the
diagonal elements of L to be 1; the resulting method is known as
Doolittle' s method. Another choice is make the diagonal elements in
U to be 1; this is Crout' s method. Since Doolittle's method will give
the same in this direct LU decomposition for A, given above, we shall
use Crout's method to illustrate decomposition procedure.
We then require that
Multiplication of L and U yields:
It is clear that this construction by Crout's method yields triangular
matrices L and U for which A= LU.
Checkpoint
1. What constitutes an LU decomposition of a matrix A?
2. How is a decomposition A = LU used to solve a linear system
Ax = b?
3. How may an LU decomposition be obtained by Gauss
elimination?
EXERCISES
1. Find an LU decomposition of the matrix
where
2. Solve each of the following systems (Exercises 1 and 3 of STEP 11)
by first finding an LU decomposition of the coefficient matrix and
then using forward and backward substitutions.
a.
b.
STEP 16
SYSTEMS OF LINEAR EQUATIONS 6*
Tests for ill- conditioning*
We recall from Section 4 of STEP 12 that the solutions of ill-conditioned
systems of linear equations are very sensitive to small changes in their
coefficients and constants. We will show now how one may test a
system for ill- conditioning.
1. Norms
One of the most common tests for ill-conditioning of a linear system
involves the norm of the coefficient matrix. In order to define this
quantity, we must consider first the concept of the norm of a vector
or matrix, which in some way assesses the size of their elements.
Let x and y be vectors. Then a vector norm ||?|| is a real number with
the properties:
1. || x || 0 and || x || = 0 if and only if x is a vector all components
of which are zero;
2. || x || = | | || x || for any real number ;
3. || x + y|| || x || + || y || (triangle inequality).
There are many possible ways to choose a vector norm with these
properties. One vector norm which is probably familiar to the student is
the Euclidean or 2-norm. Thus, if x is an n 1 vector, then the 2-norm
is denoted and defined by
As an example, if x is the 5 x 1 vector with the components x1 = 1, x2 =
-3, x3 = 4, x4 = -6, x5 = 2, then
Another possible choice of norm, which is more suitable for our
purposes here, is the infinity norm, defined by
Thus, we find for the vector in the previous example . It is easily
verified that has the three properties above. For , the first two
properties are readily verified; the triangle inequality is somewhat
more difficult and requires the use of the so-called Cauchy- Schwarz
inequality (for example, cf. Cheney and Kincaid ( 1994)).
The defining properties of a matrix norm are similar, except that it
involves an additional property. Let A and M be matrices. Then a
matrix norm || ? || is a real number with the properties:
1. is a matrix with all elements zero;
2.
3.
4.
As for vector norms, there are many ways of choosing matrix norms
with the four properties above, but we will consider only the infinity
norm. If A is an nn matrix, then the infinity norm is defined by
By this definition, this norm is the maximum of the sums obtained by
adding the absolute values of the elements in each row, whence it is
commonly referred to as the maximum row sum norm
. As an example, let
Then
so that
A useful property interrelating the matrix and vector infinity norms is
which follows from Property 4 of the matrix norms above.
2. Ill-conditioning tests
We will now find out whether or not a system is ill- conditioned by
using the condition number of the coefficient matrix. If A is an n x
n matrix and A-1 is its inverse, then the condition number of A is
defined by
It is bounded below by 1, since || I || = 1 and
where we have used the matrix norm property 4 given in the preceding
section. Large values of the condition number usually indicate ill-
conditioning. As a justification for this last statement, we state and
prove the theorem:
Theorem: Suppose x satisfies the linear system Ax = b and satisfies
the linear system . Then
Proof: We have . Since
,
we see that
However, since b = Ax, we have , or
It then follows that
from which follows the result above.
It is seen from the theorem that even if the difference between b and
is small, the change in the solution, as measured by the `relative error'
may be large when the condition number is large. It follows
that a large condition number is an indication of possible ill-
conditioning of a system. A similar theorem for the case when there
are small changes to the coefficients of A may be found in more
advanced texts such as Atkinson (1993). Such a theorem also shows
that a large condition number is an indicator of ill- conditioning.
There arises the question: How large has the condition number to be
for ill- conditioning to become a problem? Roughly speaking, if the
condition number is 1m and the machine in use to solve a linear system
has kD accuracy, then the solution of the linear system will be accurate
to k - m decimal digits.
In Step 12, we had the coefficient matrix
which was associated with an ill- conditioned system. Then
and
This suggests that a numerical solution would not be very accurate, if
only 2D accuracy were used in the calculations. Indeed, if the
components of A were rounded to two decimal digits, the two rows of A
would be identical. Then the determinant of A would be zero, and it
follows from the theorem in Step 11 that this system would not have a
unique solution.
We recall that the definition of the condition number requires A-1, the
computation of which is expensive. Moreover, even if the inverse were
calculated, this approximation might not be very accurate, if the
system is ill-conditioned. It is therefore common in software packages
to estimate the condition number by obtaining an estimate of ,
without explicitly finding A-1.
Checkpoint
1. What are the three properties of a vector norm?
2. What are the four properties of a matrix norm?
3. How is the condition number of a matrix defined and how is it
used as a test for ill- conditioning?
EXERCISES
1. For the 5 x 1 vector with elements x1 = 4, x2 = -6, x3 = -5, x4 = 1, and x5
= -1, calculate .
2. For each of the matrices in Exercise 1 of Step 14, compute
a. the infinity norm,
b. the condition number.
STEP 17
THE EIGEN- VALUE PROBLEM
The power method
Let A be an nn matrix. If there exists a number and a non-zero
vector x such that , then is said to be an eigen- value of A,
and x the corresponding eigen- vector. The evaluation of eigen-values
and eigen-vectors of matriccs is a problem that arises in a variety of
contexts. Note that if we have an eigen-value and an eigen-vector x,
then x, where is any real number, is also an eigen-vector, since
This shows that an eigen-vector is not unique and may be scaled, if
desired (for instance, one might want the sum of the components of
the eigenvector to be unity).
Writing as
we conclude from the theorem in STEP 11 that this equation can have a
non-zero solution only if the determinant of |A - I| is zero. If we
expand this determinant, then we get an n-th degree polynomial in
known as the characteristic polynomial of A. Thus, one way to find
the eigen-values of A is to obtain its characteristic polynomial and then
to find the n zeros of this polynomial (some of them may be complex!).
For example, let
Then
This last matrix has the determinant
The zeros of this quadratic yield the eigen-values.
Although the characteristic polynomial is easy to work out in this
sirnple 2 x 2 example, as n increases, it becomes more complicated
(and, of course, it has a correspondingly higher degree). Moreover,
even if we can find thc characteristic polynomial, the analytic formulae
for the roots of a cubic or quartic are somewhat inconvenient to use,
and in any case, we must use numerical methods (cf. Steps 7-10) to
find the roots of the polynomials of degree n> 4. It is therefore common
to use alternative, direct numerical methods to find eigen-values
and eigen-vectors of a matrix.
If we are only interested in the eigen- value of largest magnitude,
then a popular approach to its evaluation is the power method. We
shall discuss later on, how this method may be modified to find the
eigen- value of smallest magnitude. Methods for finding all the
eigen-values are beyond the scope of this book. (One such method,
called the QR method, is based on QR factorization .
1. The power method
Suppose that the n eigen-values of A are and that they
are ordered in such a way that
Then the method, which is readily implemented on a computer
can be used to find We begin with a starting vector w(0) and
compute the vectors
for j = 1, 2, . . ., so that induction yields
where Aj is A, multiplied by itself j times. Thus w(j) is the product
of w(0) and the j-th power of A, which explains why this approach
is called the power method.
It turns out that at the j-th iteration an approximation to the
eigen-vector x associated with is given by the k-th components
of w(j) and w(j-1), respectively, and an approximation to is given
by
for any Although there are n possible choices for k, it
is usual to choose it so that is the component of w(j) with the
largest magnitude.
2. Example
We will now use the power method to find the largest eigen-value
of the matrix
As a starting vector we take
Since the second component of w(1) has the largest magnitude,
take k = 2 so that the first approximation to is
Subsequent iterations of the power method yield
These calculations allow to conclude that the largest eigen-value
is about 2.6.
3. Variants
In the preceding example, the reader will have noted that the
components of w(j) were growing in size as j increases. Overflow
problems would arise, if this growth were to continue; hence, in
practice, it is usual to use instead the scaled power method.
This is identical to the power method except that the vectors w(j)
are scaled at each iteration. Thus, suppose w(0) be given and set
y(0) = w(0). Then we will carry out for j = 1, 2, . , . the steps:
a. Calculate w(j) = Ay(j-1);
b. find p such that so that the p-th
component of w(j) has the largest magnitude;
c. evaluate an approximation to , i.e.,
for some .
d. calculate .
In Step c, there are n choices for k. As for the unscaled power
method, k is usually chosen to be the value for which has the
largest magnitude, i.e., k is taken to be the value p obtained in
Step b. Another option is to choose k to be the value of p from
the previous iteration, although often this results in the same
value. The effect of Step d is to produce a vector y(j) with
components of magnitude not more than unity.
As an example, we apply the scaled power method to the
matrix in the preceding section. We take the value of k in each
iteration to be p. The starting vector w(0) is the same as before
and y(0) = w(0). Then the first four iterations of the scaled power
method yield:
Note that the eigen-value estimates are the same as before and
the w(j)s are just multiples of those obtained in the preceding
section.
We shall now discuss the use of the power method to find the
eigen-value of the smallest magnitude. If A has an inverse,
then may be written as
It follows that the smallest eigen-alue of A may be found by
finding the largest eigen-value of A-1 and then taking the
reciprocal. Thus, if the unscaled power method were used, we
would calculate the vectors
In general, it is more efficient to solve the linear system
than to find the inverse of A (Step 14).
4. Other aspects
It may be shown that the convergence rate of the power
method is linear and that, under suitable conditions,
where C is some positive constant. Thus, the bigger the gap
between the faster the rate of convergence. Since the
power method is an iterative method, one has to stop at some
stage. It is usual to carry on the process until successive
estimates of the eigen- value agree to a certain tolerance or,
alternatively, until a given maximum number of iterations is
exceeded.
Difficulties with the power method usually arise when our
assumptions about the eigen-values are not valid. For instance, if
then the sequence of estimates for may not converge.
Even if the sequence does converge, one may not be able to get
an approximation to the eigen-vector associated with . A short
discussion of such difficulties may be found in Conte and de Boor
(1980).
Checkpoint
1. How are the eigen- values and eigen- vectors of a matrix
defined?
2. What is the power method for finding the eigen-value of largest
magnitude?
3. What advantage does the scaled power method have over the
power method?
EXERCISES
1. For the 2 x 2 matrix
apply eight iterations of the power method. Find the
characteristic polynomial, and hence find the two true eigen-
values of the matrix. Verify that the approximations are
converging to the eigen-value with the larger magnitude.
2. Apply five iterations of the normal and scaled power
methods to the 3 x 3 matrix
STEP 18
FINITE DIFFERENCES 1
Tables
Historically speaking, numerical analysts have always been concerned
with tables of numbers, and many techniques have been developed for
dealing with mathematical functions, represented in this way. For
example, the value of the function at an untabulated point may be
required, so that a interpolation is necessary. It is also possible to
estimate the derivative or the definite integral of a tabulated
function, using some finite processes to approximate the
corresponding (infinitesimal) limiting procedures of calculus. In each
case, it has been traditional to use finite differences. Another
application of finite differences, which is outside the scope of this book,
is the numerical solution of partial differential equations.
1. Tables of values
Many books contain tables of mathematical functions. One of
the most comprehensive is the Handbook of Mathematical
Functions, edited by Abramowitz and Stegun (see the
Bibliography for publication details), which also contains useful
information about numerical methods.
Although most tables use constant argument intervals, some
functions do change rapidly in value in particular regions of their
argument, and hence may best be tabulated using intervals
varying according to the local behaviour of the function.
Tables with varying argument intervals are more difficult to
work with, however, and it is common to adopt uniform
argument intervals wherever possible. As a simple example,
consider the 6S table of the exponential function over 0.10
(0.01 ) 0.18 (a notation which specifies the domain 0.10
It is extremely important that the interval between successive
values is small enough to display the variation of the tabulated
function, because usually the value of the function will be needed
at some argument value between values specified (for example,
from the above table). If the table is constructed in
this manner, we can obtain such intermediate values to a
reasonable accuracy by using a polynomial representation
(hopefully, of low degree) of the function f.
2. Finite differences
Since Newton, finite differences have been used extensively. The
construction of a table of finite differences for a tabulated
function is simple: One obtains first differences by subtracting
each value from the succeeding value in the table, second
differences by repeating this operation on the first differences,
and so on for higher order differences. From the above table
of one has the (note the standard layout, with
decimal points and leading zeros omitted from the differences):
(In this case, the differences must be multiplied by 10-5 for
comparison with the function values.)
3. Influence of round- off errors
Consider the difference table given below for to
6S, constructed as in Section 2. As before, differences of
increasing order decrease rapidly in magnitude, but the third
differences are irregular. This is largely a consequence of round-
off errors, as tabulation of the function to 7S and differencing to
fourth order illustrates (compare Exercise 3 ).
Although the round-off errors in f should be less than 1/2 in the
last significant place, they may accumulate; the greatest error
that can be obtained corresponds to:
A rough working criterion for the expected fluctuations (noise
level) due to round-off error is shown in the table:
Checkpoint
1. What factors determine the intervals of tabulation of a
function?
2. What is the name of the procedure to determine a value
of a tabulated function at an intermediate point?
3. What may be the cause of irregularity in the highest
order differences in a difference table?
EXERCISES
1. Construct the difference table for the function f (x) = x3 for x
= 0(1) 6.
2. Construct difference tables for each of the polynomials:
a.2x - l for x = 0(1)3.
b.3x2 + 2x - 4 for x = 0(1)4.
c.2x3 + 3x - 3 for x = 0(1)5.
Study your resulting tables carefully; note what happens in the
final few columns of each table. Suggest a general result for
polynomials of degree n and compare your answer with the
theorem in STEP 20.
3. Construct a difference table for the function f (x) = ex, given to
7D for x = 0.1(0.05) 0.5
STEP 19
FINITE DIFFERENCES 2
Forward, backward, central difference notations
There are several different notations for the single set of finite
differences, described in the preceding Step. We introduce each of
these three notations in terms of the so-called shift operator, which we
will define first.
1. The shift operator E
Let be a set of values of the
function f(x) The shift operator E is defined by:
Consequently,
and so on, i.e.,
where k is any positive integer. Moreover, the last formula can be
extended to negative integers, and indeed to all real values of j
and k, so that, for example,
and
2. The forward difference operator Q
If we define the forward difference operator Q by
then
,
which is the first- order forward difference at xj. Similarly, we
find that
is the second- order forward difference at xj, and so on. The
forward difference of order k is
where k is any integer.
3. The backward difference operator
If we define the backward difference operator by
then
which is the first- order backward difference at xj. Similarly,
is the second- order backward difference at xj, etc. The
backward difference of order k is
where k is any integer. Note that .
4. The central difference operator
If we define the central difference operator by
then
which is the first- order central difference at xj. Similarly,
is the second- order central difference at xj, etc. The central
difference of order k is
where k is any integer. Note that .
5. Differences display
The role of the forward, central, and backward differences is
displayed by the difference table:
Although forward, central, and backward differences
represent precisely the same data:
1. Forward differences are useful near the start of a
table, since they only involve tabulated function values
below xj ;
2. Central differences are useful away from the ends of a
table, where there are available tabulated function values
above and below xj;
3. Backward differences are useful near the end of a
table, since they only involve tabulated function values
above xj.
Checkpoint
4. What is the definition of the shift operator?
5. How are the forward, backward, and central
difference operators defined?
6. When are the forward, backward, and central
difference notations likely to be of special use?
EXERCISES
7. Construct a table of differences for the polynomial
for x = 0(1)4. Use the table to obtain the values of :
1. ;
2. ;
3. .
8. For the difference table in Section 3 of STEP 18 of f (x) = ex
for x = 0.1(0.05)0.5 determine to six significant digits the
quantities (taking x0 = 0.1 ):
1. ;
2. ;
3. ;
4. ;
5. ;
9. Prove the statements:
1. ;
2. ;
3. ;
4. .
STEP 20
FINITE DIFFERENCES 3
Polynomials
Since polynomial approximations are used in many areas of Numerical
Analysis, it is important to investigate the phenomena of differencing
polynomials.
1. Finite differences of a polynomial
Consider the finite differences of an n-th degree polynomial
tabulated for equidistant points at the tabular interval h.
Theorem: The n-th difference of a polynomial of degree n is a constant
proportional to n
and higher order differences are zero.
Proof: For any positive integer k, the binomial expansion
yields
.
Omitting the subscript of x, we find
In passing, the student may recall that in the Differential Calculus the increment is
related to the derivative of f (x) at the point x.
2. Example
3
Construct for f (x) = x with x = 5.0(0.1)5.5 the difference table:
Since in this case n = 3, an =1, h = 0.1, we find
3 for 5.0(0.1)5.5, rounded to
Note that round- off error noise may occur; for example, consider the tabulation of f(x) = x
two decimal places:
3. Approximation of a function by a polynomial
Whenever the higher differences of a table become small (allowing for round-off noise), the function represented may be
approximated well by a polynomial. For example, reconsider the difference table of 6D for f (x ) = e x with x =
0.1(0.05)0.5:
Since the estimate for round-off error at (cf. the table in STEP 12), we say that third differences are constant
within round-off error, and deduce that a cubic approximation is appropriate for ex over the range 0.1 < x < 0.5. An
example in which polynomial approximation is inappropriate occurs when f(x) = 10 x for x = 0(1)4, as is shown by the next
table:
Although the function f(x) = 10 x is `smooth', the large tabular interval (h = 1) produces large higher order finite differences.
It should also be understood that there exist functions that cannot usefully be tabulated at all, at least in certain
neighbourhoods; for example, f(x) = sin(1/x) near the origin x = 0. Nevertheless, these are fairly exceptional cases.
Finally, we remark that the approximation of a function by a polynomial is fundamental to the widespread use of finite
difference methods.
Checkpoint
1. What may be said about the higher order ( exact) differences of a polynomial?
2. What is the effect of round- off error on the higher order differences of a polynomial?
3. When may a function be approximated by a polynomial?
EXERCISES
1. Construct a difference table for the polynomial f(x) = x 4 for x = 0(0.1)1 when
a.
b. the values of f are exact;
c. the values of f have been rounded to 3 D.
d. Compare the fourth difference round-off errors with the estimate +/-6.
e.
2. Find the degree of the polynomial which fits the data in the table: