MATH243: Errors and Computer Arithmetic
Week 2 Lectures - PS
Recall from our last lecture the possible sources of errors in a numerical computation, namely
1. Modeling errors
2. Human blunders and mistakes
3. Physical measurement errors
4. Machine representation and arithmetic errors
5. Mathematical approximation errors
and the consequences of errors: Errors, such as round-off errors get propagated in numerical computations
resulting in:
1. Loss-of-significance errors
2. Noise in function evaluation (uncertainty)
3. • underflow errors (numbers are too small). The number is taken as zero by default), and
• overflow errors (number too large). For overflow errors, the number has a value ±∞ or N aN .
Our focus here is on machine representation and arithmetic errors.
1 Machine representation and arithmetic errors
We count using the decimal number system (i.e. base 10), but computers use binary arithmetic (base 2).
Other bases, e.g. the octal (base 8) hexadecimal(base 16) number systems are also used.
Binary arithmetic - each number is represented as a binary number (that is, a finite sum of integer powers
of 2).
1. Some numbers can be represented exactly, such as:
• 1563 = 1 × 210 + 1 × 29 + 0 × 28 + 0 × 27 + 0 × 26 + 0 × 25 + 1 × 24 + 1 × 23 + 0 × 22 + 1 × 21 + 1 × 20 =
(11000011011)2
• 2.125 = 1 × 22 + 0 × 2−1 + 0 × 2−2 + 1 × 2−3
2. Other numbers have no exact binary representation (will lead to chopping and or round-off errors),
e.g.
1
• = (0.00011001100110011001100 . . .)2
10 10
1
1
• =
100 10
1
• =
1000 10
• 3.1 = 1 × 21 + 1 × 20 + 1 × 2−4 + 1 × 2−5 + . . ..
• Transcendental numbers, e.g. π have no finite representation, in both decimal or binary systems
Question: How do we convert from decimal to binary numbers? Please see the lecture notes Math243Lnotes2020a
for details. However here is a very brief tour.
In decimal representation of integers we identify an integer N by
dn dn−1 · · · d1 d0 (1)
with
N = dn × 10n + dn−1 × 10n−1 + · · · + d1 × 101 + d0 × 100 (2)
Similarly, an infinite decimal of the form
dn dn−1 · · · d1 d0 .d−1 d−2 d−3 . . . (3)
can be represented as
dn × 10n + dn−1 × 10n−1 + · · · + d1 × 101 + d0 × 100 + d−1 × 10−1 + d−2 × 10−2 + . . . (4)
where dk is an integer from 0 to 9, with dn 6= 0.
Example 1 Base 2 representation of numbers:
23456 = 2 × 104 + 3 × 103 + 4 × 102 + 5 × 101 + 6 × 100
234.56 = 2 × 102 + 3 × 101 + 4 × 100 + 5 × 10−1 + 6 × 10−2
In general, an (n + 1) digit integer I can be expressed in base b as
I = (an an−1 an−2 . . . . . . a2 a1 a0 )b (5)
If b = 10 the digits aj can take values between 0 and 9 (an 6= 0)
If b = 2 the digits aj can either be 0 or 1.
The decimal value of the number (with base b) given in equation (5) is
I = bn × an + bn−1 × an−1 + · · · + b1 × a1 + b0 × a0 . (6)
In a binary system, the mathematical expression of an integer is
I = (an an−1 an−2 . . . . . . a2 a1 a0 )2 (7)
where ai is a binary bit, that is 0 or 1. The decimal value of a binary number is
n
X
I = an × 2n + an−1 × 2n−1 + · · · + a1 × 21 + a0 × 20 = 2k ak . (8)
k=0
2
If the binary number contains a fractional part, i.e if it is of the form
I = (an an−1 an−2 . . . . . . a2 a1 a0 .a−1 a−2 a−3 . . .)2 (9)
it can be represented as
I = 2n an + 2n−1 an−1 + · · · + 21 a1 + 20 a0 + 2−1 a−1 + 2−2 a−2 + 2−3 a−3 + . . . (10)
Example 2 Base 2 to decimal:
(11001)2 = 1 × 24 + 1 × 23 + 0 × 22 + 0 × 21 + 1 × 20 = 16 + 8 + 0 + 0 + 1 = (25)10
(0.0111)2 = 0 × 2−1 + 1 × 2−2 + 1 × 2−3 + 1 × 2−4 = 0 + 0.25 + 0.125 + 0.0625 = (0.4375)10
1.1 Converting the integer part
A decimal number I10 can be represented in binary form as
I10 = an × 2n + an−1 × 2n−1 + . . . + a1 × 21 + a0 × 20 = (an an−1 an−2 . . . . . . a2 a1 a0 )2 . (11)
The following procedure is a more general approach that can be used to convert decimal integers to binary
form.
Given an integer I10 , determine the quotients Qi and the remainders Ri such that
I10 = 2Q0 + R0
Q0 = 2Q1 + R1
Q1 = 2Q2 + R2
.. ..
. .
Qn−1 = 2Qn + Rn
The remainders are 0s and 1s and the binary form for the number I10 is
Rn Rn−1 Rn−2 . . . R1 R0 (12)
It should be noted that
Rn Rn−1 Rn−2 . . . R1 R0 = an an−1 an−2 . . . a1 a0 (13)
Example 3 Convert the following decimal numbers into binary form
(a) (53)10
(b) (315)10
Answer:
(a)
Quotients Remainders
2 53
2 26 1 = a0
2 13 0 = a1
2 6 1 = a2
2 3 0 = a3
2 1 1 = a4
2 0 1 = a5
Thus (53)10 = (110101)2
3
(b)
Quotients Remainders
2 315
2 157 1 = a0
2 78 1 = a1
2 39 0 = a2
2 19 1 = a3
2 9 1 = a4
2 4 1 = a5
2 2 0 = a6
2 1 0 = a7
2 0 1 = a8
Thus (315)10 = (100111011)2
For more details, including on binary shifting please check the notes Math243Lnotes2020a.
2 Computers use two formats for numbers
Definition 1 (Fixed-point numbers) Fixed-point numbers are used to store integers.
Here each number is stored in a computer word of 32 binary digits.
• The answer is always an integer
• The stored numbers are equally spaced
• Fixed-point numbers give a limited range of numbers
Definition 2 (Floating point numbers) Floating point numbers approximate real numbers.
• stored numbers are not equally spaced
• gives a wide range of numbers that can be represented exactly.
3 Algorithms and Convergence
Algorithm, is a word that crops up so often in numerical analysis it important to have an understanding of
what it means.
Definition 3 (Algorithm) An algorithm is a procedure that describes, in an unambiguous manner, a finite
sequence of steps to be performed in a specific order to achieve a certain outcome.
Alternatively:
A numerical algorithm is simply a procedure that:
1. Takes input variables,
2. Through a sequence of steps manipulates the input variables along with additional temporary variables
to give,
3. Output variables.
4
In numerical methods the objective of an algorithm is to implement a procedure to solve a problem or
approximate a solution to a problem.
Definition 4 (Stability of an algorithm) An algorithm is stable if any small change in the initial data
only results in a small change in the final data. Otherwise it is unstable. The algorithm is conditionally
stable if it is only stable for a certain choice of initial data.
Key concepts - error growth
Let f (x) = 0 be an equation to be solved using a numerical algorithm.
Suppose x is the exact solution and x0 , the initial approximation, x1 the second approximation, · · · xn the
nth approximation. Let:
• E0 = |x − x0 | > 0 denote the initial error,
• E1 = |x − x1 | > 0 the error after the first approximation,
• E2 = |x − x2 | > 0 the error after the second approximation,
.
• ..
• En = |x − xn | is the error after n operations.
Definition 5 (Linear growth) If En ≈ CE0 · n (for a constant C, which is independent of n), then the
error growth is linear.
Linear error growth is usually unavoidable, and in the case where C and E0 are small the results are generally
acceptable. The algorithm is stable.
Definition 6 (Exponential error growth) If En ≈ C n E0 for some C > 1, then the error growth is
exponential.
In this case the error will dominate very fast(the result is unreliable).
NB. Exponential error growth is unacceptable. Regardless of the size of E0 the error grows rapidly. The
algorithm is unstable.
Example 4 Decide if the recursive equation
10
pn = pn−1 − pn−3 for n = 2, 3, . . . ,
3
gives linear or exponential error growth if 5-digit rounding is used.
Solution: Please follow the recorded lecture.
4 Rate of Convergence: Sequences
The rate of convergence of a sequence can be linear (slow), quadratic (fast), and so on. For a numerical
method, quadratic convergence is more desirable than linear convergence, higher order convergence allows
us to obtain a solution with fewer steps (iterations).
5
∞
Definition 7 (Rate of Convergence) . Suppose we have two sequences: {βn }n=1 which converges to
∞
zero, and {αn }n=1 which converges to a number α.
If there exists K > 0 with
|αn − α| ≤ K|βn |,
∞
for n large enough, then we say that {αn }n=1 converges to α with a rate of Convergence O(βn ) (read “Big
Oh of βn ”).
We write
αn = α + O(βn ).
∞
Note: The sequence {βn }n=1 is usually chosen to be
1
βn = ,
np
for some positive value of p.
Note: We are generally interested in the largest value of p with
1
αn = α + O .
np
Big-Oh defined
• Big-Oh is about concerned with finding an asymptotic upper bound
• Formal definition: f (N ) = O(g(N )) if there exists constants c and N0 such that
f (N ) ≤ c · g(N ) for all N ≥ N0 .
(read: “f (N ) grows no faster than g(N )”)
• Note: We are interested in how f grows when N is very large.
Example 5 Find the rate of convergence of the sequence
1
αn = α + √ as n → ∞.
n
Solution: see also the recorded lecture.
Strategy: can we find a K > 0 such that |αn − α| ≤ K|βn |?
Take > 0 and note that
1 1
|αn − α| = √ ≤ (1 + ) √ .
n | {z } n
K |{z}
βn
1
It follows that αn = α + O √ .
n
∞ 1
Conclusion: The rate of convergence of the sequence {αn }n=1 to α is O √ .
n
6
Example 6 Determine the rate of convergence of the sequence
1 1
αn = sin − .
n n
Solution:
Start with the Maclaurin series of sin(x) (this is the Taylor series expansion of sin(x) at x = 0):
x3 x5
sin(x) = x − + − ··· .
3! 5!
1 1 1 1
Hence sin = − 3+ − ··· .
n n 6n 120n5
Then
1 1 1 1 1 1
αn = sin − ≈− 3 +O ⇒ |αn − 0| = +O .
n n 6n n5 6n3 n5
1 1 1
Hence αn = 0 + O noting that 3 as n → ∞.
n3 n 5 n
5 Rate of Convergence: Function Limits
Definition 8 (Rate of Convergence) Consider two functions G(x) and F (x) such that
lim G(h) = 0 and lim F (h) = L.
h→0 h→0
If there exists K > 0 such that
|F (h) − L| ≤ K|G(h)| for all h < H,
where H > 0, then
F (h) = L + O (G(h)) .
We say F (h) converges to L with a rate of convergence O (G(h)).
Example 7 Consider the function
α(h) = sin(h) − h.
What is the rate of convergence of α(h) as h → 0?
Solution: This example is similar to the previous one, and so we use the Maclaurin series:
h3
+ O h5 .
sin(h) ≈ h −
6
Then
h3
+ O h5 ⇒ lim α(h) = 0 + O h3 .
|α(h)| =
6 h→0
Conclusion: The rate of convergence is O h3
——- THE END ——-