Numerical Methods (Engr-280)
Errors in Numerical Methods
Roundoff Errors
Dakar American University of Science and Technology (DAUST)
Dr. Amadou Toure
26 March, 2023
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 1 / 15
Source of Error:
Two sources of numerical error:
1 Round off error - Caused by representing a number approximately
2 Truncation error - Caused by truncating or approximating a
mathematical procedure.
1 √
Round off error: ≅ 0.333333, 2 ≅ 1.4142 …
3
Example problems created by round off error:
▶ Over 20 civilians killed by enemy’s Scud missile due to failure in the
defense tracking system to intercept the Scud. Why?
▶ Clock cycle of 1/10 seconds was represented in 24-bit fixed point
register created an error of 9.5 × 10−8 seconds.
▶ The battery was on for consecutive hours, thus causing an inaccuracy of
𝑠 3600𝑠
9.5 × 10−8 × 100ℎ𝑟 × = 0.342𝑠
0.1𝑠 1ℎ𝑟
▶ Calculated shift in the ranging system of the missile ≈ 687 meters.
▶ The target was considered to be out of range at a distance greater
than 137 meters.
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 2 / 15
Binary Representation
Decimal Number Representation (Base 10):
257.76 = 2 × 102 + 5 × 101 + 7 × 100 + 7 × 10−1 + 6 × 10−2
Binary Number Representation (Base 2):
1 × 23 + 0 × 2 2 + 1 × 2 1 + 1 × 2 0
(1011.0011)2 = ( )
+ 0 × 2−1 + 0 × 2−2 + 1 × 2−3 + 1 × 2−4
10
= 11.1875
Convert base-10 to binary:
Quotient Remainder
11/2 5 1=𝑎0 ⎧ 2) 11 ↑
↑ 1 ⎫
{ ↑
↑ }
5/2 2 1=𝑎1 2) 5 ↑
↑ 1
1110 →⎨ ↑
2/2 1 0=𝑎2 2) 2 ↑
↑
↑ 0 ⎬ 10112
{ ↑
↑ }
1/2 0 1=𝑎3 ⎩ 2) 1 ↑ 1 ⎭
Hence (11)10 = (𝑎3 𝑎2 𝑎1 𝑎0 )2 = (1011)2
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 3 / 15
Fractional Decimal Number to Binary
Number Number after decimal Number before decimal
0.1875 × 2 0.375 0.375 0 = 𝑎−1
0.375 × 2 0.75 0.75 0 = 𝑎−2
0.75 × 2 1.75 0.5 1 = 𝑎−3
0.5 × 2 1.0 0.0 1 = 𝑎−4
Hence (0.1875)10 = (𝑎−1 𝑎−2 𝑎−3 𝑎−4 )2 = (0.0011)2
(11.1875)10 = (?.?)2
▶ (11)10 = (1011)2 and (0.1875)10 = (0.0011)2
Thus:
▶ (11.1875)10 = (1011.0011)2
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 4 / 15
All Fractional Decimal Numbers Cannot be Represented Exactly:
Number Number after decimal Number before decimal
0.3 × 2 0.6 0.6 0 = 𝑎−1
0.6 × 2 1.2 0.2 1 = 𝑎−2
0.2 × 2 0.4 0.4 0 = 𝑎−3
0.4 × 2 0.8 0.8 0 = 𝑎−4
0.8 × 2 1.6 0.6 1 = 𝑎−5
⋯ ⋯ ⋯ ⋯
Hence (0.3)10 = (𝑎−1 𝑎−2 𝑎−3 𝑎−4 𝑎−5 )2 = (0.01001)2 = 0.28125
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 5 / 15
Scientific Form: Floating Point
Floating Decimal Point : Scientific Form
256.78 → +2.5678 × 102 0.003678 → +3.678 × 10−3
exponent 𝑒
The form: sign × mantissa × base or 𝜎 × 𝑚 × base
Floating Point Format for Binary Numbers:
𝑦 = 𝜎 × 𝑚 × 2𝑒
𝜎 = sign (0 for +ve, 1 for -ve )
𝑚 - mantissa or fraction (aka significand) - the valuable digits
𝑒 - exponent, how far & in which direction to move the decimal point
(11.1875)10 = (1011.0011)2 = (1.0110011)2 × 23
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 6 / 15
How computer represents number in memory.
Computer represents numbers in binary (1’s and 0’s).
Computer has a limited amount of space for each number or variable.
Thus direct impact to the size, or range, of numbers represented.
▶ a byte (8-bits) to represent 28 = 256 different numbers,
unsigned (+ve): 0 − 256
signed: 2𝑛−1 to 2𝑛−1 − 1 or −128 to 127
▶ a word (16-bits) can be used to represent 216 = 65 536 numbers
▶ a double-word (32-bits) can be used to represent 232 = 4 294 967 296
numbers.
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 7 / 15
IEEE Standard 754
Most (but not all) computer manufacturers use IEEE-754 format
Number represented:
(−1)𝑠 × (1.𝑀 ) × 2𝐸−𝑏𝑖𝑎𝑠 ≡ (−1)𝑠 × 2𝐸−𝑏𝑖𝑎𝑠 × (1.𝑀 )
31 23 0
Sign
Exponent Mantissa
Single Precision (32-bits): The exponent is 8 bits and the mantissa 23
bits.
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 8 / 15
Number Range:
𝑒 = 1111 1111 = 28 − 1 = 255 reserved
𝑒 = 0000 0000 = 0 reserved
so, 𝐸 = 𝑒 − 127 ≡ 1 − 127 ≤ 𝐸 ≤ 254 − 127 ≡ −126 ≤ 𝐸 ≤ 127
Smallest exponent: 𝑒 = 0000 0000, represents denormal numbers
(1.𝑚 → 0.𝑚)
Largest exponent:
𝑒 = 1111 1111, represents ±∞, if 𝑚 = 0
𝑒 = 1111 1111, represents NaN (not a number) , if 𝑚 ≠ 0
Smallest positive normal number
= 1.0000 0000 ⋯ 0000 × 2−126
≈ 1.2 × 10−38
bin: 0000 0000 1000 0000 0000 0000 0000 0000 MATLAB:
realmin(’single’)
Largest positive number
= 1.1111 1111 ⋯ 1111 × 2127
= (1 + (1 − 2−23 )) × 2127
≈ 2128 ≈ 3.4 × 1038
bin: 0111 1111 0111 1111 1111 1111 1111 1111 MATLAB:
realmax(’single’)
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 9 / 15
Machine Epsilon
Machine epsilon (𝜖𝑚 ) is defined as the distance (gap) between 1 and the
next largest floating point number.
For IEEE-754 single precision, 𝜖𝑚 = 2−23 ≈ 1.192 × 10−7 , as shown by:
1 = 1.00000000000000000000000
1 + 𝜖𝑚 = 1⏟ .0000000000000000000000 1⏟
20 2−23
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 10 / 15
(11.1875)10 = (1011.0011)2 = (1.0110011)2 × 23
▶ 𝑠 = 0, +ve ⟹ (−1)0 = 1
▶ 𝑒 = 3 = 𝐸 − bias ⟹ 𝐸 = 3 + 127 = (130) 0 = (100000010)
1 2
▶ 𝑀 = 0110011, fill with 0 to 23-bits
31 23 0
01000001000110011000000000000000
Given ieee-format, convert to decimal:
31 23 0
11000001110110010100000000000000
𝑠 = 1, -ve ⟹ (−1)1 = −1
𝐸 = 28 + 21 + 20 = 128 + 2 + 1 = 131
𝑒 = 𝐸 − 127 = 131 − 127 = 4
𝑀 = 2−1 + 2−3 + 2−4 + 2−7 + 2−9 = 0.697265625
= −1 × 24 × (20 + 0.697265625) = −27.15625
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 11 / 15
Some fractions can’t be represented exactly! (But, 𝜖𝑎 < 𝜖𝑚 )
0.02832 x 2 = 0.05664
0.05664 x 2 = 0.11328
0.11328 x 2 = 0.22656
0.22656 x 2 = 0.45312
0.45312 x 2 = 0.90624 (0.02832)10 = (0.00000111001111111111101011000001)2
0.90624 x 2 = 1.81248 = 1.11001111111111101011000001 × 2−6
0.81248 x 2 = 1.62496 𝐸 = −6 + 127 = 121 = 011110012
0.62496 x 2 = 1.24992
0.24992 x 2 = 0.49984
0.49984 x 2 = 0.99968
0.99968 x 2 = 1.99936
⋯
31 23 0
00111100111001111111111101011000
(0.02832)10 ≈ (0.02831999957561492919921875)2
The Error 𝐸𝑡 = −4.2438507080078125 × 10−10
𝐸𝑡
𝜖𝑎 = ∣ ∣ = 1.498535 × 10−08 < 𝜖𝑚 (𝑒−23 or 1.192 × 10−07 )
0.02832
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 12 / 15
Not Really Real
Floating point numbers approximate real numbers.
▶ The number of bits in the mantissa sets the granularity,
▶ the exponent sets the range of the approximation.
Exple: a 3-bit exponent and 3-bit mantissa represent the following values on
the real line:
Calculations with results greater than greatest value (15 in this case) gives
an overflow.
A result between 0 and the lowest value (1/8 in this case) gives an
underflow.
The distance from 1 to the smallest value greater than 1 is 𝜖𝑚 (machine
epsilon)
Dr. ▶ (in Matlab
Amadou Toure eps)Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 13 / 15
single precision - The exponent is 8 bits and the mantissa 23 bits.
𝜖𝑚 ≈ 1.2 × 10−7
Overflow at ≈ 3.4 × 1038
double precision - The exponent is 11 bits and the mantissa 52 bits.
𝜖𝑚 ≈ 2.2 × 10−16
Overflow at ≈ 1.8 × 10308
The underflow is handled gracefully by unnormalized values, that fill
the gap round zero.
If an underflow still occurs the result is set to 0.
An overflow gives ∞ as result.
Undeterminable results produce a NaN – Not a Number. (ex: 0/0 =
NaN).
Arithmetic operations (+, −, ∗, /) get a result as if they were
calculated in full precision, and then rounded to fit into the floating
point representation.
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 14 / 15
Rounding errors introduced in IEEE floating point arithmetics are modeled
like this:
𝜖
𝑓𝑙(𝑥 + 𝑦) = (𝑥 + 𝑦)(1 + 𝜎), |𝜎| ≤ 𝑚
2
𝜖
𝑓𝑙(𝑥 − 𝑦) = (𝑥 − 𝑦)(1 + 𝜎), |𝜎| ≤ 𝑚
2
𝜖
𝑓𝑙(𝑥 ∗ 𝑦) = (𝑥 ∗ 𝑦)(1 + 𝜎), |𝜎| ≤ 𝑚
2
𝜖𝑚
𝑓𝑙(𝑥/𝑦) = (𝑥/𝑦)(1 + 𝜎), |𝜎| ≤
2
Two uses:
(uncommon) Keep track on errors in calculations
(important) Analyze algorithms with respect to influence of floating
point round off. If round off errors are moderate compared to a
perturbation analysis of the problem, the algorithm is stable.
Dr. Amadou Toure Numerical Methods (Engr-280) Errors in Numerical Methods
26 March, 2023 15 / 15