KEMBAR78
NGUYEN Hong Diep 2011 These | PDF | Interval (Mathematics) | Accuracy And Precision
0% found this document useful (0 votes)
15 views109 pages

NGUYEN Hong Diep 2011 These

The document presents efficient algorithms for verified scientific computing, specifically focusing on numerical linear algebra using interval arithmetic. It details various aspects of interval arithmetic, including operations, properties, and implementations, as well as algorithms for interval matrix multiplication and verification of linear system solutions. The thesis was submitted by Hong Diep Nguyen to obtain a doctorate from the École Normale Supérieure de Lyon in 2011.

Uploaded by

bailiyunzhong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views109 pages

NGUYEN Hong Diep 2011 These

The document presents efficient algorithms for verified scientific computing, specifically focusing on numerical linear algebra using interval arithmetic. It details various aspects of interval arithmetic, including operations, properties, and implementations, as well as algorithms for interval matrix multiplication and verification of linear system solutions. The thesis was submitted by Hong Diep Nguyen to obtain a doctorate from the École Normale Supérieure de Lyon in 2011.

Uploaded by

bailiyunzhong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

Efficient algorithms for verified scientific computing :

Numerical linear algebra using interval arithmetic


Hong Diep Nguyen

To cite this version:


Hong Diep Nguyen. Efficient algorithms for verified scientific computing : Numerical linear algebra
using interval arithmetic. Other [cs.OH]. Ecole normale supérieure de lyon - ENS LYON, 2011.
English. �NNT : 2011ENSL0617�. �tel-00680352�

HAL Id: tel-00680352


https://theses.hal.science/tel-00680352v1
Submitted on 19 Mar 2012

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
THÈSE
en vue de l’obtention du grade de
Docteur de l’École Normale Supérieure de Lyon - Université de Lyon

Spécialité : Informatique

Laboratoire de l’Informatique du Parallélisme


École Doctorale de Mathématiques et Informatique Fondamentale

présentée et soutenue publiquement le 18 janvier 2011


par M. Hong Diep NGUYEN

Efficient algorithms for verified scientific computing:


numerical linear algebra using interval arithmetic

Directeur de thèse : M. Gilles VILLARD


Co-directrice de thèse : Mme Nathalie REVOL

Après l’avis de : M. Philippe LANGLOIS


M. Jean-Pierre MERLET

Devant la commission d’examen formée de :


M. Luc JAULIN, Président
M. Philippe LANGLOIS, Rapporteur
M. Jean-Pierre MERLET, Rapporteur
Mme Nathalie REVOL, Co-directrice
M. Siegfried RUMP, Membre
M. Gilles VILLARD, Directeur
Efficient algorithms for verified scientific computing:
numerical linear algebra using interval arithmetic

Nguyễn Hồng Diê.p

18 January 2011
Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Description of the document . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Interval arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.1 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Algebraic properties . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.3 Absolute value properties . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.4 Midpoint and radius properties . . . . . . . . . . . . . . . . . . . . 8
1.3.5 Variable dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.6 Wrapping effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3.8 Implementation environment . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Interval matrix multiplication 17


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 State of the art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Algorithm 1 using endpoints representation . . . . . . . . . . . . . . . . . . 21
2.3.1 Special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2 General case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Algorithm 2 using midpoint-radius representation . . . . . . . . . . . . . . 34
2.4.1 Scalar interval product . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4.2 Matrix and vector operations . . . . . . . . . . . . . . . . . . . . . 38
2.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3 Verification of the solution of a linear system 45


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 State of the art, verifylss function . . . . . . . . . . . . . . . . . . . . . 46
3.3 Floating-point iterative refinement . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Interval iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.1 Numerical behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.4.2 Updating the computed solution . . . . . . . . . . . . . . . . . . . . 52
3.4.3 Initial solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4.4 Interval iterative improvement . . . . . . . . . . . . . . . . . . . . . 54
3.4.5 Detailed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

i
ii CONTENTS

3.5 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57


3.5.1 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 Relaxation method 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Interval improvement: Jacobi and Gauss-Seidel iterations . . . . . . . . . . 64
4.2.1 Convergence rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.3 Initial solution of the relaxed system . . . . . . . . . . . . . . . . . 70
4.3 Implementation and numerical experiments . . . . . . . . . . . . . . . . . . 71
4.3.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Effects of the computing precision 75


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Implementation using variable computing precisions . . . . . . . . . . . . . 75
5.2.1 Residual computation . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 Floating-point working precision . . . . . . . . . . . . . . . . . . . . 76
5.2.3 Needed working precision . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Normwise relative error . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 Componentwise relative error . . . . . . . . . . . . . . . . . . . . . 83
5.4 Doubling the working precision for approximate solution . . . . . . . . . . 84
5.4.1 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.4.2 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4.3 Other extended precisions . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Conclusions and Perspectives 89


List of Figures

1.1 Wrapping effect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1 Performance of igemm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


2.2 Performance of igemm3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1 Solution set of the interval linear system (3.3). . . . . . . . . . . . . . . . . 51


3.2 MatLab implementation results for 1000 × 1000 matrices. . . . . . . . . . 60

4.1 Solution set of the relaxed system. . . . . . . . . . . . . . . . . . . . . . . . 65


4.2 Relaxation technique: MatLab implementation results for 1000 × 1000 ma-
trices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1 Effects of the residual precision. . . . . . . . . . . . . . . . . . . . . . . . . 77


5.2 Effects of the working precision. . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Minimal working precision needed. . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Effect of doubling the working precision for the approximate solution. . . . 87
5.5 Effect of other extended precisions. . . . . . . . . . . . . . . . . . . . . . . 88

iii
iv LIST OF FIGURES
List of Algorithms

2.1 Decomposition algorithm for endpoint intervals . . . . . . . . . . . . . . . 29


3.1 Classical iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Interval iterative refinement . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3 Testing the H-matrix property . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Solving and verifying a linear system . . . . . . . . . . . . . . . . . . . . . 56
4.1 Interval improvement function refine using relaxed Jacobi iterations . . . 72
5.1 Solving and verifying a linear system . . . . . . . . . . . . . . . . . . . . . 76
5.2 Doubling the precision for the approximate solution . . . . . . . . . . . . . 86

v
vi LIST OF ALGORITHMS
Chapter 1

Introduction

This chapter presents the general context and the motivation of my Ph.D.
preparation. It gives a brief introduction to interval arithmetic together with
its advantages as well as a discussion on the main obstacles to its application.
Interval arithmetic always provides reliable results but suffers from low perfor-
mance with respect to floating-point arithmetic. Moreover, a naive application
of interval arithmetic does not provide a sharp enclosure of the exact result, or
can even fail to provide guaranteed results. In this chapter, we present some
properties of interval arithmetic, which will serve as the basis to derive efficient
implementations of interval algorithms in the following chapters.

1.1 Motivation
Floating-point arithmetic is used for scientific computing purpose and it allows to obtain
accurate results at a very fast speed. In many cases, floating-point arithmetic provides
results of a quality which is satisfactory for user demand. However, with its well-known
effect of rounding errors, computed results are rarely exact. Each number being encoded
by a finite number of digits, the set of floating-point numbers is indeed discrete and thus
cannot represent all real values. Meanwhile the result of a mathematical operation between
real numbers or floating-point numbers is a real value and in many cases is not representable
by another floating-point number. Thus, to be able to be encoded by another floating-point
number, the computed result takes as value a floating-point number which is close to the
exact result. More precisely, the computed result is rounded to a nearby floating-point
value. The difference between the floating-point value and the exact value is called the
rounding error.
These computed results are indeed approximate results. To reduce the effect of round-
ing errors, one can increase the computing precision. The higher the computing precision
is, the smaller the rounding errors are. However, increasing the computing precision means
increasing the storage space as well as increasing significantly the computing time. There-
fore, fixed finite precision is chosen to gain performance. The most common formats are
IEEE single and double precisions. Using the IEEE double precision, which is encoded by
64 bits with 1 sign bit, 11 exponent bits, and 52 mantissa bits, the effect of rounding errors
in many applications is negligible. Nevertheless, in some cases, it is needed to assess how
accurate the computed results are.
There exists a number of techniques for assessing the effect of rounding errors. Firstly,

1
2 CHAPTER 1. INTRODUCTION

an a priori error bound upon the computed result can be obtained by performing numer-
ical analysis. This error bound is based on the theory of perturbations [Hig02], and is
usually estimated using the condition number which measures the stability of results with
respect to small changes in inputs. Nevertheless, when the problem is ill-conditioned, these
condition-based error bounds are usually pessimistic and can give only an overview of how
accurate the computed results could be expected, regardless of which computation model
is employed. In some cases, a rigorous assessment of result accuracy is needed.
Another option is to use running error analysis, which yields an a posteriori error bound
simultaneously with the computed approximate result [Hig02, Wil60, Lan04]. Running
error computation is based on the rounding error model: |x ◦ y − fl(x ◦ y)| ≤ |fl(x ◦ y)|,
where  is the machine relative error. Running error computation uses the actual computed
intermediate quantities, therefore a running error bound is usually smaller than an a priori
error bound. However, running error computation depends on every specific problem.
Each individual problem needs to have its own model of computing running errors. This
is disadvantageous when the problem in question is of high complexity.
Interval arithmetic is designed to automate this process of computing reliable results, as
stated in the dissertation of R. E. Moore entitled "Interval Arithmetic and Automatic Error
Analysis in Digital Computing" [Moo63] which is the most popular among the pioneers to
introduce interval arithmetic in the early 60s. Interval arithmetic does not aim directly at
improving the accuracy of computations, but at providing a certificate of accuracy. The
fundamental principle of interval arithmetic is to enclose all quantities which are not known
exactly during computations. Using intervals to represent data, and interval arithmetic to
perform computations, one obtains not an approximation but a range of values in which
lies the exact result. With this enclosure, one gets a certificate of accuracy, i.e. a bound
on the distance between the exact result and the computed result. This computed result
is said to be verified in the sense that a value which is not included in the computed range
cannot be the exact result.
Interval arithmetic always provides verified results. Nevertheless, if the computed en-
closure is too large, then the provided certificate is of no use. Therefore, accuracy is of
great concern when developing numerical methods that yield reliable results using interval
arithmetic.
Interval arithmetic allows to obtain certification for computed results. Meanwhile,
arbitrary precision allows to increase the accuracy of computations by reducing the impact
of rounding errors. The combination of these two components provides accurate results
accompanied by a sharp error bound.
On the one hand, having to compute an enclosure of all possible values, interval arith-
metic suffers from low performance. The ratio of execution time of interval arithmetic with
respect to floating-point arithmetic is 4 in theory, as this can be seen on the formula for
interval multiplication. In practice, this factor is much higher if care is not taken when im-
plementing it. Moreover, using arbitrary precision also introduces an important overhead
in execution time. Although interval arithmetic is able to obtain reliable results endowed
with a certificate of accuracy, if the computation time is too long then it is not of much
use.
On the other hand, for many problems which have been solved efficiently in floating-
point arithmetic, when porting to interval arithmetic, simply replacing floating-point op-
erations by their interval counterparts does not lead to an efficient algorithm and mostly
does not provide results of high precision, or it can even fail to provide guaranteed results.
1.2. DESCRIPTION OF THE DOCUMENT 3

Therefore, I proceeded the Ph.D. preparation with the aim of obtaining an efficient
implementation of interval arithmetic for some numerical problems. "Efficient" means
computing enclosures of small relative width at a cost of reasonable execution time, which
is of a constant modest factor with respect to computing approximate results using floating-
point arithmetic. Additionally, increasing the accuracy is likely to increase the computation
time. Hence, in some cases, a trade-off between accuracy and speed is mandatory.
As a first step toward providing a general methodology to obtain an efficient imple-
mentation of interval arithmetic, the 3 years of my PhD preparation succeeded to provide
some techniques to improve the efficiency of interval arithmetic for several problems in
numerical linear algebra, as well as some new tradeoffs between accuracy and speed.

1.2 Description of the document

In the context of this Ph.D., we propose several accurate algorithms and efficient implemen-
tations in verified linear algebra using interval arithmetic. Two fundamental problems are
addressed, namely the multiplication of interval matrices and the verification of floating-
point solutions of linear systems.
For the multiplication of interval matrices, we propose two algorithms which are based
respectively on endpoints and midpoint-radius representations of intervals [NR10, Ngu].
These algorithms offer new trade-offs between speed and accuracy. Indeed, they are less
accurate but much faster than the natural algorithm, the speedup is of order 10 for matrices
of dimensions 100 × 100 and might attain order 100 for matrices of dimensions 1000 × 1000.
On the other hand, they are slower than the fastest algorithm (proposed by S. M. Rump),
and the factor of execution time is around 2, but they provide sharper bounds of the
exact results with an overestimation factor less than 1.18, compared to 1.5 for the fastest
algorithm.
To verify the floating-point solution of a linear system, we adapt floating-point iterative
refinements to interval arithmetic [NR]. By using alternately floating-point and interval
refinements, our method provides an accurate solution together with a tight error bound
upon it. To reduce the slowdown due to the intensive use of interval operations, we intro-
duce a relaxation technique, which helps to reduce drastically the algorithm’s execution
time. Additionally, the use of extended precision for the approximate solution yields so-
lutions which are guaranteed to almost the last bit at a cost of an acceptable overhead of
execution time.
This document is organized as follows: the rest of this first chapter briefly introduces
interval arithmetic together with some of its properties. A more thorough introduction to
interval arithmetic and its applications can be found in [Moo63, MB79, MKC09, JKDW01,
Neu90, HKKK08]. The second chapter presents algorithms for implementing the multipli-
cation of interval matrices. The third chapter is devoted to the underlying approach to
verify a linear system of equations. Chapter 4 presents the relaxation technique to improve
the performance of the algorithm presented in chapter 3. Chapter 5 studies the effect of
the computing precisions on the accuracy of computed results.
4 CHAPTER 1. INTRODUCTION

1.3 Interval arithmetic


Denote by R the set of real numbers. A real interval is a closed connected subset of R,
which is of the form
a = [a, a] a, a ∈ R ∪ {−∞, +∞}
where a ≤ a. In particular, if a = a then a is said to be a thin interval, or a degenerate
interval. If a = −a then a is centered in zero.
The set of real intervals is denoted by IR. Interval arithmetic defines operations over
IR.
Notation. In this document, boldface letters denote intervals, lower case letters denote
scalars and vectors, upper case letters denote matrices.

1.3.1 Operations
An interval algebraic operation is defined by taking all the possible results of the corre-
sponding operation, applied to all the possible pairs of real values taken from the input
intervals. Interval operations must always obey the fundamental property of isotonicity,
regardless of representation or implementation:
∀a ∈ a ∀b ∈ b . . . f(a, b, . . .) ∈ f(a, b, . . .).

Convex hull
The convex hull of a set of real values is the smallest interval which includes all these real
values. Let S be a set of real values then the convex hull of S, denoted by , is defined by
S = [min(S), max(S)] .
 
Additionally, let a = [a, a] ∈ IR, b = b, b ∈ IR be two intervals, then the hull of these
two intervals is defined as
 
(a, b) = min(a, b), max(a, b) .

Unary operations
For a = [a, a] ∈ IR, a unary operation f on a is defined as
f(a) = {f(a) a ∈ a}.
Table 1.1 lists some unary interval operations: the formula are established using the mono-
tonicity of the operations when possible.
It is easy to prove the isotonicity of these operations. Let us take for example the case
of the square operation. For all a ∈ a, then a ≤ a ≤ a:
• if a ≥ 0 then 0 ≤ a ≤ a ≤ a ⇒ 0 ≤ a2 ≤ a2 ≤ a2 , hence a2 ∈ [a2 , a2 ];
• if a ≤ 0 then a ≤ a ≤ a ≤ 0 ⇒ 0 ≤ a2 ≤ a2 ≤ a2 , hence a2 ∈ [a2 , a2 ];
• if a < 0 < a. Because a ≤ a ≤ a then |a| ≤ max(|a|, |a|) ⇒ a2 ≤ max(|a2 |, |a2 |). It
is always true that a2 ≥ 0. Hence a2 ∈ [0, max(a2 , a2 )]. Furthermore, as 0, a and a
belong to a, 0, a2 and a2 belong to a2 , hence [0, max(a2 , a2 )] is the tightest interval.
1.3. INTERVAL ARITHMETIC 5

Negation −a = [−a, −a]


Square a2 = [a2 , a2 ] if a≥0
a2 = [a2 , a2 ] if a≤0
a2 = [0, max(a2 , a2 )] if a<0<a
Inverse 1/a = [1/a, 1/a] if a > 0 or a < 0
1/a = [−∞, ∞] if a<0<a
1/a = [1/a, ∞] if a=0
1/a = [−∞, 1/a] if a=0

Table 1.1: Unary interval operations.

Binary operations
 
Let a = [a, a] ∈ IR, b = b, b ∈ IR be two intervals. A binary operation ◦ ∈ {+ − ∗/}
between a and b is defined as

a ◦ b = {a ◦ b a ∈ a, b ∈ b}.

Addition
Due to its monotonicity, the addition of two intervals can be computed by adding their
corresponding endpoints:  
a + b = a + b, a + b .
The isotonicity of the addition is directly deduced from its monotonicity.
Subtraction
Similarly to the addition, the subtraction between two intervals is computed by
 
a − b = a − b, a − b .

Multiplication
The multiplication of two intervals is formed by taking the minimum and maximum
values of the four products between their two pairs of endpoints
 
a ∗ b = min(a ∗ b, a ∗ b, a ∗ b, a ∗ b), max(a ∗ b, a ∗ b, a ∗ b, a ∗ b) .

Therefore, an interval multiplication requires four punctual products and six compar-
isons. Otherwise, one can reduce the number of punctual products and comparisons by
inspecting the sign of input intervals. For example, if a ≥ 0 then
 
a ∗ b = min(a ∗ b, a ∗ b), max(a ∗ b, a ∗ b) .

We can then inspect the sign of endpoints of b:


 
• if b ≥ 0 then a ∗ b = a ∗ b, a ∗ b .
 
• if b ≤ 0 then a ∗ b = a ∗ b, a ∗ b .
 
• if b < 0 < b then a ∗ b = a ∗ b, a ∗ b .
6 CHAPTER 1. INTRODUCTION

The worst case is when a < 0 < a and b < 0 < b, in this case we say that a and b
contain 0 in their interior. Then
 
a ∗ b = min(a ∗ b, a ∗ b), max(a ∗ b, a ∗ b) ,

we still have to compute all the four punctual products, except that only two comparisons
are needed instead of six. This explains that the cost of interval arithmetic is 4 times the
cost of real arithmetic.
Note that, different from real number operations, an interval square is not equivalent
to a multiplication of an interval by itself. For a ∈ IR, it is always true that a2 ⊆ a ∗ a,
and equality holds when a does not contain 0 in its interior. For example, if a = [1, 2]
then a2 = [1, 4], and a ∗ a = [1, 4]. Hence, a2 = a ∗ a. Meanwhile, if a = [−1, 2] then by
definition a2 = [0, 4] whereas a ∗ a = [−2, 4].
Division
Interval division is defined via multiplication and inverse, see Table 1.1. Let a ∈ IR, b ∈
IR be two intervals then
a/b = a ∗ (1/b).

Set operations
Intervals are connected closed set of real values, hence some set operations on intervals
apply to intervals.
Infimum/supremum
The infimum and supremum of an interval are respectively its inferior and superior
endpoints. For a = [a, a] ∈ IR, then

inf(a) = a,
sup(a) = a.

Maximal magnitude value, denoted by mag(.) or |.|, is the maximal absolute value
of all real numbers taken from the interval. For a = [a, a] ∈ IR, then

mag(a) = max{|a| : a ∈ a}
= max(|a|, |a|).

Otherwise, the maximal magnitude value can be computed by the following proposition.

Proposition 1.3.1. For a = [a, a] ∈ IR, it holds that

|a| = max(−a, a).

Proof. We prove that max(−a, a) = max(|a|, |a|). There are three possible cases:

• If a ≥ 0 then a ≥ a ≥ 0 ≥ −a. Hence max(|a|, |a|) = a, and max(−a, a) = a. It


implies that max(−a, a) = max(|a|, |a|).

• If a ≤ 0 then a ≤ a ≤ 0 ≤ −a. Hence max(|a|, |a|) = −a, and max(−a, a) = −a =


max(|a|, |a|).

• If a < 0 < a then −a = |a|, a = |a|. Hence max(−a, a) = max(|a|, |a|).


1.3. INTERVAL ARITHMETIC 7

Minimal magnitude value, denoted by mig(.) or h.i, is the minimal absolute value
of all real numbers taken from the interval. For a = [a, a] ∈ IR, then

mig(a) = min{|a| : a ∈ a}

0 if a < 0 < a,
=
min(|a|, |a|) otherwise.

We can also compute the minimal magnitude value by the following proposition.

Proposition 1.3.2. For a = [a, a] ∈ IR, it holds that

hai = max(0, max(a, −a)).

The proof is by inspecting the same three cases as in the proof of Proposition 1.3.1.
Midpoint
1
mid(a) = (a + a).
2
Radius
1
rad(a) = (a − a).
2

1.3.2 Algebraic properties


Let a, b, c ∈ IR be three intervals. The following laws are true:

1. neutral and unit elements: a + 0 = a and a ∗ 1 = a,

2. commutativity: a + b = b + a and a ∗ b = b ∗ a,

3. associativity: (a + b) + c = a + (b + c) and (a ∗ b) ∗ c = a ∗ (b ∗ c),

4. sub-distributivity: a ∗ (b + c) ⊆ a ∗ b + a ∗ c.

The sub-distributivity property is one of the important sources of overestimation in


interval arithmetic.

Example 1. Let a = [1, 2], b = [−1, 2], c = [2, 3].

a ∗ b + a ∗ c = [1, 2] ∗ [−1, 2] + [1, 2] ∗ [2, 3] = [−2, 4] + [2, 6] = [0, 10]


a ∗ (b + c) = [1, 2] ∗ ([−1, 2] + [2, 3]) = [1, 2] ∗ [1, 5] = [1, 10] .

One can see that a ∗ (b + c) ⊂ a ∗ b + a ∗ c and the inclusion is strict.


8 CHAPTER 1. INTRODUCTION

1.3.3 Absolute value properties


Some properties of the absolute value for real numbers can be extended to intervals.

Proposition 1.3.3. For a, b ∈ IR, the following properties hold:

(1) | − a| = |a| and h−ai = hai,

(2) triangular inequality: |a| − |b| ≤ |a ± b| ≤ |a| + |b|,

(3) |a ∗ b| = |a| ∗ |b|,

(4) ha ± bi ≤ hai + hbi,

(5) ha ∗ bi = hai ∗ hbi.

Proof. (1),(2),(3),(5) can be deduced directly from the properties of the real absolute value.
Let us prove (4). For all a ∈ a, b ∈ b, we have

|a| − |b| ≤ |a ± b| ≤ |a| + |b| |a ∗ b| = |a| ∗ |b|.

Let â ∈ a, b̂ ∈ b such that |â| = hai and |b̂| = hbi. It is obvious that â ± b̂ ∈ a ± b.
Therefore ha ± bi ≤ |â ± b̂| ≤ |â| + |b̂| ≤ hai + hbi. It implies (4).

Note that the following inequality is not always true: hai − hbi ≤ ha ± bi. For example,
if a = [1, 2] and b = [−1, 2] then hai − hbi = 1 − 0 = 1. Meanwhile, ha + bi = h[0, 4]i = 0.
Hence, hai − hbi > ha + bi. On the other hand, if a ≥ 0 and b ≥ 0 then ha + bi = a + b =
hai + hbi ≥ hai − hbi.

1.3.4 Midpoint and radius properties


Both addition and subtraction increase the width of intervals

rad(a ± b) = rad(a) + rad(b).

The following proposition assesses the width of interval multiplication.

Proposition 1.3.4. [MKC09] For a, b ∈ IR, the following properties hold:

(1) |a| = |mid(a)| + rad(a),

(2) rad(a ∗ b) = |a| ∗ rad(b) if a is thin or b is centered in zero,

(3) rad(a ∗ b) ≤ |mid(a)| ∗ rad(b) + rad(a) ∗ |mid(b)| + rad(a) ∗ rad(b),

(4) rad(a ∗ b) ≤ |a| ∗ rad(b) + rad(a) ∗ |b|,

(5) rad(a ∗ b) ≥ rad(a) ∗ rad(b) + hai ∗ rad(b) + rad(a) ∗ hbi.

Proof.

(1) By definition, |a| = max(|a|, |a|) = max(|mid(a)−rad(a)|, |mid(a)+rad(a)|) = |mid(a)|+


rad(a).
1.3. INTERVAL ARITHMETIC 9
   
(2) If a is thin, i.e a = a then a ∗ b is equal to either a ∗ b, a ∗ b or a ∗ b, a ∗ b , which
implies that rad(a∗b) = 1/2|a∗b−a∗b| = 1/2|a(b−b)| = |a|∗1/2|b−b| = |a|∗rad(b).
If b is centered in zero, i.e −b = b = rad(b) = |b|, then for all a ∈ a, b ∈ b we
have a ∗ b = −a ∗ b, |b| ≤ |b|, and −|a| ∗ |b| ≤ |a ∗ b| ≤ |a| ∗ |b|, which implies that
a∗b = [−|a| ∗ |b|, |a| ∗ |b|]. Hence a∗b = {a∗b, a ∈ a} = {[−|a| ∗ |b|, |a| ∗ |b|] , a ∈
a} = [−|a| ∗ |b|, |a| ∗ |b|] ⇒ rad(a ∗ b) = |a| ∗ |b| = |a| ∗ rad(b) and mid(a ∗ b) = 0.
(3) This proof is established by differentiating the cases where a ≥ 0, a ≤ 0 and a 3 0
and similarly for b. When an interval contains 0, one must furthermore distinguish
according to the sign of the midpoint of the interval. The cases where the two intervals
do not contain 0 and where one interval contains 0 are detailed below, the other cases
are similar.
When a ≤ 0 and b ≥ 0, then the formula which applies for the product a ∗ b is
a ∗ b = [a ∗ b, a ∗ b] and thus rad(a ∗ b) = 21 (a ∗ b − a ∗ b). The right-hand side of the
inequality can be calculated and simplified into 14 (a ∗ b − 3a ∗ b + a ∗ b + a ∗ b). Checking
whether rad(a ∗ b) ≤ 41 (a ∗ b − 3a ∗ b + a ∗ b + a ∗ b) is equivalent to checking whether
(a − a) ∗ (b − b) ≥ 0, which holds.
When a ≥ 0 and b 3 0, then the formula which applies for the product a ∗ b is
a∗b = [a∗b, a∗b] and thus rad(a∗b) = 12 a∗(b−b). To be able to calculate and simplify
the right-hand side of the inequality, one must distinguish further according to the sign
of mid(b). If mid(b) ≥ 0 then this right-hand side is equal to 14 (−a∗b−a∗b−a∗b+3a∗b).
Checking whether rad(a ∗ b) ≤ 41 (−a ∗ b − a ∗ b − a ∗ b + 3a ∗ b) is equivalent to checking
whether (a − a) ∗ (b + b) ≥ 0, which holds since mid(b) = 21 (b + b) is assumed to be non-
negative. If mid(b) ≤ 0 then this right-hand side is equal to 41 (a∗b+a∗b−3a∗b+a∗b).
Checking whether rad(a ∗ b) ≤ 14 (a ∗ b + a ∗ b − 3a ∗ b + a ∗ b) is equivalent to checking
whether −(a − a) ∗ (b + b) ≥ 0, which holds since mid(b) = 21 (b + b) is assumed to be
non-positive.
(4) Inequality (3) can also be written as rad(a ∗ b) ≤ (|mid(a)| + rad(a)) ∗ |mid(b)| +
rad(a) ∗ rad(b), and using (1) one gets
rad(a ∗ b) ≤ |a| ∗ |mid(b)| + rad(a) ∗ rad(b)
≤ |a| ∗ |mid(b)| + rad(a) ∗ |mid(b)| + rad(a) ∗ rad(b).
Using (1) again, one gets a formula with a right-hand side symmetric in a and b:
rad(a ∗ b) ≤ |a| ∗ rad(b) + rad(a) ∗ |b|.
(5) Let α = sign(mid(a)) ∗ rad(a), β = sign(mid(b))rad(b). Then all of mid(a) ± α ∈ a,
mid(b) ± β ∈ b, and mid(a) ∗ mid(b), mid(a) ∗ β, α ∗ mid(b), α ∗ β have the same sign.
• If 0 ∈ a and 0 ∈ b then hai = 0 and hbi = 0. Both (mid(a) + α) ∗ (mid(b) + β)
and (mid(a) + α) ∗ (mid(b) − β) are included in a ∗ b, hence
rad(a ∗ b) ≥ 1/2|(mid(a) + α) ∗ (mid(b) + β) − (mid(a) + α) ∗ (mid(b) − β)|
≥ |mid(a) ∗ β + α ∗ β|
≥ |mid(a)| ∗ |β| + |α| ∗ |β| ≥ |α| ∗ |β|
≥ rad(a) ∗ rad(b)
≥ rad(a) ∗ rad(b) + hai ∗ rad(b) + rad(a) ∗ hbi.
10 CHAPTER 1. INTRODUCTION

• If 0 ∈ a and 0 ∈
/ b then hai = 0. Moreover, (mid(a) + rad(a)) ∗ mid(b) ∈ a ∗ b,
and (mid(a) − rad(a)) ∗ mid(b) ∈ a ∗ b, hence,
rad(a ∗ b) ≥ 1/2|(mid(a) + rad(a)) ∗ mid(b) − (mid(a) − rad(a)) ∗ mid(b)|
≥ |rad(a) ∗ mid(b)| = rad(a) ∗ |mid(b)|.
As |mid(b)| = hbi + rad(b), then
rad(a ∗ b) ≥ rad(a) ∗ (hbi + rad(b))
≥ rad(a) ∗ hbi + rad(a) ∗ rad(b)
≥ rad(a) ∗ rad(b) + rad(a) ∗ hbi + hai ∗ rad(b).
• The case 0 ∈/ a and 0 ∈ b is similar, as a and b play symmetric roles.
• If 0 ∈
/ a and 0 ∈/ b. Again, |mid(a)| = hai + rad(a), |mid(b)| = hbi + rad(b).
Both (mid(a) + α) ∗ (mid(b) + β) and (mid(a) − α) ∗ (mid(b) − β) are included in
a ∗ b, hence, thanks to the sign conditions imposed on α and β,
rad(a ∗ b) ≥ 1/2|(mid(a) + α) ∗ (mid(b) + β) − (mid(a) − α) ∗ (mid(b) − β)|
≥ |mid(a) ∗ β + α ∗ mid(b)|
≥ |mid(a)| ∗ |β| + |α| ∗ |mid(b)|
≥ (hai + rad(a)) ∗ rad(b) + rad(a) ∗ (hbi + rad(b))
≥ rad(a) ∗ rad(b) + hai ∗ rad(b) + rad(a) ∗ hbi.

There is no cancellation in interval arithmetic. For every non-degenerate interval


a ∈ IR, then a − a strictly contains 0. This phenomenon is called sub-cancellation. By
definition, a − a = [a − a, a − a], hence the result is in fact an interval which is centered
in zero. For example, if a = [1, 2], then a − a = [−1, 1]. Another way of expressing the
sub-cancellation property is: (a − b) + b ⊇ a.
Similarly, division is not the inverse of the multiplication: for every non-degenerate
interval a ∈ IR, then a/a strictly contains 1. For example, if a = [1, 2], then a/a =
[1/2, 2] 3 1.
Proposition 1.3.5. For a ∈ IR, b ∈ IR, such that b does not contain zero, γ ∈ R and
γ 6= 0, the following holds
(1) |a/b| = |a|/hbi,
(2) rad(a/γ) = rad(a)/|γ|,
(3) rad(a/b) ≥ rad(a)/hbi,
(4) |mid(a/b)| ≤ |mid(a)|/hbi.
Proof. (1) is trivial. If γ > 0 then a/γ = [a/γ, a/γ], hence rad(a/γ) = 1/2(a/γ − a/γ) =
1/2(a − a)/γ = rad(a)/|γ|. On the other hand, if γ < 0 then a/γ = [a/γ, a/γ], hence
rad(a/γ) = 1/2(a/γ − a/γ) = 1/2(a − a)/(−γ) = rad(a)/|γ|. Therefore we have (2). Let
b ∈ b such that |b| = hbi, then a/b ⊆ a/b. Hence rad(a/b) ≥ rad(a/b) = rad(a)/|b|,
which implies (3). Following Proposition 1.3.4, |mid(a/b)| = |a/b|−|rad(a/b)|. Therefore,
following (1) and (3), |mid(a/b)| = |a|/hbi − |rad(a/b)| ≤ |a|/hbi − rad(a)/hbi ≤ (|a| −
rad(a))/hbi. Moreover, |a| − rad(a) = |mid(a)|, thus we have (4).
1.3. INTERVAL ARITHMETIC 11

1.3.5 Variable dependency


Sub-cancellation and sub-distributivity properties are in fact the simplest cases of the de-
pendency problem in interval arithmetic. When an expression is evaluated in floating-point
arithmetic, each occurrence of the same variable takes exactly the same value. Differently,
when the expression is evaluated using interval arithmetic, each occurrence of a variable
represents a range of values. The computation assumes implicitly that the quantity in each
occurrence of a variable varies independently from the quantities in other occurrences of
the same variable. It means that we no longer have the dependency between occurrences
of a single variable when evaluating an interval expression whereas, in fact, they are cor-
related. That leads to an unwanted expansion of the computed result even if each interval
operation is evaluated exactly. The difference between the interval arithmetic evaluation
and the actual range of an expression is called overestimation.
Consider the following example of interval expression evaluation:

x2 − 2x where x = [1, 2] .

Performing interval operations, we have x2 = [1, 4], 2x = [2, 4], and finally x2 − 2x =
[−3, 2].
Meanwhile, if we rewrite the above expression as follows

(x − 1)2 − 1

then the computed result is: [0, 1]2 − 1 = [−1, 0].


As this can be seen in the above example, mathematically equivalent forms give different
results when they are evaluated in interval arithmetic. That is because by rewriting the
expression, the dependency between variables is expressed differently. One rule to obtain
an optimal result is to rewrite the expression in such a way that each variable appears
only once in the expression. In the above example, the variable x appears only once in the
second expression, thus the result computed by the second expression is sharper than that
computed by the first expression, and it is even exactly the range of this expression.
Nevertheless, not all expressions can be rewritten that way. Therefore the dependency
problem is a major source of overestimation. Another source of overestimation comes from
the so-called wrapping effect when computing in multiple dimensions.

1.3.6 Wrapping effect


The wrapping effect is the overestimation due to the enclosure of multidimensional sets by
the Cartesian product of intervals, i.e. geometrically, by boxes with sides parallel to the
axes.

Example 2. Let x = {[4, 6] , [−1, 1]} ∈ IR2 be a box whose coordinates are the two
intervals x1 = [4, 6], and x2 = [−1, 1]. This Cartesian product of intervals x is depicted by
the black box in Figure 1.1.  
4/5 −3/5
Let us now rotate x by a rotation matrix A = and then rotate it back
  3/5 4/5
4/5 3/5
by the inverse matrix A−1 = .
−3/5 4/5
12 CHAPTER 1. INTRODUCTION

x
2
x’

x1
x x’’

Figure 1.1: Wrapping effect.

In theory, x should not change after these two rotations. However, this computation
performed using interval arithmetic gives a result which is a larger interval vector x00 includ-
ing x inside its interior. After the first rotation, we get an interval vector x0 , represented
by the gray box in Figure 1.1.
    
0 4/5 −3/5 [4, 6] [13/5, 27/5]
x = = .
3/5 4/5 [−1, 1] [8/5, 22/5]
This box x0 wraps the exact image of x (dotted lines) in a box whose sides are parallel
to the axes. After the second rotation, we get the final result, represented by the white
box     
00 4/5 3/5 [13/5, 27/5] [76/25, 174/25]
x = = .
−3/5 4/5 [8/5, 22/5] [−49/25, 49/25]
One can see on Figure 1.1 that x00 is much larger than x. The factor of overestimation
of x00 with respect to x is f = {49/25, 49/25}, which means that each side of the original
box has been expanded by a factor of nearly 2.

1.3.7 Implementation
Interval arithmetic is implemented on computers using floating-point arithmetic. The most
common representation of intervals uses their endpoints. Each endpoint of an interval is
encoded by one floating-point number. Therefore, we have a new set of intervals whose
endpoints are discrete
IF = {[a, a] ∈ IR, a, a ∈ F}.
Using floating-point arithmetic, we have to handle the problem of rounding errors.
In the presence of rounding errors, the exactness of endpoint computations is no longer
guaranteed. Let us consider, for example, the addition of two intervals. The result’s upper
bound is, by definition, the sum of the upper bounds of the two summands, which is a
real number and in most cases not representable by a floating-point number. Hence, to
guarantee the isotonicity property, we adapt the definition to operations on IF.
Definition 1.3.6. For x ∈ IF, y ∈ IF, the result of an operation x ◦ y, ◦ ∈ {+ − ∗/} is
the smallest interval z ∈ IF containing
{z = x ◦ y x ∈ x, y ∈ y}.
1.3. INTERVAL ARITHMETIC 13

Directed rounding modes


To satisfy the principle of inclusion, the upper bound of the computed result must be
greater than or equal to the upper bound of the exact interval result. Similarly, the lower
bound of the computed result must be smaller than or equal to the lower bound of the
exact interval result. Therefore, one straightforward and effective way to handle rounding
errors when implementing interval arithmetic is to used directed rounding modes, namely
the upward rounding mode for the upper bound and the downward rounding mode for the
lower bound.
In upward rounding mode, the result of an arithmetic operation is the smallest floating-
point number which is greater than or equal to the exact result. In downward rounding
mode, the result of an arithmetic operation is the greatest floating-point number which is
smaller than or equal to the exact result. In this document, the following rounding modes
are used:

4 upward rounding mode,


5 downward rounding mode.

For ranges of arithmetic operations, directed rounding modes allow to obtain a rigorous
and tight enclosure of the exact
 result
 interval.
Let a = [a, a] ∈ IF, b = b, b ∈ IF. Using directed rounding modes, basic interval
arithmetic operations can easily be implemented.
Addition
 
a ⊕ b = 5 (a + b) , 4 a + b .
The isotonicity of this operation is simple. For all a ∈ a, b ∈ b, it is obvious that

5 (a + b) ≤ a + b ≤ a + b ≤ a + b ≤ 4 a + b .

Hence, a + b ∈ a + b ⊂ a ⊕ b. In much the same way, we can prove the isotonicity of


the following operations.
Subtraction
  
a b = 5 a − b , 4 (a − b) .
Multiplication
  
a ⊗ b = min(5 (a ∗ b) , 5 a ∗ b , 5 (a ∗ b) , 5 a ∗ b ),
 
max(4 (a ∗ b) , 4 a ∗ b , 4 (a ∗ b) , 4 a ∗ b ) .

In exact arithmetic, an interval multiplication requires four punctual products. How-


ever, when floating-point arithmetic is used to implement interval arithmetic, each sub-
product must be computed twice, once in downward rounding mode and once in upward
rounding mode. Hence, in total, an interval multiplication requires eight floating-point
products. As with real interval arithmetic, one can inspect the sign of input intervals to
reduce the number of sub-products. This solution suffers from low performance due to
branching and testing the sign of floating-point numbers.
For the rest of the document, we will refer to this implementation as directed interval
arithmetic.
14 CHAPTER 1. INTRODUCTION

Successor and predecessor


Directed rounding modes might not be supported on some architectures. For example,
on the Synergistic Processor Elements of the IBM Cell Broadband Engine Architecture
[KDH+ 05], only the rounding mode toward zero, also called truncation rounding mode, is
supported for IEEE single precision. In addition, some programming languages, Java for
example, do not support functions for changing rounding modes. In these cases, interval
arithmetic can be implemented using predecessor and successor functions.
Predecessor and successor functions consist in computing respectively the previous and
the next floating-point number with respect to a given floating-point number. Hence, they
are used to simulate downward and upward rounding modes. Work on predecessor and
successor functions, or interval arithmetic in rounding to nearest can be found in [RZBM09].
One advantage of this approach over directed rounding mode interval arithmetic is that
it does not require any rounding mode changes. Hence, it triggers no context switching
during computation and can be better pipelined. Nevertheless, this approach suffers from
a significant overhead of execution time introduced by predecessor and successor functions.
In this document, we assume that directed rounding mode interval arithmetic is used.
We will show that the problem of changing rounding mode can be overcome for operations
on matrices and vectors.

Implementation issues
In IEEE 754-1985 processors, i.e. on most processors including IA-32 [Int07], changing the
rounding mode of floating-point operations is done by changing the value of some global
register. This requires flushing the floating-point pipeline: all the floating-point operations
initiated beforehand must be terminated before performing the new ones with the new
rounding mode. Therefore changing rounding mode frequently as required to implement
interval arithmetic will lead to a dramatic slowdown in performance: in practice, the
computation is slowed down compared to its floating-point counterpart, by a factor between
20 and 50 [Rum99a].
This problem can be eradicated in more recent instruction sets which allow to change
rounding-mode per-instruction without penalty [MBdD+ 10]. Unfortunately, this feature
can be accessed only from low-level programming language by manipulating directly the
instruction word, not from current high level languages which are designed to be compliant
with the IEEE 754-1985 [IEE85] standard. The new IEEE 754-2008 [IEE08] standard for
floating-point arithmetic corrects this issue. However, it may take some time before it is
fully supported.
Otherwise, the inclusion property of interval arithmetic usually requires obtaining the
maximal and minimal values from intermediate results, e.g. interval multiplication or
square. These maximum and minimum functions require themselves at least a comparison
followed by a selection of the right value. They are usually coded by a test and branch or on
some particular processors, such as the CELL processor, by comparison and bit selection.
These max and min functions introduce a significant overhead of execution time.
With the obstacles mentioned above, a lot of work is needed in order to obtain an
efficient interval arithmetic implementation, which is well pipelined and parallelized. It may
even require redesigning and reimplementing everything from scratch. That is not desirable,
as there exists a number of efficient implementations for the floating-point versions of
the corresponding algorithms which have been well optimized and parallelized. A more
1.4. CONCLUSION 15

realistic solution is to exploit these existing optimized implementations to obtain an efficient


implementation for interval arithmetic. This idea is the foundation of my Ph.D., targeting
at an efficient implementation for verified linear algebra using interval arithmetic.

1.3.8 Implementation environment


During my Ph.D. preparation, I used the IntLab library, which is a library in MatLab
developed by professor S.M. Rump [Rum99b] for interval arithmetic, to implement our
algorithms. The main reason is that the IntLab library is based on BLAS and uses di-
rected rounding modes to implement interval arithmetic. The BLAS (Basic Linear Algebra
Subprograms) [LHKK79] are routines that provide standard building blocks for perform-
ing basic vector and matrix operations. To my knowledge, IntLab is the most efficient
implementation of interval arithmetic for linear algebra. Its extensive use of the BLAS
routines is the main key to achieve high speed for interval computations. Another main
advantage of using the BLAS routines is that a code using the BLAS routines can run fast
on a variety of computers. Since the BLAS routines are widely available, it is supported
by a vast majority of processors. Manufacturers usually provide very fast, well optimized
implementations of the BLAS routines for their specific computers.
Another reason to choose IntLab as the implementation environment is its ease of use.
Since IntLab is a library in MatLab, which is an interactive programming environment,
algorithms can be implemented in MatLab in such a way that is close to pseudo-code.
Especially matrix operations are expressed in a very simple and straightforward way.
Nevertheless, the algorithms presented in this document are not restricted to be imple-
mented using the IntLab library. The implementations presented in this document, which
use the IntLab library, are only reference implementations to demonstrate the correctness
as well as the performance of these algorithms. Otherwise, the presented algorithms can
be implemented using any optimized BLAS library, in any programming language which
supports functions for changing rounding modes.

1.4 Conclusion
Interval arithmetic paves the way for a new powerful tool to obtain verified results in scien-
tific computing. Instead of giving an approximation, algorithms using interval arithmetic
provide an enclosure of the results. It means that the exact results are guaranteed to
be included in the computed results. On the contrary, a value not included in the com-
puted results can never be the exact result. Therefore using interval arithmetic provides
some level of certification. The accuracy of computed results can also be easily checked by
assessing the width of computed intervals.
Nevertheless, there are obstacles that prevent interval arithmetic from being broadly
used. Firstly, interval computations usually provide results of large width. Some impor-
tant sources of overestimation can be named here: variable dependency and wrapping
effects. A naive application of interval arithmetic to existing algorithms for floating-point
arithmetic does not, in general case, provide sharp enclosures of the results. The width of
intervals explodes very fast if care is not taken when implementing an algorithm in interval
arithmetic.
16 CHAPTER 1. INTRODUCTION

Secondly, to ensure the isotonicity property of interval computations, redundant oper-


ations must be performed to obtain the convex hull of the exact results. Therefore the
factor of interval arithmetic with respect to floating-point arithmetic in terms of execution
time in theory is 4. It is even much worse in implementation when rounding errors and
parallelism are taken into account.
Although some general techniques are available, such as making use of original punctual
data or performing as many as possible real operations without having to change rounding
mode, the way to obtain an efficient algorithm depends on every specific problem. More-
over, for many problems there is no perfect solution to both these two obstacles, i.e. there
is no algorithm which provides very sharp enclosures and at the same time runs very fast.
Therefore a tradeoff between accuracy and performance must usually be made.
In the context of this thesis, we deal with two fundamental problems of linear algebra,
that are the multiplication of interval matrices (Chapter 2) and the verification of floating-
point solutions of linear systems of equations (Chapter 3, 4 and 5). Nonetheless, techniques
introduced in this document are applicable to other problems as well since they are based
on interval properties such as symmetry and sub-distributivity.
Chapter 2

Interval matrix multiplication

This chapter presents two new algorithms for the multiplication of interval
matrices, which are based on endpoints and midpoint-radius representations of
interval respectively. These algorithms are based on a decomposition of inter-
vals into two parts, one which is centered in zero, and the other which captures
the largest part not containing zero in its interior. They are implemented using
9 and 7 floating-point matrix products respectively. They are a bit slower than
the most efficient algorithm, to my knowledge, proposed by S. M. Rump, which
costs only 4 floating-point matrix products. Nonetheless, the overestimation
factor of the results computed by the two new algorithms is always bounded
by 1.18, compared to 1.5 for Rump’s algorithm. Both algorithms are imple-
mented using IntLab, which is a BLAS-based library for interval arithmetic, to
demonstrate their performance.

2.1 Introduction
In this chapter we study the case of matrix multiplication which involves O(n3 ) operations
on O(n2 ) data. Interval matrix product implemented by the natural algorithm provides
good results in terms of accuracy but suffers from low performance in terms of execution
time. An idea is to exploit existing libraries which are well optimized for floating-point
matrix operations such as BLAS, ATLAS, etc. Rump [Rum99a] proposed an efficient algo-
rithm to calculate the interval matrix multiplication with a cost of 4 floating-point matrix
products. The factor of overestimation in the worst case of this algorithm is 1.5. In this
chapter, we propose a new algorithm which offers a trade-off between performance and
accuracy: this algorithm costs 7 floating-point matrix products with a factor of overesti-
mation in the worst case of 1.18.

2.2 State of the art


Let A, B ∈ IRn×n . The product between A and B is also an interval matrix whose elements
are defined as
n
X
Ci,j = Ai,k ∗ Bk,j . (2.1)
k=1

17
18 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Using floating-point arithmetic to implement interval arithmetic, we have to deal with


rounding errors. Let A, B ∈ IFn×n , the product between A and B cannot be represented
exactly by an interval matrix which is in IFn×n . In fact, to compute the product between
A and B we compute an interval matrix C ∈ IFn×n which includes the exact product
n
X
Ai,k ∗ Bk,j ⊆ Ci,j .
k=1

It should be noted here that a Strassen-like algorithm [vzGG03, chap.12], [Hig02,


chap.23], [Str69, Hig90] in order to obtain a fast implementation of matrix multiplication,
which costs O(nω ) floating-point operations with ω < 3, cannot be applied to interval arith-
metic [CK04], as such algorithms are based on identities such as (a + b) ∗ x + b ∗ (−x + y) =
a∗x+b∗y where a, b, x, y ∈ R, which do not hold in interval arithmetic. For a, b, x, y ∈ IR,
we only have that (a + b) ∗ x + b ∗ (−x + y) ⊇ a ∗ x + b ∗ y, since the two expressions
are mathematically equivalent meanwhile in the second expression each variable appears
only once hence the second expression provides the optimal result of the expression, which
implies the inclusion. For example, if a = [0, 1] , b = [−1, 1] , x = [0, 1] , y = [−1, 1] then,
a ∗ x + b ∗ y = [−1, 2], meanwhile (a + b) ∗ x + b ∗ (−x + y) = [−3, 4] is much larger than
a ∗ x + b ∗ y. Therefore a recursive application of Strassen-like formula leads to a very fast
increase of the width of result components.
Using natural algorithm (2.1), the computation of each result element requires n interval
products and n interval additions. Moreover, given that interval arithmetic is implemented
on computer using floating-point arithmetic and directed rounding mode as explained
in Section 1.3.7, each interval addition requires two floating-point additions, meanwhile
each interval product requires eight floating-point products. Hence, in total formula (2.1)
requires 8n3 floating-point products and 2n3 floating-point additions.
The problem is that, using directed interval arithmetic for each interval operation, we
have to switch between upward and downward rounding modes to compute the upper and
lower bounds respectively. Moreover, the interval product uses max and min functions
which are costly. Otherwise, we can test the sign of input intervals to reduce the number
of floating-point products. But testing the sign is costly also. These problems make the
natural algorithm difficult to be pipelined and to be parallelized. For instance, Rump’s
approach is faster than the BIAS approach [Knu06] which is based on the natural algorithm
by a factor 20 to 100 [Rum99a].
To overcome this problem, Rump proposes in [Rum99a] an algorithm to compute ef-
ficiently an interval matrix product. His idea is to (i) perform as many floating-point
operations as possible without having to change rounding mode, and to (ii) sacrifice a bit
the result accuracy in order to improve significantly the performance.
In reality, the fastest algorithm to compute the multiplication of interval matrices is
provided by Ogita et al [OO05]. This algorithm is also based on Rump’s algorithm and
relaxes the radius computation to gain speed: it is 2 times faster in comparison with Rump’s
algorithm. Nevertheless, the accuracy of the results computed by Ogita’s algorithm is
pessimistic with an overestimation factor up to 5. Therefore we consider here only Rump’s
algorithm. Note that the matrices in question here are non-degenerate. Algorithms to
enclose a product of point matrices can be found in [Rum99a, OOO10].
Rump’s algorithm is based on midpoint-radius representation of intervals:
D E each interval
is represented by its midpoint and its radius. Let a = hâ, αi and b = b̂, β be two intervals
2.2. STATE OF THE ART 19

represented in midpoint-radius format, with â, b̂ be the midpoints of a, b respectively, and


α, β be the radii of a, b respectively. The basic operations are defined as follows:
 D E
 a ⊕ b = â + b̂, α + β ,
D E (2.2)
 a b = â ∗ b̂, |â| ∗ β + |b̂| ∗ α + α ∗ β .

The isotonicity of (2.2) is clear. For all a ∈ a and b ∈ b, we have:

|a − â| ≤ α and |b − b̂| ≤ β.

Hence

a ∗ b − â ∗ b̂ = â ∗ (b − b̂) + (a − â) ∗ b̂ + (a − â) ∗ (b − b̂)


⇒ |a ∗ b − â ∗ b̂| ≤ |â ∗ (b − b̂)| + |(a − â) ∗ b̂| + |(a − â) ∗ (b − b̂)|
≤ |â|
D ∗ β + |b̂| ∗ α + α ∗ β E
⇒ a∗b ∈ â ∗ b̂, |â| ∗ β + |b̂| ∗ α + α ∗ β
∈ a b.

Regarding the exactness of operations, it can easily be verified that a⊕b = a+b. For the
case of multiplication, except for the case where either one of the two multipliers is punctual
or is centered in zero, (2.2) always provides overestimated results. The overestimation
factor of this formula is provided in Proposition 2.2.2 [Rum99a]:

Definition 2.2.1. A real interval a = ha, αi not containing 0 is said to be of relative


precision e, 0 ≤ e ∈ R, if
α ≤ e ∗ |a|.
A real interval containing 0 is said to be of relative precision 1.
D E
Proposition 2.2.2. Let a = hâ, αi ∈ IR, b = b̂, β ∈ IR, and denote the overestimation
of a b as defined by (2.2) with respect to a ∗ b by

rad(a b)
ρ= ,
rad(a ∗ b)

where 0/0 is interpreted as 1. Then the following holds.

1. It always holds that ρ ≤ 1.5.

2. If one of the intervals a and b is of relative precision e, then


e
ρ≤1+ .
1+e

3. If interval a is of relative precision e, and b is of relative precision f , then


ef
ρ≤1+ .
1+e

All estimations are sharp.


20 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Matrix operations can be defined in the same way as scalar operations. Suppose that
each interval matrix is represented by two punctual matrices, one representing the mid-
points and the other representing the radius. Let A = hA, Λi ∈ IRn×n , B = hB, Σi ∈ IRn×n

A ⊕ B = hA + B, Λ + Σi ,
(2.3)
A B = hA ∗ B, |A| ∗ Σ + |B| ∗ Λ + Λ ∗ Σi .
The main advantage of Rump’s algorithm is that using formula (2.3), an interval matrix
product can be computed by performing merely point matrix products. It means that the
interval matrix product as defined by formula (2.3) can be implemented using existing
optimized libraries for floating-point matrix operations, and can benefit from pipelined or
parallelized implementations.
Nonetheless, care must be taken when implementing (2.3) in finite precision. Indeed, in
the presence of rounding errors, one cannot compute exactly the product of two floating-
point matrices. Therefore, firstly, we have to enclose the midpoint of the result by comput-
ing it following (2.3) twice, in downward and upward rounding modes. Secondly, to guaran-
tee the property of isotonicity, we must compute an upper bound of the radius. According to
(2.3), the expression to compute the radius involves only products of non-negative matrices.
Thus an upper bound of the radius can be obtained by simply evaluating in upward round-
ing mode. Additionally, (2.3) can be rewritten as A B = hA ∗ B, (|A| + Λ) ∗ Σ + |B| ∗ Λi.
Therefore, in total, the implementation of Rump’s algorithm in finite precision to compute
an interval matrix product requires only 4 floating-point matrix products together with a
few rounding mode changes.
In regard to accuracy, the addition A ⊕ B is always exactly equal to A + B in infinite
precision. For the case of multiplication:
( n )
X
A∗B= Ai,k ∗ Bk,j , i = 1 : n, j = 1 : n
( k=1
n
)
X
⊆ Ai,k Bk,j , i = 1 : n, j = 1 : n
( k=1
n
)
X
⊆ hAi,k ∗ Bk,j , |Ai,k | ∗ Σk,j + Λi,k ∗ |Bk,j | + Λi,k ∗ Σk,j i , i = 1 : n, j = 1 : n
k=1
(* n n
+)
X X
⊆ Ai,k ∗ Bk,j , |Ai,k | ∗ Σk,j + Λi,k ∗ |Bk,j | + Λi,k ∗ Σk,j
k=1 k=1
⊆ hA ∗ B, |A| ∗ Σ + |B| ∗ Λ + Λ ∗ Σi
⊆ A B.
The overestimation factor of the interval matrix product defined by (2.3) is also given
in [Rum99a]:
Proposition 2.2.3. For A an interval matrix with all components of relative precision e,
and for B an interval matrix with all components of relative precision f , the overestimation
of each component of the result of multiplication is bounded by
ef
1+ .
e+f
In any case, the overestimation is uniformly bounded by 1.5.
2.3. ALGORITHM 1 USING ENDPOINTS REPRESENTATION 21

The overestimation factor of 1.5 can be achieved, as in the following example.

Example 3. Let’s compute the following interval matrix product by both the natural
algorithm and Rump’s algorithm:

     
[0, 4] [0, 2] [0, 2] [0, 2] [0, 12] [0, 12]
∗ = .
[0, 2] [0, 4] [0, 2] [0, 2] [0, 12] [0, 12]
       
[0, 4] [0, 2] [0, 2] [0, 2] h2, 2i h1, 1i h1, 1i h1, 1i
=
[0, 2] [0, 4] [0, 2] [0, 2] h1, 1i h2, 2i h1, 1i h1, 1i
 
h3, 9i h3, 9i
=
h3, 9i h3, 9i
 
[−6, 12] [−6, 12]
= .
[−6, 12] [−6, 12]

The overestimation factor of each component is ρ = 18/12 = 1.5.

In practice, the factor between results computed by (2.3) and those computed by (2.1)
is interestingly much smaller than this bound, and in many cases close to 1. In some cases
when the majority of input elements are small, i.e their radius is small with respect to their
midpoint, the matrix product computed by (2.3) is even sharper than that computed by
(2.1).
On the other hand, if all components of A and B are of large relative precision e ≈ 1,
then the overestimation of each component of the result computed by (2.3) is close to 1.5.
This might not be acceptable in some applications, for example in global optimization. In
the following sections we will introduce new algorithms to reduce this overestimation factor
while not deteriorating significantly the performance.

2.3 Algorithm 1 using endpoints representation


In this section, we assume that each interval matrix is represented by two real matrices,
one representing the lower bound and the other representing the upper bound. First, we
study some special cases.

2.3.1 Special cases


If some information about the input matrices is known in advance, then we can efficiently
compute the multiplication using the natural algorithm (2.1) without any overestimation.

Centered-in-zero multiplier
Let a = [a, a] ∈ IR be a centered-in-zero interval, meaning that a = −a and a > 0. Let
b = b, b be an interval, then the product between a and b can be computed by
 
a ∗ b = a ∗ max(|b|, |b|), a ∗ max(|b|, |b|)
= [−a ∗ mag (b) , a ∗ mag (b)] .
22 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Note that this formula is still valid in the presence of rounding errors, with a ∗ mag (b)
being computed in upward rounding mode. Indeed, suppose that c = 4(a ∗ mag (b)) then
c ≥ a ∗ mag (b), which implies −c ≤ −a ∗ mag (b) ≤ a ∗ mag (b) ≤ c. Hence a ∗ b ⊆ [−c, c].
Due to the symmetry, the interval multiplication can be computed by one punctual
multiplication for the upper bound, followed by one negation to compute the lower bound.
This also holds
 for a matrix product.
n×n
Let A = A, A ∈ IR be a centered-in-zero interval matrix, i.e. A = −A and A > 0.
 
Let B = B, B be an interval matrix, then the product between A and B can be computed
by  
A ∗ B = −A ∗ mag (B) , A ∗ mag (B) . (2.4)
The implementation in floating-point arithmetic simply sets the rounding mode to
upward before computing A ∗ mag (B).
The lower bound is opposite to the upper bound, hence the product A ∗ B requires
only one punctual matrix product.

Non-negative multiplier
 
Let a = [a, a] ∈ IR be an interval where a ≥ a ≥ 0. Let b = b, b ∈ IR be an interval,
then
max(a ∗ b, a ∗ b) = a ∗ b, max(a ∗ b, a ∗ b) = a ∗ b,
min(a ∗ b, a ∗ b) = a ∗ b, min(a ∗ b, a ∗ b) = a ∗ b.
Hence

sup(a ∗ b) = max(a ∗ b, a ∗ b),


inf(a ∗ b) = min(a ∗ b, a ∗ b).
 
+ x if x ≥ 0 − 0 if x ≥ 0
Denote by x = and x = .
0 if x < 0 x if x < 0
It is easy to verify that
+ −
max(a ∗ b, a ∗ b) = a ∗ b + a ∗ b
min(a ∗ b, a ∗ b) = ha ∗ b− + a ∗ b+ i
+ −
⇒ a ∗ b = a ∗ b − + a ∗ b+ , a ∗ b + a ∗ b .

By induction on the dimension of the matrices, we can generalize this formula to matrix
multiplication.
Proposition 2.3.1. For A = A, A ∈ IRn×n , with Ai,j ≥ 0 for all 1 ≤ i ≤ n, 1 ≤ j ≤ n,
 

and B = B, B ∈ IRn×n , the multiplication is computed by


 

h + −
i
A ∗ B = A ∗ B− + A ∗ B+, A ∗ B + A ∗ B . (2.5)

Hence, in this case, an interval matrix product requires four punctual matrix products.
Similarly, when A is non-positive, the multiplication is also computed by four punctual
matrix products:
h − +
i
A ∗ B = A ∗ B + A ∗ B , A ∗ B+ + A ∗ B− . (2.6)
2.3. ALGORITHM 1 USING ENDPOINTS REPRESENTATION 23

Not containing zero multiplier


More generally, if one multiplier does not contain zero then we can combine the two previous
subcases.
For a ∈ IR, denote by

a+ = {a ∈ a | a ≥ 0},
a− = {a ∈ a | a ≤ 0}

where the empty interval is replaced by 0.


Proposition 2.3.2. If a does not contain zero then for any b ∈ IR

a ∗ b = a+ ∗ b + a− ∗ b.

Proof. a does not contain zero, hence


• if a ≥ 0 then a+ = a and a− = 0, which gives a+ ∗ b + a− ∗ b = a ∗ b + 0 ∗ b = a ∗ b;

• if a ≤ 0 then a+ = 0 and a− = a, which gives a+ ∗ b + a− ∗ b = 0 ∗ b + a ∗ b = a ∗ b.

Proposition 2.3.2 means that the computation of the interval multiplication in this case
requires eight punctual multiplications.
We can generalize Proposition 2.3.2 to matrix multiplication by induction. One matrix
does not contain zero if each element of this matrix does not contain zero.
Proposition 2.3.3. For A ∈ IRn×n not containing zero, and B ∈ IRn×n , the multiplication
A ∗ B can be computed by eight punctual matrix multiplications:

A ∗ B =A+ ∗ B + A− ∗ B
h + −
i
= A + ∗ B − + A+ ∗ B + , A+ ∗ B + A+ ∗ B
h − +
i
+ A− ∗ B + A− ∗ B , A− ∗ B + + A − ∗ B −
h − +
= A+ ∗ B − + A+ ∗ B + + A − ∗ B + A − ∗ B ,
+ −
i
A+ ∗ B + A + ∗ B + A − ∗ B + + A− ∗ B − .

Note that all the formulae for these special cases provide results without overestimation.

2.3.2 General case


When no information about the input matrices is known in advance, then to compute
exactly the matrix multiplication, it is mandatory to use the natural algorithm (2.1), which
is not efficient. Rump’s algorithm can be seen as decomposing input matrices into two parts,
one punctual part and one centered-in-zero part, then computing sub-products which can
be computed exactly and efficiently, and finally accumulating all the sub-products. Even
though each sub-product can be computed exactly, the final computed result is an enclosure
of the exact result because of the sub-distributivity property of interval arithmetic.
24 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Let A = hA, Λi ∈ IRn×n and B = hB, Σi ∈ IRn×n , we have that

A = A + h0, Λi
B = B + h0, Σi
⇒ A ∗ B = (A + h0, Λi) ∗ (B + h0, Σi)
⊆ A ∗ B + A ∗ h0, Σi + h0, Λi ∗ B + h0, Λi ∗ h0, Σi
⊆ hA ∗ B, 0i + h0, |A| ∗ Σi + h0, Λ ∗ |B|i + h0, Λ ∗ Σi
⊆ hA ∗ B, |A| ∗ Σ + Λ ∗ |B| + Λ ∗ Σi .

With this decomposition, the overestimation in the worst case is 1.5. Our idea is to
find another decomposition such that:
• sub-products can be efficiently computed, and

• the overestimation is small.


As we can see in Section 2.3.1, an interval product can be efficiently computed when
one of the two multipliers is centered in zero or does not contain zero. This leads to our
following proposition of decomposition.
Proposition 2.3.4. Let A be an interval matrix. If A is decomposed into two interval
matrices A0 and A∗ which satisfy:

then A0i,j = 0, A∗i,j = Ai,j




 if 0 ≤ Ai,j ≤ Ai,j
if Ai,j ≤ Ai,j ≤ 0 then A0i,j = 0, A∗i,j = Ai,j

A∗i,j = 0, Ai,j + Ai,j 
 

 if Ai,j < 0 < |Ai,j | ≤ Ai,j then A0i,j = Ai,j , −Ai,j  ,
if Ai,j < 0 < Ai,j < |Ai,j | then A0i,j = −Ai,j , Ai,j , A∗i,j = Ai,j + Ai,j , 0

(2.7)
then
(1) A0 is centered in zero,

(2) A∗ does not contain zero,

(3) A0 + A∗ = A, and

(4) mag(A0 ) + mag(A∗ ) = mag (A).


Proof. Easily deduced from the formulae of decomposition.
Proposition 2.3.5. Let A and B two interval matrices and (A0 , A∗ ) a decomposition of
A by Proposition 2.3.4. Let C the interval matrix being computed by

C = A0 ∗ B + A∗ ∗ B, (2.8)

then A ∗ B is contained in C. Denote A ~ B = C.


Proof. Following Proposition 2.3.4: A = A0 +A∗ . Interval multiplication is sub-distributive,
hence
A ∗ B = (A0 + A∗ ) ∗ B ⊆ A0 ∗ B + A∗ ∗ B = C.
2.3. ALGORITHM 1 USING ENDPOINTS REPRESENTATION 25

Moreover, A0 is centered in zero and A∗ does not contain zero, according to Section
2.3.1, A0 ∗B and A∗ ∗B can be computed exactly using respectively one and eight punctual
matrix products. Hence, it requires in total nine punctual matrix products to compute an
enclosure of A ∗ B by Proposition 2.3.5. Nevertheless, because of the sub-distributivity,
the result computed by this proposition is always an overestimation of the exact result.
The following theorem provides the overestimation factor of this algorithm in case of
scalar inputs.
Theorem 2.3.6. For a ∈ IR, b ∈ IR, with a being decomposed, following Proposition
2.3.4, into two parts: a = a0 + a∗ , the overestimation of a ~ b = a0 ∗ b + a∗ ∗ b with respect
to a ∗ b is defined as
diam(a ~ b)
ρ1 = .
diam(a ∗ b)
The following holds:

(1) ρ1 is always bounded by 4 − 2 2;

(2) if either a or b is centered in zero then ρ1 = 1;

(3) if either a or b does not contain zero then ρ1 = 1.


Proof. We will first prove the property (2).
If a is centered in zero then a∗ = 0 → a ∗ b = a0 ∗ b. Thus the result is exact, i.e.
ρ = 1.
If b is centered in zero then
 
a ∗ b = −mag (a) ∗ b, mag (a) ∗ b
a0 ∗ b + a∗ ∗ b = −mag(a0 ) ∗ b, mag(a0 ) ∗ b + −mag(a∗ ) ∗ b, mag(a∗ ) ∗ b
   

= −(mag(a0 ) + mag(a∗ )) ∗ b, (mag(a0 ) + mag(a∗ )) ∗ b .


 

Moreover, following Proposition 2.3.4

mag(a0 ) + mag(a∗ ) = mag (a) ,

which implies that a ~ b = a ∗ b, or ρ1 = 1. Hence, property (2) is true.


Properties (1) and (3) can be proven by inspecting subcases related to the values of a
and b. There are two cases for the value of a, which are “a does not contain zero” and “a
contains zero”.
If a does not contain zero then a0 = 0 ⇒ a ∗ b = a∗ ∗ b ⇒ ρ1 = 1.
If a contains zero. It means that a < 0 < a. The case a < |a| is similar to the case
a > |a|, hence without loss of generality suppose that a > |a|. Following Proposition 2.3.4,
a0 = [a, −a] and a∗ = [0, a + a]. Hence, following Proposition 2.3.5

a ~ b = a0 ∗ b + a∗ ∗ b
= [a, −a] ∗ b + [0, a + a] ∗ b
= [a ∗ mag (b) , −a ∗ mag (b)] + [0, a + a] ∗ b.

There are four subcases for the value of b, which are b ≥ 0, b ≤ 0, b > |b| > 0 > b,
and |b| > b > 0 > b.
26 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

1. Case b ≥ 0
mag (b) = b
⇒ a~b  ∗ mag (b) , −a ∗ mag (b)] + [0,
= [a  a + a] ∗ b
= a ∗ b, −a ∗b + 0, (a + a) ∗ b
= a ∗ b, a ∗ b .
 
Meanwhile, b ≥ 0 → a ∗ b = a ∗ b, a ∗ b . Hence a ~ b = a ∗ b. It means that there
is no overestimation in this case, i.e ρ1 = 1.
2. Case b ≤ 0: this case is similar to the case b ≥ 0. It means that if a contains zero
but b does not contain zero then ρ1 = 1. Hence property (3) is true.
3. Case b > 0 > b and b ≥ |b| → mag (b) = b and (a + a) ∗ b < 0 < (a + a) ∗ b. Hence
a ~ b = [a ∗ mag (b) , −a ∗ mag (b)] + [0, a + a] ∗ b
   
= a ∗ b, −a ∗ b + (a + a) ∗ b, (a + a) ∗ b
 
= (a + a) ∗ b + a ∗ b, a ∗ b
⇒ diam(a ~ b) = a ∗ b − ((a + a) ∗ b + a ∗ b)
= a ∗ b ∗ (1 − a/a − b/b − a/a ∗ b/b).
As a > 0 > a > −a and b > 0 > b > −b
a∗b<0
a∗b<0<a∗b<a∗b
 
⇒ a ∗ b = min(a ∗ b, a ∗ b), a ∗ b
⇒ diam(a ∗ b) = a ∗ b − min(a ∗ b, a ∗ b)
= a ∗ b ∗ (1 − min(a/a, b/b)).

Denote by M = max(|a|/a, |b|/b) and m = min(|a|/a, |b|/b) then



 0 < m ≤ M ≤ 1,

min(a/a, b/b) = −M,

(2.9)

 a/a + b/b = −m − M,
a/a ∗ b/b = m ∗ M.

The factor of overestimation is computed by:


diam(a ~ b)
ρ1 =
diam(a ∗ b)
1 − a/a − b/b − a/a ∗ b/b
ρ1 =
1 − min(a/a, b/b)
1 + M + m − Mm
ρ1 =
1+M
m(1 − M )
ρ1 = 1+
1+M
M (1 − M )
ρ1 ≤ 1+ ≡ f (M ).
1+M
2.3. ALGORITHM 1 USING ENDPOINTS REPRESENTATION 27

By studying the variations of√the function f (M ), with M varying


√ between 0 and 1,
we obtain that f (M ) ≤ 4 − 2 2. Equality holds when M = 2 − 1.
√ √
Hence, ρ1 ≤ 4 − 2 2. Equality holds when a/a = b/b = 1 − 2.

4. Case b > 0 > b and b < |b|: this case is similar to the case b > 0 > b and b ≥
|b|. By denoting by M = max(|a|/a, b/|b|) and m = min(|a|/a, b/|b|), the factor of
overestimation in this case is
m(1 − M )
ρ1 = 1 + ,
1+M
√ √
which is also bounded by ρ1 ≤ 4 − 2 2. Equality holds when a/a = b/b = 1 − 2.

We can also generalize this theorem to the case of matrix multiplication.

Theorem 2.3.7. For A ∈ IRn×n , B ∈ IRn×n , the overestimation factor of each component
of the multiplication defined by Proposition 2.3.5 is bounded by

ρ ≤ 4 − 2 2.

Moreover, if all the components of either A or B do not contain zero then all the computed
components are exacts.

Proof. By induction from Theorem 2.3.6 and from the fact that

diam(a + b) = diam(a) + diam(b).

Hence,
√ it is clear that the overestimation in the worst case of our algorithm, which is
4 − 2 2 ≈ 1.18, is smaller than the overestimation in the worst case of Rump’s algorithm,
which is 1.5. For each individual pair of inputs, the following proposition gives a rough
comparison between the two algorithms.
 
Proposition 2.3.8. For a = [a, a] = ha, αi ∈ IR, b = b, b = hb, βi ∈ IR

a~b⊆a b.

Equality holds when either one of the two inputs is punctual or centered in zero.

Proof. We distinguish between two cases for the values of a and b:

1. Either a or b does not contain zero.


In this case, following Theorem 2.3.6, the overestimation of a ~ b is ρ1 = 1, or in
other words there is no overestimation. Hence, it is obvious that

a~b⊆a b.

2. Both a and b contain zero. There are four possible sub-cases where:
28 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

(a) a ≥ |a| > 0 > a and b ≥ |b| > 0 > b,


(b) a ≥ |a| > 0 > a and b < 0 < b ≤ |b|,
(c) a < 0 < a ≤ |a| and b ≥ |b| > 0 > b,
(d) a < 0 < a ≤ |a| and b < 0 < b ≤ |b|.

We will consider here only the first sub-case. The proof of the other 3 sub-cases is
similar.
[a, a] and ha, αi represent the same interval, i.e.
 
a = a−α a = 1/2 ∗ (a + a)
and
a = a+α α = 1/2 ∗ (a − a).

Similarly we have
 
b = b−β b = 1/2 ∗ (b + b)
and
b = b+β β = 1/2 ∗ (b − b).

We have a ≥ |a| > 0 > a, b ≥ |b| > 0 > b hence α > a > 0 and β > b > 0.
a b as defined by (2.2) is

mid(a b) = a ∗ b,
rad(a b) = |a| ∗ β + α ∗ |b| + α ∗ β
= a ∗ β + α ∗ b + α ∗ β.

As mentioned in the proof of Theorem 2.3.6, when a ≥ |a| > 0 > a and b ≥ |b| >
0 > b, a ~ b is
 
a~b = (a + a) ∗ b + a ∗ b, a ∗ b
= [2 ∗ a ∗ (b − β) + (a − α) ∗ (b + β), (a + α) ∗ (b + β)]
= [3 ∗ a ∗ b − a ∗ β − α ∗ β − α ∗ β, a ∗ b + a ∗ β + α ∗ b + α ∗ β]
= [2 ∗ a ∗ b + mid(a b) − rad(a b), mid(a b) + rad(a b)]
⊆ [mid(a b) − rad(a b), mid(a b) + rad(a b)] as a ∗ b > 0
⊆ a b.

This proposition can be generalized to the multiplication of interval matrices by in-


duction on row and column indices and by the fact that if a0 ⊆ a and b0 ⊆ b then
a0 + b0 ⊆ a + b. Therefore Proposition 2.3.5 always provides sharper results than those
computed by Rump’s proposition.
2.3. ALGORITHM 1 USING ENDPOINTS REPRESENTATION 29

2.3.3 Implementation
We will now detail the implementation of the formula given in Proposition 2.3.5. Using
floating-point arithmetic to implement interval arithmetic, we have to deal with rounding
errors.
For the matrix decomposition as defined by Proposition 2.3.4, the only source of round-
ing error is the addition between the two bounds, as the rest can be expressed in terms of
max and min functions which are free of rounding errors. Hereafter, MatLab notations are
used to describe algorithms: = means assignment. Operations take place between scalars,
vectors or matrices as long as their dimensions agree. Indeed, to increase the parallelism
of algorithm, instead of using loops, we will express algorithms solely in terms of matrix
operations, using MatLab-like syntax.
n×n
 
Let A = A, A ∈ IF be an interval matrix, the following algorithm decomposes A
into two parts as defined by Proposition 2.3.4. All operations are componentwise.

Algorithm 2.1 Decomposition algorithm for endpoint intervals

A0 = max(min(−A, A), 0)
A1 = 5(A + A0)
A2 = 4(A − A0)

The validity of this algorithm is given by the following proposition.


Proposition 2.3.9. Let A = A, A ∈ IFn×n be an interval matrix. For A0, A1, A2 ∈ Fn×n
 

being computed by Algorithm 2.1, it holds that


1. A0 = [−A0, A0],

2. A∗ ⊆ [A1, A2].
Proof. We will inspect several subcases for the component Ai,j , 1 ≤ i, j ≤ n of A:
• If 0 ≤ Ai,j ≤ Ai,j , then A0i,j = 0 and A∗i,j = Ai,j . For A0i,j we have

A0i,j = max(min(−Ai,j , Ai,j ), 0) = max(−Ai,j , 0) = 0.

Since the floating-point addition or subtraction of 0 is exact, then



A1i,j = 5(Ai,j + A0i,j ) = Ai,j
A2i,j = 4(Ai,j − A0i,j ) = Ai,j

Hence [−A0i,j , A0i,j ] = 0 = A0i,j , and [A1i,j , A2i,j ] = Ai,j , Ai,j = A∗i,j .
 

• If Ai,j ≤ Ai,j ≤ 0, then A0i,j = 0 and A∗i,j = Ai,j . For A0i,j we have

A0i,j = max(min(−Ai,j , Ai,j ), 0) = max(Ai,j , 0) = 0.

Therefore A1i,j = Ai,j and A2i,j = Ai,j . Hence [−A0i,j , A0i,j ] = 0 = A0i,j , and
[A1i,j , A2i,j ] = Ai,j , Ai,j = A∗i,j .
 
30 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

• If Ai,j < 0 < |Ai,j | < Ai,j , then A0i,j = Ai,j , −Ai,j , and A∗i,j = 0, Ai,j + Ai,j .
   

A0i,j = max(min(−Ai,j , Ai,j ), 0) = max(−Ai,j , 0) = −Ai,j


⇒ A1i,j = 5(Ai,j + A0i,j ) = 5(Ai,j − Ai,j ) = 0,
A2i,j = 4(Ai,j − A0i,j ) = 4(Ai,j + Ai,j ).
   
It is obvious that 4(Ai,j + Ai,j ) ≥ Ai,j + Ai,j ⇒ 0, Ai,j + Ai,j ⊆ 0, 4(Ai,j + Ai,j ) .
Hence, A∗i,j ⊆ [A1i,j , A2i,j ], and [−A0i,j , A0i,j ] = Ai,j , −Ai,j = A0i,j .


• If Ai,j < 0 < Ai,j < |Ai,j |, then A0i,j = −Ai,j , Ai,j , and A∗i,j = Ai,j + Ai,j , 0 .
   

A0i,j = max(min(−Ai,j , Ai,j ), 0) = max(Ai,j , 0) = Ai,j


⇒ A1i,j = 5(Ai,j + A0i,j ) = 5(Ai,j + Ai,j ),
A2i,j = 4(Ai,j − A0i,j ) = 0.
   
Again, it is obvious that 5(Ai,j +Ai,j ) ≤ Ai,j +Ai,j ⇒ Ai,j + Ai,j , 0 ⊆ 5(Ai,j + Ai,j ), 0 .
Hence, A∗i,j ⊆ [A1i,j , A2i,j ], and [−A0i,j , A0i,j ] = −Ai,j , Ai,j = A0i,j .
 

Now we have to compute the two sub-products, which are [−A0, A0]∗B and [A1, A2]∗B.
Following Section 2.3.1,

[−A0, A0] ∗ B = [−A0 ∗ mag (B) , A0 ∗ mag (B)] .

Hence to compute [−A0, A0] ∗ B, it is only needed to compute its upper bound, which is
A0 ∗ mag (B). In the presence of rounding errors, it must be computed in upward rounding
mode.
To compute the multiplication [A1, A2] ∗ B using Proposition 2.3.3, it is needed to get
the negative and positive parts of the two bounds of [A1, A2] and B.
For a ∈ F, it is true that a− = min(a, 0) and a+ = max(a, 0).
For a = [a, a] ∈ IF and a does not contain zero, it is true that

a+ = max(a, 0),
a+ = max(a, 0),
a− = min(a, 0),
a− = min(a, 0).

Hence, using Proposition 2.3.3 we can compute the multiplication [A1, A2] ∗ B by:

inf([A1, A2] ∗ B) = 5(max(A2, 0) ∗ min(B, 0) + max(A1, 0) ∗ max(B, 0)


+ min(A2, 0) ∗ min(B, 0) + min(A1, 0) ∗ max(B, 0)),
sup([A1, A2] ∗ B) = 4(max(A2, 0) ∗ max(B, 0) + max(A1, 0) ∗ min(B, 0)
+ min(A2, 0) ∗ max(B, 0) + min(A1, 0) ∗ min(B, 0)).

Finally, the full algorithm is implemented in MatLab as a function called igemm using
the IntLab library as follows.
2.3. ALGORITHM 1 USING ENDPOINTS REPRESENTATION 31

1 function R = igemm (A, B)


2 A0 = max( 0 ,min(−A. i n f ,A. sup ) ) ;
3 s e t r o u n d ( −1) ;
4 A1 = A. i n f + A0 ;
5 setround (1) ;
6 A2 = A. sup − A0 ;

8 A2a = max(A2 , 0 ) ;
9 A2i = min(A2 , 0 ) ;
10 A1a = max(A1 , 0 ) ;
11 A1i = min(A1 , 0 ) ;

13 B2 = B . sup ;
14 B1 = B . i n f ;
15 B2a = max( B2 , 0 ) ;
16 B2i = min( B2 , 0 ) ;
17 B1a = max( B1 , 0 ) ;
18 B1i = min( B1 , 0 ) ;

20 zeroA = a l l ( a l l (A0 == 0 ) ) ;
21 i f ( zeroA )
22 A0 = zeros ( s i z e (A0 , 1 ) , s i z e (B, 2 ) ) ;
23 else
24 A0 = A0 ∗ max(−B . i n f , B . sup ) ;
25 end

27 RU = A2a ∗ B2a + A1a ∗ B2i + A2i ∗ B1a + A1i ∗ B1i + A0 ;


28 s e t r o u n d ( −1) ;
29 RI = A2a ∗ B1i + A1a ∗ B1a + A2i ∗ B2i + A1i ∗ B2a − A0 ;

31 R = i n f s u p ( RI ,RU) ;
32 end

The IntLab library uses endpoints format to represent real intervals. Each real interval
term (scalar, vector, or matrix) is represented by two floating-point terms of the same size,
one for the lower bound and one for the upper bound. One can obtain the lower bound and
the upper bound of each interval term by using the getter functions, which are .inf and
.sup respectively. Beside basic functions of MatLab, we use here some additional functions:
• setround(rnd) to change the rounding mode according to the input parameter rnd:
1 means upward rounding mode, −1 means downward rounding mode, and 0 means
to-nearest rounding mode;

• infsup(P 1, P 2) to construct an interval matrix whose lower bound and upper bound
is respectively P 1 and P 2;

• zeros(m, n) to construct a matrix of m × n null components;

• all(X == 0) to test if all elements of X are non-zero. Indeed, if X is a vector, it


returns 1 if and only if none of the elements of X is zero. If X is a matrix, all(X)
operates on the column of X, returning a row vector of logical 1’s and 0’s.
Lines 2–6 perform the decomposition following Algorithm 2.1, which requires 2 rounding
mode changes. Then we do not change the rounding mode and keep the upward rounding
32 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Figure 2.1: Performance of igemm.

execution time / fp matrix multiplication 30


Intlab
25 Algorithm 1

20

15

10

0
0 100 200 300 400 500 600 700 800 900 1000
dimension

mode for the following lines: this has no impact on lines 8–22, and it is the required
rounding mode for line 24 which consists in computing A0 ∗ mag(B), and line 27 which
computes the upper bound of the result. Then we change to downward rounding mode to
compute the lower bound of the result in line 29. In total, this implementation uses only
3 changes of rounding mode.

Practical performance
The igemm function uses in total 9 floating-point matrix products, along with some matrix
operations of order O(n2 ) to manipulate data. Hence, in theory, the factor in terms of
execution time between igemm and a floating-point matrix product is 9.
In practice, when the matrix dimension is small, the difference between operations of
order O(n2 ) and operations of order O(n3 ) is small. That is not only because the theoretical
factor n is small, but also that operations of order O(n3 ) better exploit memory locality
and parallelism.
Test matrices for this set of experiments as well as other experiments in this section
are generated by function rand(n) from MatLab, given n the dimension of the matrix.
This function returns a matrix of size n × n whose components are uniformly distributed
random numbers between 0 and 1.
Since the igemm function uses only a small constant number of BLAS level 3 routines,
the overhead in execution time introduced by the interpretation is negligible. It means that
there is almost no difference in performance in terms of execution time between a MatLab
implementation and other implementations on C or FORTRAN programming language.
As shown on Figure 2.1, when the matrix dimension is small, i.e smaller than 100, the
execution time of igemm with respect to a floating-point matrix multiplication, marked by
the × symbol, is high. Nevertheless, when the matrix dimension gets higher, this factor
gets smaller and gets close to the theoretical factor.
The same phenomenon can be observed with Rump’s algorithm, marked by the +
symbol.
In the implementation, for some special cases, we can reduce the number of floating-
2.3. ALGORITHM 1 USING ENDPOINTS REPRESENTATION 33

point matrix products by inspecting the signs of input matrices. For example, if A is a
non-negative matrix then, following Proposition 2.3.1, the interval multiplication requires
only four punctual multiplications. Furthermore, if the upper bound of all the components

of B are of the same sign then either B or B + is zero. Hence, in total it requires only
three punctual matrix multiplications.
The best case is when all the components of A are of the same sign, and all the
components of B are of the same sign, though it is not a very common case. In this
case, we need only two punctual matrix multiplications to implement an interval matrix
multiplication.

Implementation issues
Following Proposition 2.3.8, results computed by this algorithm are always included in
results computed by the IntLab library. Especially when one of the multiplier does not
contain zero, then there is no overestimation with this algorithm. For the implementation
using floating-point arithmetic, we face the problem of rounding errors. Therefore, when
the input data are of small relative precision, then results computed by the IntLab library
are curiously even tighter than results computed by this algorithm. This can be explained
by the following example.
Denote by  the relative machine error for double precision, which is 2−53 , then 1 +
2, 1 + 4 and 1 + 6 are floating-point numbers but 1 + , 1 + 3 and 1 + 5 are not.
Let’s consider the interval multiplication a ∗ b where a = [1 − 2, 1 + 2] = h1, 2i, b =
[1 + 2, 1 + 6] = h1 + 4, 2i.
Using the natural algorithm

a ∗ b = [5((1 − 2) ∗ (1 + 2)), 4((1 + 2) ∗ (1 + 6))]


= 5(1 − 42 ), 4(1 + 8 + 122 )
 

= [1 − , 1 + 10] .

Using the midpoint-radius representation

mid(a b) = [5(1 ∗ (1 + 4)), 4(1 ∗ (1 + 4))]


= 1 + 4,
rad(a b) = 4(2 ∗ (1 + 4 + 2) + 1 ∗ 2)
= 4(4 + 122 )
= 4 ∗ (1 + 4).

Without having to transform back to endpoints representation, we can see that

diam(a b) = 8 ∗ (1 + 4) < 11 = diam(a ∗ b).

This example is somewhat tricky, but its purpose is to show that, taking into account
rounding errors, algorithms’ behavior is different from that of the mathematical model.
Hence, although in theory our algorithm is shown to be always better than Rump’s algo-
rithm in terms of accuracy, in practice it is only true when the intervals’ width is large
enough.
In the following section, we will explore another algorithm which takes advantage of
the decomposition proposition, and which is less prone to rounding errors.
34 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

2.4 Algorithm 2 using midpoint-radius representation


The interval matrix multiplication algorithm addressed in the previous section is based
on the representation of intervals by endpoints. Using this representation, the product
between one interval matrix, which does not contain zero in its interior, and another
interval matrix requires 8 floating-point matrix products. In general, an interval matrix
multiplication as defined by Proposition 2.3.5 requires 9 punctual
√ matrix multiplications.
The overestimation in the worst case of this algorithm is 4 − 2 2 ≈ 1.18.
In this section, we introduce another algorithm, which is based on the midpoint-radius
representation, to compute the product of two interval matrices. This algorithm also uses
the decomposition of one multiplier matrices into two parts, one which does not contain
zero and the other which is centered in zero.

2.4.1 Scalar interval product


Let us first consider the case of the scalar interval product.
Let a = ha, αi and b = hb, βi be two intervals in midpoint-radius representation. Let
us split the problem into four cases where a ≥ 0, a < 0, 0 ≤ a < α, and −α < a ≤ 0.
For each case, we apply the decomposition proposition to compute the product of a and
b, and to eventually deduce a homogeneous formula in midpoint-radius format for all the
four cases.

Case a ≥ 0
When a ≥ 0, a + α ≥ 0 and a − α ≥ 0. Therefore

sup(a ∗ b) = max((a + α) ∗ (b + β), (a − α) ∗ (b + β))


= a ∗ (b + β) + max{a ∗ (b + β), −α ∗ (b + β)}
= a ∗ (b + β) + α ∗ |b + β|,
inf(a ∗ b) = min((a + α) ∗ (b − β), (a − α) ∗ (b − β))
= a ∗ (b − β) + min(α ∗ (b − β), −α ∗ (b − β))
= a ∗ (b − β) − α ∗ |b − β|.

Hence
sup(a ∗ b) + inf(a ∗ b)
mid(a ∗ b) =
2
a ∗ (b + β) + α ∗ |b + β| + a ∗ (b − β) − α ∗ |b − β|
=
2
|b + β| − |b − β|
= a∗b+α∗ ,
2
sup(a ∗ b) − inf(a ∗ b)
rad(a ∗ b) =
2
a ∗ (b + β) + α ∗ |b + β| − (a ∗ (b − β) − α ∗ |b − β|)
=
2
|b + β| + |b − β|
= a∗β+α∗ .
2
2.4. ALGORITHM 2 USING MIDPOINT-RADIUS REPRESENTATION 35

Case a < 0
When a < 0, a + α < 0 and a − α < 0. Therefore
sup(a ∗ b) = max{(a + α) ∗ (b − β), (a − α) ∗ (b − β)}
= a ∗ (b − β) + max{α ∗ (b − β), −α ∗ (b − β)}
= a ∗ (b − β) + α ∗ |b − β|,
inf(a ∗ b) = min{(a + α) ∗ (b + β), (a − α) ∗ (b + β)}
= a ∗ (b + β) + min{α ∗ (b + β), −α ∗ (b + β)}
= a ∗ (b − β) − α ∗ |b + β|.
Hence
sup(a ∗ b) + inf(a ∗ b)
mid(a ∗ b) =
2
a ∗ (b − β) + α ∗ |b − β| + a ∗ (b + β) − α ∗ |b + β|
=
2
|b + β| − |b − β|
= a∗b−α∗ ,
2
sup(a ∗ b) − inf(a ∗ b)
rad(a ∗ b) =
2
a ∗ (b − β) + α ∗ |b − β| − (a ∗ (b + β) − α ∗ |b − β|)
=
2
|b + β| + |b − β|
= −a ∗ β + α ∗ .
2
Denote by sign(a) the sign of a, these two cases where a does not contain zero can be
rewritten by the same formula
(
mid(a ∗ b) = a ∗ b + sign(a) ∗ α ∗ |b+β|−|b−β|
2
,
|b+β|+|b−β|
rad(a ∗ b) = |a| ∗ β + α ∗ 2
.

Moreover β ≥ 0, hence

|b + β| + |b − β| = 2 ∗ b,
• If b > β ≥ 0 then
|b + β| − |b − β| = 2 ∗ β.

|b + β| + |b − β| = 2 ∗ β,
• If β ≥ b ≥ 0 then
|b + β| − |b − β| = 2 ∗ b.

|b + β| + |b − β| = 2 ∗ β,
• If 0 ≥ b ≥ −β then
|b + β| − |b − β| = −2 ∗ b.

|b + β| + |b − β| = −2 ∗ b,
• If b < −β ≤ 0 then
|b + β| − |b − β| = −2 ∗ β.
It is easy to deduce that
|b + β| + |b − β|
= max(|b|, β),
2
|b + β| − |b − β|
= sign(b) ∗ min(|b|, β).
2
36 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Therefore

mid(a ∗ b) = a ∗ b + sign(a) ∗ α ∗ sign(b) ∗ min(|b|, β),
(2.10)
rad(a ∗ b) = |a| ∗ β + α ∗ max(|b|, β).

Case 0 ≤ a < α
We have a − α < 0 < |a − α| < a + α. Using the technique of decomposition from section
2.3, a = ha, αi = [a − α, a + α] is decomposed into two parts as follows:
a0 = [a − α, α − a] = h0, α − ai ,
a∗ = [0, 2 ∗ a] = ha, ai .
As a0 is centered in zero, mid(a0 ∗ b) = 0 and
rad(a0 ∗ b) = (α − a) ∗ mag(b)
= (α − a) ∗ (|b| + β).
Moreover, a∗ does not contain zero, and following (2.10)
mid(a∗ ∗ b) = a ∗ b + a ∗ sign(b) ∗ min(|b|, β),


rad(a∗ ∗ b) = a ∗ β + a ∗ max(|b|, β).


Due to the sub-distributivity property of interval arithmetic, we have
a ∗ b ⊆ a0 ∗ b + a∗ ∗ b ≡ a ~ b
where
mid(a ~ b) = mid(a∗ ∗ b) + mid(a0 ∗ b)
= a ∗ b + a ∗ sign(b) ∗ min(|b|, β),
rad(a ~ b) = rad(a∗ ∗ b) + rad(a0 ∗ b)
= a ∗ β + a ∗ max(|b|, β) + (α − a) ∗ (|b| + β)
= a ∗ β + α ∗ |b| + α ∗ β − a ∗ (|b| + β − max(|b|, β))
= a ∗ β + α ∗ |b| + α ∗ β − a ∗ min(|b|, β).

Case 0 ≥ a > −α
We have a − α < 0 < a + α < |a − α|. Using the technique of decomposition from section
2.3, a is decomposed into two parts as follows:
a0 = [−(a + α), α + a] = h0, a + αi ,
a∗ = [2 ∗ a, 0] = ha, −ai .
As a0 is centered in zero, mid(a0 ∗ b) = 0 and
rad(a0 ∗ b) = (α + a) ∗ mag(b)
= (α + a) ∗ (|b| + β).
Again, a∗ does not contain zero, following (2.10)
mid(a∗ ∗ b) = a ∗ b + sign(a) ∗ (−a) ∗ sign(b) ∗ min(|b|, β)
= a ∗ b + a ∗ sign(b) ∗ min(|b|, β),
rad(a∗ ∗ b) = |a| ∗ β − a ∗ max(|b|, β).
2.4. ALGORITHM 2 USING MIDPOINT-RADIUS REPRESENTATION 37

Due to the sub-distributive property of interval arithmetic, we have

a ∗ b ⊆ a0 ∗ b + a∗ ∗ b ≡ a ~ b

where
mid(a ~ b) = a ∗ b + a ∗ sign(b) ∗ min(|b|, β),
rad(a ~ b) = |a| ∗ β − a ∗ max(|b|, β) + (α + a) ∗ (|b| + β)
= |a| ∗ β + α ∗ |b| + α ∗ β + a ∗ (|b| + β − max(|b|, β))
= |a| ∗ β + α ∗ |b| + α ∗ β − |a| ∗ min(|b|, β).

Hence, the last two cases where 0 ≤ a < α or 0 ≥ a > −α can be rewritten in the same
formula 
mid(a ~ b) = a ∗ b + a ∗ sign(b) ∗ min(|b|, β),
(2.11)
rad(a ~ b) = |a| ∗ β + α ∗ |b| + α ∗ β − |a| ∗ min(|b|, β).
Let’s now compare the two formulae (2.10) and (2.11) to deduce a homogeneous formula
for the general case. Let us recall Formula (2.10) for the case where a does not contain
zero: 
mid(a ∗ b) = a ∗ b + sign(a) ∗ α ∗ sign(b) ∗ min(|b|, β),
rad(a ∗ b) = |a| ∗ β + α ∗ max(|b|, β).
If a does not contain zero, then min(|a|, α) = α.
Otherwise, if a contains zero, then min(|a|, α) = |a|. Hence a = sign(a) ∗ |a| =
sign(a) ∗ min(|a|, α).
Therefore, the midpoint of the two formulae (2.10) and (2.11) can be rewritten in the
same form:

mid(a ~ b) = a ∗ b + sign(a) ∗ min(|a|, α) ∗ sign(b) ∗ min(|b|, β).

Moreover, when a does not contain zero

rad(a ∗ b) = |a| ∗ β + α ∗ max(|b|, β)


= |a| ∗ β + α ∗ (|b| + β − min(|b|, β))
= |a| ∗ β + α ∗ |b| + α ∗ β − α ∗ min(|b|, β))
= |a| ∗ β + α ∗ |b| + α ∗ β − min(|a|, α) ∗ min(|b|, β).

Finally, these two formulae can be rewritten in the same form for both cases, whether
a contains zero or not:


mid(a ~ b) = a ∗ b + sign(a) ∗ min(|a|, α) ∗ sign(b) ∗ min(|b|, β),
(2.12)
rad(a ~ b) = |a| ∗ β + α ∗ |b| + α ∗ β − min(|a|, α) ∗ min(|b|, β).

mmr(a) = sign(a) ∗ min(|a|, α)
Denote by
mmr(b) = sign(b) ∗ min(|b|, β)
then (2.12) is rewritten by:

mid(a ~ b) = a ∗ b + mmr(a) ∗ mmr(b),
(2.13)
rad(a ~ b) = |a| ∗ β + α ∗ |b| + α ∗ β − |mmr(a)| ∗ |mmr(b)|.
38 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

This formula differs from Rump’s algorithm (2.2) only by the term mmr(a) ∗ mmr(b). It
can be considered as a correction term to Rump’s algorithm. Using this correction term
and the fact that |mmr(a) ∗ mmr(b)| ≤ |mmr(a)| ∗ |mmr(b)|, it is easy to deduce that a ~ b is
always included in a b for all a ∈ IR and b ∈ IR. The following proposition provides an
estimation of the accuracy of a ~ b.
Proposition 2.4.1. Let a, b be two intervals. The product between a and b as defined by
(2.13) is:
• equal to the exact product when a or b does not contain zero,
• an enclosure
√ of the exact product with an overestimation factor of less than or equal
to 4 − 2 2 when both a and b contain zero in its interior.
Proof. Using the decomposition proposition to deduce the formula, the overestimation
factor can also easily be deduced from the previous section.

2.4.2 Matrix and vector operations


Formula (2.13) for the product of two intervals can be generalized to matrix and vector
operations.
Proposition 2.4.2. For A = hA, Λi ∈ IRm×n , B = hB, Σi ∈ IRn×k , the multiplication
A ~ B is defined as

mid(A ~ B) = A ∗ B + mmr(A) ∗ mmr(B),
(2.14)
rad(A ~ B) = |A| ∗ Σ + Λ ∗ |B| + Λ ∗ Σ − |mmr(A)| ∗ |mmr(B)|.
Then A ∗ B ⊆ A ~ B.
Proof. Matrix and vector operations are both based on dot product. Therefore, the iso-
tonicity of (2.14) can be proven by considering the case of the interval dot product.
Let a and b be two vectors of n intervals.
n
X
T
a∗b = ai ∗ bi
i=1
Xn
⊆ ai ~ bi
i=1
n
X
⊆ hai ∗ bi + mmr(ai ) ∗ mmr(bi ), |ai | ∗ βi + αi ∗ |bi | + αi ∗ βi − |mmr(ai )| ∗ |mmr(bi )|i
*i=1n
X
⊆ ai ∗ bi + mmr(ai ) ∗ mmr(bi ),
i=1
n
+
X
|ai | ∗ βi + αi ∗ |bi | + αi ∗ βi − |mmr(ai )| ∗ |mmr(bi )|
i=1
⊆ a ∗ bT + mmr(a) ∗ mmr(b)T , |a| ∗ β T + α ∗ |b|T + α ∗ β T − |mmr(a)| ∗ |mmr(b)|T
⊆a ~ bT .
2.4. ALGORITHM 2 USING MIDPOINT-RADIUS REPRESENTATION 39

Let a and b be two vectors of n intervals. Because there is no overestimation in interval


addition, and thus diam(x + y) = diam(x)+diam(y), by induction arguments, we can prove
that √
diam(a ~ b) ≤ (4 − 2 2) ∗ diam(a ∗ b).
Moreover, if no component of a contains zero, then a ~ b = a ∗ b. By consequence, we
get the following theorem.
Theorem 2.4.3. For A = hA, Λi ∈ IRm×n , B = hB, Σi ∈ IRn×k , the multiplication A ~ B
is defined as (2.14). It holds that:

1. the overestimation for each component of A ~ B is always bounded by 4 − 2 2;

2. if no component of A or B contains zero, then there is no overestimation.

2.4.3 Implementation
To implement this algorithm, we first have to implement the function mmr. Getting the
sign of a matrix is often costly, hence instead of using the original formula we will use an
alternate formula which employs only maximum and minimum functions.
Proposition 2.4.4. For a = ha, αi ∈ IR

mmr(a) = min(α, max(a, −α)).

Proof. The equivalence of this alternative formula and the original formula can be checked
by inspecting all possible relative positions of a with respect to 0 and α.
1. If a ≥ 0 then sign(a) = 1 and a = |a|. Hence

min(α, max(a, −α)) = min(α, a) = sign(a) ∗ min(|a|, α) = mmr(a).

2. If a < 0 then sign(a) = −1 and max(a, −α) = max(−|a|, −α) = −min(|a|, α) < 0 <
α. Hence

min(α, max(a, −α)) = −min(|a|, α) = sign(a) ∗ min(|a|, α) = mmr(a).

The implementation in IntLab of the function mmr, using the previous proposition,
exhibits some minor improvement. It is faster than the straightforward implementation
using the original formula by a factor 4/3.
Although IntLab library supports both endpoints and midpoint-radius representations
of interval, for real input data, intervals are exposed to the user in endpoints format.
Midpoint-radius format is only used internally for complex data. Hence, to implement our
algorithm, we have to transform the input data from endpoints representation to midpoint-
radius representation before performing computations, and finally transform results from
midpoint-radius representation back to endpoints representation for output.
Using floating-point arithmetic to implement interval arithmetic, rounding errors must
be taken care of when transforming interval data. Transforming from midpoint-radius
representation to endpoints representation is simple.
40 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Proposition 2.4.5. For a = ha, αi ∈ IF an interval in midpoint-radius representation, an


enclosing endpoints representation is an interval [a, a] being computed by:

a = 5(a − α),
a = 4(a + α).

It holds that a ⊆ [a, a].


To transform intervals from endpoints representation to midpoint-radius representation,
we use an algorithm from Intlab library [Rum], which is due to Oishi [Ois98].
Proposition 2.4.6. For a = [a, a] ∈ IF an interval in endpoints representation, an en-
closing midpoint-radius representation is an interval ha, αi being computed by:

a = 4(a + a) ∗ 0.5,
α = 4(a − a).

It is true that a ⊆ ha, αi.


The isotonicity of this algorithm is given below.
Proof. Using directed rounding mode, we have that:

α = 4(a − a) ≥ a − a.

Hence, a − α ≤ a.
Moreover
a = 4(a + a) ∗ 0.5 ≥ (a + a) ∗ 0.5.
Hence a + α ≥ a + (a − a) ≥ 2 ∗ a − a ≥ (a + a) − a = a.
Combining this with a − α ≤ a, we get a − α ≤ a ≤ a ≤ a + α and thus a ⊆ ha, αi.
Like in Rump’s algorithm, because of rounding errors, the midpoint of the result cannot
be computed exactly using formula (2.14). Hence, we have to compute an enclosure of the
midpoint. It means that it must be computed twice, once with downward rounding mode
and once with upward rounding mode.
What follows is a function named igemm3 implementing this algorithm in MatLab,
using the IntLab library.
1 function C = igemm3 (A, B)
2 setRound ( 1 ) ; % s w i t c h t o upward r o u n d i n g mode
3 mA = ( sup (A) + i n f (A) ) ∗ 0 . 5 ;
4 mB = ( sup (B) + i n f (B) ) ∗ 0 . 5 ;
5 rA = mA − i n f (A) ;
6 rB = mB − i n f (B) ;
7 mmA = min(max(mA,−rA ) , rA ) ;
8 mmB = min(max(mB,−rB ) , rB ) ;

10 mCu = mA ∗ mB + mmA ∗ mmB;


11 setRound ( −1) ; % s w i t c h t o downward r o u n d i n g mode
12 mCd = mA ∗ mB + mmA ∗ mmB;

14 rC = abs (mmA) ∗ abs (mmB) ;


15 setRound ( 1 ) ; % s w i t c h t o upward r o u n d i n g mode
2.4. ALGORITHM 2 USING MIDPOINT-RADIUS REPRESENTATION 41

16 rC = rA ∗ ( rB + abs (mB) ) + abs (mA) ∗ rB − rC ;

18 mCu = mCu + rC ;
19 setRound ( −1) ; % s w i t c h t o downward r o u n d i n g mode
20 mCd = mCd − rC ;

22 C = i n f s u p (mCd, mCu) ;
23 end

This implementation is straightforward with respect to the theoretical formula (2.14).


It is obvious that rC computed at line 14 of the code snippet above is non-negative. Hence
rC computed at line 16 must be smaller or equal to rA ∗ (rB + abs(mB)) + abs(mA) ∗ rB,
which is the radius of Rump’s algorithm.Therefore, apparently, the results computed by this
implementation should be tighter than results computed by the IntLab library. However,
in practice, when all the components of A and B are of small relative precision then the
results computed by the IntLab library are slightly tighter. This is in fact the consequence
of rounding errors. For example, using double precision with mantissa of length 52 bits for
the computations:

4(1 + 2−54 − 2−53 ) = 4(4(1 + 2−54 ) − 2−53 ) = 4((1 + 2−52 ) − 2−53 ) = 1 + 2−52 > 1.

With the presence of rounding errors, computed results depend not only on the input
data but also on the order of the computations. With the same example as above, if we
change the order of computations then the computed result is better.

4(1 + 2−54 − 2−53 ) = 4(1 + 4(2−54 − 2−53 )) = 4(1 − 2−54 ) = 1.

One useful trick to minimize the effect of rounding errors when performing floating-point
summation is to sum the smaller terms first. It is shown in [Hig02, chap.4] to be the best
strategy (in terms of the accuracy of the final result) when all summands are non-negative.
It is often a good heuristic otherwise. By definition of mmA and mmB, mmA ∗ mmB and
abs(mmA) ∗ abs(mmB) should be the smallest among the other sub-products. Because
in the implementation of A ~ B, we need to compute the addition and the subtraction
between midpoint and radius to get the final result in endpoints representations, we will first
compute these two smallest products, then compute the other sub-products to hopefully
reduce rounding errors. This leads to another implementation of the formula (2.14).
1 function C = igemm4 (A, B)
2 setRound ( 1 ) ;
3 mA = ( sup (A) + i n f (A) ) ∗ 0 . 5 ;
4 mB = ( sup (B) + i n f (B) ) ∗ 0 . 5 ;
5 rA = mA − i n f (A) ;
6 rB = mB − i n f (B) ;

8 mmA = min(max(mA,−rA ) , rA ) ;
9 mmB = min(max(mB,−rB ) , rB ) ;

11 setRound ( −1) ;

13 rC = abs (mmA) ∗ abs (mmB) ;


14 mCd = mmA ∗ mmB + rC ;

16 setRound ( 1 ) ;
42 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION

Figure 2.2: Performance of igemm3.

17 mCu = mmA ∗ mmB − rC ;


18 rC = rA ∗ ( rB + abs (mB) ) + abs (mA) ∗ rB ;

20 mCu = mCu + rC ;
21 mCu = mCu + mA ∗ mB;
22 setRound ( −1) ;
23 mCd = mCd − rC ;
24 mCd = mCd + mA ∗ mB;

26 C = i n f s u p (mCd, mCu) ;
27 end

With this new implementation, results computed by (2.14) are at least as accurate as
results computed by the IntLab library when input intervals are small. On the other hand,
when input intervals are thick, this implementation provides results tighter than those
computed by the IntLab library.
Using 7 floating-point matrix multiplication, the theoretical factor of this algorithm
in terms of execution time with respect to floating-point multiplication is 7. In practice,
because of data manipulation which is of order O(n2 ), when the matrix dimension is small,
the practical factor is much higher than the theoretical factor. The theoretical factor is
only reached when matrix dimension is high enough, i.e larger than 100 × 100, see Figure
2.2.

2.5 Conclusion
Using solely floating-point matrix operations to implement interval matrix products, the
performance of our algorithms depends strongly on the performance of the employed
floating-point matrix library. Moreover, these algorithms can be easily parallelized by
using parallelized versions of the floating-point matrix library.
We can use whatever library optimized for the running system is available, with only
one requirement, which is that this library must support upward and downward rounding
2.5. CONCLUSION 43

modes. We can cite here for example the ATLAS and GotoBLAS libraries. Unfortunately,
this requirement prevents us from using Strassen-like algorithms or even faster ones [Str69,
Pan84, Hig02] to reduce the complexity of floating-point matrix product to less than O(n3 ).
Rump’s algorithm computes the interval matrix products very fast, with only four
floating-point matrix products, but suffers from a factor of overestimation in the worst
case of 1.5. Our algorithms provide a trade-off between performance and accuracy. With
five more floating-point matrix products for the first algorithm, and three more floating-
point matrix products for the second algorithm, we manage to reduce the overestimation
factor in the worst case to approximately 1.18. Moreover, when one multiplier does not
contain zero, then our algorithms provide exact results in exact arithmetic. In the presence
of rounding errors, the second proposed algorithm always provides results of smaller width
than those computed by Rump’s algorithm.
44 CHAPTER 2. INTERVAL MATRIX MULTIPLICATION
Chapter 3

Verification of the solution of a linear


system

This chapter deals with the verification of the floating-point solutions of linear
systems of equations. Our approach is based on the floating-point iterative re-
finement, applied to the system pre-multiplied by an approximate inverse of the
coefficient matrix. In our method, interval arithmetic is used to compute the
pre-multiplication and the residual. Then interval iterative improvements are
applied, namely the interval Jacobi or Gauss-Seidel iterations. The computed
result is a tight enclosure of the correction term and not merely an approxima-
tion of if. This method is implemented in MatLab using the IntLab library as a
function called certifylss to perform numerical experiments, and to compare
our method with an algorithm proposed by S.M. Rump, which is packed in the
IntLab library as the function called verifylss. Numerical results shows the
high accuracy of the two verified functions at the cost of a longer execution time
in comparison with the MatLab non-verified built-in function. When the con-
dition number of the matrix of the system is small or moderate, certifylss is
faster than verifylss. Nevertheless, certifylss gets slower than verifylss
as the coefficient matrix is ill-conditioned.

3.1 Introduction
In this chapter, our approach is presented for solving a linear system in floating-point arith-
metic and at the same time verifying the computed solution. “Verify” means to compute
an enclosure of the error by switching from floating-point arithmetic to interval arithmetic
to solve the residual system: this yields a guaranteed enclosure of the error between the
exact result and the approximate floating-point result. This idea can be found in the
verifylss function of the IntLab library [Rum]. The function verifylss first computes a
floating-point approximation of the solution using iterative refinement, then it switches to
interval arithmetic to compute a “tight” interval error bound using a method due to Neu-
maier [Neu99]. Our proposal is to use alternately floating-point arithmetic and interval
arithmetic to refine simultaneously the approximation and the error bound.
The use of the residual is the basis of iterative refinement methods [Mol67, Ske80,
Wil63, Hig02]. An enclosure of the error can be computed, using interval arithmetic, by

45
46 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

adapting one of these iterative refinement methods. These two building blocks, i.e, the
floating-point solution of a linear system and the iterative refinement of the error bounds
using interval arithmetic, are combined to produce a more accurate solution along with a
tight enclosure of the error. In practice, the error bound is tight enough to indicate the
number of correct digits of the approximate solution.

3.2 State of the art, verifylss function


It is well known that interval arithmetic allows us to obtain verified results. A very naive
way to apply interval arithmetic is by taking existing floating-point algorithms, replacing
all the input data by punctual intervals, as well as replacing all the arithmetic operations
by their corresponding interval operations. Hence the computed result will be an interval
which includes the exact result. For example, to solve a lower triangular linear system of
equations using Gaussian elimination, performing all the operations in interval arithmetic
yields an enclosure of the solution. Some people tend to think of this technique as a
“sloppy” way to compute a verified result, “sloppy” meaning here “lazy”.
In practice, in some cases, this technique might provide results, which are verified with
a large relative error. In the general case, this technique fails to provide a computed result.
In such a situation, however, it is incorrect to state that the problem is singular. Consider
the following example, taken from [Rum05].

Example 4. Let us solve the triangular linear system


    
1− x1 1
 1 1  x2  1
    
 1 1 1  x3  = 1
    
 1 1 1 1  x4  1
1 1 1 1 1 x5 1

where  = 2−53 is the machine relative error. Hence 1 −  and 1 + 2 are floating-point
numbers. Nevertheless, 1 +  is not a floating-point number.
Using forward substitution to solve this lower triangular system in interval arithmetic,
it is obvious that

x1 = 1/(1 − ) = [1, 1 + 2]


x2 = 1 − x1 = 1 − [1, 1 + 2] = [−2, 0]
x3 = 1 − x1 − x2 = 1 − [1, 1 + 2] − [−2, 0] = [−2, 2]
x4 = 1 − x1 − x2 − x3 = 1 − [1, 1 + 2] − [−2, 0] − [−2, 2] = [−4, 4]
x5 = 1 − x1 − x2 − x3 − x4 = 1 − [1, 1 + 2] − [−2, 0] − [−2, 2] − [−4, 4] = [−8, 8] .

Note that, except for the computation of the first element which leads to an interval
of small width, all the other computations are exact, i.e. no rounding error occurs. By
induction one can prove that, for i ≥ 3,

xi = −2i−2 , 2i−2  .
 
3.2. STATE OF THE ART, VERIFYLSS FUNCTION 47

Hence the width of the computed solution’s components explodes exponentially fast
with respect to row’s index. Meanwhile it is easy to verify that the exact solution is [1/(1−
), −/(1 − ), 0, 0, 0, 0]T . Moreover, the coefficient matrix is well-conditioned, because
 −1  
1− 1/(1 − )
 1 1  −1/(1 − ) 1 
   
 1
 1 1  =
  −1 1 .

 1 1 1 1   −1 1 
1 1 1 1 1 −1 1

The main reason to that explosion is that there is no cancellation in interval arithmetic.
For all x ∈ IR, then x − x 6= 0. Addition and subtraction both increase the width of
intervals. Hence, a naive application of interval arithmetic does not provide good results.
One important trick when working with interval arithmetic is to use original data, in
punctual form, as long as possible [Rum05].
The verifylss function of the IntLab library [Rum] verifies the solution of a linear
system. Its signature is given below

x = verifylss(A, b)

where A can be a floating-point or interval square matrix, and b can be a floating-point or


interval vector. The result x is an interval vector which encloses the hull of the solution
set of the system Ax = b, denoted by Σ(A, b).
First, verifylss computes an approximate floating-point solution x̃ of the system
mid(A)x = mid(b) using floating-point iterative refinement. Then it switches to interval
arithmetic to compute an enclosure of the error bound upon the floating-point solution.
The interval part of verifylss comprises two stages. The first stage is based on Krawczyk
operator [Rum80, Rum91, Rum05] and the second stage is based on a theorem by Ning &
Kearfott [Han92, Roh93, NK97].
Theorem 3.2.1 (Krawczyk). Let A ∈ IRn×n , b ∈ IRn , R ∈ Rn×n , eK ∈ IRn be given and
suppose
R(b − Ax̃) + (I − RA)eK ⊆ eK . (3.1)
Then for all à ∈ A and for all b̃ ∈ b, Ã−1 b̃ ∈ x̃ + eK .
Usually, R is set to an approximate inverse of mid(A). By that way, I −RA is hopefully
close to zero, or centered in zero if R is a good approximation of mid(A).
Furthermore, to compute such an interval vector eK , verifylss uses the epsilon-
inflation technique [Rum98, OO09]. Denote by z = R(b − Ax̃), C = I − RA, then
we get a Krawczyk operator: z + Cx = x. The interval eK is initialized to eK 0 =
z + 0.1 ∗ rad(z) ∗ [−1, 1]. At each step the fixed-point property (3.1) is checked. If (3.1) is
satisfied then the iterations are terminated. Otherwise, eK is updated simultaneously by
Krawczyk operator and by inflation as follows:

eK K K

i+1 = z + C ∗ ei + 0.1 ∗ rad(ei ) ∗ [−1, 1] .

The maximum number of iterations is set to 7 in verifylss. If it is not possible to


obtain such an interval vector eK after 7 iterations then verifylss switches to the second
step to compute a tight enclosure of the solution of the system (RA)e = z according to
48 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

[HS81, Han92, Neu99]. The formula for the interval enclosure of the solution set of interval
linear systems of equations, in the case where the coefficient matrix is an H-matrix, is given
by the following theorem of Ning and Kearfott [NK97].
Theorem 3.2.2 (Ning-Kearfott [NK97]). Let A ∈ IRn×n be an H-matrix, b ∈ IRn a
right-hand side,
u = hAi−1 |b|, di = (hAi−1 )ii
and
αi = hAii i − 1/di , βi = ui /di − |bi |,
where hAi is the comparison matrix of A, which is defined by

mig(Ai,j ) for i = j,
hAii,j =
−mag(Ai,j ) for i 6= j.

Then Σ(A, b) is contained in the vector x with components


bi + [−βi , βi ]
xi = .
Aii + [−αi , αi ]
Moreover, if the midpoint of A is diagonal then Σ(A, b) = x.
To be able to apply the Ning-Kearfott Theorem requires to compute an upper bound
for hAi−1 , which amounts in fact to compute an upper bound for |A−1 | [Neu90, Section
3.6-3.7, pp104–114]. To that end, according to [Neu99], an approximate inverse of A must
be computed, which costs 2n3 FLOPs. This renders the second stage more costly than the
first stage. In return, the second stage allows to obtain a tight enclosure of the solution of
the residual system, even for ill-conditioned problems.
In the following sections, we will introduce our proposed method to avoid having to
compute a tight enclosure of the solution of the residual system without reducing the result
accuracy. In order to obtain an accurate solution, our approach is also based on iterative
refinements. Therefore, the floating-point iterative refinements will be briefly recalled prior
to explaining how to adapt it to interval arithmetic.

3.3 Floating-point iterative refinement


Iterative refinement is a technique for improving a computed solution x̃ of a linear system
Ax = b. First, the approximate solution x̃ is computed by some method. Then, the residual
r = b − Ax̃ of the system for this approximation is computed. The exact error e is the
solution of a linear system involving the matrix A, with r as the right-hand side: Ae = r.
By solving, again approximately, the residual system, a correction term ẽ is obtained and
is used to update the floating-point approximation. This method is sketched in Algorithm
3.1 using MatLab notations. In particular, A \ b means the solution of the linear system
of equations Ax = b computed by some method.
If r and ẽ could be computed exactly, then obviously A(x̃ + ẽ) = Ax̃ + r = b. Hence,
the iteration would converge with just one step. Nevertheless, because of rounding errors,
none of them are computed exactly.
In the first versions of this method, floating-point iterative refinement was used with
Gaussian elimination. Its convergence conditions are provided in [Mol67, Ske80]. Higham
3.4. INTERVAL ITERATIVE REFINEMENT 49

Algorithm 3.1 Classical iterative refinement


x̃ = A \ b
while (stopping criterion not verified) do
r = b − Ax̃
ẽ = A \ r
x̃ = x̃ + ẽ
end while

[Hig97] gives a more thorough and general analysis for a generic solver and for both fixed
and mixed computing precision. In fixed precision iterative refinement, the working pre-
cision is used for all computations. In mixed precision iterative refinement, residuals are
computed in twice the working precision.
First, it is stated in [Mol67, Hig97] that the rates of convergence of mixed and fixed
precision iterative refinements are similar. It is also stated in [Mol67] that the computing
precision used for residual computations does not materially affect the rate of convergence.
However, the computing precision used for residual computations affects the accuracy of
results. In this section, only forward errors on x̃, namely the relative error kx − x̃k∞ /kxk∞
and the componentwise relative error k ((xi − x̃i )/xi ) k∞ are of concern.
Indeed, given a matrix A of dimension n which is not too ill-conditioned, the rel-
ative error, after convergence, of the mixed precision iterative refinement is of order
kx − x̃k∞ /kxk∞ ≈ , with  being the relative machine error related to working preci-
sion [Hig97]. For fixed precision iterative refinement, the relative error, after convergence,
is only of order kx − x̃k∞ /kxk∞ ≈ 2n cond(A, x): a relative error of order  is no longer
ensured. Here cond(A, x) denotes the condition number of the linear system when only A
is subject to perturbations [Ske79]. However, this relative error bound is the best that can
be obtained without using higher precision. Usually, fixed precision iterative refinement is
used to get a stable solution of linear systems, such as in [LD03, LLL+ 06]. Indeed, only
one iteration of iterative refinement with only fixed precision accumulation of the residual
suffices to make Gaussian elimination componentwise backward stable [Ske80].

3.4 Interval iterative refinement


The idea of the interval version of iterative refinement is to compute an enclosure of the
error term instead of approximating it. This enclosure of the error, added to the approxi-
mate solution, yields an enclosure of the exact result. The algorithm proceeds as follows:
a floating-point approximation x̃ of the solution is computed first. Then the residual is
computed using interval arithmetic: it contains the exact residual. The residual system
is now an interval linear system. Hence, the solution e of this interval linear system con-
tains the exact correction of the floating-point approximation. Thus, x̃ + e contains the
exact solution of the original system. Finally, the floating-point approximate solution x̃ is
updated by adding the midpoint of e to it, meanwhile e is centered to zero.
Actually, when computing the initial solution, the residual system is usually pre-
multiplied by R, a floating-point approximate inverse of A, and that multiplication is
performed using interval arithmetic. This operation leads to another interval system:

Ke = z, where K = [RA], z = [Rr]. (3.2)


50 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

Algorithm 3.2 Interval iterative refinement


1: x̃ = A \ b
2: while (stopping criterion not verified) do
3: r = [b − Ax̃] % interval computation
4: e=A\r
5: x̃ = x̃ + mid (e)
6: e = e − mid (e)
7: end while

The goal is to make the iteration contractant and thus to ensure its convergence. However,
the system now has an interval matrix and solving such a system is NP-hard [RK95, Roh05].
For A ∈ IRn×n , b ∈ IRn , the solution of the interval linear system Ax = b, denoted
by Σ(A, b), is the set of all the solutions of all the possible punctual linear systems taken
from the interval system:
Σ(A, b) = {x ∈ Rn | ∃A ∈ A, ∃b ∈ b : Ax = b}.
Usually, Σ(A, b) is not an interval vector. It is not even a convex set.
Example 5. Consider the following example
    
2 [−1, 1] x1 [−2, 2]
= . (3.3)
[−1, 1] 2 x2 0
The solution of the above system is given by
Σ(A, b) = {x ∈ R2 | 2|x2 | ≤ |x1 |, 2|x1 | ≤ 2 + |x2 |}
which is depicted in Figure 3.1 by the grayed out region.
If A is regular, i.e every A ∈ A is regular, then Σ(A, b) is also bounded, since A and
b are bounded intervals. Hence, the hull of the solution set is defined as follows:
AH b = Σ(A, b) = [inf(Σ(A, b)), sup(Σ(A, b))] . (3.4)
The hull of the solution set of the system (3.3) is depicted by the dashed rectangular
box in Figure 3.1. Let A−1 denote the matrix inverse of A which is
A−1 = {Ã−1 | Ã ∈ A}
then we have the following inclusion [Neu90, Eq. (6), Section 3.4, p.93]
AH b ⊆ A−1 b, with equality if A is thin.
Algorithms for solving interval linear systems return a box containing the convex hull
of the solution set (3.4), which is thus not the minimal enclosure. Direct algorithms to
enclose the solution of an interval linear system exist [Han92, Roh93, Neu99]. Here we
prefer an iterative refinement method, which reuses some of the previous floating-point
computations. In our case, e is usually relatively small with respect to the approximate
solution of the original system x̃. Hence ideally, the enclosure of the error corresponds to
the last bits of x̃, and does not need to be too tight. Indeed, as it will be shown in the
next section, such an enclosure of that convex hull can be obtained at a cost of O(n2 )
operations. This complexity is asymptotically negligible compared to the cost of solving
the original system using floating-point arithmetic.
3.4. INTERVAL ITERATIVE REFINEMENT 51

x2

x1

Figure 3.1: Solution set of the interval linear system (3.3).

3.4.1 Numerical behavior


Although switching from floating-point arithmetic to interval arithmetic, in this section we
will show that under some conditions, the numerical behavior of x̃ of the interval iterative
refinement is similar to that of the floating-point refinement.

Lemma 3.4.1. Let e∗ be the exact solution of the system Ae = mid(r). If ẽ is a solution
of the system Ae = r then 2e∗ − ẽ is also a solution of the system Ae = r.

Proof. e∗ is the exact solution of the system Ae = mid(r) hence Ae∗ = mid(r).
ẽ is a solution of the system Ae = r, which means that there exists a vector r̃ ∈ r such
that Aẽ = r̃.
Moreover, r̃ ∈ r ⇒ |r̃−mid(r)| ≤ rad(r) ⇒ mid(r)+(mid(r)− r̃) ∈ r ⇒ 2mid(r)− r̃ ∈ r.
Replace mid(r) and r̃ by Ae∗ and Aẽ respectively we get

2Ae∗ − Aẽ ∈ r
⇒ A(2e∗ − ẽ) ∈ r.

Therefore 2e∗ − ẽ is also a solution of the system Ae = r.

The above lemma means that the exact solution of the system Ae = mid(r) is the
midpoint of the exact solution of the system Ae = r. Hence if we can obtain a good
enclosure of the solution of the system Ae = r then

mid(A \ r) = e∗ ∼
= A \ mid(r).

r is an enclosure of the residual upon x̃, hence r is computed by

r = [5(b + A(−x̃)), 4(b + A(−x̃))]


⇒ mid(r) ≈ b − Ax̃.

It means that mid(r) is an approximation of b − Ax̃.


Therefore, ignoring the interval parts, Algorithm 3.2 can be considered as analogous to
the following algorithm
52 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

x̃ = A \ b
while (stopping criterion not verified) do
r̂ = b − Ax̃
ê = A \ r
x̃ = x̃ + ê
end while

One can see that the numerical behavior of x̃ in the above algorithm is similar to that
of the floating-point iterative refinement.

3.4.2 Updating the computed solution


After each step the approximate solution is updated, and also the error bound on it (lines
5 and 6 of Algorithm 3.2). The purpose of this update is to ensure that x̃ + e always
contains the exact solution in its interior. It is worth noting here that, using floating-point
arithmetic, all the punctual operations are subject to rounding errors. Thus to have a
correctly updated solution, interval operations must be used. Lines 5 and 6 of Algorithm
3.2 should be replaced by the following snippet.

x̃0 = x̃ + mid(e)
e = e − [x̃0 − x̃]
x̃ = x̃0

Using [x̃0 − x̃] instead of mid(e) seems to inflate the interval a little bit. In fact, if mid(e)
is small in magnitude with respect to x̃, i.e |mid(e)| ≤ 1/2|x̃| which is often the case, then
x̃ and x̃0 have the same sign and even satisfy 1/2x̃ ≤ x̃0 ≤ 2x̃, thus following the Sterbenz
lemma [Ste74], x̃0 − x̃ is exact in floating-point arithmetic.
Moreover, the following inclusion holds

x̃0 + (e − [x̃0 − x̃]) ⊇ x̃0 + (e − x̃0 − x̃) = x̃ + e.

Hence the inclusion property is ensured by this updating process. This updated error
bound will serve as the starting point for the next step, or as the initial solution for the
next refinement: there is no need to recompute it.
Still, to start the refinement, an initial solution of the residual system is needed. In
the following section, we explain how to compute an initial solution of an interval linear
system.

3.4.3 Initial solution


The determination of an initial enclosure of the solution is a main issue in the refinement
method. The entire real line could be considered as an initial enclosure of each component,
but formulae shown in Section 3.4.4 cannot refine it.
The necessary condition to be able to compute an initial enclosure of the interval
linear system Ke = z is that K is invertible, i.e. each real matrix K ∈ K is invertible.
Nevertheless, checking the invertibility of an interval matrix is NP-hard [Roh96, Roh05].
3.4. INTERVAL ITERATIVE REFINEMENT 53

However, if some sufficient condition is satisfied, an initial enclosure of the solution


of an interval linear system can be computed. For this purpose, we apply the following
proposition to the original, real, matrix A and to the interval right-hand side r.

Proposition 3.4.2 ([Neu90, Prop. 4.1.9, p.121]). Let A ∈ IRn×n be an interval matrix of
dimensions n × n and let C, C 0 ∈ Rn×n .

1. If CAC 0 is an H-matrix then, for all b ∈ IRn , we have

|AH b| ≤ |C 0 |hCAC 0 i−1 |Cb|,

2. if hCAC 0 iu ≥ v > 0 for some u ≥ 0 then

|AH b| ≤ kCbkv · |C 0 |u,


AH b ⊆ kCbkv · |C 0 |[−u, u],

where AH b is the convex hull of the solution set of the system Ax = b, hAi is the
comparison matrix of A, whose components are defined by:

hAii,i = min(|ai,i |, ai,i ∈ Ai,i ),


hAii,j = − max(|ai,j |, ai,j ∈ Ai,j ) for j 6= i,

and kbkv is the scaled norm with respect to v: kbkv = max(|bi |/vi ) i ∈ 1, . . . , n.

Following this proposition, a sufficient condition to be able to compute an initial solution


is that we exhibit C and C 0 such that CAC 0 is an H-matrix. In our algorithm, we use
C = R, with R an approximate inverse of A, and C 0 = I. If R is a good approximation of
the inverse of A then CAC 0 = RA ≈ I is an H-matrix. We also need to exhibit u and v
as in the second part of the proposition, to get e = kRrkv [−u, u] as an initial enclosure of
the system Ae = r.
Because all computations are performed using floating-point (as opposed to exact)
arithmetic, it is necessary to compute RA using interval arithmetic in order to get a
guaranteed result. So the previous proposition is modified as shown below.

Proposition 3.4.3. Let A ∈ Fn×n and R ∈ Fn×n be a floating-point approximate inverse


of A. If h[RA]iu ≥ v > 0 for some u ≥ 0 then:

|A−1 r| ≤ kRrkv u,
A−1 r ⊆ kRrkv · [−u, u].

What remains now is to find a positive vector u so that h[RA]iu > 0. Using [Neu90,
Prop. 3.7.3, p.112], u is set to an approximate solution ũ of the linear system h[RA]iu = e,
where e = (1, . . . , 1)T . Because two important properties of an H-matrix A are that hAi
is regular, and hAi−1 e > 0, ũ satisfies ũ ≈ h[RA]i−1 e > 0 and h[RA]iũ ≈ e > 0, for
sufficiently good approximations ũ.
In our case, A is a floating-point matrix. If A is well preconditioned, or more precisely if
RA is close to identity, or diagonally dominant, then it suffices to use the vector (1, . . . , 1)T
as the value of u and the product h[RA]iu is positive. If the test of positivity of h[RA]iu
fails with u = (1, . . . , 1)T , then h[RA]i is not a diagonally dominant matrix. However it
54 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

does not mean that h[RA]i is not an H-matrix. In this case, we apply some floating-point
Jacobi iterations in order to compute an approximate solution of the system h[RA]iu = e ≡
(1, . . . , 1)T . Let D, E be the diagonal and off-diagonal parts of h[RA]i. Jacobi iterations
can be expressed by the following formula, where u is the previous iterate, u0 is the new
iterate, and v is the product between h[RA]i and u:

u0 = D−1 (e − Eu) = u + D−1 (e − (D + E)u) = u + D−1 (e − v).

Let K = h[RA]i be the comparison matrix of [RA], the algorithm to determine u is


detailed in Algorithm 3.3.

Algorithm 3.3 Testing the H-matrix property


D = diag(K)
e = (1, . . . , 1)T
u=e
v =K ∗u
nbiter = 10
i=0
while ( any(v ≤ 0) && i < nbiter ) do
i=i+1
u = abs(u + (e − v)./D)
v =K ∗u
end while

Note that, in order to ensure that u is a non-negative vector, we take the absolute value
of the updated solution after each refinement. Moreover, as it is not necessary to compute
an accurate solution, the refinement stops as soon as Ku > 0.
If the Jacobi iterative refinement fails to obtain a vector u ≥ 0 such that v = K ∗ u > 0
after 10 iterations, then our algorithm stops and issues a message of failure. Each Jacobi
refinement costs only O(n2 ) floating-point operations, hence the overhead of execution time
introduced by this refinement is negligible.
Nevertheless, Proposition 3.4.3 does not yield a tight enclosure of the solution. The
next section presents methods to improve this interval enclosure.

3.4.4 Interval iterative improvement


There exists several methods of interval iterative improvement, such as the methods of
Jacobi and of Gauss-Seidel (introduced below), or the method of Krawczyk. Krawczyk
method has the quadratic approximation property [Neu90, Section 2.3, p.56], but Gauss-
Seidel iteration always yields tighter intervals than Krawczyk iteration, when applied to a
preconditioned system [Neu90, Theorem 4.3.5, p.138].
The essence of interval iterative improvement methods is to compute an improved
enclosure of the solution from a given initial enclosure of the solution. Given an initial
enclosure e to the interval linear system Ke = z, an improved approximate enclosure is
obtained by writing the linear system satisfied by

ẽ ∈ e : ∃K̃ ∈ K, ∃z̃ ∈ z : K̃ ẽ = z̃.


3.4. INTERVAL ITERATIVE REFINEMENT 55

Developing the i-th line and separating the i-th component, one gets:
i−1 n
!
X X
ẽi = z̃i − K̃i,j ẽj − K̃i,j ẽj /K̃i,i and ẽi ∈ ei .
j=1 j=j+1

Replacing punctual terms by the corresponding interval terms yields the formula of the
interval Jacobi iteration:
i−1 n
!
X X
ẽi ∈ e0i := z i − K i,j ej − K i,j ej /K i,i ∩ ei .
j=1 j=j+1

Taking into account the fact that for components that have already been refined, e˜j
belongs to both original value ej and refined valued e0j , ej can be replaced by e0j for j < i
to obtain the Gauss-Seidel iteration:
i−1 n
!
X X
e0i := z i − K i,j e0j − K i,j ej /K i,i ∩ ei . (3.5)
j=1 j=j+1

Taking the intersection with the former iterate ei implies the contracting property of
both Jacobi and Gauss-Seidel iterations. Hence, both iterations converge. Nevertheless,
making full use of the refined values, Gauss-Seidel iterations converge much more quickly
than Jacobi iterations. As mentioned in Section 3.4.3, our sufficient condition to compute
an initial solution of the interval residual system is that [RA] is an H-matrix. Under this
condition, Gauss-Seidel iterations converge very quickly.
Let us now detail the complexity of an iteration. For both Jacobi and Gauss-Seidel
iterations, the improvement of each component requires n − 1 interval multiplications,
n interval additions and one interval division. Thus, in total, each Jacobi/Gauss-Seidel
iteration costs O(n2 ) interval operations. Hence, theoretically, the improvement stage
should not affect the overall cost of the method, which is O(n3 ).

3.4.5 Detailed algorithm


The complete algorithm we propose is given below, it uses all the building blocks introduced
above.

Complexity
The complexity of the whole algorithm is determined by counting the number of operations
for each step. To be more precise, we distinguish here several types of operations, namely:

• FLOP a floating-point operation in general,


• FMUL a floating-point multiplication,
• FADD a floating-point addition,
• IOP an interval operation in general,
• IMUL an interval multiplication,
• IADD an interval addition.
56 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

Algorithm 3.4 Solving and verifying a linear system


1: Compute the LU decomposition of A: A = L × U
2: Compute an approximation x̃ with a forward and a backward substitution using L and
U
3: Compute R, an approximate inverse of A, by solving RL = inv(U ) [Hig02, Chapter
14]
4: Pre-multiply A by R: K = [RA]
5: Test if K is an H-matrix by computing a non-negative vector u such that hKiu ≥ v > 0
6: if (fail to compute u) then
7: print Failed to verify the solution. The system may be either singular or too ill-
conditioned.
8: return
9: end if
10: Compute the residual r = [b − Ax̃] in double the working precision
11: Pre-multiply the residual by R: z = Rr
12: Compute an initial error bound e = kzkv · [−u, u]
13: while (not converged) do
14: Apply 5 Gauss-Seidel iterations on K, z, and e
15: Update x̃ and e
16: Recompute r and z
17: end while

The LU decomposition of a floating-point matrix costs 4/3n3 FLOPs.


Forward and backward substitution using upper and lower triangular matrices cost
2
n + O(n) FLOPs each [Hig02, chap.8].
Matrix inversion using LU decomposition costs 8/3n3 FLOPs.
Pre-multiplying A by R, assuming that we use directed rounding modes, requires two
floating-point matrix multiplications, one for the upper bound and the other for the lower
bound. Hence in total the preconditioning requires 4n3 FLOPs.
Computing the comparison matrix of K requires 2n2 + O(n) FLOPs.
Testing if hKi is diagonally dominant is equivalent to a matrix-vector operation. Hence
it costs 2n2 + O(n) FLOPs.
Computing the residual in interval arithmetic requires computing the residual twice in
floating-point arithmetic, once in downward rounding mode and once in upward rounding
mode. Moreover, to increase the result accuracy, the residual is computed in twice the
working precision. Therefore, we will count it by 2n2 + O(n) FLOP2 s, where FLOP2 denotes
a floating-point operation in twice the working precision.
Pre-multiplying the residual is a floating-point matrix-interval vector multiplication,
which costs 2n2 + O(n) FLOPs.
Computing the scaled norm of a vector costs only O(n) FLOPs.
For each refinement iteration:

• each Jacobi/Gauss-Seidel iteration costs n2 IMULs + n2 IADDs + O(n) IOPs.

• the updating process requires only O(n) FLOPs.

• recomputing the residual system requires (2n2 + O(n)) FLOP2 s + (2n2 + O(n)) FLOPs.
3.5. NUMERICAL EXPERIMENTS 57

As will be shown in section 3.5.1, a floating-point operation in twice the working


precision can be implemented by some floating-point operations in the working preci-
sion, i.e. 1FLOP2 = O(1)FLOPs. One can see that the complexity of the algorithm is
8n3 + O(n2 )FLOPs + O(n2 )IOPs. Meanwhile, to solve a linear system using floating-point
arithmetic costs 4/3n3 +O(n2 )FLOPs. Therefore, the theoretical slowdown factor of solving
a linear system in interval arithmetic with respect to floating-point arithmetic is 6.

Stopping criteria
As mentioned in Section 3.4, the width of the interval error decreases after each step. So
it is a non-negative non-increasing series. This property leads to two stopping criteria (for
a study of stopping criteria, see [DHK+ 06]).
Firstly, we stop the computation whenever reaching a required number of accurate bits.
For example, working with double floating-point numbers, there is no point of getting a
result which is accurate to more than 52 bits. Using interval arithmetic, it is quite easy to
compute the number of exact bits in the result via componentwise relative errors:
  
rad (ei )
nb_bits = −log2 max . (3.6)
|x̃i |
Nevertheless, the program can end up without reaching the required number of exact
bits. Hence, we have to detect whether the iteration yields extra correct digits or stagnates.
Let e and e0 be two successive iterates, a second stopping criterion is that no more correct
digit is obtained in any component of the approximation:
!
rad e0j − rad (ej )

max (i)
< , (3.7)
j |x̃j |
with  being the machine relative error, which is 2−53 for IEEE 754 binary double precision
format.
In cases where none of these two criteria above is matched, it is necessary to stop
the computation when one has tried enough. Practically, in our implementations, the
maximum number of iterations is set to 10. Also, the maximum number of Gauss-Seidel
iterations is set to 5 due to its quick convergence property.

3.5 Numerical experiments


In what follows, the implementation in MatLab, using the interval library IntLab, of Al-
gorithm 5.1, is called certifylss. The computing precision used for all computations is
the IEEE double floating-point precision, except for the residual computation, which is
computed in twice the working precision. The matrices used in the tests are generated by
the MatLab function gallery (0 randsvd0 , dim, cond), where dim is the matrix dimension
(taken as 1000), and cond is the condition number, which varies between 25 and 250 .

3.5.1 Implementation issues


Higher precision for the residual
At the moment, almost no processor supports native quadruple precision. Hence, in order
to obtain a higher precision for the residual computation, we use simulated double-double
58 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

precision which is based on error-free transforms according to [Dek71, HLB01, LDB+ 02,
ORO05, LL08b, LL08a, Rum09b, Rum09a, NGL09, GNL09].
More precisely, it uses the the error-free transforms for the sum and for the product
of two floating-point numbers. The error-free transform of the sum of two floating-points,
called TwoSum, costs 6FADDs. Moreover, the error-free transform of the product of two
floating-points, called TwoProd, costs 10FADDs and 7FMULs unless an operation of type
Fused-Multiply-Add is available [BM05]. Hence the cost of computing the residual, as
accounted by 2n2 + O(n)FLOP2 s in the previous section, using simulated double-double
precision will be translated into FLOPs by 7n2 FMULs + 16n2 FADDs. Hence, in theory, this
cost can still be considered as O(n2 )FLOPs.
1 function r e s = r e s i d u a l (A, x , b )
2 setround (0)
3 f a c t o r = 2^27 + 1 ;

5 % s p l i t v e c t o r −x i n t o two p a r t s x1 and x2
6 % x1 c o r r e s p o n d s t o t h e f i r s t h a l f o f t h e mantissa , o f h i g h
weight
7 % x2 c o r r e s p o n d s t o t h e second h a l f o f t h e mantissa , o f low
weight
8 x = −x ;
9 y = f a c t o r ∗x ;
10 xbig = y − x ;
11 x1 = y − x b i g ;
12 x2 = x − x1 ;

14 n = s i z e (A, 1 ) ;

17 r = b;

19 e r 2 = zeros ( n , 1 ) ;
20 e r 1 = zeros ( n , 1 ) ;
21 for i = 1 : n
22 setround (0) ;
23 % s p l i t a row o f A
24 a = A( : , i ) ;
25 a2 = a ∗ f a c t o r ;
26 a b i g = a2 − a ;
27 a1 = a2 − a b i g ;
28 a2 = a − a1 ;

30 % EFT f o r t h e p r o d u c t
31 p_h = a ∗ x ( i ) ;
32 p_l = a2 ∗ x2 ( i ) − ( ( ( p_h − a1 ∗ x1 ( i ) ) − a2 ∗ x1 ( i ) ) − a1
∗ x2 ( i ) ) ;

34 % EFT f o r t h e sum o f h i g h p a r t s o f t h e p r o d u c t s
35 s = r + p_h ;
36 v = s − r;
37 e = ( r − ( s − v ) ) + (p_h − v ) ;
38 r = s;

40 setround (1) ;
41 e r 2 = e r 2 + e + p_l ;
3.5. NUMERICAL EXPERIMENTS 59

42 e r 1 = ( e r 1 − e ) − p_l ;
43 end

45 r e s = r + i n f s u p (− er1 , e r 2 ) ;
46 end
47 end

However, it is much worse in practice. The implementation of error-free transforms


needs separate access to columns and rows of matrix, which is disadvantageous in MatLab.
That causes the residual computation to be very slow. For example, on a computer Intel(R)
Core(TM)2 Quad CPU Q9450 of frequency 2.66GHz, a product between two matrices of
dimensions 1000 × 1000 takes 0.376557 seconds, a product between a matrix of dimensions
1000 × 1000 and a vector of 1000 elements takes 0.001773 seconds. Meanwhile the residual
computed by function residual with the same dimensions takes 0.195453 seconds, which
is roughly half the execution time of matrix multiplication.

Gauss-Seidel vs Jacobi
As mentioned in Section 3.4.5, each Gauss-Seidel iteration as well as each Jacobi iteration
costs n2 IMULs + n2 IADDs + O(n)IOPs. Nevertheless, making full use of refined values,
Gauss-Seidel iterations converge more quickly than Jacobi iterations. In implementation,
this observation is still true in terms of number of iterations. On the other hand, in terms
of execution time, this observation does not always hold. That is because of parallelism and
memory access problem in MatLab. Each Jacobi iteration can be translated directly into a
matrix-vector multiplication. Hence its performance relies totally on the underlying library.
Meanwhile, a Gauss-Seidel iteration is translated into a series of dot products, which
reduces the possibility of parallelism. Moreover, accessing rows individually in MatLab
is disadvantageous. Hence, in an implementation in MatLab, a Gauss-Seidel iteration is
much slower than a Jacobi iteration.
Therefore, for problems of not too large dimensions, Jacobi iterations are preferable over
Gauss-Seidel iterations in MatLab. Its slower convergence speed with respect to Gauss-
Seidel iterations can be compensated by augmenting the number of iterations, and still
gaining in execution time. This phenomenon has been observed in [BT89, chap.2, Section
2.4 and 2.5, pp.130–142] for parallel implementation of classical iterative methods.

3.5.2 Experimental results


Figure 3.2 depicts results computed by our certifylss function, by the function verifylss
of the IntLab library with the residual being computed in quadruple precision, and non-
verified results computed by MatLab. Using higher precision for residual computation,
verified results provided by the two verified functions: certifylss and verifylss are
much more accurate than non-verified results provided by MatLab, at the price of a higher
execution time. In this experimental set, the execution time of a call to the MatLab’s
function for computing an approximate solution is about 0.1s.
When the coefficient matrix is not too ill-conditioned, i.e. less than 237 in these exper-
iments, both verified functions provide the same accuracy (52 bits). It means that both
verified functions succeed to verify the last bit of the approximate solution. Nevertheless,
certifylss is better than verifylss in terms of execution time.
60 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM

60

accuracy (bits) 50

40
matlab
30
verifylss
20 certifylss
10

0
5 10 15 20 25 30 35 40 45 50
log2(condition number)

5 matlab
execution time (s)

verifylss
4
certifylss
3

0
5 10 15 20 25 30 35 40 45 50
log2(condition number)

Figure 3.2: MatLab implementation results for 1000 × 1000 matrices.

When the condition number increases, an accuracy of 52 bits cannot be obtained any
more. However, the execution time for certifylss becomes higher than the execution
time for verifylss. This is explained by the fact that verifylss uses only floating-point
iterations to refine the approximate solutions, while certifylss uses alternately floating-
point iterations and interval iterations to refine the approximate solutions, and at the
same time the error bounds. As the condition number increases, the number of iterations
also increases. Interval operations are slower than floating-point operations, hence, the
execution time of certifylss increases faster than the execution time of verifylss along
with the increase of the number of iterations.
When the coefficient matrix is not too ill-conditioned, the factor of execution time of
certifylss with respect to the non-verified function of MatLab is about 8. Nevertheless,
when the condition number increases, this theoretical factor is no longer satisfied. It is not
only that interval operations are slow, but also that as the condition number increases, the
number of iterations as well as the time of recomputing the residual increase. As explained
in the previous section, residual computing is costly and it introduces a significant overhead
of execution time to both certifylss and verifylss functions.
When the condition number gets close to 1/, both functions fail to verify the solution.
Indeed, a good approximation of the inverse of A cannot be obtained, thus the sufficient
condition that RA is an H-matrix does not hold.
3.6. CONCLUSION 61

3.6 Conclusion
Combining floating-point iterative refinement and interval improvement, we are able to
provide accurate results together with a tight error bound upon the approximate solutions.
This error bound, which is indeed an interval vector, enables us to state how many guar-
anteed bits there are in the approximate solutions. When the condition number of the
system is small to moderate, our method succeeds in verifying up to the last bit of the
approximate solution. It means that the computed solution is the exact solution rounded
to a floating-point vector in faithful rounding mode.
Using alternately floating-point iterative refinement and interval improvement also en-
ables us to detect easily when to terminate the refinements, by counting the number of
guaranteed bits as well as by checking improvement in the width of the error bound.
Moreover, our method also checks the feasibility before performing any further refinement.
Therefore our method exhibits better performance than Rump’s algorithm when the matrix
is either well-conditioned or too ill-conditioned.
On the other hand, when the matrix is ill-conditioned, the computation time of our
method increases very fast and becomes much slower than Rump’s algorithm. That is
because the proposed method uses intensively interval operations. The next chapter will
present a technique to overcome this problem.
62 CHAPTER 3. VERIFICATION OF THE SOLUTION OF A LINEAR SYSTEM
Chapter 4

Relaxed method to verify the solution


of a linear system

This chapter presents an improvement of the method presented in Chapter


3. Due to the intensive use of interval operations, the execution time of the
function certifylss increases substantially when the condition number of the
coefficient matrix increases. We introduce a relaxation technique to reduce this
effect. This technique consists in inflating the matrix of the pre-multiplied sys-
tem in such a way that it is centered in a diagonal matrix. Then a product
of that inflated matrix with another interval vector is equivalent to a floating-
point matrix vector product in terms of execution time. The relaxed version of
the function certifylss, also implemented using the IntLab library, is called
certifylss_relaxed. Numerical results show that, when applied to the resid-
ual system, the relaxation technique helps to reduce significantly the execution
time of the method without deteriorating the accuracy of the final results.

4.1 Introduction

The interval improvement step is used to improve the error bound upon a computed ap-
proximation. Nevertheless, an intensive use of interval operations makes the interval im-
provement costly in terms of execution time. Especially when the condition number of the
system is high, the number of iterations is high. The overhead of execution time introduced
by the interval improvement becomes significant and even larger than the execution time
of computing a tight error bound enclosure according to [Neu99].
In this chapter, we introduce a technique, which we shall call the relaxation technique,
to reduce the overhead of execution time introduced by the interval improvement. By
nature, the error bound (i) is an enclosure of the difference between the approximation
and the exact solution and hence (ii) is used to update the approximation. This error
bound should correspond to lost bits in the approximation. Hence, it is not necessary to
compute an error bound with high accuracy. Thus we relax the tightness requirement on
this error bound to speed up the program.

63
64 CHAPTER 4. RELAXATION METHOD

4.2 Interval improvement: Jacobi and Gauss-Seidel it-


erations
Let A ∈ IRn×n be an H-matrix, b ∈ IRn , and e0 be an initial enclosure of the solution of
the interval linear system
Ae = b. (4.1)
The essence of interval improvement is to compute an improved enclosure of the solution
from e0
e0 = {e ∈ e0 | ∃A ∈ A ∃b ∈ b : Ae = b}.
Let D, L, U be respectively the diagonal, the lower triangular and the upper triangular
parts of A, then A = L + D + U. Jacobi iteration can be expressed by

e0 = D−1 (b − (L + U)e0 ).

Likewise, Gauss-Seidel iteration can also be expressed symbolically in the same way

e0 = D−1 (b − (Le0 + Ue0 )).

D is a diagonal matrix, its inverse is also a diagonal matrix whose diagonal elements
are the inverse of their corresponding elements of D. Hence, the most expensive part of
Jacobi or Gauss-Seidel iterations is the interval matrix-vector multiplication by L and U.
Even if we use Rump’s fast algorithm for interval matrix multiplication, the theoretical
factor between an interval matrix vector product in terms of execution time with respect
to a floating-point matrix vector product of the same size is 4. However, if we take into
account the execution time for transforming data, for example computing midpoint and
radius of the matrix, then the practical factor is even higher.
When applied to a preconditioned system, A is close to the identity, hence L, U are
close to zero. As it has been observed in Chapter 2 Eq. (2.4), an interval multiplication
between a centered-in-zero matrix and another interval matrix or vector is equivalent to
its floating-point counterpart in terms of execution time. Hence, to gain in performance,
the matrix of the system is enlarged, so as to have its off-diagonal elements centered in
0. Formulas for subsequent operations are thus simplified: the computed intervals are
symmetrical around 0 and only one endpoint is computed, using floating-point arithmetic.
More precisely, we will work with

L̂ = [−|L|, |L|] ,
(4.2)
Û = [−|U |, |U |] .

Lemma 4.2.1. For x ∈ IR, let x̂ = [−|x|, |x|] then

x ⊆ x̂ and |x| = |x̂|.

We will call x̂ the relaxed version of x.


Proof. From the definition of mag, for all x ∈ x then

|x| ≤ |x| ⇒ −|x| ≤ x ≤ |x| ⇒ x ⊆ x̂.

It is obvious that |x̂| = |x|.


4.2. INTERVAL IMPROVEMENT: JACOBI AND GAUSS-SEIDEL ITERATIONS 65

Hence, L ⊆ L̂ and U ⊆ Û. Denote  = L̂ + D + Û, then A ⊆ Â, we replace the


original system by the relaxed system

Âe = b. (4.3)

Proposition 4.2.2. Let A, A0 ∈ IRn×n be two regular matrices. Let b ∈ IRn be an interval
vector. If A ⊆ A0 then Σ(A, b) ⊆ Σ(A0 , b).
Proof. For all x̃ ∈ Σ(A, b), then

∃Ã ∈ A, ∃b̃ ∈ b such that Ãx̃ = b̃.

As A ⊂ A0 , Ã ∈ A0 . It implies that x̃ ∈ Σ(A0 , b). Therefore Σ(A, b) ⊆ Σ(A0 , b).


From Proposition 4.2.2, the solution of the original system (4.1) is included in the
solution of the relaxed system (4.3).
Example 6. Consider the following interval linear system
    
4 [−0.5, 1] x1 4
= .
[−0.5, 1] 4 x2 2

The relaxed version of this system is


    
4 [−1, 1] x1 4
= . (4.4)
[−1, 1] 4 x2 2

The solution set of the original system is depicted in Figure 4.1 by the grayed out
region. Meanwhile, the solution set of the relaxed system is depicted by the white region.
X2

1.5

X1
1 1.5

Figure 4.1: Solution set of the relaxed system.

One can see that the solution set of the relaxed system overestimates the solution set
of the original system. Nonetheless, if the off-diagonal elements of the original system are
almost centered in zero then the grayed out region is close to the white region.
Let us apply the interval improvement to this relaxed system.
The relaxed Jacobi iterations are

e0 = D−1 (b − (L̂ + Û)e0 ).


66 CHAPTER 4. RELAXATION METHOD

The relaxed Gauss-Seidel iterations are

e0 = D−1 (b − (L̂e0 + Ûe0 )).

Since L̂, Û are centered in zero, L̂ + Û is also centered in zero. According to Section
2.3.1, the product between L̂, Û or L̂ + Û with any interval vector can be computed by
just one punctual matrix-vector product. For x ∈ IRn :

L̂x = [−|L||x|, |L||x|] ,


Ûx = [−|U||x|, |U||x|] ,
(L̂ + Û)x = [−|L + U||x|, |L + U||x|] .

Therefore the execution time of applying the interval improvement to the relaxed system
is much smaller than applying it to the original system. In theory, the factor of execution
time between an interval matrix-vector product with respect to a punctual matrix-vector
product is 4. The relaxed interval iterations also exhibit some advantages over the original
interval iterations regarding storage memory. As  is a matrix whose off-diagonal com-
ponents are centered in zero, to store Â, it is only needed to store its upper bound as a
floating-point matrix, and its diagonal as an interval vector.

4.2.1 Convergence rate


Let us consider the case of Jacobi iterations

e0 = D−1 (b − (L̂ + Û)e0 ).

Since L̂ + Û is centered in zero, then (L̂ + Û)e0 is centered in zero and

|b − (L̂ + Û)e0 | = |b| + |(L̂ + Û)e0 | = |b| + |L + U||e0 |.

Moreover, D−1 is a positive diagonal matrix, hence

|D−1
ii | = |1/Dii | = 1/hDii i

which implies that |D−1 | = hDi−1 .


Therefore
|e0 | = hDi−1 (|b| + |L + U||e0 |). (4.5)
Equation (4.5) is in fact the Jacobi iteration to compute an approximate solution of
the system (hDi − |L + U|)x = |b|, or hÃix = |b|. Therefore, the convergence rate of
the relaxed interval Jacobi iterations is the same as the convergence rate of the real Jacobi
iterations when applied to the system hAix = |b|.
Denote by u = hAi−1 |b| the exact solution of the system hAix = |b|, then

hDi|e0 | = |b| + |L + U||e0 |


= hAiu + |L + U||e0 |
= (hDi − |L + U|)u + |L + U||e0 |
0
⇒ hDi(|e | − u) = |L + U|(|e0 | − u).
4.2. INTERVAL IMPROVEMENT: JACOBI AND GAUSS-SEIDEL ITERATIONS 67

If hAi is a diagonally dominant matrix, then the Jacobi iterations converge very quickly.
Denote P
k6=i |Aik |

θ = maxi i = (1, . . . , n) = khDi−1 |L + U|k∞
hAii i
then

k|e0 | − uk∞ = khDi−1 |L + U|(|e0 | − u)k∞


≤ khDi−1 |L + U|k∞ k|e0 | − uk∞
≤ θk|e0 | − uk∞ .

Therefore, the series k|e0 | − uk∞ converges more quickly than a geometric series of ratio
θ. Moreover, the Jacobi iterations converge even more quickly if the coefficient matrix is
close to the identity, since in this case θ is close to 0.

4.2.2 Accuracy
It is evident that using the relaxed system to improve the solution will produce a larger
enclosure of the solution than the original system. Nonetheless, in the context of our
method, the interval system is a preconditioned system, hence it is nearly centered in a
diagonal matrix if the original system is not too ill-conditioned.
This relaxation technique is not really a new idea. Although it is not explicitly stated so,
this relaxation technique is already used in [HS81, Han92, Neu99]. Let us recall Theorem
3.2.2: the solution of the original system (4.1) is bounded by

Σ(A, b) ⊆ eo

where
u = hAi−1 |b|, di = (hAi−1 )ii ,

αi = hAiii − 1/di , βi = ui /di − |bi |,


and
bi + [−βi , βi ]
eoi = .
Aii + [−αi , αi ]
As L̂ and Û are centered in zero, the midpoint of  is a diagonal matrix. Thus, applying
Theorem 3.2.2, the hull of the solution set of the interval system Âe = b is computed by

Σ(Â, b) = ê

where
û = hÂi−1 |b|, dˆi = (hÂi−1 )ii

α̂i = hÂiii − 1/dˆi , β̂i = ûi /dˆi − |bi |.


and h i
bi + −β̂i , β̂i
êi = .
Âii + [−α̂i , α̂i ]
68 CHAPTER 4. RELAXATION METHOD

We have that  = L̂ + D + Û and thus Âii = Dii = Aii . Moreover, following Lemma
4.2.1, hL̂i = hLi and hÛi = hUi ⇒ hÂi = hL̂i + hDi + hÛi = hLi + hDi + hUi = hAi. It
means that: û = u, dˆ = d, α̂ = α, β̂ = β. Hence eo = ê. In other words, Theorem 3.2.2
computes an enclosure of the solution of (4.1) by computing the hull of the solution set of
the relaxed system (4.3).
Take Example 6 again. Applying Theorem 3.2.2 to compute an enclosure of the solution,
we have
   
4 −1 −1 4/15 1/15
hAi = hAi =
−1  4 1/15 4/15
18/15 4/15
u = hAi−1 |b| = d =
12/15
  4/15
1/4 1/2
α = β =
1/4 1
0 4+[−1/2,1/2] 0 2+[−1,1]
e1 = 4+[−1/4,1/4] = [14/17, 6/5] e2 = 4+[−1/4,1/4] = [4/17, 4/5] .

The interval vector e is in fact the hull of the solution set of the relaxed system (4.4),
which is depicted by the outer dashed box in Figure 4.1. Meanwhile, the exact hull of the
solution set of the original system is
 
∗ [14/15, 68/63]
e =
[4/15, 40/63]

which is depicted by the inner dotted dashed box. It is obvious that e∗ is included in
e. Therefore, it is suggested by Neumaier [Neu99] that if the midpoint of the coefficient
matrix is not a diagonal matrix, then some Gauss-Seidel iterations should be applied in
order to obtain a tighter enclosure of the solution.
Nevertheless, the computed result of the interval improvement is not the hull of the
solution of the system Âx = b. Jacobi iterations converge to an interval vector, which is
the solution of the following system of equations:
!
X
ei = bi − Âik ek /Ãii i = (1, . . . , n).
k6=i

The uniqueness of this solution is provided by [Neu90, Theorem 4.4.4, p145], we recall
it below.

Theorem 4.2.3. Let A ∈ IRn×n be an H-matrix. Then for all b ∈ IRn , the system of
equations !
X
ei = bi − Aik ek /Aii i = (1, . . . , n)
k6=i
n
has a unique solution z ∈ IR .
Denote AF b = z. If A is thin then AF b = A−1 b.

Applied to the relaxed system, the midpoint of  is a diagonal matrix and the fixed
point of the Jacobi iterations can be computed by the following theorem [Neu90, Theorem
4.4.10, p150].
4.2. INTERVAL IMPROVEMENT: JACOBI AND GAUSS-SEIDEL ITERATIONS 69

Theorem 4.2.4. Let A ∈ IRn×n be an H-matrix and suppose that mid(A) is diagonal.
Then
|AF b| = hAi−1 |b|,
and z = AF b can be expressed in terms of u = hAi−1 |b| as

bi + (hAiii ui − |bi |) [−1, 1]


zi = i = (1, . . . , n).
Aii

Consequently, we can compare the enclosure of the solution set of the system Âx = b,
and the enclosure computed by Jacobi iterations.
Corollary 4.2.5. Let A ∈ IRn×n be an H-matrix and suppose that mid(A) is diagonal.
Then for all b ∈ IRn
(1) |AF b| = |AH b|,

(2) rad(AF b) ≤ 2 ∗ rad(AH b).


Proof. Following Theorem 3.2.2, since mid(A) is diagonal, we have

bi + [−βi , βi ]
(AH b)i = .
Aii + [−αi , αi ]

Since A is an H-matrix then 0 ∈


/ Aii . Suppose that Aii > 0, then
|bi | + βi |bi | + ui /di − |bi | ui /di
|(AH b)i | = = = = ui = (hAi−1 |b|)i .
hAiii − αi hAiii − (hAiii − 1/di ) 1/di

Following Theorem 4.2.4, |AF b| = hAi−1 |b|. Therefore the first equality holds.
Because of the subdistributivity property, we have
bi + (hAiii ui − |bi |) [−1, 1] bi [−1, 1]
(AF b)i = ⊆ + (hAiii ui − |bi |) ≡ yi ,
Aii Aii Aii
where
bi [−1, 1] |bi | 1
|yi | ≤ + (hAiii ui − |bi |) = + (hAiii ui − |bi |) = ui ,
Aii Aii hAiii hAiii
bi [−1, 1] bi
mid(yi ) = mid( ) + mid((hAiii ui − |bi |) ) = mid( ).
Aii Aii Aii
Therefore
rad(yi ) = |yi | − |mid(yi )| = ui − |mid(bi /Aii )|.
It is obvious that
 
bi bi bi + [−βi , βi ]
mid ∈ ⊆ = (AH b)i .
Aii Aii Aii + [−αi , αi ]
From the definition of the minimal magnitude value, we get:
 
bi
mid ≥ h(AH b)i i.
Aii
70 CHAPTER 4. RELAXATION METHOD

Hence

2 ∗ rad(AH b)i ≥ |AH b|i − hAH bii ≥ |AF b|i − |mid (bi /Aii ) |
≥ |AF b|i − |mid(AF b)|i ≥ rad(AF b)i .

The case Aii < 0 is similar to the case Aii > 0.


In other words, the above corollary means that when the Jacobi iterations are applied
to the relaxed system, the convergent interval of the iterations is at worst twice larger than
the exact hull of the solution set of the relaxed system.

4.2.3 Initial solution of the relaxed system


With the assumption of A being an H-matrix, an initial enclosure of the solution of the
original system (4.1) can be computed using Proposition 3.4.3.
Let w ∈ Rn be a non-negative vector such that hAiv ≥ w > 0 then:

A−1 b ⊆ kbkw · [−v, v] .

Meanwhile, hÂi = hAi ⇒ hÂiv ≥ w > 0. It means that the solution of the relaxed
system (4.3) is bounded by the same interval vector as the original system:

Â−1 b ⊆ |b|w · [−v, v] .

In Example 6, if v is set to [1, 1]T then


 
3
w = hAiv = > 0.
3

Hence, the initial solution of both the original system and the relaxed system is com-
puted by    
4 [−1, 1] [−4/3, 4/3]
kbkw · [−v, v] = = .
3 [−1, 1] [−4/3, 4/3]
This initial solution
 is much larger
 then the hull of the solution set computed by The-
[14/17, 6/5]
orem 3.2.2, which is .
[4/17, 4/5]
On the other hand, from Theorem 4.2.4, if we can obtain an initial solution e0 = [−u, u],
with u = hAi−1 |b| then the interval improvement converges after only one Jacobi iteration:
!
X
e1i = bi − Aik e0k /Aii
k6=i
!
X
= bi − [−|Aik |uk , |Aik |uk ] /Aii
k6=i
!
X
= bi − ( |Aik |uk ) [−1, 1] /Aii .
k6=i
4.3. IMPLEMENTATION AND NUMERICAL EXPERIMENTS 71

Since u = hAi−1 |b| then hAiii ui +


P P
k6=i |Aik |uk = |bi | and thus k6=i |Aik |uk = |bi | −
hAiii ui . It implies that

e1i = (bi − (|bi | − hAiii ui ) [−1, 1]) /Aii = zi ,

where zi is defined as in Theorem 4.2.4.


Therefore, instead of [1, . . . , 1]T , if we set the value of v to an approximation of the
solution of the system hAix = |b|, then we obtain   a better initial enclosure.
1.2
In this example, if we set v to hAi−1 |b| = , then
0.8
 
4
w = hAiv = > 0.
2

Hence, the initial enclosure is


 
[−1.2, 1.2]
kbkw · [−v, v] = .
[−0.8, 0.8]

This initial enclosure is better than that for v = [1, . . . , 1]T and makes the iterations
converge more quickly. Nonetheless, the computation of an approximation of the solution
of the system hAix = |b| costs O(n3 ) floating-point operations and therefore introduces a
significant overhead of execution time to the overall algorithm. Thus we do not use it in
our algorithms.

4.3 Implementation and numerical experiments


In this section we implement a relaxed version, called certifylss_relaxed, of the function
certifylss, in MatLab using the IntLab library to explore the performance as well as the
accuracy of the relaxation technique.

4.3.1 Implementation
There are only some minor modifications in certifylss_relaxed with respect to its pre-
decessor certifylss. First, the preconditioned matrix is inflated, hence it is not necessary
to store the off-diagonal components of the preconditioned matrix as an interval but to
store only their maximal magnitude value. Consequently, to store the product of that
matrix with an interval vector, it is sufficient to store only the maximal magnitude of the
result.
Algorithm 4.1 describes the interval improvement using Jacobi iterations and the re-
laxation technique.
Note that D and K are computed only once, and need not to be recomputed each time
the interval improvement is applied.

4.3.2 Numerical results


The enclosure computed by the relaxed residual system will be of larger relative error than
that of the original system. Nevertheless, in our case, this enclosure is used only to update
72 CHAPTER 4. RELAXATION METHOD

Algorithm 4.1 Interval improvement function refine using relaxed Jacobi iterations
Require: e ∈ IFn be an initial enclosure of the solution of the interval system Ae = z,
where A ∈ IFn×n be an H-matrix, and z ∈ IFn
Ensure: e is improved using Jacobi iterations
D = diag(A)
K = mag(A)
Set the diagonal components of K to 0
while (termination criteria not satisfied) do
setround(1)
tmp = K ∗ mag(e)
e1 = (z + infsup(−tmp, tmp)) ./ D
e = intersect(e, e1)
end while

the approximate solution and to bound the error upon this solution. Hence it is not always
true that the solution computed by certifylss_relaxed is of larger relative precision
than that computed by certifylss.
Figure 4.2 depicts numerical results with matrices of dimensions 1000 × 1000 and con-
dition number varying from 25 to 250 for the three verified functions, namely verifylss
of the IntLab library, certifylss, certifylss_relaxed, together with results computed
by MatLab. Again, the matrices used in the tests are generated by the MatLab function
gallery (0 randsvd0 , dim, cond).
One can see that, although it relaxes the tightness of the error bound enclosure, the
function certifylss_relaxed always provides results of the same accuracy as those com-
puted by the function certifylss. This is also the reason why we do not use the algo-
rithms presented in Chapter 2 for the interval matrix product. Indeed, in this case, it is
not necessary to compute a very accurate enclosure of the interval matrix vector product.
Moreover, if certifylss_relaxed fails to verify the solution then certifylss also
fails and vice versa. That is because the two functions share the same initial error bound
enclosure. Failure to obtain an initial error bound means failure to verify the solution.
Nevertheless, using the relaxation technique helps to reduce a lot the overhead of execu-
tion time introduced by the interval improvement, and therefore it reduces a lot the overall
execution time of the function. As it can be seen on Figure 4.2, certifylss_relaxed is
even almost faster than verifylss except for too ill-conditioned systems.
When the condition number gets close to the machine relative error, there is a drop in
execution time for both certifylss_relaxed and verifylss. It corresponds, for both,
to a failure to compute an initial error bound. In such a case, there is no need to perform
any further iterative refinement and the two functions terminate prematurely without any
certification on the computed approximate solution.
Even though the use of the relaxation technique makes the error bound improvement
phase become negligible in terms of execution time, when the condition number increases,
the execution time increases significantly. Indeed, when the condition number increases,
the number of iterations for refining the approximate solution increases. More precisely,
the dominant computation, in terms of execution time, is the computation of the residual,
as the computing precision is doubled, and thus the computation of the successive residuals
is no more negligible compared to the overall execution time.
4.4. CONCLUSION 73

60

50

accuracy (bits)
40
matlab
30 verifylss
certifylss
20
certifylss_relaxed
10

0
5 10 15 20 25 30 35 40 45 50
log2(condition number)

matlab
5
verifylss
execution time (s)

4 certifylss
certifylss_relaxed
3

0
5 10 15 20 25 30 35 40 45 50
log2(condition number)

Figure 4.2: Relaxation technique: MatLab implementation results for 1000×1000 matrices.

4.4 Conclusion
By relaxing the exactness to exploit the symmetry of data as well as of operations, the
performance has been significantly improved. At the same time, the relaxed method still
allows to obtain good accuracy. Following Section 4.2.2, the hull of the solution set of
the relaxed system is exactly the enclosure of the solution computed by the Ning-Kearfott
method [NK97]. Moreover, Corollary 4.2.5 shows that the enclosure produced by applying
interval Jacobi iterations on the relaxed system is at worst 2 times larger than the exact
hull of the solution set of the relaxed system. It means that, in comparison with the
Ning-Kearfott method, the relaxed method to compute an enclosure of the solution set
of an interval linear system looses only 1 bit, whereas it costs only O(n2 ) floating-point
operations, instead of O(n3 ) floating-point operations for the Ning-Kearfott method.
In the context of our problem, the relaxation technique is applied to the residual system.
Therefore, although deteriorating the tightness of the error bound, in effect, it still allows
to recover the lost bits in the approximate solution. Therefore, it does not deteriorate
the accuracy of the main result. As shown in the Section 4.3, the relaxed method for the
verification of the solution of a linear system is able to guarantee the same number of exact
bits as the original method, which is explained in Chapter 3.
Additionally, using the relaxation technique, each interval improvement step in par-
ticular and the whole method in general can now be expressed almost solely by punctual
operations except for the vectorial componentwise division. Therefore, the performance
of the method relies entirely on the floating-point library employed. It also means that,
without having to use a dedicated interval library, the method can be implemented using
74 CHAPTER 4. RELAXATION METHOD

only floating-point matrix library and appropriate rounding mode control. We can there-
fore use existing libraries which are well optimized for floating-point matrix operations,
such as ATLAS [WP05], GotoBlas [TAC], LAPACK, or the Intel R Math Kernel Library,
to implement our method.
The relaxation technique works very well with matrices whose midpoint is nearly di-
agonal. Hence it can be applied to any algorithm that needs to precondition the original
system, for example the Krawczyk iterations, to improve the performance of the algorithm.
Chapter 5

Effects of the computing precision

The effect of the computing precisions used for each step of the function
certifylss on the accuracy, i.e. the width, of the final result is studied. The
error analysis shows that, not only the computing precision for the residual
computation, but also the computing precisions used for inverting the matrix,
for pre-multiplying the matrix and for storing the approximate solution influ-
ence the accuracy of the final result. Nonetheless, for the sake of performance,
we choose to increase the computing precision only for storing the approximate
solution. Each approximate solution is stored as an unevaluated sum of two
vectors in the working precision. By decoupling the computation of the resid-
ual and refining the two parts of the approximate solution successively, we can
still obtain results which are guaranteed to almost the last bit with an acceptable
overhead in execution time, even when the coefficient matrix is ill-conditioned.

5.1 Introduction
Classically, in the implementation of iterative refinement using floating-point arithmetic,
the residual is computed using twice the working precision in order to obtain a solution
accurate to the working precision. The working precision is used for all the other compu-
tations, as well as for storing program variables.
In this chapter, we will study the effect of computing precision for each step of our
algorithm. The use of a higher precision will increase the result quality but reduce the
performance of the program. Conversely, a lower precision will increase the performance
of the program at the cost of lower accuracy. Studying the effect of computing precision
at each step allows us to find a trade-off between performance and accuracy.

5.2 Implementation using variable computing precisions


MPFR [FHL+ 07] and MPFI [RR05] are libraries that offer arbitrary precision for floating-
point arithmetic and interval arithmetic, respectively. Using these two libraries to imple-
ment our algorithms, the computing precision can be tuned at each step in order to study
the effect of these precisions on the result accuracy.
The matrices used in the following tests are generated by the MatLab function gallery
( randsvd0 , dim, cond), where dim is the matrix dimension (taken as 1000), and cond is the
0

75
76 CHAPTER 5. EFFECTS OF THE COMPUTING PRECISION

condition number. The coefficients of the generated matrices are IEEE double floating-
point numbers. The condition number varies between 25 and 250 .
In order to clarify each experimental set, let us recall the algorithm.

Algorithm 5.1 Solving and verifying a linear system


1: Compute the LU decomposition of A: A = L × U
2: Compute an approximation x̃ with a forward and a backward substitution using L and
U
3: Compute R, an approximate inverse of A, by solving RL = inv(U ) [Hig02, Chapter
14]
4: Pre-multiply A by R: K = [RA]
5: Test if K is an H-matrix by computing a non-negative vector u such that hKiu ≥ v > 0
6: if (fail to compute u) then
7: print Failed to verify the solution. The system may be either singular or too ill-
conditioned.
8: return
9: end if
10: Compute the residual r = [b − Ax̃] in double the working precision
11: Pre-multiply the residual by R: z = Rr
12: Compute an initial error bound e = kzkv · [−u, u]
13: while (not converged) do
14: Apply five Gauss-Seidel iterations on K, z, and e
15: Update x̃ and e
16: Recompute r and z
17: end while
18: return x̃ + e

5.2.1 Residual computation


First, the precision for all computations is fixed to the working precision, except for the
residual computation, to study its effect on the result accuracy. It means that only line 10
is performed in arbitrary precision, then the result is rounded to the working precision.
Figure 5.1 depicts results obtained with a fixed working precision of 53 bits. When
the condition number decreases, the number of guaranteed correct bits (computed as in
Eq.(3.6)) increases nearly linearly. When the residual computation precision increases,
the number of guaranteed correct bits also increases linearly. However, when the residual
computation precision gets a bit higher than twice the working precision, the number of
correct bits stops increasing: this precision becomes too high and it does not have any
effect when the other precisions remain low. Practically, this phenomenon justifies the use
of twice the working precision for residual computations in iterative methods. From now
on, the computing precision for the residual will be set to twice the working precision.

5.2.2 Floating-point working precision


Next the working precision varies. Only the output precision is fixed. The purpose of this
set of experiments is to check, for a fixed output precision, whether it is mandatory to use
5.2. IMPLEMENTATION USING VARIABLE COMPUTING PRECISIONS 77

(From left) (From right)

Figure 5.1: Effects of the residual precision.

a precision equal to the output precision to get a result which is correct to the last bit.
Experimental results are shown in figure 5.2.
For a fixed output precision, the needed working precision depends on the condition
number, and this dependency is again linear. As a rule of thumb [Hig02, chap.1, p.10],
the solution loses one digit of accuracy for every power of 10 in the condition number. On
Figure 5.2, this rule can be interpreted as follows: in order to be able to verify the solution
of the system, when the condition number increases by 1 bit, the computing precision
should be increased also by 1 bit. Nevertheless, the phenomenon is different when the
componentwise relative error is of concern. As revealed on Figure 5.2 by the white line,
when the condition number increases by 2 bits, the computing precision should be increased
by 1 bit in order to get a result which is correct to the last bit.

5.2.3 Needed working precision


In the final set of experiments, the demanded number of correct bits varies, and the minimal
working precision that reaches the prescribed accuracy is determined.
As shown in Figure 5.3, the minimal working precision depends nearly linearly on the
condition number and the number of correct bits demanded.
78 CHAPTER 5. EFFECTS OF THE COMPUTING PRECISION

Floating working precision seen from top

Figure 5.2: Effects of the working precision.

Figure 5.3: Minimal working precision needed.

5.3 Error analysis


As we can observe in the first set of experiments, and on Figure 5.1, the doubling of the
working precision should be used for residual computation in order to obtain solutions
of smallest relative error. Nevertheless, for ill-conditioned matrices, even if the residual
is computed in twice the working precision or higher, we still cannot guarantee the last
bits of the solution. This phenomenon has been covered in [DHK+ 06] for floating-point
iterative refinement. For interval iterative refinement, failure to guarantee the last bits
of the solution does not necessarily mean that an accurate floating-point solution cannot
be obtained, but maybe that the iterations fail to compute a tight enclosure upon this
approximate solution.
In this section, the error analysis of the interval iterative refinement is performed in
order to study the effect of computing precisions on the result accuracy. For the sake of
generality, different precisions will be used for each computation phase and for each variable
storage format. Precisions are denoted by machine relative error . For a floating-point
format of radix β where the mantissa is coded by p digits, the machine error is  = β −p+1 .
5.3. ERROR ANALYSIS 79

Precisely, the following precisions will be used in our analysis:


• w is the working precision, if not mentioned explicitly the computations are per-
formed in the working precision and the variables are stored in the working precision,
• x is the precision used to store the approximate solution x,
• i is the precision used to compute an inverse of A,
• p is the precision used to pre-multiply the coefficient matrix,
• r and r are the precisions used to compute and to store the residual respectively.
Let us assume that R is computed by Gauss elimination, using method B following
[Hig02, chap.14, Section 14.3, p.268], which is used by LINPACK’s xGEDI, LAPACK’s
xGETRI, and Matlab’s inv function. Then the backward error is bounded by:

|RA − I| ≤ c0n i |R| · |L̃| · |Ũ | (5.1)

where c0n is a constant depending on n, the dimension of the problem.


If L̃, Ũ are non-negative, from [Hig02, chap.9, Section 9.3, p.164] we have
1 ni
|L̃| · |Ũ | ≤ |A|, where γn,i = .
1 − γn,i 1 − ni
For the general case, using GE without pivoting could return an arbitrarily large ratio
k|L̃||Ũ |k/kAk. To analyze the stability of GE, we use the growth factor [Hig02, chap.9,
Section 9.3, p.165]
maxi,j,k (|akij |)
ρn =
maxi,j (|aij |)
which involves all the elements akij that occur during the elimination. Using this term we
have the following result, established in [Hig02, Lemma 9.6, Chap.9, p.165]:
Lemma 5.3.1 ([Hig02]). If A = LU ∈ Rn×n is an LU factorization produced by GE without
pivoting then
k|L̃| · |Ũ |k∞ ≤ (1 + 2(n2 − n)ρn )kAk∞ .
Let us now study the effect of pre-multiplying the coefficient matrix by its approximate
inverse in precision p . Following [Hig02, chap.3, Section 3.5, p.71], the forward error bound
of a floating-point matrix multiplication is bounded by:
np
K̃ = fl(RA) verifies |K̃ − RA| ≤ γn,p ∗ |R| · |A|, γn,p = .
1 − np
Hence, the width of the pre-multiplied matrix is bounded by:

diam(K) ≤ 2γn,p ∗ |R| · |A|.

Its maximal magnitude value is bounded by:

|K − I| = |K − RA + (RA − I)|
≤ |K − RA| + |RA − I|
≤ γn,p |R| · |A| + c0n i |R| · |L̃| · |Ũ |. (5.2)
80 CHAPTER 5. EFFECTS OF THE COMPUTING PRECISION

Let recall that r and r are the precisions used to compute and to store the residual
respectively. Let r̃ ∈ Fn be the result of b − Ax̃ computed in floating-point arithmetic with
some rounding mode, then following [Hig02, chap.12, Section 12.1, p.233]
r̃ = b − Ax̃ + ∆r, |∆r| ≤ [r + (1 + r )γ n+1,r ]|A| · |x∗ − x̃| + 2(1 + r )γ n+1,r |A| · |x∗ |
where γ n+1,r = (n + 1)r /(1 − (n + 1)r ). Hence
r ⊆ A(x∗ − x̃) + midrad(0, [r + (1 + r )γ n+1,r ]|A| · |x∗ − x̃| + 2(1 + r )γ n+1,r |A| · |x∗ |).
Ignoring rounding errors introduced by pre-multiplying the residual, we have
z = Rr ⊆ RA(x∗ −x̃)+midrad(0, |R|([r +(1+r )γ n+1,r ]|A|·|x∗ −x̃|+2(1+r )γ n+1,r |A|·|x∗ |))
which gives
|z| ≤ |RA(x∗ − x̃)| + |R|([r + (1 + r )γ n+1,r ]|A| · |x∗ − x̃| + 2(1 + r )γ n+1,r |A| · |x∗ |)
≤ |RA| · |x∗ − x̃| + [r + (1 + r )γ n+1,r ]|R| · |A| · |x∗ − x̃| + 2(1 + r )γ n+1,r |R| · |A| · |x∗ |.
Since x̃ is a floating-point approximate solution, the best bound upon the difference
between x̃ and the exact solution x∗ that can be expected is:
1
|x̃ − x∗ | ≤ x |x∗ |.
2
Hence
1 1
|z| ≤ x |RA| · |x∗ | + [ x [r + (1 + r )γ n+1,r ] + 2(1 + r )γ n+1,r ]|R| · |A| · |x∗ |. (5.3)
2 2
Let D denote the diagonal of K, and E denote the off-diagonal components of K.
Suppose that with the interval improvement, i.e Jacobi or Gauss-Seidel iterations, the
enclosure of the solution of the residual system converges to a fixed point ef then ef
satisfies:
ef = D−1 (z − Eef )
⇒ |ef | = |D−1 (z − Eef )|
= |D−1 | · |z − Eef |.
The necessary condition for Jacobi/Gauss-Seidel to be convergent is that K is an H-
matrix. It implies that D does not contain zero in its interior. Therefore we can suppose
here that D is positive, if it is not the case we simply multiply the rows of K, and the
corresponding element of z, whose diagonal element is negative by −1. Hence
(D−1 )ii = 1/Dii
⇒ |D−1 |ii = 1/hDii i
|(z − Eef |i
⇒ |ef |i =
hDii i
⇒ |e |i hDii i = |z − Eef |i
f

⇒ |ef |i = |z − Eef |i + (I − hDi)ii |ef |i


⇒ |ef | ≤ |z| + |E| · |ef | + |D − I| · |ef |
≤ |z| + |K − I| · |ef |.
5.3. ERROR ANALYSIS 81

Lemma 5.3.2. If the midpoint of K is equal to the identity then:


|ef | = |z| + |K − I| · |ef |.
Proof. The fact that the midpoint of K is equal to the identity implies that E is centered
in zero and the midpoint of D is equal to the identity. One can deduce that:
hDi + |D| = inf(D) + |D| = 2I
⇒ I − hDi = |D| − I = |D − I|.
Eef = −|E| · |ef |, |E| · |ef |
 

⇒ |z + Eef | = |inf(z) − |E| · |ef |, sup(z) + |E| · |ef ||


= max(inf(z), sup(z)) + |E| · |ef |
= |z| + |E| · |ef |.
Moreover, D − I is a diagonal matrix and the diagonal components of E are zero, hence
E + (D − I) = E + D − I = K − I, and |E| + |D − I| = |E + D − I| = |K − I|, which gives:
|z − Eef | + (I − hDi)|ef | = |z| + |E| · |ef | + |D − I| · |ef |
= |z| + (|E| + |D − I|) · |ef |
⇒ |ef | = |z| + |K − I| · |ef |
which is the sought equality.
Because the best bound that we can expect on |x̃ − x∗ | is 21 x |x∗ |, the best bound that
we can expect on |ef | is also 21 x |x∗ |, which gives
1
|ef | ≤ |z| + |K − I|x |x∗ |.
2
Using the bound for |K − I| given in (5.2), and the bound for |z| given in (5.3) we have
1 1
|ef | ≤ x |RA||x∗ | + dn |R||A||x∗ | + c0n i x |R||L̃||Ũ ||x∗ |
2 2
where
1 1
dn = x [r + (1 + r )γ n+1,r ] + 2(1 + r )γ n+1,r + x γn,p .
2 2
From (5.1), one can deduce that
|RA| ≤ I + c0n i |R| · |L̃| · |Ũ |.
Hence
1   1
|ef | ≤ x I + c0n i |R| · |L̃| · |Ũ | |x∗ | + dn |R| · |A| · |x∗ | + c0n i x |R| · |L̃| · |Ũ | · |x∗ |
2 2
1
≤ x |x∗ | + dn |R| · |A| · |x∗ | + c0n i x |R| · |L̃| · |Ũ | · |x∗ |.
2
The first term of the above bound corresponds to the rounding error of the computed
result to the working precision, which is unavoidable. Meanwhile the other terms of the
above bound are related to different calculations during the computing process, and indeed
influence the accuracy of the final result: the second term is influenced by the computing
precisions r and r used for the residual computation and storage and by the computing
precision p used for the product K = [RA], and the third term is influenced by the
computing precision i used to compute the approximate inverse R of A and by the precision
x used to store the approximate solution x̃.
82 CHAPTER 5. EFFECTS OF THE COMPUTING PRECISION

5.3.1 Normwise relative error



The normwise relative error of the certified solution is defined as kx̃−x k∞
kx∗ k∞
. Let us assume
∗ f
that x̃ is the approximate solution after convergence, then x̃ − x ∈ e . Therefore the
normwise relative error is bounded by:

kx̃ − x∗ k∞ kef k∞

kx∗ k∞ kx∗ k∞
1 k|R| · |A| · |x∗ |k∞ 0 k|R| · |L̃| · |Ũ | · |x∗ |k∞
≤ x + dn + c  
n i x
2 kx∗ k∞ kx∗ k∞
1 kRk∞ · kAk∞ · kx∗ k∞ 0 kRk∞ · k|L̃| · |Ũ |k∞ · kx∗ k∞
≤ x + dn + c n i x
2 kx∗ k∞ kx∗ k∞
1
≤ x + dn kRk∞ · kAk∞ + c0n i x kRk∞ · k|L̃| · |Ũ |k∞ .
2
Following Lemma 5.3.1 we get

kx̃ − x∗ k∞ 1

≤ x + dn kRk∞ · kAk∞ + c0n i x (1 + 2(n2 − n)ρn )kRk∞ · kAk∞
kx k∞ 2
1
≤ x + [dn + c0n i x (1 + 2(n2 − n)ρn )]e κ(A)
2
1 1
≤ x + x [r + (1 + r )γ n+1,r + γn,p ]e κ(A)
2 2
+ 2(1 + r )γ n+1,r κe(A) + c0n i x (1 + 2(n2 − n)ρn )e
κ(A). (5.4)

Note that, here, we use κ e(A) = kRk∞ · kAk∞ instead of κ(A), because by definition
κ(A) = k|A−1 | · |A|k∞ meanwhile we only have that R is an approximate inverse of A.
Hopefully R ≈ A−1 hence κ e(A) ≈ κ(A).
(5.4) shows the dependence of the result quality on the condition number of A, as well
as on the different precisions employed in the algorithm. Note that, although the precision
x is used to store x, the final result is still returned in the working precision w . Hence,
in order to achieve a solution which is guaranteed to the last bit, the following inequality
must hold
kx̃ − x∗ k∞
≤ w .
kx∗ k∞
Suppose that κ e(A) varies between 1 and 1/w . It is obvious that x must be stored in a
higher precision than the working precision, i.e. x ≤ w . Therefore the term x 21 [r + (1 +
κ(A) is of order O(w ) and does not affect significantly the result accuracy.
r )γ n+1,r + γn,p ]e
Nevertheless, the term 2(1 + r )γ n+1,r κe(A) in (5.4) influences the result relative error.
This is in fact the effect of the precision used for residual computation on the result
quality. When A is ill-conditioned, say κ e(A) = k1w , with k > 1 but not too big, in order
that 2(1 + r )γ n+1,r κe(A) = O(w ), we must have γ n+1,r = O(2w ). It means that twice
the working precision must be used in order to get a result accurate to the last bit in the
working precision.
Nevertheless, doubling the working precision for the residual computation does not
suffice to obtain the result accurate to the last bit in the working precision. One can see
on Figure 4.2 that when the condition number is high, there is a drop in result accuracy.
5.3. ERROR ANALYSIS 83

That is due to the term c0n x i (1 + 2(n2 − n)ρn )e


κ(A) in (5.4). When A is ill-conditioned,
1
e(A) = kw with k > 1 but not too big, if we use the working precision to store
say again κ
the approximate solution and to compute the inverse of A, i.e x = w and i = w then

c0n (1 + 2(n2 − n)ρn )


c0n x i (1 + 2(n2 − n)ρn )e
κ(A) = w .
k

It seems that this term is of order O(w ) and does not affect a lot the result quality. It is
true when matrix dimensions are small, but no longer true for matrices of large dimensions.
For example, if n = 1000 then 2(n2 − n) ≈ 221 . It makes this term significant and reduces
the result accuracy.
In order to reduce the impact of the term c0n x i (1 + 2(n2 − n)ρn )e κ(A) on the result
quality, one can either reduce x or i . That means either increasing the precision to
store x or increasing the precision to compute the inverse of A. Nevertheless, inverting A
costs O(n3 ) FLOPs. Hence, increasing the precision to invert A will introduce a significant
overhead to the algorithm execution time.
Therefore, we choose to increase the precision for storage format of x. As it will be
shown in the next section, doubling the working precision for the approximate solution will
yield results accurate to the last bit without deteriorating significantly the performance of
the overall algorithm.

5.3.2 Componentwise relative error


Again, since x̃ − x∗ ∈ ef , then the componentwise relative error of x̃ is bounded by:

|x̃ − x∗ |i |ef |i
   
maxi ≤ maxi
|x∗ |i |x∗ |i
1 (|R| · |A| · |x∗ |)i 0 (|R| · |L̃| · |Ũ | · |x∗ |)i
≤ x + dn maxi + c n  i x maxi
2 |x∗ |i |x∗ |i
1 kRk∞ · kAk∞ · kx∗ k∞ 0 kRk∞ · k|L̃| · |Ũ |k∞ · kx∗ k∞
≤ x + dn + c 
n i x
2 mini (|x∗ |i ) mini (|x∗ |i )
1   max (|x∗ | )
i i
≤ x + dn kRk∞ · kAk∞ + c0n i x kRk∞ · k|L̃| · |Ũ |k∞ ∗
.
2 mini (|x |i )

One can see that the componentwise relative error bound differs from the normwise
maxi (|x∗ |i )
relative error bound only by the factor mini (|x∗ |i ) , which corresponds to the variation in
absolute value of the components of the exact solution. If all the components of x∗ are
equal then the componentwise relative error is equal to the normwise relative error.
One problem occurs when using componentwise relative error, when there is a null
maxi (|x∗ |i )
component in the exact solution. In that case the factor mini (|x∗ |i ) ∼ ∞, and the relative
error of the null component will also be infinite. Hence we suppose that there is no null
component in the exact solution.
84 CHAPTER 5. EFFECTS OF THE COMPUTING PRECISION

5.4 Doubling the working precision for approximate so-


lution
To verify the effect of using twice the working precision for the approximate solution, in this
section we implement, in MatLab using the Intlab library, a function called certifylssx,
which follows the Algorithm 5.1 and uses twice the working precision for both computing
the residual and storing the floating-point solution.
The use of higher precision for the approximate solution is likely to increase the al-
gorithm execution time. But we will show later in this section that to achieve the same
accuracy, there is hardly no difference between certifylss and certifylssx in terms of
execution time. Nonetheless certifylssx allows to guarantee more bits when the matrix
is ill-conditioned.

5.4.1 Implementation issues


Since there is no quadruple precision, to store the floating-point solution x̃ we will use a
pair of two vectors of double precision floating-point numbers {x̃1 , x̃2 }, where x̃1 represents
the high part, i.e the first 53 bits, of x̃, and x̃2 represents the low part of x̃: x̃ = x̃1 + x̃2 .
In other words, the approximate solution is now stored as an unevaluated sum x̃1 + x̃2 .
Therefore, the residual upon x̃ is computed, always in twice the working precision, by

r = [b − Ax̃]2 = [b − A(x̃1 + x̃2 )]2 .

An arithmetic using twice the working precision and this representation of numbers as
sums of two floating-point numbers, is called a double-double arithmetic. Implementing
a fully supported package of simulated double-double precision operations, namely sum
and product, is very costly [Dek71, HLB01, LDB+ 02]. Instead, we reuse the function
for computing the residual in twice the working precision with a minor modification. A
new function called residualxx is implemented. Instead of returning an interval vector
in the working precision which encloses the exact residual, residualxx returns a triple
{r, er1, er2}, where r, er1, and er2 are vectors of double precision floating-point num-
ber: r is an approximation of the exact residual, and er1, er2 are respectively the lower
bound and the upper bound of the difference between r and the exact residual. The pair
{r, infsup(er1, er2)} is in fact the intermediate result of the function residual just be-
fore returning the guaranteed residual as r + infsup(er1, er2) = r + [er1, er2]. Thus,
b − A ∗ x̃ ∈ (r + [er1, er2]) − A ∗ x̃2 ⊆ [r − A ∗ x̃2 ] + [er1, er2] . Therefore, with this function,
we can compute the residual upon x̃ = {x̃1 , x̃2 } in twice the working precision by:

[r1 er1 er2] = residualxx(A, x1, b)


r = infsup(er1, er2) + residual(A, x2, r1)

where x1 in the code corresponds to x̃1 and x2 to x̃2 .


Moreover, let e2 be the error bound upon x̃ = {x̃1 , x̃2 }, then x̃2 + e2 is also an error
bound upon x̃1 . On the other hand, if e1 is the error bound upon x̃1 , then we can use
mid(e1 ) as x̃2 and e1 − mid(e1 ) as e2 . This means that we can decouple the interval
improvement into two phases to refine x1 and x2 separately. Suppose that x1, an initial
floating-point approximate solution in the working precision, and e1, an initial error bound
upon this approximation, have been computed according to Section 3.2:
5.4. DOUBLING THE WORKING PRECISION FOR APPROXIMATE SOLUTION 85

• Phase 1: compute the residual upon x1 as a pair of approximation vector and error
bound vector {r1, er1}. Use the system Ke = (r1 + er1) to improve e1.

• Phase 2: set x2 = mid(e1) and e2 = e1−mid(e1). compute the residual upon x1+x2
and use it to refine e2.

A straightforward implementation would exhibit a slowdown with a factor 2. However,


even if the approximate solution is computed and stored in twice the working precision,
it is possible to implement Phase 2 in such a way that the execution time remains the
same as before, where only the residual computation is performed in twice the working
precision. Indeed, Phase 2 can be considered as updating x1 by x2 and refining the error
bound upon this updated solution. In other words, by decoupling each interval iteration
into two phases, where each one corresponds to one iteration in working precision, there
is no overhead of execution time between the algorithm using extended-precision for the
approximate solution and the one using working precision for the approximate solution.
Regarding the accuracy, the updating process in phase 2 can be considered as being
performed in infinite precision for the approximate solution, which is stored in twice the
working precision. Only the error bound is updated using the working precision. Hence,
it increases the accuracy of the interval iterative refinement.
During the execution, there might be cases where x1 does not change after one iteration,
such as, for example, when we obtain a good approximation of the solution but fail to
compute a tight enclosure upon it. In those cases, there is no need to recompute the
residual upon x1.
Additionally, a tight enclosure of the solution can be obtained right after phase 1 of the
iteration. In that case, the program can terminate without having to proceed to phase 2.
The algorithm is detailed in Algorithm 5.2. Note that, for the sake of simplicity,
Algorithm 5.2 does not use the relaxation technique. In reality, the relaxation technique is
also applied to this algorithm in order to improve the performance.

5.4.2 Numerical results


We performed numerical experiments to compare the two functions certifylssx and
certifylss both in terms of execution time and result accuracy. Matrices are again
generated by the function gallery of MatLab, and the condition number varies between
25 and 250 . Numerical results for matrices of dimensions 1000×1000 are depicted on Figure
5.4.
One can see that results computed by certifylssx are guaranteed to almost the last
bit, except in the case of certification failure. Meanwhile, when the matrix is too ill-
conditioned, both functions certifylss and verifylss fail to guarantee the last bit.
Nonetheless, when the matrix is not too ill-conditioned, there is no overhead of execution
time in certifylssx with respect to certifylss. When the matrix is too ill-conditioned,
certifylssx is a bit slower than certifylss at the benefit of gaining guaranteed bits.

5.4.3 Other extended precisions


From the bound (5.4), to improve the result accuracy one can reduce either x or i to
compensate for the constant c0n (1+2(n2 −n))ρn . As shown in the previous section, doubling
the working precision for the solution allows to obtain very accurate results. In fact, it is
86 CHAPTER 5. EFFECTS OF THE COMPUTING PRECISION

Algorithm 5.2 Doubling the precision for the approximate solution


Require: A ∈ Fn×n , b ∈ Fn
Ensure: x ∈ Fn is an approximate solution of Ax = b, e ∈ IFn is an error bound on x
R = inv(A)
x=R∗b
K = R ∗ intval(A)
Acom = compmat(K)
Find u ≥ 0 such that v = Acom ∗ u > 0 according to Algorithm 3.3
[r1 er1 er2] = residualxx(A, x1, b)
r = r1 + infsup(er1, er2)
z=R∗r
e = max(mag(z)./v) ∗ infsup(−u, u)
while (termination criteria not satisfied) do
if (Not the first iteration && x1 changes) then
[r1 er1 er2] = residualxx(A, x1, b)
r = r1 + infsup(er1, er2)
z=R∗r
end if
e = refine(K, e, z)
if (termination criteria satisfied) then
break
end if
x2 = mid(e)
e = e − x2
r2 = residual(A, x2, r1) + infsup(er1, er2)
e = refine(K, e, R ∗ r2)
e = x2 + e
tmp = x + mid(e)
e = e + (x − intval(tmp))
x = tmp
return x and e
end while
5.5. CONCLUSION 87

60

50

40
accuracy (bits) matlab
30 verfifylss
20 certifylss_relaxed
certifylssx
10

−10

−20
5 10 15 20 25 30 35 40 45 50
log2(condition number)

3.5 matlab

3 verifylss
execution time (s)

certifylss_relaxed
2.5
certifylssx
2

1.5

0.5

0
5 10 15 20 25 30 35 40 45 50
log2(condition number)

Figure 5.4: Effect of doubling the working precision for the approximate solution.

not really necessary to use twice the working precision. A lower precision is indeed sufficient
to achieve results accurate to the last bit in the working precision. For example, x̃ can be
coded by a couple {x̃1 , x̃2 } where x̃1 is one vector of IEEE double floating-point numbers
and x̃2 is one vector of IEEE single floating-point numbers. Using a lower precision for
the solution can help to reduce the computation time of each interval improvement step.
Nevertheless, using a lower precision will increase the error upon the residual computation.
Hence, it will decrease the rate of convergence, or in other words, increase the number of
iterations and thus the number of times residual computation is performed.
Figure 5.5 shows a comparison between the above function certifylssx and another
version of it called certifylssxs which uses IEEE single precision for x̃2 . One can see
that certifylssxs still succeeds to obtain very accurate results. Moreover, there is not
much difference between the two functions in terms of execution time.

5.5 Conclusion
In contrast to the previous chapter, which consists in relaxing the exactness to improve the
speed, in this chapter, we relax the speed requirement in order to improve the accuracy
by studying the effect of the computing precision on both the accuracy and the speed of
the algorithm. Classically, twice the working precision is used for the residual computation
88 CHAPTER 5. EFFECTS OF THE COMPUTING PRECISION

60

accuracy (bits) 50

40 matlab
certifylssx
30 certifylssxs

20

10

0
10 15 20 25 30 35 40 45 50
log2(condition number)

matlab
1.5 certifylssx
execution time(s)

certifylssxs

0.5

0
10 15 20 25 30 35 40 45 50
log2(condition number)

Figure 5.5: Effect of other extended precisions.

in order to obtain results which are accurate in the working precision. Nonetheless, in
our case where rigorous error bounds are needed or componentwise relative errors are in
question, doubling the working precision only for the residual computation does not suffice
to guarantee the last bits of the computed results when the matrix is too ill-conditioned.
By performing the error analysis as well as performing some experiments using arbi-
trary precision libraries, namely the MPFR and MPFI libraries, to study the effect of the
precision for each step on the final accuracy, we observed that in order to verify the last
bits of the approximate solution, it is needed to increase either the precision for inverting
the coefficient matrix or the precision for storing the intermediate approximate solution.
Nevertheless, an increase of precision for inverting the coefficient matrix will lead to a
drastic increase of execution time and memory usage. Therefore we chose to increase the
precision for storing the approximate solution. By reusing the procedure for computing the
residual in twice the working precision, the use of twice the working precision for storing
the approximate solution does not harm the algorithm’s speed when the condition number
is small to intermediate. Furthermore, when the matrix is too ill-conditioned, the use of
twice the working precision for the approximate solution helps to obtain very accurate
results, up to the last bit, at a cost of acceptable overhead of execution, as revealed in the
experimental section.
Conclusions and Perspectives

Conclusion
In this thesis, we concentrated on efficient implementations for verified linear algebra using
interval arithmetic. Following the fundamental property of inclusion, interval arithmetic
allows to compute an enclosure of the exact result, or a certificate of accuracy. Neverthe-
less, there are obstacles to the application of interval arithmetic. Firstly, interval arithmetic
suffers from low performance in comparison with floating-point arithmetic. Secondly, al-
gorithms using interval arithmetic usually provide results of large overestimation if care
is not taken when implementing them. Our main contribution is to find and implement
algorithms which run fast and at the same time provide sharp enclosures of the exact re-
sults. Two fundamental problems are addressed in this thesis, namely the multiplication of
interval matrices and the verification of the solution to a linear system. For each problem,
our contributions can be summarized as follows.

Multiplication of interval matrices


Using the natural algorithm to implement the multiplication of interval matrices suffers
from very low performance due to the use of too many changes of rounding modes as well
as maximal and minimal functions. S. M. Rump proposed an efficient algorithm which is
based on midpoint-radius representation of intervals and uses merely floating-point matrix
products. This is the fastest implementation till now and costs only 4 floating-point matrix
products. Nevertheless, Rump’s algorithm always provides overestimated results, and the
factor of overestimation in the worst case is 1.5.
We present in this thesis two new algorithms, which are based respectively on endpoints
and midpoint-radius representations of intervals and offer new trade-offs between speed
and accuracy. In order to reduce the overestimation factor, we introduce a new scheme for
decomposing intervals. Each component of one input matrix is decomposed into 2 parts,
one which is centered in zero and the other which is the largest interval as possible which
does not contain zero in its interior. With this decomposition, both new algorithms can be
implemented using merely floating-point matrix products. They cost 9 and 7 floating-point
matrix products respectively, and thus they are a bit slower than Rump’s algorithm. In
return, the factor of overestimation in the worst case of both new algorithms is reduced
to less than 1.18. Furthermore, if all the components of one multiplier matrix does not
contain zero, then the computed result is exact in the absence of rounding error. In practice,
even in the presence of rounding errors, our implementation of the second algorithm yields
accurate results, conforming to this theoretical result.

89
90 CONCLUSIONS AND PERSPECTIVES

Verify the solution of a linear system of equations


In order to compute a sharp enclosure of the solution of a linear system, we adapt floating-
point iterative refinement to interval arithmetic. First, an initial approximate solution is
computed using floating-point arithmetic. Then we switch to interval arithmetic to com-
pute the residual. Still in interval arithmetic, the residual system is pre-multiplied by an
approximate inverse of the matrix and is used to refine the error bound using either interval
Jacobi or interval Gauss-Seidel iterations. The combination of floating-point refinement
and interval improvement allows us to obtain accurate results together with a tight en-
closure of error bound. However, as the condition number increases, the execution time
increases drastically. Moreover, as the matrix gets too ill-conditioned, this method fails to
guarantee the last bits of computed result.
To tackle the first problem, we introduce the relaxation technique which consists in
relaxing the exactness to gain in performance. The pre-multiplied matrix is indeed inflated
in such a way that its midpoint is a diagonal matrix. Exploiting the symmetry of interval
data helps to reduce significantly the execution time of the interval improvement phase.
Nonetheless, the result accuracy is not affected by this relaxation. Although the error
bound becomes larger due to inflated data, it is still able to recover the lost bits in the
approximate solution.
The second problem is solved by applying extra-precise refinement, where extended
precision is used not only for residual computation but also for storing the approximate
solution, since the precision for the approximate solution influences the result accuracy
when the matrix is too ill-conditioned. Numerical experiments have shown that by using
extended precision for the approximate solution, our method provides very accurate re-
sults which are guaranteed to almost the last bit at the cost of an acceptable overhead of
execution time.

Perspectives
During my Ph. D. preparation, MatLab is used as experimental environment. It allows
to obtain quickly a “proof of concept” for the performance as well as the accuracy of the
proposed algorithms. Nevertheless, because of the overhead in execution time introduced
by the interpretation, functions implemented in MatLab suffer from low performance when
employing loops such as the for and while loops, or when accessing separately rows and
columns of matrices.
Nonetheless, an implementation in other programming languages, such as C or FOR-
TRAN, does not suffer from these problems. By making direct calls to BLAS level 3
routines and accessing directly columns of the matrices, one can obtain a better implemen-
tation of the Gauss-Seidel iterations, and especially the residual computation. Although the
residual computation costs O(n2 ) floating-point operations, an implementation in MatLab
is much more expensive in terms of execution time as explained in Section 3.5.1. Therefore,
a better implementation of the residual computation will reduce significantly the execution
time of the method.
Our method for verifying the solution of a linear system allows to compute a very
accurate solution together with a tight enclosure of the error bound when the matrix is
not too ill-conditioned. Nevertheless, when the matrix is too ill-conditioned, i.e. when the
condition number of the matrix is close to 1/ with  being the machine relative error,
91

then both our functions and the function verifylss of the IntLab library fail to verify the
solution. It is because that a good approximation of the inverse of the coefficient matrix
cannot be obtained.
To extend the range for condition number of verifiable systems, we can use a technique
proposed by Rump [Rum09b] in order to inverse extremely ill-conditioned matrices. This
technique is based on multiplicative correction terms and K-fold precision. For example,
using 2-fold precision, i.e. twice the working precision, one can obtain a good approximation
for matrices whose condition number falls between 1/ and 1/2 . This technique has also
been used in [OOR09] to refine the solution of ill-conditioned linear equations, as well as
in [Ogi10] to factorize ill-conditioned matrices.
A research direction that I am also interested in is the implementation of interval
arithmetic on new hardware architectures. Till now, there is no dedicated processor for
interval arithmetic. There was in fact a hardware chip called XPA 3233, which has been
developed by Ulrich W. Kulisch in 1993 [Kul08], which supports long interval arithmetic
through the exact dot product. However, this chip has never been put to production. The
implementation of interval arithmetic on traditional processors suffers from some limits
such as changing rounding mode, testing and branching, etc. Meanwhile new hardware
architectures, for example the Cell processor [GLRM08], the Graphics Processing Unit
(GPU) [CFD08], or the Field-Programmable Gate Array (FPGA), offer new computational
capacities as well as new optimization options. On these new hardware architectures,
numerous computations can be run simultaneously, for example the SIMD instructions on
the Cell processor and the GPU. However, they have also some restrictions. For example,
the Cell processor supports only the rounding mode toward zero for the single precision,
meanwhile the GPU supports only the rounding modes toward nearest and toward zero.
Therefore, care must be taken in order to obtain an efficient implementation of interval
arithmetic on these architectures.
Currently, the midpoint-radius representation is less used than the endpoints represen-
tation except for matrix operations because of its disadvantage of lower accuracy. However,
when the intervals are of small width then the midpoint-radius representation is almost
of the same accuracy as the endpoints representation. As observed in Section 2.3.3, in
some specific cases, the midpoint-radius representation is even of better accuracy. In im-
plementing on new hardware architecture, the midpoint-radius exhibits another advantage
in comparison with the endpoints representation, that is the midpoint and the radius can
be stored in different precisions. For intervals of small width, a smaller precision can be
used to represent the radius. That allows to reduce not only the memory usage but also
the cost of data transfer.
92 CONCLUSIONS AND PERSPECTIVES
Bibliography

[BM05] Sylvie Boldo and Jean-Michel Muller, Some Functions Computable with a
Fused-MAC, Proceedings of the 17th Symposium on Computer Arithmetic
(Cape Cod, USA) (Paolo Montuschi and Eric Schwarz, eds.), 2005, pp. 52–58.

[BT89] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and distributed computation, Pren-
tice Hall, 1989.

[CFD08] Sylvain Collange, Jorge Flóres, and David Defour, A GPU interval library
based on Boost interval, Real Numbers and Computers, July 2008, pp. 61–72.

[CK04] Martine Ceberio and Vladik Kreinovich, Fast Multiplication of Interval Matri-
ces (Interval Version of Strassen’s Algorithm), Reliable Computing 10 (2004),
no. 3, 241–243.

[Dek71] T.J. Dekker, A floating-point technique for extending the available precision,
Numer. Math. 18 (1971), 224–242.

[DHK+ 06] James Demmel, Yozo Hida, William Kahan, Xiaoye S. Li, Sonil Mukher-
jee, and E. Jason Riedy, Error bounds from extra-precise iterative refinement,
ACM Trans. Mathematical Software 32 (2006), no. 2, 325–351.

[FHL+ 07] Laurent Fousse, Guillaume Hanrot, Vicent Lefèvre, Patrick Pélissier, and
Paul Zimmermann, MPFR: A Multiple-Precision Binary Floating-Point Li-
brary with Correct Rounding, ACM Trans. Mathematical Software 33 (2007),
no. 2.

[GLRM08] Stef Graillat, Jean-Luc Lamotte, Siegfried M. Rump, and Svetoslav Markov,
Interval arithmetic on the Cell processor, 13th GAMM - IMACS International
Symposium on Scientific Computing, Computer Arithmetic and Verified Nu-
merical Computations SCAN’08, 2008, pp. 54–54.

[GNL09] Stef Graillat, Hong Diep Nguyen, and Jean-Luc Lamotte, Error-Free Trans-
formation in rounding mode toward zero, Lecture Notes in Computer Science
(LNCS) 5492 (2009), 217–229, Numerical Validation in Current Hardware
Architecture.

[Han92] E. R. Hansen, Bounding the Solution of Interval Linear Equations, SIAM


Journal on Numerical Analysis 29 (1992), no. 5, 1493–1503.

[Hig90] N. J. Higham, Is fast matrix multiplication of practical use ?, SIAM News


(1990), 12–14.

93
94 BIBLIOGRAPHY

[Hig97] Nicholas J. Higham, Iterative refinement for linear systems and LAPACK,
IMA Journal of Numerical Analysis 17 (1997), no. 4, 495–509.

[Hig02] , Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM Press,
2002.

[HKKK08] Chenyi Hu, R. Baker Kearfott, Andre de Korvin, and Vladik Kreinovich,
Knowledge processing with interval and soft computing, 1 ed., Springer Pub-
lishing Company, Incorporated, 2008.

[HLB01] Yozo Hida, Xiaoye S. Li, and David H Baily, Algorithms for quad-double pre-
cision floating point arithmetic, 15th IEEE Symposium on Computer Arith-
metic, 2001, pp. 155–162.

[HS81] Eldon Hansen and Saumyendra Sengupta, Bounding solutions of systems of


equations using interval analysis, BIT Numerical Mathematics 21 (1981), 203–
211.

[IEE85] IEEE Task P754, ANSI/IEEE 754-1985, Standard for Binary Floating-Point
Arithmetic, IEEE, New York, NY, USA, August 1985.

[IEE08] , IEEE 754-2008, Standard for Floating-Point Arithmetic, IEEE, New


York, NY, USA, August 2008.

[Int07] Intel Corporation, Intel R 64 and IA-32 Architectures Software Developer’s


Manual, Specification, 2007, http://www.intel.com/products/processor/
manuals/index.htm.

[JKDW01] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter, Applied Interval Analysis,


Springer, 2001.

[KDH+ 05] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and


D. Shippy, Introduction to the cell multiprocessor, IBM J. Res. Dev. 49 (2005),
589–604.

[Knu06] 0. Knuppel, PROFIL/BIAS–A fast interval library, Computing 53 (2006),


no. 3–4, 277–287.

[Kul08] Ulrich W. Kulisch, Computer Arithmetic and Validity: Theory, Implementa-


tion, and Applications, de Gruyter Studies in Mathematics, Berlin, New York
(Walter de Gruyter), 2008.

[Lan04] Philippe Langlois, More accuracy at fixed precision, ournal of Computational


and Applied Mathematics 162 (2004), no. 1, 57–77.

[LD03] Xiaoye S. Li and James W. Demmel, SuperLU_DIST: A scalable distributed-


memory sparse direct solver for unsymmetric linear systems, ACM Trans.
Mathematical Software 29 (2003), no. 2, 110–140.
BIBLIOGRAPHY 95

[LDB+ 02] Xiaoye S. Li, James W. Demmel, David H. Bailey, Greg Henry, Yozo Hida,
Jimmy Iskandar, William Kahan, Suh Y. Kang, Anil Kapur, Michael C. Mar-
tin, Brandon J. Thompson, Teresa Tung, and Daniel J. Yoo, Design, im-
plementation and testing of extended and mixed precision blas, ACM Trans.
Mathematical Software 28 (2002), 152–205.

[LHKK79] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh, Basic Linear


Algebra Subprograms for Fortran Usage, ACM Trans. Mathematical Software
5 (1979), 308–323.

[LL08a] Philippe Langlois and Nicolas Louvet, Accurate solution of triangular linear
system, 13th GAMM - IMACS International Symposium on Scientific Com-
puting, Computer Arithmetic, and Validated Numerics, El Paso (TX), USA,
September 2008.

[LL08b] , Compensated Horner algorithm in K times the working precision,


RNC-8, Real Numbers and Computer Conference, Santiago de Compostela,
Spain (J.D. Brugera and M. Daumas, eds.), July 2008.

[LLL+ 06] Julie Langou, Julien Langou, Piotr Luszczek, Jakub Kurzak, Alfredo Buttari,
and Jack Dongarra, Exploiting the Performance of 32 bit Floating Point Arith-
metic in Obtaining 64 bit Accuracy (Revisiting Iterative Refinement for Linear
Systems) - Article 113 (17 pages), Proc. of the 2006 ACM/IEEE conference
on Supercomputing, 2006.

[MB79] Ramon E. Moore and Fritz Bierbaum, Methods and Applications of Interval
Analysis (SIAM Studies in Applied and Numerical Mathematics) (Siam Stud-
ies in Applied Mathematics, 2.), Soc for Industrial & Applied Math, 1979.

[MBdD+ 10] Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre


Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien
Stehlé, and Serge Torres, Handbook of Floating-Point Arithmetic, Birkhäuser
Boston, 2010.

[MKC09] Ramon E. Moore, R. Baker Kearfott, and Michael J. Cloud, Introduction to


Interval Analysis, SIAM, Philadelphia, PA, USA, 2009.

[Mol67] Cleve B. Moler, Iterative Refinement in Floating Point, J. ACM 14 (1967),


no. 2, 316–321.

[Moo63] Ramon Edgar Moore, Interval arithmetic and automatic error analysis in
digital computing, Ph.D. thesis, Stanford University, Stanford, CA, USA, 1963.

[Neu90] Arnold Neumaier, Interval Methods for Systems of Equations, Cambridge Uni-
versity Press, 1990.

[Neu99] , A Simple Derivation of the Hansen-Bliek-Rohn-Ning-Kearfott En-


closure for Linear Interval Equations, Reliable Computing 5 (1999), no. 2,
131–136.
96 BIBLIOGRAPHY

[NGL09] Hong Diep Nguyen, Stef Graillat, and Jean-Luc Lamotte, Extended precision
with a rounding mode toward zero environement. Application on the Cell pro-
cessor, Int. J. Reliability and Safety 3 (2009), no. 1/2/3, 153–173.

[Ngu] Hong Diep Nguyen, Efficient implementation of interval matrix multiplica-


tion, Proceedings of PARA 2010: State of the Art in Scientific and Parallel
Computing.

[NK97] S. Ning and R. B. Kearfott, A comparison of some methods for solving linear
interval equations, SIAM J. Numer. Anal. 34 (1997), no. 4, 1289–1305.

[NR] Hong Diep Nguyen and Nathalie Revol, Resolve and Certify a Linear System,
to appear on Reliable Computing (Special volume for SCAN 2008).

[NR10] , High performance linear algebra using interval arithmetic, PASCO


’10: Proceedings of the 4th International Workshop on Parallel and Symbolic
Computation (Marc Moreno Maza and Jean-Louis Roch, eds.), ACM, 2010,
pp. 171–172.

[Ogi10] Takeshi Ogita, Accurate matrix factorization: inverse lu and inverse qr fac-
torizations, SIAM J. Math. Anal. Appli. 31 (2010), no. 5, 2477–2497.

[Ois98] Shin’ichi Oishi, Finding all solutions of nonlinear systems of equations using
linear programming with guaranteed accuracy, J. Universal Computer Science
4 (1998), no. 2, 171–177.

[OO05] Takeshi Ogita and Shin’ichi Oishi, Fast inclusion of interval matrix multipli-
cation, Reliable Computing 11 (2005), no. 3, 191–205.

[OO09] , Fast verified solutions of linear systems, Japan Journal of Industrial


and Applied Mathematics 26 (2009), 169–190, 10.1007/BF03186530.

[OOO10] K. Ozaki, Takeshi Ogita, and Shin’ichi Oishi, Tight and efficient enclosure of
matrix multiplication by using optimized BLAS, preprint available at http:
//onlinelibrary.wiley.com/doi/10.1002/nla.724/pdf.

[OOR09] Shi’ichi Oishi, Takeshi Ogita, and Segfried M. Rump, Iterative Refinement
for Ill-conditioned Linear Equations, Japan J. Indust. Appl. Math. 26 (2009),
no. 2, 465–476.

[ORO05] Takeshi Ogita, Siegfried M. Rump, and Shin’ichi Oishi, Accurate Sum and
Dot Product, SIAM J. Sci. Comput. 26 (2005), no. 6, 1955–1988.

[Pan84] Victor Pan, How to multiply matrices faster, Springer-Verlag New York, Inc.,
New York, NY, USA, 1984.

[RK95] Jiří Rohn and Vladik Kreinovich, Computing Exact Componentwise Bounds
on Solutions of Linear Systems with Interval Data is NP-Hard, SIAM J. Ma-
trix Anal. Appl. 16 (1995), no. 2, 415–420.

[Roh93] Jiří Rohn, Cheap and Tight Bounds: The Recent Result by E. Hansen Can Be
Made More Efficient, Interval Computations (1993), no. 4, 13–21.
BIBLIOGRAPHY 97

[Roh96] , Checking Properties of Interval Matrices, Tech. Report 686, Czech


Academy of Sciences, 1996, http://uivtx.cs.cas.cz/~rohn/publist/92.
pdf.

[Roh05] , A Handbook of Results on Interval Linear Problems, 2nd ed.,


Czech Academy of Sciences, 2005, http://www.cs.cas.cz/~rohn/publist/
handbook.

[RR05] Nathalie Revol and Fabrice Rouillier, Motivations for an Arbitrary Precision
Interval Arithmetic and the MPFI Library, Reliable Computing 11 (2005),
no. 4, 275–290.

[Rum] Siegfried M. Rump, INTLAB - INTerval LABoratory, http://www.ti3.


tu-hamburg.de/rump/intlab.

[Rum80] , Kleine Fehlerschranken bei Matrixproblemen, Ph.D. thesis, Univer-


sität Karlsruhe, 1980.

[Rum91] , On the solution of interval linear systems, Computing 47 (1991),


337–353.

[Rum98] , A note on epsilon-inflation, Reliable Computing 4 (1998), 371–375.

[Rum99a] , Fast And Parallel Interval Arithmetic, BIT Numerical Mathematics


39 (1999), no. 3, 539–560.

[Rum99b] S.M. Rump, INTLAB - INTerval LABoratory, Developments in Reliable Com-


puting (Tibor Csendes, ed.), Kluwer Academic Publishers, Dordrecht, 1999,
http://www.ti3.tu-harburg.de/rump/, pp. 77–104.

[Rum05] Siegfried M. Rump, Handbook on Accuracy and Reliability in Scientific Com-


putation (edited by Bo Einarsson), ch. Computer-assisted Proofs and Self-
validating Methods, pp. 195–240, SIAM, 2005.

[Rum09a] , Error-Free Transformations and ill-conditioned problems, Interna-


tional Workshop on Verified Computations and Related Topics, 2009.

[Rum09b] , Inversion of extremely ill-conditioned matrices in floating-point,


Japan J. Indust. Appl. Math (JJIAM) 26 (2009), 249–277.

[RZBM09] Siegfried M. Rump, Paul Zimmermann, Sylvie Boldo, and Guillaume


Melquiond, Computing predecessor and successor in rounding to nearest, BIT
Numerical Mathematics 49 (2009), no. 2, 419–431.

[Ske79] Robert D. Skeel, Scaling for Numerical Stability in Gaussian Elimination, J.


ACM 26 (1979), no. 3, 494–526.

[Ske80] , Iterative Refinement Implies Numerical Stability for Gaussian Elim-


ination, Mathematics of Computation 35 (1980), no. 151, 817–832.

[Ste74] Pat H. Sterbenz, Floating-point computation, Prentice-Hall, Englewood Cliffs,


New Jersey, 1974.
98 BIBLIOGRAPHY

[Str69] Volker Strassen, Gaussian elimination is not optimal, Numer. Math. 13


(1969), 354–356.

[TAC] TACC, GotoBLAS2, http://www.tacc.utexas.edu/tacc-projects/


gotoblas2/.

[vzGG03] J. von zur Gathen and J. Gerhard, Modern computer algebra, 2nd edition,
Cambridge University Press, 2003.

[Wil60] J. H. Wilkinson, Error analysis of floating-point computation, Numerische


Mathematik 2 (1960), no. 1, 319–340.

[Wil63] James H. Wilkinson, Rounding errors in algebraic processes, Prentice-Hall.


Reprinted in 1994 by Dover Publications, Incorporated, 1963.

[WP05] R. Clint Whaley and Antoine Petitet, Minimizing development


and maintenance costs in supporting persistently optimized BLAS,
Software: Practice and Experience 35 (2005), no. 2, 101–121,
http://www.cs.utsa.edu/~whaley/papers/spercw04.ps.

You might also like