Gallopoulos - Parallelism in Matrix Computations
Gallopoulos - Parallelism in Matrix Computations
Efstratios Gallopoulos
Bernard Philippe
Ahmed H. Sameh
Parallelism
in Matrix
Computations
www.allitebooks.com
Parallelism in Matrix Computations
www.allitebooks.com
Scientific Computation
Editorial Board
www.allitebooks.com
Efstratios Gallopoulos Bernard Philippe
•
Ahmed H. Sameh
Parallelism in Matrix
Computations
123
www.allitebooks.com
Efstratios Gallopoulos Ahmed H. Sameh
Computer Engineering Department of Computer Science
and Informatics Department Purdue University
University of Patras West Lafayette, IN
Patras USA
Greece
Bernard Philippe
Campus de Beaulieu
INRIA/IRISA
Rennes Cedex
France
www.allitebooks.com
To the memory of Daniel L. Slotnick,
parallel processing pioneer
www.allitebooks.com
Preface
vii
www.allitebooks.com
viii Preface
www.allitebooks.com
Preface ix
of order larger than 50, in practice one can handle much larger systems: The
majority of problems have very degenerate matrices and we do not need to store
anything like as much as … since the coefficients in these equations are very
systematic and mostly zero. The computational challenges we face today are cer-
tainly different in scale than those above but they are surprisingly similar in their
dependence on matrix computations and numerical linear algebra. In the early
1980s, during building the experimental parallel computing platform “Cedar”, led
by David Kuck, at the University of Illinois at Urbana-Champaign, a table was
compiled that identifies the common computational bottlenecks of major science
and engineering applications, and the parallel algorithms that need to be designed,
together with their underlying kernels, in order to achieve high performance.
Among the algorithms listed, matrix computations are the most prominent.
A similar list was created by UC Berkeley in 2009. Among Berkeley’s 13 parallel
algorithmic methods that capture patterns of computation and communication,
which are called “dwarfs”, the top two are matrix computation-based. Not only are
matrix computations, and especially sparse matrix computations, essential in
advancing science and engineering disciplines such as computational mechanics,
electromagnetics, nanoelectronics among others, but they are also essential for
manipulation of the large graphs that arise in social networks, sensor networks, data
mining, and machine learning just to list a few. Thus, we conclude that realizing
high performance in dense and sparse matrix computations on parallel computing
platforms is central to many applications and hence justify our focus.
Our goal in this book is therefore to provide researchers and practitioners with
the basic principles necessary to design efficient parallel algorithms for dense and
sparse matrix computations. In fact, for each fundamental matrix computation
problem such as solving banded linear systems, for example, we present a family of
algorithms. The “optimal” choice of a member of this family will depend on the
linear system and the architecture of the parallel computing platform under con-
sideration. Clearly, however, executing a computation on a parallel platform
requires the combination of many steps ranging from: (i) the search for an “optimal”
parallel algorithm that minimizes the required arithmetic operations, memory ref-
erences and interprocessor communications, to (ii) its implementation on the
underlying platform. The latter step depends on the specific architectural charac-
teristics of the parallel computing platform. Since these architectural characteristics
are still evolving rapidly, we will refrain in this book from exposing fine imple-
mentation details for each parallel algorithm. Rather, we focus on algorithm
robustness and opportunities for parallelism in general. In other words, even though
our approach is geared towards numerically reliable algorithms that lend themselves
to practical implementation on parallel computing platforms that are currently
available, we will also present classes of algorithms that expose the theoretical
limitations of parallelism if one were not constrained by the number of cores/
processors, or the cost of memory references or interprocessor communications.
www.allitebooks.com
x Preface
Acknowledgments
We wish to thank all of our current and previous collaborators who have been,
directly or indirectly, involved with topics discussed in this book. We thank
especially: Guy-Antoine Atenekeng-Kahou, Costas Bekas, Michael Berry, Olivier
Bertrand, Randy Bramley, Daniela Calvetti, Peter Cappello, Philippe Chartier,
Michel Crouzeix, George Cybenko, Ömer Eğecioğlu, Jocelyne Erhel, Roland
Freund, Kyle Gallivan, Ananth Grama, Joseph Grcar, Elias Houstis, William Jalby,
Vassilis Kalantzis, Emmanuel Kamgnia, Alicia Klinvex, Çetin Koç, Efi
Kokiopoulou, George Kollias, Erricos Kontoghiorghes, Alex Kouris, Ioannis
Koutis, David Kuck, Jacques Lenfant, Murat Manguoğlu, Dani Mezher, Carl
Christian Mikkelsen, Maxim Naumov, Antonio Navarra, Louis Bernard Nguenang,
Nikos Nikoloutsakos, David Padua, Eric Polizzi, Lothar Reichel, Yousef Saad,
Miloud Sadkane, Vivek Sarin, Olaf Schenk, Roger Blaise Sidje, Valeria Simoncini,
Aleksandros Sobczyk, Danny Sorensen, Andreas Stathopoulos, Daniel Szyld,
Maurice Tchuente, Tayfun Tezduyar, John Tsitsiklis, Marian Vajteršic, Panayot
Vassilevski, Ioannis Venetis, Brigitte Vital, Harry Wijshoff, Christos Zaroliagis,
Dimitris Zeimpekis, Zahari Zlatev, and Yao Zhu. Any errors and omissions, of
course, are entirely our responsibility.
In addition we wish to express our gratitude to Yousuff Hussaini who encour-
aged us to have our book published by Springer, to Connie Ermel who typed a
major part of the first draft, to Eugenia-Maria Kontopoulou for her help in preparing
the index, and to our Springer contacts, Kirsten Theunissen and Aldo Rampioni.
Finally, we would like to acknowledge the remarkable contributions of the late
Gene Golub—a mentor and a friend—from whom we learned a lot about matrix
computations. Further, we wish to pay our respect to the memory of our late
collaborators and friends: Theodore Papatheodorou, and John Wisniewski.
Preface xi
Last, but not least, we would like to thank our families, especially our spouses,
Aristoula, Elisabeth and Marilyn, for their patience during the time it took us to
produce this book.
References
1. Babbage, C.: Passages From the Life of a Philosopher. Longman, Green, Longman, Roberts &
Green, London (1864)
2. Cybenko, G., Kuck, D.: Revolution or Evolution, IEEE Spectrum, 29(9), 39–41 (1992)
3. Richardson, L.F.: Weather Prediction by Numerical Process. Cambridge University Press,
Cambridge (1922). (Reprinted by Dover Publications, 1965)
Contents
Part I Basics
2 Fundamental Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Higher Level BLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Dense Matrix Multiplication . . . . . . . . . . . . . . . . . . . 20
2.2.2 Lowering Complexity via the Strassen Algorithm . . . . 22
2.2.3 Accelerating the Multiplication
of Complex Matrices . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 General Organization for Dense Matrix Factorizations . . . . . . . 25
2.3.1 Fan-Out and Fan-In Versions . . . . . . . . . . . . . . . . . . 25
2.3.2 Parallelism in the Fan-Out Version . . . . . . . . . . . . . . 26
2.3.3 Data Allocation for Distributed Memory. . . . . . . . . . . 28
2.3.4 Block Versions and Numerical Libraries . . . . . . . . . . 29
2.4 Sparse Matrix Computations. . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Sparse Matrix Storage and Matrix-Vector
Multiplication Schemes . . . . . . . . . . . . . . . . . . .... 31
2.4.2 Matrix Reordering Schemes . . . . . . . . . . . . . . . .... 36
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 43
xiii
xiv Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
List of Figures
xix
www.allitebooks.com
xx List of Figures
xxiii
List of Algorithms
xxv
xxvi List of Algorithms
xxix
www.allitebooks.com
xxx Notations
and
0 1
α1;1 α1;2
B α2;1 α2;2 α2;3 C
B .. .. .. C
B C
½αi;i1 ; αi;i ; αi;iþ1
1:n ¼B . . . C
B .. .. C
@ . . αn1;n A
αn;n1 αn;n
In this chapter, we briefly present the main concepts in parallel computing. This is
an attempt to make more precise some definitions that will be used throughout this
book rather than a survey on the topic. Interested readers are referred to one of the
many books available on the subject, e.g. [1–8].
In what follows:
• p denotes the number of processors.
• T p denotes the number of steps of the parallel algorithm on p processors. Each
step is assumed to consist of (i) the interprocessor communication and memory
references, and (ii) the arithmetic operations performed immediately after by the
active processors. Thus, for example, as will be shown in Sect. 2.1 of Chap. 2, an
inner product of two n-sized vectors, requires at least T p = O(log n) steps on
p = n processors.
• O p denotes the number of arithmetic operations required by the parallel algorithm
when run on p processors.
• R p denotes arithmetic redundancy. This is the ratio of the total number of arith-
metic operations, O p , in the parallel algorithm, over the least number of arithmetic
operations, O1 , required by the sequential algorithm. Specifically
R p = O p /O1 .
T1 Sp
Sp = , and E p = .
Tp p
Op
Vp = .
Tp
CU
INTERCONNECTION NETWORK
PE PE
that turn out to be rational functions in the problem size, it must be understood that we
are referring to an integer value. Typically this is the ceiling function of the fraction.
The earliest computers constructed aiming at high performance via the use of paral-
lelism were of the Single Instruction Multiple Data (SIMD) type, according to the
standard classification of parallel architectures [9]. These are characterized by the
fact that several processing elements (PE) are controlled by a single control unit
(CU); see Fig. 1.1, where MB denotes the memory banks.
For executing a program on such an architecture,
• the CU runs the program and sends: (i) the characteristics of the vectors to the
memory, (ii) the permutation to be applied to these vectors by the interconnection
network in order to achieve the desired alignment of the operands, and (iii) the
instructions to be executed by the PEs,
• the memory delivers the operands which are permuted through the interconnection
network,
• the processors perform the operation, and
• the results are sent back via the interconnection network to be correctly stored in
the memory.
The computational rate of the operation depends on the vector length. Denoting the
duration of the operation on one slice of N ≤ p components by t, the parallel steps
needed to perform the operation on a vector of length n is given by
n
Tp = t.
p
Thus, the computational rate realized via the use of this SIMD architecture is given
by,
6 1 Parallel Programming Paradigms
.6
V p (n)
.4
.2
0
0 20 40 60 80 100 120 140 160
vector length n
n
Vp = ,
Tp
ci = ai + bi , for i = 1, . . . , n,
x1
C A S N x 1 y1
y1
x2
C A S N x2 y2
y2
x3
C A S N x 3 y3
y3
x4
C A S N x4 y 4
y4
x5
C A S N x5 y 5
y5
0 τ 2τ 3τ 4τ 8τ time
Thus, the elapsed time corresponding to the addition of two vectors of length n is
given by,
T p = t0 + nτ
Figure 1.4, depicts the computational rate realized with respect to the vector length.
The asymptotic computational rate, r∞ = 1/τ , is not reached for finite vector lengths.
Half of that asymptotic computational rate, however, is reached for a vector of length
n1/2 = t0 /τ = s − 1. The two numbers (r∞ , and n1/2 ) characterize the pipeline
performance since it is easy to see:
n1/2
V p = r∞ 1 − .
n + n1/2
.6
V p (n)
.4
.2
0
0 20 40 60 80 100 120 140 160
vector length n
operands of length n must be partitioned into slices each of length N . This approach
favors operations on short vectors since it decreases the start-up time. By assuming
that vector registers are ready with slices of the operands so that an operation is
immediately performed on a given slice once the operation on a previous slice is
completed, then the rate due to pipelining can be expressed as,
n
V p (n) =
kt0 + nτ
.6
V p (n)
.4
.2
0
0 20 40 60 80 100 120 140 160
vector length n
1.1 Computational Models 9
For illustration, let us consider the following vector operation that is quite common
in a variety of numerical algorithms (see Sect. 2.1):
di = ai + bi ci , for i = 1, . . . , n.
Assuming that the time for each stage of the two pipelines (which performs the
multiplication and the addition) is equal, by chaining the pipelines, the results are
still obtained at the same rate. Therefore, the speedup is doubled since two operations
are performed at the same time. It is worth noting that in several cases pipelining and
chaining are used to increase the performance of elementary operations. In particular,
the scalar operation a + bc is implemented as a single instruction, often called Fused
Multiply-Add (FMA). This also deliver results with smaller roundoff than if one
were to implement it as two separate operations; cf. [12].
Pipelining can also be applied to interconnection networks between memories and
PEs to enhance memory bandwidth. In addition, it can be applied to multiprocessors
to enhance memory bandwidth.
INTERCONNECTION NETWORK
CPU CPU
www.allitebooks.com
10 1 Parallel Programming Paradigms
CPU CPU
INTERCONNECTION NETWORK
Many machines today adopt a hierarchical design in the sense that both the process-
ing and memory systems are organized in layers, each having specific characteristics.
Regarding the processing hierarchy, it consists of units capable of scalar processing,
vector and SIMD processing, multiple cores each able to process one or multiple
threads, the multiple cores organized into interconnected clusters. The memory hier-
1.1 Computational Models 11
archy consists of short and long registers, various levels of cache memory local
to each processor, and memory shared between processors in a cluster. In many
respects, it can be argued that most parallel systems are of this type or simplified
versions thereof.
which each iteration is independent of the rest. These directives steer the compiler in
its restructuring of the program. The most prominent programming paradigm of this
category is called OpenMP (Open Multiprocessing Application Program Interface),
e.g. see [21, 22]. In such a paradigm, the tasks are implemented via threads.
In a given program, directives allow the user to define parallel regions which
invoke fork and join mechanisms at the beginning and the end of the region, with the
ability to define the shared variables of the region. Other directives allow specifying
parallelism through loops. For parallel loops, the programmer specifies whether the
distribution of the iterations through the tasks (threads) is static or dynamic. Static
repartition lowers the overhead cost whereas dynamic allocation adapts better to
irregular task loads. Several techniques are provided for synchronizing threads which
include locks, barriers, and critical sections.
In this book, when necessary for the sake of illustration, we consider three types
of loops which are shown in Table 1.1:
(a) do in which the iterations are run sequentially,
(b) doall in which the iterations are run independently and possibly simultaneously,
(c) doacross in which the iterations are pipelined.
Table 1.1 illustrates these three cases. The doacross loop enables the use of pipelining
between iterations, so that an iteration can start before the previous one has been
completed. For this reason, it also depends on the use of synchronization mechanisms
(such as wait and post) to control the execution of the iterations.
It is also worth noting that frequently, it is best to write programs that utilize both
of the above paradigms, that is MPI and OpenMP. This hybrid mode is natural in
order to take advantage of the hierarchical structure of high performance computer
systems.
It was observed quite early in the history of parallel computing that there is a limit
to the available parallelism in a given fixed-size computation. The limit is governed
1.2 Principles of Parallel Programming 13
Proposition 1.1 (Amdahl’s law) [23] Let O p be the number of operations a parallel
program implemented on p processors. If the portion f p (0 < f p < 1) of the O p
operations is inherently sequential, then the speedup is bounded by S p < 1/ f p .
1− f p
Proof From the definitions of T1 and T p is obvious that T p ≥ ( f p + p )T1 ; the
above upper bound immediately follows.
This simple result gives a rule-of-thumb for the maximum speedup and efficiency
that can be expected of a given algorithm. The limits to parallel processing implied
by the law but also the limits of the law’s basic assumptions have been discussed
extensively in the literature, by designers of parallel systems and algorithms; for
example, cf. [24–31]. We make the following remarks.
Remark 1.1 Even for a fully parallel program, there is a loss of efficiency due to
the overhead (interprocessor communication time, memory references, and parallel
management) which usually implies that S p < p.
Remark 1.2 An opposite situation may occur when the observed speedup is “super-
linear”, i.e. S p > p. This happens, for example, when the data set is too large for the
local memory of one processor; whereas storing it across the p processor memories
becomes possible. In fact, the ability to manipulate larger datasets is an important
advantage of parallel processing that is becoming especially relevant in the context
of data intensive computations.
Remark 1.3 The speedup bound in Amdahl’s law refers strictly to the performance of
a program running in single-user mode on a parallel computer, using one or all of the
processors of the system. This was a reasonable assumption at the time of large-scale
SIMD systems, but today one can argue that it is too strong of a simplification. For
instance, it does not consider the effect of systems that can handle multiple parallel
computations across the processors. It also does not capture the possibility of parallel
systems consisting of heterogeneous processors with different performance.
Remark 1.4 The aspect of Amdahl’s law that has been under criticism by various
researchers is the assumption that the program whose speedup is evaluated is solving a
problem of fixed size. As was observed by Gustafson in [32], the problem size should
be allowed to vary with the number of processors. This meant that the parallel fraction
of the algorithm is not constant as the number of processors varies. We discuss this
issue in greater detail next.
Assuming that this holds independently of the size of the problem, the program is
characterized as strongly scalable.
In many cases, the speedup graph exhibits a two-stage behavior: there is some
threshold value, say p̃, such that S p increases linearly as long as p̃ ≥ p, whereas for
p > p̃, S p stagnates or even decreases. This is the sign that for p > p̃, the overhead,
i.e. the time spent in managing parallelism, becomes too high when the size of the
problem is too small for that number of processors.
The above performance constraint can be partially or fully removed, if we allow
the size of the problem to increase with the number of processors. Intuitively, it is
not surprising to expect that as the computer system becomes more powerful, so
would the size of the problem to be solved. This also means that the fraction of the
computations that are computed in parallel does not remain constant as the number
of processors increases. The notion of weak scalability is then relevant.
Naturally, weak scalability is easier to achieve than strong scalability, which seeks
constant efficiency for a problem of fixed size. On the other hand, even by selecting
the largest possible size for the problem that can be run on one processor, the problem
becomes too small to be run efficiently on a large number of processors.
In conclusion, we note that in order to investigate the scalability potential of a
program, it is well worth analyzing the graph of the mapping p → (1 − f p )O p , that
is “processors to the total number of operations that can be performed in parallel”
(assuming no redundancy). Defining as problem size, O1 , that is the total number
of operations that are needed to solve the problem, one question is how fast should
the problem size increase in order to keep the efficiency constant as the number of
processors increases. An appropriate rate, sometimes called isoefficiency [33], could
indeed be neither of the two extremes, namely the constant problem size of Amdahl
and the linear increase suggested in [32]; cf. [24] for a discussion.
References
1. Arbenz, P., Petersen, W.: Introduction to Parallel Computing. Oxford University Press (2004)
2. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation. Prentice Hall, Engle-
wood Cliffs (1989)
3. Culler, D., Singh, J., Gupta, A.: Parallel Computer Architecture: A Hardware/Software
Approach. Morgan Kaufmann, San Francisco (1998)
References 15
4. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing: Design and
Analysis of Algorithms, 2nd edn. Addison-Wesley (2003)
5. Casanova, H., Legrand, A., Robert, Y.: Parallel Algorithms. Chapman & Hall/CRC Press (2008)
6. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach. Elsevier Science
& Technology (2011)
7. Hockney, R., Jesshope, C.: Parallel Computers 2: Architecture, Programming and Algorithms,
2nd edn. Adam Hilger (1988)
8. Tchuente, M.: Parallel Computation on Regular Arrays. Algorithms and Architectures for
Advanced Scientific Computing. Manchester University Press (1991)
9. Flynn, M.: Some computer organizations and their effectiveness. IEEE Trans. Comput. C-21,
948–960 (1972)
10. Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming, 1st edn.
Morgan Kaufmann Publishers Inc., San Francisco (2013)
11. Hockney, R.: The Science of Computer Benchmarking. SIAM, Philadelphia (1996)
12. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
13. Karp, R., Sahay, A., Santos, E., Schauser, K.: Optimal broadcast and summation in the logP
model. In: Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Archi-
tectures SPAA’93, pp. 142–153. ACM Press, Velen (1993). http://doi.acm.org/10.1145/165231.
165250
14. Culler, D., Karp, R., Patterson, D., A. Sahay, K.S., Santos, E., Subramonian, R., von Eicken,
T.: LogP: towards a realistic model of parallel computation. In: Principles, Practice of Parallel
Programming, pp. 1–12 (1993). http://citeseer.ist.psu.edu/culler93logp.html
15. Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Gabriel, G.F.E., Dongarra, J.: Performance
analysis of MPI collective operations. In: Fourth International Workshop on Performance
Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS’05).
Denver (2005). (Submitted)
16. Breshaers, C.: The Art of Concurrency - A Thread Monkey’s Guide to Writing Parallel Appli-
cations. O’Reilly (2009)
17. Rauber, T., Rünger, G.: Parallel Programming—for Multicore and Cluster Systems. Springer
(2010)
18. Darema., F.: The SPMD model : past, present and future. In: Recent Advances in Parallel
Virtual Machine and Message Passing Interface. LNCS, vol. 2131/2001, p. 1. Springer, Berlin
(2001)
19. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message
Passing Interface. MIT Press, Cambridge (1994)
20. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI: The Complete Reference
(1995). http://www.netlib.org/utk/papers/mpi-book/mpi-book.html
21. Chapman, B., Jost, G., Pas, R.: Using OpenMP: Portable Shared Memory Parallel Program-
ming. The MIT Press, Cambridge (2007)
22. OpenMP Architecture Review Board: OpenMP Application Program Interface (Version 3.1).
(2011). http://www.openmp.org/mp-documents/
23. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing
capabilities. Proc. AFIPS Spring Jt. Comput. Conf. 31, 483–485 (1967)
24. Juurlink, B., Meenderinck, C.: Amdahl’s law for predicting the future of multicores considered
harmful. SIGARCH Comput. Archit. News 40(2), 1–9 (2012). doi:10.1145/2234336.2234338.
http://doi.acm.org/10.1145/2234336.2234338
25. Hill, M., Marty, M.: Amdahl’s law in the multicore era. In: HPCA, p. 187. IEEE Computer
Society (2008)
26. Sun, X.H., Chen, Y.: Reevaluating Amdahl’s law in the multicore era. J. Parallel Distrib.
Comput. 70(2), 183–188 (2010)
27. Flatt, H., Kennedy, K.: Performance of parallel processors. Parallel Comput. 12(1), 1–20
(1989). doi:10.1016/0167-8191(89)90003-3. http://www.sciencedirect.com/science/article/
pii/0167819189900033
16 1 Parallel Programming Paradigms
28. Kuck, D.: High Performance Computing: Challenges for Future Systems. Oxford University
Press, New York (1996)
29. Kuck, D.: What do users of parallel computer systems really need? Int. J. Parallel Program.
22(1), 99–127 (1994). doi:10.1007/BF02577794. http://dx.doi.org/10.1007/BF02577794
30. Kumar, V., Gupta, A.: Analyzing scalability of parallel algorithms and architectures. J. Parallel
Distrib. Comput. 22(3), 379–391 (1994)
31. Worley, P.H.: The effect of time constraints on scaled speedup. SIAM J. Sci. Stat. Comput.
11(5), 838–858 (1990)
32. Gustafson, J.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)
33. Grama, A., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms
and architectures. IEEE Parallel Distrib. Technol. 12–21 (1993)
Chapter 2
Fundamental Kernels
In this chapter we discuss the fundamental operations, that are the building blocks
of dense and sparse matrix computations. They are termed kernels because in most
cases they account for most of the computational effort. Because of this, their imple-
mentation directly impacts the overall efficiency of the computation. They occur
often at the lowest level where parallelism is expressed.
Most basic kernels are of the form C = C + AB, where A, B and C can be
matrix, vector and possibly scalar operands of appropriate dimensions. For dense
matrices, the community has converged into a standard application programming
interface, termed Basic Linear Algebra Subroutines (BLAS) that have specific syn-
tax and semantics. The set is organized into three separate sets of instructions. The
first part of this chapter describes these sets. It then considers several basic sparse
matrix operations that are essential for the implementation of algorithms presented
in future chapters. In this chapter we frequently make explicit reference to communi-
cation costs, on account of the well known growing discrepancy, in the performance
characteristics of computer systems, between the rate of performing computations
(typically measured by a base unit of the form flops per second) and the rate of
moving data (typically measured by a base unit of the form words per second).
A common feature of these instructions is that minimal number of data that needs
to be read (loaded) into memory and then stored back in order for the operation to take
place is O(n). Moreover, the number of computations required on a uniprocessor
is also O(n). Therefore, the ratio of instructions to load from and store to memory
relative to purely arithmetic operations is O(1).
With p = n processors, the _AXPY primitive requires 2 steps which yields
a perfect speedup, n. The _DOT primitive involves a reduction with the sum of n
numbers to obtain a scalar. We assume temporarily, for the sake of clarity, that n = 2m .
At the first step, each processor computes the product of two components, and the
(0)
result can be expressed as the vector (si )1:n . This computation is then followed by
(k−1)
m steps such that at each step k the vector (si )1:2m−k+1 is transformed into the
(k) (k)
vector (si )1:2m−k by computing in parallel si = s2i−1 (k−1) + s2i (k−1) , for i =
(m)
1, . . . , 2m−k , with the final result being the scalar s1 . Therefore, the inner product
consumes T p = m + 1 = (1 + log n) steps, with a speedup of S p = 2n/(1 + log n)
and an efficiency of E p = 2/(1 + log n).
On vector processors, these procedures can obtain high performance, especially
for the _AXPY primitive which allows chaining of the pipelines for multiplication
and addition.
Implementing these instructions on parallel architectures is not a difficult task. It
is realized by splitting the vectors in slices of the same length, with each processor
performing the operation on its own subvectors. For the _DOT operation, there is an
additional summation of all the partial results to obtain the final scalar. Following
that, this result has to be broadcast to all the processors. These final steps entail extra
costs for data movement and synchronization, especially for computer systems with
distributed memory and a large number of processors.
We analyze this issue in greater detail next, departing on this occasion from the
assumptions made in Chap. 1 and taking explicitly into account the communication
costs in evaluating T p . The message is that inner products are harmful to parallel
performance of many algorithms.
Inner Products Inhibit Parallel Scalability
A major part of this book deals with parallel algorithms for solving large sparse linear
systems of equations using preconditioned iterative schemes. The most effective
classes of these methods are dominated by a combination of a “global” inner product,
that is applied on vectors distributed across all the processors, followed by fan-
out operations. As we will show, the overheads involved in such operations cause
inefficiency and less than optimal speedup.
To illustrate this point, we consider such a combination in the form of the following
primitive for vectors u, v, w of size n that appears often in many computations:
w = w − (u v)u. (2.1)
We assume that the vectors are stored in a consistent way to perform the operations
on the components (each processor stores slices of components of the two vectors
2.1 Vector Operations 19
with identical indices). The _DOT primitive involves a reduction and therefore an all-
to-one (fan-in) communication. Since the result of a dot product is usually needed
by all processors in the sequel, the communication actually becomes an all-to-all
(fan-out) procedure.
To evaluate the weak scalability on p processors (see Definition 1.1) by taking
communication into account (as mentioned earlier, we depart here from our usual
definition of T p ), let us assume that n = pq. The number of steps required by the prim-
itive (2.1) is 4q −1. Assuming no overlap between communication and computation,
the cost on p processors, T p ( pq), is the sum of the computational and communication
p ( pq) and T p ( pq), respectively. For the all-to-all communication,
costs: Tcal com
In order to increase efficiency, vector operations are often packed into a global task
of higher level. This occurs for the multiplication of a matrix by a vector which in
a code is usually expressed by a doubly nested loop. Classic kernels of this type are
gathered into the set known as Level_2 Basic Linear Algebraic Subroutines (BLAS2)
[2]. The most common operations of this type, assuming general matrices, are
_GEMV : given x ∈ Rn , y ∈ Rm and A ∈ Rm×n this performs the matrix-vector
multiplication and accumulate y = y + Ax. It is also possible to multiply (row)
vector by matrix, scale the result before accumulating.
_TRSV : given b ∈ Rn and A ∈ Rn×n upper or lower triangular, this solves the
triangular system Ax = b.
_GER : given scalar α, x ∈ Rn , y ∈ Rm and A ∈ Rm×n , this performs the rank-one
update A = A + αx y .
A common feature of these instructions is that the smallest number of data that
needs to be read into memory and then stored back in order for the operation to take
place when m = n is O(n 2 ), arithmetic operations is O(1). Moreover, the number
of computations required on a uniprocessor is also O(n 2 ). Therefore, the ratio of
instructions to load from and store to memory relative to purely arithmetic ones is
www.allitebooks.com
20 2 Fundamental Kernels
O(1). Typically, the constants involved are a little smaller than those for the BLAS1
instructions. On the other hand, of far more interest in terms of efficiency.
Although of interest, the efficiencies realized by these kernels are easily surpassed
by those of (BLAS3) [3], where one additional loop level is considered, e.g. matrix
multiplication, and rank-k updates (k > 1). The next section of this chapter is devoted
to matrix-matrix multiplications.
The set of BLAS is designed for a uniprocessor and used in parallel programs in
the sequential mode. Thus, an efficient implementation of the BLAS is of the utmost
importance to enable high performance. Versions of the BLAS that are especially
fine-tuned for certain types of processors are available (e.g. the Intel Math Kernel
Library [4] or the open source set GotoBLAS [5]). Alternately, one can create a
parametrized BLAS set which can be tuned on any processor by an automatic code
optimizer, e.g. ATLAS [6, 7]. Yet, it is hard to outperform well designed methods
that are based on accurate architectural models and domain expertise; cf. [8, 9].
C = C + AB. (2.2)
2.2 Higher Level BLAS 21
We adopt our discussion from [11], where the authors consider the situation of a
cluster of p processors with a common cache and count loads and stores in their
evaluation. We simplify that discussion and outline the implementation strategy for
a uniprocessor equipped with a cache memory that is characterized by fast access.
The purpose is to highlight some critical design decisions that will have to be faced by
the sequential as well as by the parallel algorithm designer. We assume that reading
one floating-point word of the type used in the multiplication from cache can be
accomplished in one clock period. Since the storage capacity of a cache memory is
limited, the goal of a code developer is to reuse, as much as possible, data stored in
the cache memory.
Let M be the storage capacity of the cache memory and let us assume that matrices
A, B and C are, respectively, n 1 × n 2 , n 2 × n 3 and n 1 × n 3 matrices. Partitioning
these matrices into blocks of sizes m 1 × m 2 , m 2 × m 3 and m 1 × m 3 , respectively,
where n i = m i ki for all i = 1, 2, 3, our goal is then to estimate the block sizes m i
which maximize data reuse under the cache size constraint.
Instruction (2.2) can be expressed as the nested loop,
do i = 1 : k1 ,
do k = 1 : k2 ,
do j = 1 : k3 ,
Ci j = Ci j + Aik × Bk j ;
end
end
end
Further, since the blocks are obviously smaller than the original matrices, we need
the additional constraints:
1 ≤ m i ≤ n i for i = 1, 2, 3. (2.4)
Evaluating the volume of the data moves using the number of data loads necessary for
the whole procedure and assuming that the constraints (2.3) and (2.4) are satisfied,
we observe that
• all the blocks of the matrix A are loaded only once;
• the blocks of the matrix B are loaded k1 times;
• the blocks of the matrix C are loaded k2 times.
22 2 Fundamental Kernels
if n 1 n 2 ≤ M then
m 1 = n 1 and
√ m 2 = n2 ;
else if n 2 ≤ M then
m 1 = nM2 and m 2 = n 2 ;
√
else if n 1 ≤ M then
m 1 = n 1 and m 2 = nM1 ;
else √ √
m 1 = M and m 2 = M;
end if
In practice, M should be slightly smaller than the total cache volume to allow for
storing the neglected vectors. With this parameter adjustment, at the innermost level,
the block multiplication involves 2m 1 m 2 operations and m 1 +m 2 loads as long as Aik
resides in the cache. This indicates why the matrix multiplication can be a compute
bound program.
The reveal the important decisions that need to be made by the code developer,
and leads to a scheme that is very similar to parallel multiplication.
In [11], the authors consider the situation of a cluster of p processors with a
common cache and count loads and stores in their evaluation. However, the final
decision tree is similar to the one presented here.
The classical multiplication algorithms implementing the operation (2.2) for dense
matrices use 2n 3 operations. We next describe the scheme proposed by Strassen
which reduces the number of operations in the procedure [12]: assuming that n is
even, the operands can be decomposed in 2 × 2 matrices of n2 × n2 blocks:
C11 C12 A11 A12 B11 B12
= . (2.6)
C21 C22 A21 A22 B21 B22
Then, the multiplication can be performed by the following operations on the blocks
2.2 Higher Level BLAS 23
T0 = A11 , S0 = B11 , Q0 = T0 S0 , U1 = Q 0 + Q 3 ,
T1 = A12 , S1 = B21 , Q1 = T1 S1 , U2 = U1 + Q 4 ,
T2 = A21 + A22 , S2 = B12 + B11 , Q2 = T2 S2 , U3 = U1 + Q 2 ,
T3 = T2 − A12 , S3 = B22 − S2 , Q3 = T3 S3 , C11 = Q 0 + Q 1 , (2.8)
T4 = A11 − A12 , S4 = B22 − B12 , Q4 = T4 S4 , C12 = U3 + Q 5 ,
T5 = A12 + T3 , S5 = B22 , Q5 = T5 S5 , C21 = U2 − Q 6 ,
T6 = A22 , S6 = S3 − B21 , Q6 = T6 S6 , C22 = U2 + Q 2 .
Clearly, (2.7) and (2.8) are still valid for rectangular blocks. If n = 2γ , the approach
can be repeated for implementing the multiplications of the blocks. If it is recursively
applied up to 2×2 blocks, the total complexity of the process becomes O(n ω0 ), where
ω0 = log 7. More generally, if the process is iteratively applied until we get blocks
of order m ≤ n 0 , the total number of operations is
T (n) = cs n ω0 − 5n 2 , (2.9)
with cs = (2n 0 + 4)/n 0ω0 −2 , which achieves its minimum for n 0 = 8; cf. [14].
The numerical stability of the above methods has been considered by several
authors. In [15], it is shown that the rounding errors in the Strassen algorithm can
be worse than those in the classical algorithm for multiplying two matrices, with the
situation somewhat worse in Winograd’s algorithm. However, ref. [15] indicates that
it is possible to get a fast and stable version of _GEMM by incorporating in it steps
from the Strassen or Winograd-type algorithms.
Both the Strassen algorithm (2.7), and the Winograd version (2.8), can be imple-
mented on parallel architectures. In particular, the seven block multiplications are
independent, as well as most of the block additions. Moreover, each of these opera-
tions has yet another inner level of parallelism.
24 2 Fundamental Kernels
Savings may be realized in multiplying two complex matrices, e.g. see [18]. Let
A = A1 + i A2 and B = B1 + iB2 two complex matrices where A j , B j ∈ Rn×n
for j = 1, 2. The real and imaginary parts C1 and C2 of the matrix C = AB can
be obtained using only three multiplications of real matrices (and not four as in the
classical expression):
T1 = A1 B1 , C1 = T1 − T2 ,
(2.10)
T2 = A2 B2 , C2 = (A1 + A2 )(B1 + B2 ) − T1 − T2 .
The savings are realized through the way the imaginary part C2 is computed. Unfor-
tunately, the above formulation may suffer from catastrophic cancellations, [18].
For large n, there is a 25 % benefit in arithmetic operations over the conven-
tional approach. Although remarkable, this benefit does not lower the complexity
which remains the same, i.e. O(n 3 ). To push such advantage further, one may use
the Strassen’s approach in the three matrix multiplications above to realize O(n ω0 )
arithmetic operations.
Parallelism is achieved at several levels:
• All the matrix operations are additions and multiplications. They can be imple-
mented with full efficiency. In addition, the multiplication can be realized through
the Strassen algorithm as implemented in CAPS, see Sect. 2.2.2.
• The three matrix multiplications are independent, once the two additions are per-
formed.
2.3 General Organization for Dense Matrix Factorizations 25
In this section, we describe the usual techniques for expressing parallelism in the
factorization schemes (i.e. the algorithms that compute any of the well-known decom-
positions such as LU, Cholesky, or QR). More specific factorizations are included in
the ensuing chapters of the book.
Factorization schemes can be based on one of two basic templates: the fan-out tem-
plate (see Algorithm 2.1) and the fan-in version (see Algorithm 2.2). Each of these
templates involves two basic procedures which we generically call compute( j) and
update( j, k). The two versions, however, differ only by a single loop interchange.
The above implementations are also respectively named as the right-looking and
the left-looking versions. The exact definitions of the basic procedures, when applied
to a given matrix A, are displayed in Table 2.1 together with their arithmetic complex-
ities on a uniprocessor. They are based on a column oriented organization. For the
analysis of loop dependencies, it is important to consider that column j is unchanged
by task update( j, k) whereas column k is overwritten by the same task; column j is
overwritten by task compute( j).
The two versions are based on vector operations (i.e. BLAS1). It can be seen,
however, that for a given j, the inner loop of the fan-out algorithm is a rank-one
update (i.e. BLAS2), with a special feature for the Cholesky factorization, where
only the lower triangular part of A is updated.
26 2 Fundamental Kernels
Table 2.1 Elementary factorization procedures; MATLAB index notation used for submatrices
Factorization Procedures Complexity
√
Cholesky on C : A( j : n, j) = A( j : n, j)/ A( j, j) 3 n + O(n )
1 3 2
In the fan-out version, the inner loop (loop k) of Algorithm 2.1 involves independent
iterations whereas in the fan-in version, the inner loop (loop j) of Algorithm 2.2
must be sequential because of a recursion on vector k.
The inner loop of Algorithm 2.1 can be expressed as a doall loop. The resulting
algorithm is referred to as Algorithm 2.3.
At the outer iteration j, there are n − j independent tasks with identical cost.
When the outer loop is regarded as a sequential one, idle processors will result at
the end of most of the outer iterations. Let p be the number of processors used,
and for the sake of simplicity, let n = pq + 1 and assume that the time spent
by one processor in executing task compute( j) or task update( j, k) is the same
which is taken as the time unit. Note that this last assumption is valid only for
the Gram-Schmidt orthogonalization, since for the other algorithms, the cost of
task compute( j) and task update( j, k) are proportional to n − j or even smaller
for the Cholesky factorization. A simple computation shows that the sequential
process consumes T1 = n(n + 1)/2 steps, whereas the parallel process on p proces-
q+1
sors consumes T p = 1 + p i=2 i = pq(q+3) 2 + 1 = (n−1)(n−1+3
2p
p)
+ 1 steps. For
2.3 General Organization for Dense Matrix Factorizations 27
Table 2.2 Benefit of pipelining the outer loop in MGS (QR factorization)
steps parallel runs steps parallel runs
1 C(1) 1 C(1)
2 U(1,2) U(1,3) U(1,4) U(1,5) 2 U(1,2) U(1,3) U(1,4) U(1,5)
3 U(1,6) U(1,7) U(1,8) U(1,9) 3 C(2) U(1,6) U(1,7) U(1,8)
4 C(2) 4 U(1,9) U(2,3) U(2,4) U(2,5)
5 U(2,3) U(2,4) U(2,5) U(2,6) 5 C(3) U(2,6) U(2,7) U(2,8)
6 U(2,7) U(2,8) U(2,9) 6 U(2,9) U(3,4) U(3,5) U(3,6)
7 C(3) 7 C(4) U(3,7) U(3,8) U(3,9)
8 U(3,4) U(3,5) U(3,6) U(3,7) 8 U(4,5) U(4,6) U(4,7) U(4,8)
9 U(3,8) U(3,9) 9 C(5) U(4,9)
10 C(4) 10 U(5,6) U(5,7) U(5,8) U(5,9)
11 U(4,5) U(4,6) U(4,7) U(4,8) 11 C(6)
12 U(4,9) 12 U(6,7) U(6,8) U(6,9)
13 C(5) 13 C(7)
14 U(5,6) U(5,7) U(5,8) U(5,9) 14 U(7,8) U(7,9)
15 C(6) 15 C(8)
16 U(6,7) U(6,8) U(6,9) 16 U(8,9)
17 C(7) 17 C(9)
18 U(7,8) U(7,9)
19 C(8)
Notations : C(j) = compute(j)
20 U(8,9)
U(j,k) = update(j,k)
21 C(9)
(a) Sequential outer loop.
(b) doacross outer loop.
0.9
0.8
Efficiency
0.7
0.6 p=4
p=8
p=16
0.5 p=32
p=64
0.4
100 200 300 400 500 600 700 800 900 1000
Vector length
Fig. 2.1 Efficiencies of the doall approach with a sequential outer loop in MGS
The previous analysis is valid for shared or distributed memory architectures. How-
ever, for distributed memory systems we need to discuss the data allocation. As an
illustration consider a ring of p processors, numbered from 0 to p − 1, on which r
consecutive columns of A are stored in a round-robin mode. By denoting j̃ = j − 1,
column j is stored on processor s when j̃ = r ( pv + t) + s with 0 ≤ s < r and
0 ≤ t < p.
As soon as column j is ready, it is broadcast to the rest of the processors so they
can start tasks update( j, k) for the columns k which they own. This implements the
doacross/doall strategy of the fan-out approach, listed as Algorithm 2.5.
2.3 General Organization for Dense Matrix Factorizations 29
To reduce the number of messages, one may transfer only the blocks of r consec-
utive vectors when they are all ready to be used (i.e. the corresponding compute( j)
tasks are completed). The drawback of this option is increasing the periods during
which there are idle processors. Therefore, the block size r must be chosen so as to
obtain a better trade-off between using as many processors as possible and reduc-
ing communication cost. Clearly, the optimum value is architecture dependent as it
depends on the smallest efficient task granularity.
The discussion above could easily be extended to the case of a torus configuration
where each processor of the previous ring is replaced by a ring of q processors. Every
column of the matrix A is now distributed into slices on the corresponding ring in
a round-robin mode. This, in turn, implies global communication in each ring of q
processors.
www.allitebooks.com
30 2 Fundamental Kernels
Well designed block algorithms for matrix multiplication and rank-k updates for
hierarchical machines with multiple levels of memory and parallelism are of critical
importance for the design of solvers for the problems considered in this chapter that
demonstrate high performance and scalability. The library LAPACK [20], that solves
the classic matrix-problems, is a case in point by being based on BLAS3 as well as
its parallel version ScaLAPACK [21]:
• LAPACK: This is the main reference for a software library for numerical linear
algebra. It provides routines for solving systems of linear equations and linear least
squares, eigenvalue problems, and singular value decomposition. The involved
matrices can be stored as dense matrices or band matrices. The procedures are
based on BLAS3 and are proved to be backward stable. LAPACK was originally
written in FORTRAN 77, but moved to Fortran 90 in version 3.2 (2008).
• ScaLAPACK: This library can be seen as the parallel version of the LAPACK
library for distributed memory architectures. It is based on the Message Passing
Interface standard MPI [22]. Matrices and vectors are stored on a process grid
into a two-dimensional block-cyclic distribution. The library is often chosen as
the reference to which compare any new developed procedure.
In fact, many users became fully aware of these gains even when using high-
level problem solving environments like MATLAB (cf. [23]). As early works on the
subject had shown (we consider it rewarding for the reader to consider the pioneering
analyses in [11, 24]), the task of designing primitives is far from simple, if one desires
to provide a design that closely resembles the target computer model. The task
becomes more difficult as the complexity of the computer architectures increases.
It becomes even harder when the target is to build methods that can deliver high
performance for a spectrum of computer architectures.
Methods designed for dense matrix computations are rarely suitable for sparse
matrices since they quickly destroy the sparsity of the original matrix leading to
the need of storing a much larger number of nonzeros. However, with the avail-
ability of large memory capacities in new architectures, factorization methods (LU
and QR) exist that control fill-in and manage the needed extra storage. We do not
present such algorithms in this book but refer the reader to existing literature, e.g. see
[26–28]. Another option is to use matrix-free methods in which the sparse matrix is
not generated explicitly but used as an operator through the matrix-vector multipli-
cation kernel.
To make feasible large scale computations that involve sparse matrices, they are
encoded in some suitable sparse matrix storage format in which only nonzero ele-
ments of the matrix are stored together with sufficient information regarding their
row and column location to access them in the course of operations.
Let A = (αi j ) ∈ Rn×n be a sparse matrix, and nnz the number of nonzero entries
in A.
Definition 2.1 (Graph of a sparse matrix) The graph of the matrix is given by the
pair of nodes and edges (< 1 : n >, G) where G is characterized by
((i, j) ∈ G) iff αi j = 0.
The corresponding MV kernel is given by Algorithm 2.6. The inner loop implements
a sparse inner product through a so-called gather procedure.
The corresponding MV kernel is given by Algorithm 2.7. The inner loop implements
a sparse _AXPY through a so-called scatter procedure.
The corresponding MV kernel is given by Algorithm 2.8. It involves both the scatter
and gather procedures.
2.4 Sparse Matrix Computations 33
is expressed for a CRS-stored matrix by Algorithm 2.7 and for a CCS-stored one by
Algorithm 2.6. For a COO-stored matrix, the algorithm is obtained by inverting the
roles of the arrays ia and ja in Algorithm 2.8.
Nowadays, the scatter-gather procedures (see step 3 in Algorithm 2.6 and step 3
in Algorithm 2.7) are pipelined on the architectures allowing vector computations.
However, their startup time is often large (i.e. order of magnitude of n1/2 —as defined
in Sect. 1.1.2—is in the hundreds. If in MV a were a dense matrix n1/2 would be in the
tens). The vector lengths in Algorithms 2.6 and 2.7 are determined by the number of
nonzero entries per row or per column. They often are so small that the computations
are run at sequential computational rates. There have been many attempts to define
sparse storage formats that favor larger vector lengths (e.g. see the jagged diagonal
format mentioned in [30, 31]).
An efficient storage format which combines the advantages of dense and sparse
matrix computations attempts to define a square block structure of a sparse matrix in
which most of the blocks are empty. The non empty blocks are stored in any of the
above formats, e.g. CSR, or the regular dense storage depending on the sparsity
density. Such a sparse storage format is called either Block Compressed Sparse
storage (BCRS) where the sparse nonempty blocks are stored using the CRS format,
or Block Compressed Column storage (BCCS) where the sparse nonempty blocks
are strored using the CCS format.
Basic Implementation on Distributed Memory Architecture
Let us consider the implementation of w = w +Av and w = w + A v on a distributed
memory parallel architecture with p processors where A ∈ Rn×n and v, w ∈ Rn . The
first stage consists of partitioning the matrix and allocating respective parts to the
local processor memories. Each processor Pq with q = 1, . . . , p, receives a block
of rows of A and the corresponding slices of the vectors v and w:
34 2 Fundamental Kernels
⎛ ⎞ ⎛⎞ ⎛ ⎞
P1 : A1,1 A1,2 · · · A1, p v1 w1
P2 : ⎜ A2,1 A2,2 · · · A2, p ⎟ ⎜ v2 ⎟ ⎜ w2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
.. A = ⎜ .. .. .. ⎟ v=⎜ . ⎟ w=⎜ . ⎟
. ⎝ . . . ⎠ ⎝ .. ⎠ ⎝ .. ⎠
Pp : A p,1 A p,2 · · · A p, p vp wp
The efficiency of the two procedures MV and MTV are often quite different,
depending on the chosen sparse storage format.
2.4 Sparse Matrix Computations 35
A
1 ν2
1
3 1
A2 ν1=ν2
ν2
2
3 1
A3 ν =ν
2 3
A = 2
ν
3
3 1
Ap νp−1=νp
2
νp
ν ν ν3 ...
νp
1 2
Fig. 2.2 Partition of a block-diagonal matrix with overlapping blocks; vector v is decomposed in
overlapping slices
where β is the latency for a message and τc the time for sending a word to an
immediate neighbouring node regardless of the latency. Since T p is independent of
p, weak scalability is assured.
Algorithms for reordering sparse matrices play a vital role in enhancing the parallel
scalability of various sparse matrix algorithms and their underlying primitives, e.g.,
see [32, 33].
2.4 Sparse Matrix Computations 37
Hence, depending on the original matrix A, the matrix B can be extracted as:
(a) “narrow-banded” of bandwidth β much smaller than the order n of the matrix
A, i.e., β = 10−4 n, for n ≥ 106 , for example (the most fortunate situation), see
Fig. 2.5,
(b) “medium-banded”, i.e., of the block-tridiagonal form [H, G, J ], in which the
elements of the off-diagonal blocks H and J are all zero except for their small
upper-right and lower-left corners, respectively, see Fig. 2.6, or
(c) “wide-banded”, i.e., consisting of overlapped diagonal blocks, in which each
diagonal block is a sparse matrix, see Fig. 2.7.
The motivation for desiring such a reordering scheme is three-fold. First, B can
be used as a preconditioner of a Krylov subspace method when solving a linear
system Ax = f of order n. Since E is of a rank p much less than n, the precondi-
tioned Krylov subspace scheme will converge quickly. In exact arithmetic, the Krylov
subspace method will converge in exactly p iterations. In floating-point arithmetic,
however, this translates into the method achieving small relative residuals in less
than p iterations. Second, since we require the diagonal of B to be zero-free with the
product of its entries maximized, and that the Frobenius norm of B is close to that of
A, this will enhance the possibility that B is nonsingular, or close to a nonsingular
matrix. Third, multiplying C by a vector can be implemented on a parallel archi-
tecture with higher efficiency by splitting the operation into two parts: multiplying
the “generalized-banded” matrix B by a vector, and a low-rank sparse matrix E by
a vector. The former, e.g. v = Bu, can be achieved with high parallel scalability on
distributed-memory architectures requiring only nearest neighbor communication,
e.g. see Sect. 2.4.1 for the scalable parallel implementation of an overlapped block
diagonal matrix-vector multiplication scheme. The latter, e.g. w = Eu, however,
incurs much less irregular addressing penalty compared to y = Au since E contains
far fewer nonzero entries than A.
Since A is nonsymmetric, in general, we could reduce its profile by using RCM
(i.e. via symmetric permutations only) applied to (|A| + |A |), [40], or by using
the spectral reordering introduced in [41]; see also [42]. However, this will neither
2.4 Sparse Matrix Computations 39
realize a zero-free diagonal, nor insure bringing the heaviest off-diagonal elements
close to the diagonal. Consequently, RCM alone will not realize a central “band”
B with its Frobenius norm statisfying: B F ≥ (1 − ε)∗ A F . In order to solve
this weighted bandwidth reduction problem, we use a weighted spectral reordering
technique which is a generalization of spectral reordering. To alleviate the shortcom-
ings of using only symmetric permutations, and assuming that the matrix A is not
structurally singular, this weighted spectral reordering will need to be coupled with
www.allitebooks.com
40 2 Fundamental Kernels
i.e., to minimize the maximum distance of a nonzero entry from the main diagonal.
Let us assume for the time being that A is a symmetric matrix, and that we aim at
extracting a central band B = (βi j ) of minimum bandwidth such that, for a given
tolerance ε,
i, j |αi j − βi j |
≤ ε, (2.16)
i, j |αi j |
2.4 Sparse Matrix Computations 41
and
βi j = αi j if |i − j| ≤ k,
(2.17)
βi j = 0
The idea behind this formulation is that if a significant part of the matrix is packed
into a central band B, then the rest of the nonzero entries can be dropped to obtain an
effective preconditioner. In order to find a heuristic solution to the weighted band-
width reduction problem, we use a generalization of spectral reordering. Spectral
reordering is a linear algebraic technique that is commonly used to obtain approx-
imate solutions to various intractable graph optimization problems [51]. It has also
been successfully applied to the bandwidth and envelope reduction problems for
sparse matrices [41]. The core idea of spectral reordering is to compute a vector
x = (ξi ) that minimizes
σ A (x) = (ξi − ξ j )2 , (2.18)
i, j:αi j =0
Note that the matrix L is positive semidefinite, and the smallest eigenvalue of this
matrix is equal to zero. The eigenvector x that minimizes σ A (x) = x L x, such that
x2 = 1 and x e = 0, is the eigenvector corresponding to the second smallest
eigenvalue of the Laplacian, i.e. the symmetric eigenvalue problem
Lx = λx, (2.20)
and is known as the Fiedler vector. The Fiedler vector of a sparse matrix can be
computed efficiently using any of the eigensolvers discussed in Chap. 11, see also
[53].
While spectral reordering is shown to be effective in bandwidth reduction, the
classical approach described above ignores the magnitude of nonzeros in the matrix.
Therefore, it is not directly applicable to the weighted bandwidth reduction problem.
However, Fiedler’s result can be directly generalized to the weighted case [54]. More
precisely, the eigenvector x that corresponds to the second smallest eigenvalue of the
42 2 Fundamental Kernels
where L is defined as
λi j = −|αi j | if i = j,
λii = |αi j |. (2.22)
j
We now show how weighted spectral reordering can be used to obtain a continuous
approximation to the weighted bandwidth reduction problem. For this purpose, we
first define the relative bandweight of a specified band of the matrix as follows:
i, j:|i− j|<k |αi j |
wk (A) = . (2.23)
i, j |αi j |
Let Ā be the matrix obtained by reordering the rows and columns of A according to
π , i.e.,
Ā(πi , π j ) = αi j for 1 ≤ i, j ≤ n. (2.26)
σ̂k (A) = |αi j |δk (i, j), (2.27)
i, j
then,
σ̂k (A) = (1 − wk ( Ā)) |αi j |. (2.28)
i, j
Therefore, for a fixed α, the α-bandwidth of the matrix Ā is equal to the smallest
k that satisfies σ̂ A (k)/ |αi j | ≤ 1 − α.
i, j
Note that the problem of minimizing σ̄x (A) is a continuous relaxation of the
problem of minimizing σ̂k (A) for a given k. Therefore, the Fiedler vector of the
weighted Laplacian L provides a good basis for reordering A to minimize σ̂k (A).
Consequently, for a fixed ε, this vector provides a heuristic solution to the problem
of finding a reordered matrix Ā = (ᾱi j ) with minimum (1 − ε)-bandwidth. Once the
matrix is obtained, we extract the central band B as follows:
References
1. Lawson, C., Hanson, R., Kincaid, D., Krogh, F.: Basic linear algebra subprograms for Fortran
usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979)
2. Dongarra, J., Croz, J.D., Hammarling, S., Hanson, R.: An extended set of FORTRAN basic
linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)
3. Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.: A set of level-3 basic linear algebra
subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
4. Intel company: Intel Math Kernel Library. http://software.intel.com/en-us/intel-mkl
44 2 Fundamental Kernels
28. Zlatev, Z.: Computational Methods for General Sparse Matrices, vol. 65. Kluwer Academic
Publishers, Dordrecht (1991)
29. Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of
Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia (2000)
30. Melhem, R.: Toward efficient implementation of preconditioned conjugate gradient methods
on vector supercomputers. Int. J. Supercomput. Appl. 1(1), 70–98 (1987)
31. Philippe, B., Saad, Y.: Solving large sparse eigenvalue problems on supercomputers. Technical
report RIACS TR 88.38, NASA Ames Research Center (1988)
32. Schenk, O.: Combinatorial Scientific Computing. CRC Press, Switzerland (2012)
33. Kepner, J., Gilbert, J.: Graph Algorithms in the Language of Linear Algebra. SIAM, Philadel-
phia (2011)
34. George, J., Liu, J.: Computer Solutions of Large Sparse Positive Definite Systems. Prentice
Hall (1981)
35. Pissanetzky, S.: Sparse Matrix Technology. Academic Press, New York (1984)
36. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: Proceedings
of 24th National Conference Association Computer Machinery, pp. 157–172. ACM Publica-
tions, New York (1969)
37. Liu, W., Sherman, A.: Comparative analysis of the Cuthill-McKee and the reverse Cuthill-
McKee ordering algorithms for sparse matrices. SIAM J. Numer. Anal. 13, 198–213 (1976)
38. D’Azevedo, E.F., Forsyth, P.A., Tang, W.P.: Ordering methods for preconditioned conjugate
gradient methods applied to unstructured grid problems. SIAM J. Matrix Anal. 13(3), 944–961
(1992)
39. Duff, I., Meurant, G.: The effect of ordering on preconditioned conjugate gradients. BIT 29,
635–657 (1989)
40. Reid, J., Scott, J.: Reducing the total bandwidth of a sparse unsymmetric matrix. SIAM J.
Matrix Anal. Appl. 28(3), 805–821 (2005)
41. Barnard, S., Pothen, A., Simon, H.: A spectral algorithm for envelope reduction of sparse
matrices. Numer. Linear Algebra Appl. 2, 317–334 (1995)
42. Spielman, D., Teng, S.: Spectral partitioning works: planar graphs and finite element meshes.
Numer. Linear Algebra Appl. 421, 284–305 (2007)
43. Duff, I.: On algorithms for obtaining a maximum transversal. ACM Trans. Math. Softw. 7,
315–330 (1981)
44. Duff, I., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse matrix.
SIAM J. Matrix Anal. Appl. 22, 973–966 (2001)
45. Duff, I., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal
of sparse matrices. SIAM J. Matrix Anal. Appl. 20, 889–901 (1999)
46. The HSL mathematical software library. See http://www.hsl.r1.ac.uk/index.html
47. Tarjan, R.: Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160
(1972)
48. Cheriyan, J., Mehlhorn, K.: Algorithms for dense graphs and networks on the random access
computer. Algorithmica 15, 521–549 (1996)
49. Dijkstra, E.: A Discipline of Programming, Chapter 25. Prentice Hall, Englewood Cliffs (1976)
50. Manguoğlu, M., Mehmet, K., Sameh, A., Grama, A.: Weighted matrix ordering and parallel
banded preconditioners for iterative linear system solvers. SIAM J. Sci. Comput. 32(3), 1201–
1206 (2010)
51. Hendrickson, B., Leland, R.: An improved spectral graph partitioning algorithm for mapping
parallel computations. SIAM J. Sci. Comput. 16(2), 452–469 (1995). http://citeseer.nj.nec.
com/hendrickson95improved.html
52. Fiedler, M.: Algebraic connectivity of graphs. Czechoslovak Math. J. 23, 298–305 (1973)
53. Kruyt, N.: A conjugate gradient method for the spectral partitioning of graphs. Parallel Comput.
22, 1493–1502 (1997)
54. Chan, P., Schlag, M., Zien, J.: Spectral k-way ratio-cut partitioning and clustering. IEEE Trans.
CAD-Integr. Circuits Syst. 13, 1088–1096 (1994)
Part II
Dense and Special Matrix Computations
Chapter 3
Recurrences and Triangular Systems
for some integer n. It is not difficult to verify that ψk + 8ψk−1 = k1 , which is a linear
recurrence for evaluating ψk , k ≥ 1, with ψ0 = loge (9/8).
Example 3.2 We wish to use a finite-difference method for solving the second-
order differential equation ψ = φ(ξ, ψ) with the initial conditions ψ(0) = α and
ψ (0) = γ . Replacing the derivatives ψ and ψ by the differences
ψk+1 − ψk−1
ψ (ξk ) ,
2h
ψk+1 − 2ψk + ψk−1
ψ (ξk ) ,
h2
where h = ξk+1 − ξk , k ≥ 0, we obtain the linear recurrence relation
The undefined value ψ−1 can be eliminated using (3.1) (with k = 0), and (3.2).
Hence, the linear recurrence (3.1) can be started with ψ0 = α and ψ1 = ψ0 + hγ +
2 h φ(ξ0 , ψ0 ).
1 2
Tk (ξ ) = cos(k cos−1 ξ ), k = 0, 1, 2, . . . ,
Tk+1 (ξ ) − 2ξ Tk (ξ ) + Tk−1 (ξ ) = 0
ξ0 = β.
See also [3] for examples of recurrences in logic design and restructuring com-
pilers.
Stability Issues
It is well known that computations with recurrence relations are prone to error growth;
at each step, computations are performed and new errors generated by operating on
3.1 Definitions and Examples 51
past data that maybe be already contaminated with errors [1]. This error propagation
and gradual accumulation could be catastrophic. For example, consider the linear
recurrence from Example 3.1,
1
ψk + 8ψk−1 =
k
with ψ0 = loge (9/8). Using three decimals (with rounding) throughout the eval-
uation of ψi , i ≥ 0, and taking ψ0 0.118, the recurrence yields ψ1 0.056,
ψ2 0.052, and ψ3 −0.083. The reason for obtaining a negative ψ3 (note that
ψk > 0 for all values of k) is that the initial error δ0 in ψ0 has been highly magnified
in ψ3 . In fact, even if we use exact arithmetic in evaluating ψ1 , ψ2 , ψ3 , . . . the initial
rounding error δ0 propagates such that the error δk in ψk is given by (−8)k δ0 . There-
fore, since |δ0 | ≤ 0.5×10−3 we get |δ3 | ≤ 0.256; a very high error, given that the true
value of ψ3 (to three decimals) is 0.028. Note that this numerical instability cannot be
eliminated by using higher precision (six decimals, say). Such wrong results will only
be postponed, and will show up at a later stage. The relation between the available
precision and the highest subscript k of a reasonably accurate ψk , however, will play
an important role in constructing stable parallel algorithms for handling recurrence
relations in general. Such numerical instability may be avoided by observing that the
definite integral of Example 3.1 decreases as the value of k increases. Hence, if we
assume that ψ5 0 (say) and evaluate the recurrence ψk + 8ψk−1 = k1 backwards,
we should obtain reasonably accurate values for at least ψ0 and ψ1 , since in this
case δk−1 = (−1/8)δk . Performing the calculations to three decimals, we obtain
ψ4 0.025, ψ3 0.028, ψ2 0.038, ψ1 0.058, and ψ0 0.118.
In the remainder of this section we discuss algorithms that are particularly suitable
for evaluating linear and certain nonlinear recurrence relations on parallel architec-
tures.
ξ 1 = φ1 ,
i−1
(3.4)
ξi = φi − λij ξ j ,
j=k
where
x = (ξ1 , ξ2 , . . . , ξn ) ,
f = (φ1 , φ2 , . . . , φn ) , and
Here L is unit lower triangular with bandwidth m + 1 i.e., λii = 1, and λij = 0 for
i − j > m. The sequential algorithm given by (3.4), forward substitution, requires
2mn + O(m 2 ) arithmetic operations.
We present parallel algorithms for solving the unit lower triangular system (3.6)
for the following cases:
(i) dense systems (m = n − 1), denoted by R<n>,
(ii) banded systems (m n), denoted by R<n,m>, and
(iii) Toeplitz systems, denoted by R̂<n>, and R̂<n,m>, where λij = λ̃i− j .
The kernels used as the basic building blocks of these algorithms will be those dense
BLAS primitives discussed in Chap. 2.
Here φi(1) = φi −φ1 λi1 , i = 2, 3, . . . , n. The process may be repeated to obtain the
rest of the components of the solution vector. Assuming we have (n − 1) processors,
this algorithm requires 2(n − 1) parallel steps with no arithmetic redundancy. This
method is often referred to as the column-sweep algorithm; we list it as Algorithm 3.1
(CSweep). It is straightforward to show that the cost becomes 3(n − 1) parallel
operations for non-unit triangular systems. The column-sweep algorithm can be
Algorithm 3.1 CSweep: Column-sweep method for unit lower triangular system
Input: Lower triangular matrix L of order n with unit diagonal, right-hand side f
Output: Solution of L x = f
1: set φ (0)
j = φ j j = 1, . . . , n //that is f
(0) = f
2: do i = 1 : n
3: ξi = φi(i−1)
4: doall j = i + 1 : n
(i) (i−1) (i−1)
5: φj = φj − φi λ j,i //compute f (i) = Ni−1 f (i−1)
6: end
7: end
modified, however, to solve (3.6) in fewer parallel steps but with higher arithmetic
redundancy.
Theorem 3.1 ([4, 5]) The triangular system of equations Lx = f , where L is a unit
lower triangular matrix of order n, can be solved in T p = 21 log2 n + 23 log n using
no more than p = 1024
15 3
n + O(n 2 ) processors, yielding an arithmetic redundancy
of R p = O(n).
L = N1 N2 N3 . . . Nn−1 ,
54 3 Recurrences and Triangular Systems
where
⎛ ⎞
1
⎜ 1 ⎟
⎜ ⎟
⎜ . .. ⎟
⎜ ⎟
⎜ ⎟
Nj = ⎜
⎜ 1 ⎟
⎟ (3.8)
⎜ λ 1 ⎟
⎜ j+1, j ⎟
⎜ .. . . ⎟
⎝ . . ⎠
λn, j 1
where the inverse of N j is trivially obtained by reversing the signs of λij in (3.8).
Forming the product (3.9) as shown in Fig. 3.1, we obtain the column-sweep algo-
rithm which requires 2(n − 1) parallel steps to compute the solution vector x, given
(n − 1) processors.
Assuming that n is power of 2 and utilizing a fan-in approach to compute
(0)
Mn−1 · · · M1(0) f
in parallel, the number of stages in Fig. 3.1 can be reduced from n − 1 to log n, as
(0)
shown in Fig. 3.2, where Mi = −Ni−1 . This is the approach used in Algorithm 3.2
(DTS). To derive the total costs, it is important to consider the structure of the terms
(0)
in the tree of Fig. 3.2. The initial terms Mi , i = 1, . . . , n − 1 can each be computed
−1 −1
Nn−1 Nn−2 ··· N3−1 N2−1 N1−1 f
Fig. 3.1 Sequential solution of lower triangular system Lx = f using CSweep (column-sweep
Algorithm 3.1)
3.2 Linear Recurrences 55
(1) (1)
M n −1 M n −2 (1) f (1)
2 2 M1
(2) f (2)
M n −1
4
f (3) ≡ x
Fig. 3.2 Computation of solution of lower triangular system Lx = f using the fan-in approach of
DTS (Algorithm 3.2)
using one parallel division of length n − i and a sign reversal. This requires at most
2 steps using at most n(n + 1)/2 processors.
( j)
The next important observation, that we show below, is that each Mk has a
maximum of 1 + 2 elements in any given row. Therefore,
j
( j+1) ( j) ( j)
Mk = M2k+1 M2k (3.10)
( j+1) ( j)
f = M1 f ( j) (3.11)
log n−1
1 3
Tp = ( j + 2) + 2 = log2 n + log n.
2 2
j=0
We now show our claim about the number of independent inner products in
the pairwise products occurring at each stage. It is convenient at each stage j =
( j)
1, . . . , log n, to partition each Mk as follows:
⎛ ( j)
⎞
Iq
( j) ⎜ ( j) ⎟
Mk =⎝ Lk ⎠, (3.12)
( j) ( j)
Wk Ir
( j) ( j) ( j)
where L k is unit lower triangular of order s = 2 j , Iq and Ir are the identities
( j)
of order q = ks − 1, and r = (n + 1) − (k + 1)s, and Wk is of order r -by-s. For
j = 0,
56 3 Recurrences and Triangular Systems
(0)
L s = 1, and
(0)
Wk = −(λk+1,k , . . . , λn,k ) .
Observe that the number of nonzeros at the first stage j = 0, in each row of matrices
(0) ( j)
Mk is at most 2 = 20 + 1. Assume the result to be true at stage j. Partitioning Wk
as,
( j)
( j) Uk
Wk = ( j) ,
Vk
( j)
where Uk is a square matrix of order s, then from
( j+1) ( j) ( j)
Mk = M2k+1 M2k ,
we obtain
( j)
( j+1) L 2k 0
Lk = ( j) ( j) ( j) , (3.13)
L 2k+1 U2k L 2k+1
and
( j+1) ( j) ( j) ( j) ( j)
Wk = (W2k+1 U2k + V2k , W2k+1 ). (3.14)
From (3.12) to (3.14) it follows that the maximum number of nonzeros in each row
( j+1)
of Mk is 2 j+1 + 1. Also, if we partition f ( j) as
⎛ ( j) ⎞
g1
⎜ ( j) ⎟
f ( j) = ⎜
⎝ g2 ⎠
⎟
( j)
g3
( j) ( j)
in which g1 is of order (s − 1), and g2 is of order s, then
( j+1) ( j)
g1 = g1 ,
( j+1) ( j) ( j)
g2 = L 1 g2 , and
( j+1) ( j) ( j) ( j)
g3 = W1 g2 + g3
with the first two partitions constituting the first (2s − 1) elements of the solution.
( j) ( j)
We next estimate the number of processors to accomplish DTS. Terms L 2k+1 U2k
( j) ( j)
in (3.13) and W2k+1 U2k in (3.14) can be computed simultaneously. Each column
of the former requires s inner
products of s pairs of vectors of size 1, . . . , s. There-
s
fore, each column requires i=1 i = s(s + 1)/2 processors. Moreover, the term
( j) ( j)
W2k+1 U2k necessitates sr inner products of length s, where as noted above, at this
3.2 Linear Recurrences 57
stage r = (n +1)−(k +1)s. The total number of processors for this product therefore
is s 2 r . The total becomes s 2 (s + 1)/2 + s 2 r and substituting for r we obtain that the
number of processors necessary is
( j+1) s2 s3
pM (k) = (2n + 3) − (2k + 1).
2 2
The remaining matrix addition in (3.14) can be performed with fewer processors.
( j+1)
Similarly, the evaluation of f ( j+1) requires p f = 21 (2n + 3)s − 23 s 2 processors.
Therefore the total number of processors necessary for stage j + 1 where j =
0, 1, . . . , log n − 2 is
n/2s−1
( j+1) ( j+1)
p ( j+1) = pM (k) + p f
k=1
In many practical applications, the order m of the linear recurrence (3.6) is much less
than n. For such a case, algorithm DTS for dense triangular systems can be modified
to yield the result in fewer parallel steps.
Theorem 3.2 ([5]) Let L be a banded unit lower triangular matrix of order n and
bandwidth m + 1, where m ≤ n/2, and λik = 0 for i − k > m. Then the system
58 3 Recurrences and Triangular Systems
L x = f can be solved in less than T p = (2 + log m) log n parallel steps using fewer
than p = m(m + 1)n/2 processors.
Proof The matrix L and the vector f can be written in the form
⎛ ⎞ ⎛ ⎞
L1 f1
⎜ R1 L 2 ⎟ ⎜ f2 ⎟
⎜ ⎟ ⎜ ⎟
⎜ R2 L 3 ⎟ ⎜ f3 ⎟
L=⎜ ⎟, f = ⎜ ⎟,
⎜ .. .. ⎟ ⎜ .. ⎟
⎝ . . ⎠ ⎝ . ⎠
R n
m −1
L mn f mn
where L i and Ri are m×m unit lower triangular and upper triangular matrices, respec-
tively. Premultiplying both sides of Lx = f by the matrix D = diag(L −1 −1
1 , . . . , L p ),
(0) (0)
we obtain the system L x = f , where
⎛ ⎞
Im
⎜ G (0) Im ⎟
⎜ 1 ⎟
⎜ (0) ⎟
L (0) =⎜
⎜
G 2 Im ⎟,
⎟
⎜ .. .. ⎟
⎝ . . ⎠
G (0)
n Im
m −1
(0)
L 1 f1 = f 1 , and
(3.15)
(0) (0)
L i (G i−1 , f i ) = (Ri−1 , f i ), i = 2, 3, . . . , mn .
From Theorem 3.1, we can show that solving the systems in (3.15) requires T (0) =
(0) = 21 m 2 n + O(mn) processors. Now
2 log m + 2 log m parallel steps, using p
1 2 3
128
we form matrices D ( j) , j = 0, 1, . . . , log 2m
n
such that if
then L (μ) = I and x = f (μ) , where μ ≡ log(n/m). Each matrix L ( j) is of the form
⎛ ⎞
Ir
⎜ G ( j) Ir ⎟
⎜ 1 ⎟
⎜ ( j) ⎟
L ( j) =⎜
⎜
G 2 Ir ⎟,
⎟
⎜ .. .. ⎟
⎝ . . ⎠
( j)
G n −1 Ir
r
( j) ( j)
where r = 2 j ·m. Therefore, D ( j) = diag((L 1 )−1 , . . . , L p )−1 ) (i = 1, 2, . . . , 2r
n
)
in which
3.2 Linear Recurrences 59
( j)−1 Ir
Li = ( j) .
−G 2i−1 Ir
and
( j)
( j+1) f 2i−1 n
fi = ( j) ( j) ( j) , i = 1, 2, . . . , .
−G 2i−1 · f 2i−1 + f 2i 2r
( j)
Observing that all except the last m columns of each matrix G i are zero, then
( j) ( j) ( j) ( j)
G 2i+1 G 2i and G 2i−1 f 2i−1 for all i, can be evaluated simultaneously in 1 + log m
parallel arithmetic operations using p = 21 m(m +1)n −r m 2 processors. In one final
( j+1) ( j+1)
subtraction, we evaluate f i and G i , for all i, using p = p /m processors.
Therefore, T ( j + 1) = 2 + log m parallel steps using p ( j+1) = max{ p , p } =
2 m(m + 1)n − r m processors. The total number of parallel steps is thus given by
1 2
n
log
m
Tp = T ( j) = (2 + log m) log n − (1/2)(log m)(1 + log m) (3.16)
j=0
1
p ≡ p (1) = m(m + 1)n − m 3 (3.17)
2
processors.
It is well known that substitution algorithms for solving triangular systems, includ-
ing CSweep, are backward stable: in particular, the computed solution x̃, satisfies a
60 3 Recurrences and Triangular Systems
relation of the form b − L x̃ ∞ = L ∞ x̃ ∞ u; cf. [6]. Not only that, but in many
cases the forward error is much smaller than what is predicted by the usual upper
bound involving the condition number (either based on normwise or componentwise
analysis) and the backward error. For some classes of matrices, it is even possi-
ble to prove that the theoretical forward error bound does not depend on a condition
number. This is the case, for example, for the lower triangular matrix arising from an
LU factorization with partial or complete pivoting strategies, and the upper triangular
matrix resulting from the QR factorization with column pivoting [6]. This is also
the case for some triangular matrices arising in the course of parallel factorization
algorithms; see for example [7].
The upper bounds obtained for the error of the parallel triangular solvers are less
satisfactory; cf. [5, 8]. This is due to the anticipated error accumulation in the (loga-
rithmic length) stages consisting of matrix multiplications building the intermediate
values in DTS. Bounds for the residual were first obtained in [5]. These were later
improved in [8]. In particular, the residual corresponding to the computed solution,
x̃, of DTS satisfies
where M(L) is the matrix with values |λi,i | on the diagonal, and −|λi, j | in the
off-diagonal positions, with dn and d̃n constants of the order n log n. When L is an
M-matrix and b ≥ 0, DTS can be shown to be componentwise backward stable to
first order; cf. [8].
Toeplitz triangular systems, in which λij = λ̃i− j , for i > j, arise frequently in
practice. The algorithms presented in the previous two sections do not take advan-
tage of this special structure of L. Efficient schemes for solving Toeplitz triangular
systems require essentially the same number of parallel arithmetic operations as in
the general case, but need fewer processors, O(n 2 ) rather than O(n 3 ) processors for
dense systems, and O(mn) rather than O(m 2 n) processors for banded systems. The
solution of more general banded Toeplitz systems is discussed in Sect. 6.2 of Chap. 6.
To pave the way for a concise presentation of the algorithms for Toeplitz systems,
we present the following fundamental lemma.
Lemma 3.1 ([9]) If L is Toeplitz, then L −1 is also Toeplitz, where
⎛ ⎞
1
⎜ λ1 1 ⎟
⎜ ⎟
⎜ λ2 λ1 1 ⎟
L=⎜ ⎟.
⎜ .. . . . . . . ⎟
⎝ . . . . ⎠
λn−1 · · · λ2 λ1 1
L(J xi ) = ei+1 ,
62 3 Recurrences and Triangular Systems
where we have used the fact that L and J commute. Therefore, xi+1 = J xi , and L −1
can be written as
L −1 = (x1 , J x1 , J 2 x1 , . . . , J n−1 x1 ),
Theorem 3.3 Let L be a dense Toeplitz unit lower triangular matrix of order n.
Then the system L x = f can be solved in T p = log2 n + 2 log n − 1 parallel steps,
using no more than p = n 2 /4 processors.
Proof From Lemma 3.1, we see that the first column of L −1 determines L −1
uniquely. Using this observation, consider a leading principal submatrix of the
Toeplitz matrix L,
L1 0
,
G1 L1
and doubling the size every stage, we obtain the inverse of the leading principal
submatrix M of order n/2 in
3.2 Linear Recurrences 63
n
log
4
2( j + 1) = log2 n − log n − 2
j=1
parallel steps with (n 2 /16) processors. Thus, the solution of a Toeplitz system L x =
f , or
M 0 x1 f
= 1 ,
N M x2 f2
is given by
x1 = M −1 f 1 ,
and
x2 = M −1 f 2 − M −1 N M −1 f 1 .
Since we have already obtained M −1 (actually the first column of the Toeplitz matrix
M −1 ), x1 and x2 can be computed in (1 + 3 log n) parallel arithmetic operations
using no more than n 2 /4 processors. Hence, the total parallel arithmetic operations
for solving a Toeplitz system of linear equations is (log2 n + 2 log n − 1) using n 2 /4
processors.
Let now n = 2ν , Le1 = (1, λ1 , λ2 , ..., λn−1 ) , and let G k be a square Toeplitz of
order 2k with its first and last columns given by
and
G k e2k = (λ1 , λ2 , . . . , λ2k ) .
Based on this discussion, we construct the triangular Toeplitz solver TTS that is
listed as Algorithm 3.4.
Theorem 3.4 Let L be a Toeplitz banded unit lower triangular matrix of order n
and bandwidth (m + 1) ≥ 3. Then the system L x = f can be solved in less than
(3 + 2 log m) log n parallel steps using no more than 3mn/4 processors.
At the jth stage of this algorithm, we consider the leading principal submatrix L j
of order 2r = m2 j ,
L j−1 0
Lj = ,
R j−1 L j−1
( j) ( j)
xi = L −1
j f i , i = 1, 2, . . . , n/2r,
or
( j−1)
( j−1)
x2i−1 L −1
j−1 f 2i−1
( j−1) = ( j−1) ( j−1) . (3.21)
y2i L −1
j−1 f 2i − L −1 −1
j−1 R j−1 L j−1 f 2i−1
( j)
Note that the first 2r components of the solution vector x are given by x1 . Assuming
( j−1) ( j−1)
that we have already obtained L −1 −1
j−1 f 2i−1 and L j−1 f 2i from stage ( j − 1), then
(3.21) may be computed as shown below in Fig. 3.3, in T = (3 + 2 log m) parallel
steps, using p = mr − 21 m 2 − 21 m (n/2r ) processors. This is possible only if we
have L −1 −1
j−1 explicitly, i.e., the first column of L j−1 .
3.2 Linear Recurrences 65
( j−1)
L−1
j−1
( j−1)
f2i L−1
j−1 R j−1 L−1
j−1 f2i−1
− ·
·
−
m(n + m)/4 processors [5]. After stage ν, where ν = log(n/2m), the last n/2
components of the solution are obtained by solving the system
(ν)
(ν)
Lν 0 x1 f1
(ν) = (ν) .
Rν L ν x2 f2
log(n/m)
Tp = T (k) = (3 + 2 log m) log n − (log2 m + log m + 1).
k=0
The maximum number of processors p used in any given stage, therefore, does not
exceed 3mn/4. For n m the number of parallel steps required is less than twice
that of the general-purpose banded solver while the number of processors used is
reduced roughly by a factor of 2m/3.
66 3 Recurrences and Triangular Systems
In Sects. 3.2.1–3.2.4, we presented algorithms that require the least known number of
parallel steps, usually at the expense of a rather large number of processors, especially
for dense triangular systems. Throughout this section, we present alternative schemes
that achieve the least known number of parallel steps for a given number of processors.
First, we consider banded systems of order n and bandwidth (m +1), i.e., R
n, m,
and assume that the number of available processors p satisfies the inequality, m <
p n. Clearly, if p = m we can use the column-sweep algorithm to solve the
triangular system in 2(n − 1) steps. Our main goal here is to develop a more suitable
algorithm for p > m that requires O(m 2 n/ p) parallel steps given only p > m
processors.
In the following, we present two algorithms; one for obtaining all the elements of
the solution of L x = f , and one for obtaining only the last m components of x.
Theorem 3.5 Given p processors m < p n, a unit lower triangular system of n
equations with bandwidth (m+1) can be solved in
n−m
T p = 2(m − 1) + τ (3.22)
p( p + m − 1)
L 1 h 1 = g1 − R0 z 0 , (3.26)
τ1 = 2m 2 p
parallel steps while each of the ( p − 1) systems in (3.27) can be solved sequentially
in m
τ2 = (2m 2 + m) p − (2m 2 + 3m + 1),
2
parallel steps. Now the reason for choosing q = mp is clear; we need to make the
difference in parallel steps between solving (3.26) and any of the ( p − 1) systems
in (3.27) as small as practically possible. This will minimize the number of parallel
steps during which some of the processors remain idle. In fact, τ1 = τ2 for p =
(2m 2 + 3m + 1)/2. Assigning one processor to each of the p systems in (3.27), they
can be solved simultaneously in τ3 = max{τ1 , τ2 } parallel steps. From (3.25), the
solution vector z is given by
68 3 Recurrences and Triangular Systems
z1 = h1,
z i = h i − G i−1 z i−1 , i = 2, 3, . . . , p.
Observing that only the last m columns of each G i are different from zero and
using the available p processors, each z i can be computed in 2m parallel steps. Thus,
z 2 , z 3 , . . . , z p are obtained in τ4 = 2m( p − 1) parallel steps, and the system L ẑ = g
in (3.24) is solved in
τ = τ3 + τ4
(2m 2 + 3m) p − (m/2)(2m 2 + 3m + 5) (3.28)
= max
2m(m + 1) p − 2m
Vi xi = ( f i − Ui xi−1 ), i = 1, 2, 3, . . . , k. (3.30)
Solving (3.29) by the column-sweep method in 2(m − 1) parallel steps (note that
p > m), the k systems in (3.30) can then be solved one at a time using the algorithm
developed for solving (3.24). Consequently, using p processors, L x = f is solved
in
n−m
T p = 2(m − 1) + τ
p( p + m − 1)
Theorem 3.6 Consider the unit lower triangular system L x = f of order n and
bandwidth 2. Given p processors, 1 < p n, we can obtain ξn , the last component
of x in
T p = 3(q − 1) + 2 log p
L 1 x1 = f 1 , (3.32)
Splitting the unit lower triangular system of order p whose elements are encircled in
Fig. 3.4, the p elements ξr , ξr +q , ξr +2q , . . . , ξn can be obtained using the algorithm
of Theorem 3.2 in 2 log p parallel steps using ( p − 1) processors. Hence, we can
obtain ξn in
T p = σ + 2 log p.
will, however, consider an interesting special case; a parallel Horner’s rule for the
evaluation of polynomials.
Theorem 3.7 Given p processors, 1 < p n, we can evaluate a polynomial of
degree n in T p = 2(k − 1) + 2 log( p − 1) parallel steps where k = (n + 1)/( p − 1).
Proof Consider the evaluation of the polynomial
Pn (ξ ) = α1 ξ n + α2 ξ n−1 + . . . + αn ξ + αn+1
β1 = α1 ,
βi+1 = θβi + αi+1 i = 1, 2, . . . , n,
where βn+1 = Pn (θ ). This amounts to obtaining the last component of the solution
vector of the triangular system
⎛ ⎞⎛ ⎞ ⎛ ⎞
1 β1 α1
⎜ −θ 1 ⎟ ⎜ β2 ⎟ ⎜ α2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ −θ 1 ⎟ ⎜ β3 ⎟ ⎜ α3 ⎟
⎜ ⎟⎜ ⎟=⎜ ⎟. (3.34)
⎜ .. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . . ⎠ ⎝ . ⎠ ⎝ . ⎠
−θ 1 βn+1 αn+1
where,
L̂h 1 = a1 ,
(3.36)
Lh i = ai , i = 2, 3, . . . , p − 1,
and
L(Ĝe j ) = L(Gek ) = −θ e1 . (3.37)
Assigning one processor to each of the bidiagonal systems (3.36) and (3.37), we
obtain h i and Gek in 2(k − 1) parallel steps. Since L is Toeplitz, one can easily show
that
g = Ĝe j = Gek = (−θ, −θ 2 , . . . , −θ k ) .
In a manner similar to the algorithm of Theorem 3.6 we split from (3.35) a smaller
linear system of order ( p − 1)
⎛ ⎞ ⎛ b ⎞ ⎛ e h ⎞
1 j j 1
⎜ −θ k 1 ⎟⎜ b ⎟ ⎜ e h 2 ⎟
⎟⎜ ⎟ ⎜ k ⎟
j+k
⎜
⎜ −θ k 1 ⎟⎜ b j+2k ⎟ ⎜
⎟ = ⎜ ek h 3 ⎟.
⎟
⎜ ⎟⎜⎜ . ⎟ ⎜ . ⎟
(3.38)
⎝ · · ⎠⎝ . ⎠ ⎝ .. ⎠
.
−θ k 1 bn+1
ek h p−1
From Theorem 3.2 we see that using ( p−2) processors, we obtain bn+1 in 2 log( p−1)
parallel steps. Thus the total number of parallel steps for evaluating a polynomial of
degree n using p n processors is given by
Table 3.1 Summary of bounds for parallel steps, efficiency, redundancy and processor count in
algorithms for solving linear recurrences
L Tp p Ep Rp
(proportional to)
2 log n + 2 log n 15/1024n 3 +O(n 2 )
1 2 3
Dense R<n> 1/(n log2 n) O(n)
log2 n + 2 log n − 1 n /4 1/ log2 n
Triangular R̂<n> 2 O(1)(=5/4)
Banded R<n,m> (2 + log m) log n + 21 m(m + 1)n − m 3 1/(m log m log n) O(m log n)
Triangular O(log2 m)
R̂<n,m> (3+2 log m) log n+ 3mn/4 1/(log m log n) log n
O(log 2 m)
Here n is the order of the unit triangular system L x = f and m + 1 is the bandwidth of L
Example 3.6 Let n = 1024 and p = 16. Hence, k = 69 and T p = 144. This is
roughly (1/14) of the number of arithmetic operations required by the sequential
scheme, which is very close to (1/ p).
The various results presented in this chapter so far are summarized in Tables 3.1 and
3.2. In both tables, we deal with the problem of solving a unit lower triangular system
of equations L x = f . Table 3.1 shows upper bounds for the number of parallel steps,
processors, and the corresponding arithmetic redundancy for the unlimited number
of processors case. In this table, we observe that when the algorithms take advantage
of the special Toeplitz structure to reduce the number of processors and arithmetic
redundancy, the number of parallel steps is increased. Table 3.2 gives upper bounds
on the parallel steps and the redundancy for solving banded unit lower triangular
systems given only a limited number of processors, p, m < p n. It is also of
interest to note that in the general case, it hardly makes a difference in the number
of parallel steps whether we are seeking all or only the last m components of the
solution. In the Toeplitz case, on the other hand, the parallel steps are cut in half if
we seek only the last m components of the solution.
We have already seen in Sects. 3.1 and 3.2 that parallel algorithms for evaluating
linear recurrence relations can achieve significant reduction in the number of required
parallel steps compared to sequential schemes. For example, from Table 3.1 we see
that the speedup for evaluating short recurrences (small m) as a function of the
problem size behaves like S p = O(n/ log n). Therefore, if we let the problem size
vary, the speedup is unbounded with n. Even discounting the fact that efficiency goes
to 0, the result is quite impressive for recurrence computations that appear sequential
at first sight. As we see next, and has been known for a long time, speedups are far
more restricted for nonlinear recurrences.
Table 3.2 Summary of linear recurrence bounds using a limited number, p, of processors, where
m < p n and m and n are as defined in Table 3.1
Problem Tp Ep Rp
2
p+m−1/2 +
2m n
Solve the General case Solve for all proportional O(m)
p )
O( mn
banded unit or last m to 1/m
lower components of
triangular x
system
Lx = f
(m = 1) Solve 3n p + O(log p) 2/3 2
for last
component of
x
p + O(mp) 1/2
4mn
Toeplitz case Solve for all O(m)
components of
x
2mn
Solve for last p−m + 1 − m
p 1 + 1
p
m components O(m 2 log p)
of x
m = 1 (e.g., p−1 +
2n
1 − 1
p 1 + 1
p
evaluation of a O(log p)
polynomial of
degree n − 1)
The mathematics literature dealing with the above recurrence is extensive, see for
example [10, 11].
The most obvious method for speeding up the evaluation of (3.39) is through
linearization, if possible, and then using the algorithms of Sects. 3.2 and 3.3.
Example 3.7 ([10]) Consider the first order rational recurrence of degree one
αk ξk + βk
ξk+1 = k ≥ 0. (3.40)
γk + ξk
we obtain
1
ξk+1 (γk + ξk ) = (ωk+2 − γk+1 ωk+1 ),
ωk
74 3 Recurrences and Triangular Systems
and
1
αk ξk + βk = (αk ωk+1 + (βk − αk γk )ωk ).
ωk
where
δk+1 = −(αk + γk+1 ), and
ζk = αk γk − βk .
Theorem 3.8 ([12]) Let ξi+1 = f (ξi ) be a rational recurrence of order 1 and
degree d > 1. Then the speedup for the parallel evaluation of the final term, ξn , of
the recurrence on p processors is bounded as follows
O1 ( f (ξ ))
Sp ≤ .
log d
α(ξk + ξk+1 )
ξk+2 = .
(ξk ξk+1 − α)
This is equivalent to
ψ = f (ξ)
ξ0 ξ1 ξ2
0
0
using p = 3n processors, whereas the sequential algorithm for handling the nonlinear
recurrence directly requires O1 = 5(n − 1) arithmetic operations.
Therefore, provided that τ n, we can achieve speedup S p = O(n/ log n).
This can be further enhanced if we are seeking only ξn and have a parallel algorithm
for evaluating the tan and arctan functions. In many cases the first-order nonlinear
recurrences ξk+1 = f (ξk ) arise when one attempts to converge to a fixed point. The
goal in this case is not the evaluation of ξi for i ≥ 1 (given the initial iterate ξ0 ), but an
approximation of lim ξi . The classical sequential algorithm is illustrated in Fig. 3.5,
i→∞
where | f (ξ )| < 1 in the neighborhood of the root α. In this case, α is called an
attractive fixed point. An efficient parallel algorithm, therefore, should not attempt
to linearize the nonlinear recurrence ξk+1 = f (ξk ), but rather seek an approximation
of α by other methods for obtaining roots of a nonlinear function.
2ξk (ξk2 + 6)
ξk+1 = = f (ξk ).
3ξk2 + 2
g(ξ)
10/3
•
α 4 ξ
then
t + log10 (γ /2)
ν≤ ,
log10 ( p − 1)
and
T p ≤ 2 + 3ν.
References
1. Gautschi, W.: Computational aspects of three-term recurrence relations. SIAM Rev. 9, 24–82
(1967)
2. Rivlin, T.: The Chebyshev Polynomials. Wiley-Interscience, New York (1974)
3. Kuck, D.: The Structure of Computers and Computations. Wiley, New Yok (1978)
4. Chen, S.C., Kuck, D.: Time and parallel processor bounds for linear recurrence systems. IEEE
Trans. Comput. C-24(7), 701–717 (1975)
5. Sameh, A., Brent, R.: Solving triangular systems on a parallel computer. SIAM J. Numer. Anal.
14(6), 1101–1113 (1977)
6. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
7. Sameh, A., Kuck, D.: A parallel QR algorithm for symmetric tridiagonal matrices. IEEE Trans.
Comput. 26(2), 147–153 (1977)
8. Higham, N.: Stability of parallel triangular system solvers. SIAM J. Sci. Comput. 16(2), 400–
413 (1995)
9. Lafon, J.: Base tensorielle des matrices de Hankel (ou de Toeplitz). Appl. Numer. Math. 23,
249–361 (1975)
10. Boole, G.: Calculus of Finite Differences. Chelsea Publishing Company, New York (1970)
11. Wimp, J.: Computation with Recurrence Relations. Pitman, Boston (1984)
12. Kung, H.: New algorithms and lower bounds for the parallel evaluation of certain rational
expressions and recurrences. J. Assoc. Comput. Mach. 23(2), 252–261 (1976)
13. Miranker, W.: Parallel methods for solving equations. Math. Comput. Simul. 20(2), 93–101
(1978). doi:10.1016/0378-4754(78)90032-0. http://www.sciencedirect.com/science/article/
pii/0378475478900320
14. Gal, S., Miranker, W.: Optimal sequential and parallel seach for finding a root. J. Combinatorial
Theory 23, 1–14 (1977)
Chapter 4
General Linear Systems
such that
Mw = ω1 e1 . (4.3)
μ j1 = −(ω j /ω1 ).
4.1 Gaussian Elimination 81
α11 a
If A ≡ A1 = is diagonally dominant, then α11 = 0 and B is also
b B
diagonally dominant. One can construct an elementary lower triangular matrix M1
such that A2 = M1 A1 is upper-triangular as far as its first column is concerned, i.e.,
α11 a
A2 = (4.4)
0 C
Mn−1 · · · M2 M1 A = U (4.5)
or
M̂n−1 M̂n−2 · · · M̂1 (Pn−1 · · · P2 P1 )A = U,
As outlined above, using a pivoting strategy such as partial pivoting requires obtaining
the element of maximum modulus of a vector of order n (say). In a parallel imple-
mentation, that vector could be stored across p nodes. Identifying the pivotal row
will thus require accessing data along the entire matrix column, potentially giving
rise to excessive communication across processors. Not surprisingly, more aggres-
sive forms of pivoting that have the benefit of smaller potential error, are likely to
involve even more communication. To reduce these costs, one approach is to utilize
a pairwise pivoting strategy. Pairwise pivoting leads to a factorization of the form
1 0 −γ 1
or .
−γ 1 1 0
(a) factorization:
(S2n−3 · · · S2 S1 )A = U̇ , or A = Ṡ −1 U̇ ,
7
6 8
5 7 9
4 6 8 10
3 5 7 9 11
2 4 6 8 10 12
1 3 5 7 9 11 13
A detailed error analysis of this pivoting strategy was conducted in [16], and in a
much refined form in [17], which shows that
This upper bound is 2n−1 larger than that obtained for the partial pivoting strategy.
Nevertheless, except in very special cases, extensive numerical experiments show
that the difference in quality (as measured by relative residuals) of the solutions
obtained by these two pivoting strategies is imperceptible.
Another pivoting approach that is related to pairwise pivoting is termed incre-
mental pivoting [18]. Incremental pivoting has also been designed so as to reduce
the communication costs in Gaussian elimination with partial pivoting.
Now, the process is repeated for B0 , i.e., choosing a window Ḃ0 consisting of the
first ν columns of B0 , obtaining an LU factorization of Ṗ1 Ḃ0 followed by obtaining
ν additional rows and columns of the factorization (4.16)
4.3 Block LU Factorization 85
⎛ ⎞⎛ ⎞
L 11 U
U11 U12
Iν 12
Ṗ0 A = ⎝ L 21 L 22 ⎠⎝ U22 U23 ⎠ .
Ṗ1
L 21 L 32 In−2ν B1
A = Ŝ −1 Û (4.18)
in which,
G i j = A ji Aii−1 , and (4.20)
1
2 3
3 4 5
4 5 6 7
5 6 7 8 9
6 7 8 9 10 11
7 8 9 10 11 12 13
Steps Nodes →
↓ 1 2 3 4 5 6 7 8
1 fact(A11)
2 [1, 2]
3 [1, 3] fact(A22)
4 [1, 4] [2, 3]
5 [1, 5] [2, 4] fact(A33)
6 [1, 6] [2, 5] [3, 4]
7 [1, 7] [2, 6] [3, 5] fact(A44)
8 [1, 8] [2, 7] [3, 6] [4, 5]
9 [2, 8] [3, 7] [4, 6] fact(A55)
10 [3, 8] [4, 7] [5, 6]
11 [4, 8] [5, 7] fact(A66)
12 [5, 8] [6, 7]
13 [6, 8] fact(A77)
14 [7, 8]
15 fact(A88)
We consider next the general case, where the matrix A is not diagonally domi-
nant. One attractive possibility is to proceed with the factorization of each Aii without
partial pivoting but applying, whenever necessary, the procedure of “diagonal boost-
ing” which insures the nonsingularity of each diagonal block. Such a technique was
originally proposed for Gaussian elimination in [20].
Diagonal boosting is invoked whenever any diagonal pivot violates the following
condition:
|pivot| > 0u A j 1 ,
where 0u is a multiple of the unit roundoff, u. If the diagonal pivot does not satisfy
the above condition, its value is “boosted” as:
4.4 Remarks
Well designed kernels for matrix multiplication and rank-k updates for hierarchical
machines with multiple levels of memory and parallelism are of critical importance
for the design of dense linear system solvers. ScaLAPACK based on such kernels in
BLAS3 is a case in point. The subroutines of this library achieve high performance
and parallel scalability. In fact, many users became fully aware of these gains even
when using high-level problem solving environments like MATLAB (cf. [21]). As
early work on the subject had shown (we consider it rewarding for the reader to
consider the pioneering analyses undertaken in [22, 23]), the task of designing kernels
with high parallel scalability is far from simple, if one desires to provide a design that
closely resembles the target computer model. The task becomes even more difficult as
the complexity of the computer architecture increases. It becomes even harder when
the target is to build methods that can deliver high performance across a spectrum of
computer architectures.
88 4 General Linear Systems
References
1. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
2. Stewart, G.: Matrix Algorithms. Vol. I. Basic Decompositions. SIAM, Philadelphia (1998)
3. Grcar, J.: John Von Neumann’s analysis of Gaussian elimination and the origins of modern
numerical analysis. SIAM Rev. 53(4), 607–682 (2011). doi:10.1137/080734716. http://dx.doi.
org/10.1137/080734716
4. Stewart, G.: The decompositional approach in matrix computations. IEEE Comput. Sci. Eng.
Mag. 50–59 (2000)
5. von Neumann, J., Goldstine, H.: Numerical inverting of matrices of high order. Bull. Am. Math.
Soc. 53, 1021–1099 (1947)
6. Householder, A.S.: The Theory of Matrices in Numerical Analysis. Dover Publications, New
York (1964)
7. Skillicorn, D.: Understanding Complex Datasets: Data Mining using Matrix Decompositions.
Chapman Hall/CRC Press, Boca Raton (2007)
8. Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S.: Dense linear algebra on accelerated multicore
hardware. In: Berry, M., et al. (eds.) High-Performance Scientific Computing: Algorithms and
Applications, pp. 123–146. Springer, New York (2012)
9. Luszczek, P., Kurzak, J., Dongarra, J.: Looking back at dense linear algebrasoftware. J. Parallel
Distrib. Comput. 74(7), 2548–2560 (2014). DOI http://dx.doi.org/10.1016/j.jpdc.2013.10.005
10. Igual, F.D., Chan, E., Quintana-Ortí, E.S., Quintana-Ortí, G., van de Geijn, R.A., Zee, F.G.V.:
The FLAME approach: from dense linear algebra algorithms to high-performance multi-
accelerator implementations. J. Parallel Distrib. Comput. 72, 1134–1143 (2012)
11. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and
sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). doi:10.
1137/080731992
12. Grigori, L., Demmel, J., Xiang, H.: CALU: a communication optimal LU factorization algo-
rithm. SIAM J. Matrix Anal. Appl. 32(4), 1317–1350 (2011). doi:10.1137/100788926. http://
dx.doi.org/10.1137/100788926
13. Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix
multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004). doi:10.1016/j.jpdc.2004.
03.021. http://dx.doi.org/10.1016/j.jpdc.2004.03.021
14. Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication
lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23, 1–155
(2014)
15. Sameh, A.: Numerical parallel algorithms—a survey. In: Kuck, D., Lawrie, D., Sameh, A.
(eds.) High Speed Computer and Algorithm Optimization, pp. 207–228. Academic Press, New
York (1977)
16. Stern, J.: A fast Gaussian elimination scheme and automated roundoff error analysis on S.I.M.E.
machines. Ph.D. thesis, University of Illinois (1979)
17. Sorensen, D.: Analysis of pairwise pivoting in Gaussian elemination. IEEE Trans. Comput.
C-34(3), 274–278 (1985)
18. Quintana-Ortí, E.S., van de Geijn, R.A.: Updating an LU factorization with pivoting. ACM
Trans. Math. Softw. (TOMS) 35(2), 11 (2008)
19. Dongarra, J., Duff, I., Sorensen, D., van der Vorst, H.: Numerical Linear Algebra for High-
Performance Computers. SIAM, Philadelphia (1998)
20. Stewart, G.: Modifying pivot elements in Gaussian elimination. Math. Comput. 28(126), 537–
542 (1974)
References 89
21. Moler, C.: MATLAB incorporates LAPACK. Mathworks Newsletter (2000). http://www.
mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html
22. Gallivan, K., Jalby, W., Meier, U.: The use of BLAS3 in linear algebra on a parallel processor
with a hierarchical memory. SIAM J. Sci. Statist. Comput. 8(6), 1079–1084 (1987)
23. Gallivan, K., Jalby, W., Meier, U., Sameh, A.: The impact of hierarchical memory systems on
linear algebra algorithm design. Int. J. Supercomput. Appl. 2(1) (1988)
Chapter 5
Banded Linear Systems
Using the classical Gaussian elimination scheme, with partial pivoting, outlined
in Chap. 4, for handling banded systems results in limited opportunity for paral-
lelism. This limitation becomes more pronounced the narrower the system’s band.
To illustrate this limitation, consider a banded system of bandwidth (2m + 1), shown
in Fig. 5.1 for m = 4.
© Springer Science+Business Media Dordrecht 2016 91
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_5
92 5 Banded Linear Systems
The scheme will consider the triangularization of the leading (2m +1) × (2m+1)
window with partial pivoting, leading to possible additional fill-in. The process is re-
peated for the following window that slides down the diagonal, one diagonal element
at a time.
In view of such limitation, an alternative LU factorization considered in [1] was
adopted in the ScaLAPACK library; see also [2, 3]. For ease of illustration, we
consider the two-partitions case of a tridiagonal system Ax = f , of order 18 × 18,
initially setting A0 ≡ A and f 0 ≡ f ; see Fig. 5.2 for matrix A. After row permuta-
tions, we obtain A1 x = f 1 where A1 = P0 A0 , with P0 = [e18 , e1 , e2 , . . . , e17 ]
in which e j is the jth column of the identity I18 , see Fig. 5.3. Following this by the
column permutations, A2 = A1 P1 , with
⎛ ⎞
I7 0 0 0
⎜0 0 I2 0⎟
P1 = ⎜
⎝0
⎟
I7 0 0⎠
0 0 0 I2
we get,
B1 0 E 1 F1
A2 = ,
0 B2 E 2 F2
and
E j F j
L −1 E j , Fj = ,
j E j F j
is also nonsingular and hence y, and subsequently, the unique solution x can be
computed.
In Sects. 5.2 and 5.3, we present alternatives to this parallel direct solver where we
present various hybrid (direct/iterative) schemes that possess higher parallel scala-
bility. Clearly, these hybrid schemes are most suitable when the user does not require
obtaining solutions whose corresponding residuals have norms of the order of the
unit roundoff.
Parallel banded linear system solvers have been considered by many authors
[1, 4–11]. We focus here on one of these solvers—the Spike algorithm which dates
back to the 1970s (the original algorithm created for solving tridiagonal systems on
parallel architectures [12] is discussed in detail in Sect. 5.5.5 later in this chapter).
Further developments and analysis are given in [13–18]. A distinct feature of the
Spike algorithm is that it avoids the poorly scalable parallel scheme of obtaining
the classical global LU factorization of the narrow-banded coefficient matrix. The
overarching strategy of the Spike scheme consists of two main phases:
5.2 The Spike Family of Algorithms 95
First, we review the Spike algorithm and its features. The Spike algorithm consists
of the four stages listed in Algorithm 5.1.
For ease of illustration, we assume for the time being, that any block-diagonal part of
A is nonsingular. Later, however, this assumption will be removed, and the systems
AX = F are solved via a Krylov subspace method with preconditioners which are
low-rank perturbations of the matrix A. Solving systems involving such banded
preconditioners in each outer Krylov iteration will be achieved using a member of
the Spike family of algorithms.
96 5 Banded Linear Systems
Preprocessing Stage
This stage starts with partitioning the banded linear system into a block tridiagonal
form with p diagonal blocks A j ( j = 1, . . . , p), each of order n j = n/ p (assuming
that n ia an integer multiple of p), with coupling matrices (of order m n) B j ( j =
1, . . . , p − 1), and C j ( j = 2, . . . , p), associated with the super- and sub-block
diagonals, respectively. Note that it is not a requirement that all the diagonal blocks
A j be of the same order. A straightforward implementation of the Spike algorithm
on a distributed memory architecture could be realized by choosing the number of
partitions p to be the same as the number of available multicore nodes q. In general,
however, p ≤ q. This stage concludes by computing the LU-factorization of each
diagonal block A j . Figure 5.5 illustrates the partitioning of the matrices A and F for
p = 4.
The Spike Factorization Stage
Based on our assumption above that each A j is nonsingular, the matrix A can be
factored as A = DS, where D is a block-diagonal matrix consisting only of the
factored diagonal blocks A j ,
D = diag(A1 , . . . , A p ),
A1 F1 1
B
1
C
2 A2 F2
B 2
A = 2 F =
C
3 F3
A3 B 3
3
C
4 A4 F4 4
Fig. 5.5 Spike partitioning of the matrix A and block of right-hand sides F with p = 4
5.2 The Spike Family of Algorithms 97
*
I. .
. . V
. . 1 1
I *
*
. I. *
.
W . . . V
2 . . . 2 2
S = * I *
*
. I. *
.
W . . . V
3 . . . 3 3
* I *
*
. I.
W . .
4 . . 4
* I
⎞ ⎛
0 Cj
⎜ .. ⎟
⎜ . 0 ⎟
⎜
A j (V j , W j ) = ⎜ ⎟
. ⎟. (5.2)
⎝ 0 .. ⎠
Bj 0
Postprocessing Stage
Solving the system AX = F now consists of two steps:
The solution of the linear system DG = F in step (a), yields the modified right
hand-side matrix G needed for step (b). Assigning one partition to each node, step
(a) is performed with perfect parallelism. If we decouple the pre- and post-processing
stages, step (a) may be combined with the generation of the spikes in Eq. (5.2).
Let the spikes V j and W j be partitioned as follows:
(t)
⎞ ⎛ ⎛ (t) ⎞
Vj W
⎜ ⎟ ⎜ j ⎟
V j = ⎝ j ⎠ and W j = ⎝ W j ⎟
⎜ V ⎟ ⎜
⎠ (5.5)
(b) (b)
Vj Wj
and
(t) (b)
Vj = [Im 0]V j ; W j = [0 Im ]W j . (5.7)
98 5 Banded Linear Systems
Thus, in solving SX = G in step (b), we observe that the problem can be reduced
further by solving the reduced system,
Ŝ X̂ = Ĝ, (5.9)
Finally, once the solution X̂ of the reduced system (5.9) is obtained, the global
solution X is reconstructed with perfect parallelism from X k(b) (k = 1, . . . , p − 1)
and X k(t) (k = 2, . . . , p) as follows:
⎧
⎪ (t)
⎨ X 1 = G 1 − V1 X 2 ,
(t) (b)
X j = G j − V j X j+1 − W j X j−1 , j = 2, . . . , p − 1,
⎪
⎩ (b)
X p = G p − W j X p−1 .
Remark 5.1 Note that the generation of the spikes is included in the factorization
step. In this way, the solver makes use of the spikes stored in memory thus allowing
solving the reduced system quickly and efficiently. Since in many applications one
has to solve many linear systems with the same coefficient matrix A but with different
5.2 The Spike Family of Algorithms 99
right hand-sides, optimization of the solver step is crucial for assuring high parallel
scalability of the Spike algorithm.
Several options are available for efficient implementation of the Spike algorithm on
parallel architectures. These choices depend on the properties of the linear system
as well as the parallel architecture at hand. More specifically, each of the following
three tasks in the Spike algorithm can be handled in several ways:
1. factorization of the diagonal blocks A j ,
2. computation of the spikes,
3. solution of the reduced system.
In the first task, each linear system associated with A j can be solved via: (i) a direct
method making use of the LU-factorization with partial pivoting, or the Cholesky
factorization if A j is symmetric positive definite, (ii) a direct method using an LU-
factorization without pivoting but with a diagonal boosting strategy, (iii) an iterative
method with a preconditioning strategy, or (iv) via an appropriate approximation of
the inverse of A j . If more than one node is associated with each partition, then linear
systems associated with each A j may be solved using the Spike algorithm creating
yet another level of parallelism. In the following, however, we consider only the case
in which each partition is associated with only one multicore node.
In the second task, the spikes can be computed either explicitly (fully or partially)
using Eq. (5.2), or implicitly—“on-the-fly”.
In the third task, the reduced system (5.9) can be solved via (i) a direct method
such as LU with partial pivoting, (ii) a “recursive” form of the Spike algorithm, (iii)
a preconditioned iterative scheme, or (iv) a “truncated” form of the Spike scheme,
which is ideal for diagonally dominant systems, as will be discussed later.
Note that an outer iterative method will be necessary to assure solutions with
acceptable relative residuals for the original linear system whenever we do not use
numerically stable direct methods for solving systems (5.3) and (5.9). In such a case,
the overall hybrid solver consists of an outer Krylov subspace iteration for solving
Ax = f , in which the Spike algorithm is used as a solver of systems involving a
banded preconditioner consisting of an approximate Spike factorization of A in each
outer iteration. In the remainder of this section, we describe several variants of the
Spike algorithm depending on: (a) whether the whole spikes are obtained explicitly,
or (b) whether we obtain only an approximation of the independent reduced system.
We describe these variants depending on whether the original banded system is
diagonally dominant.
100 5 Banded Linear Systems
If it is known apriori that all the diagonal blocks A j are nonsingular, then the LU-
factorization of each block A j with partial pivoting is obtained using either the
relevant single core LAPACK routine [19], or its multithreaded counterpart on one
multicore node. Solving the resulting banded triangular system to generate the spikes
in (5.2), and update the right hand-sides in (5.3), may be realized using a BLAS3
based primitive [16], instead of BLAS2 based LAPACK primitive. In the absence of
any knowledge of the nonsingularity of the diagonal blocks A j , an LU-factorization
is performed on each diagonal block without pivoting, but with a diagonal boosting
strategy to overcome problems associated with very small pivots. Thus, we either ob-
tain an LU-factorization of a given diagonal block A j , or the factorization of a slightly
perturbed A j . Such a strategy circumvents difficulties resulting from computation-
ally singular diagonal blocks. If diagonal boosting is required for the factorization of
any diagonal block, an outer Krylov subspace iteration is used to solve Ax = f with
the preconditioner being M = D̂ Ŝ, where D̂ and Ŝ are the Spike factors resulting
after diagonal boosting. In this case, the Spike scheme reduces to solving systems
involving the preconditioner in each outer Krylov subspace iteration.
One natural way to solve the reduced system (5.9) in parallel is to make use of an
inner Krylov subspace iterations with a block Jacobi preconditioner obtained from
the diagonal blocks of the reduced system (5.10). For these non-diagonally dominant
systems, however, this preconditioner may not be effective in assuring a small relative
residual without a large number of outer iterations. This, in turn, will result in high
interprocessor communication cost. If the unit cost of interprocessor communication
is excessively high, the reduced system may be solved directly on a single node.
Such alternative, however, may have memory limitations if the size of the reduced
system is large. Instead, we propose the following parallel recursive scheme for solv-
ing the reduced system. This “Recursive” scheme involves successive applications
of the Spike algorithm resulting in better balance between the computational and
communication costs.
First, we dispense with the simple case of two partitions, p = 2. In this case, the
reduced system consists only of one diagonal block (5.10), extracted from the central
part of the system (5.4)
Im V1(b) X 1(b) G (b)
(t) (t) = 1
(t) , (5.12)
W2 I m X2 G2
Let the spikes of the new reduced system at level 1 of the recursion be denoted by
Vk[1] and Wk[1] , where
(t)
(t)
Vk Wk
Vk[1] = and Wk[1] = . (5.14)
Vk(b) Wk(b)
In preparation for level 2 of the recursion of the Spike algorithm, we choose now
to partition the matrix S̃1 using p/2 partitions with diagonal blocks each of size 4m.
The matrix can then be factored as
S̃1 = D1 S̃2 ,
where S̃2 represents the new Spike matrix at level 2 composed of the spikes Vk[2] and
Wk[2] . For p = 4 partitions, these matrices are of the form
102 5 Banded Linear Systems
and
In general, at level i of the recursion, the spikes Vk[i] , and Wk[i] , with k ranging from
1 to p/(2i ), are of order 2i m × m. Thus, if the number of the original partitions p is
equal to 2d , the total number of recursion levels is d − 1 and the matrix S̃1 is given
by the product
S̃1 = D1 D2 . . . Dd−1 S̃d ,
where the matrix S̃d has only two spikes V1[d] and W2[d] . Thus, the reduced system
can be written as,
S̃d X̃ = B, (5.15)
B = D −1 −1 −1
d−1 . . . D2 D1 G̃. (5.16)
If we assume that the spikes Vk[i] and Wk[i] , for all k, of the matrix S̃i are known at
a given level i, then we can compute the spikes Vk[i+1] and Wk[i+1] at level i + 1 as
follows:
STEP 1: Denoting the bottom and the top blocks of the spikes at the level i by
and
[i](b)
Im V2k−1 Ẇk[i+1] [i](b)
W2k−1 p
= , k = 2, 3, . . . , . (5.18)
[i](t)
W2k Im Ẅk[i+1] 0 2i−1
These reduced systems are solved similarly to (5.12) to obtain the central portion
of all the spikes at level i + 1.
STEP 2: The rest of the spikes at level i + 1 are retrieved as follows:
and
As outlined above for the two-partition case, once the reduced system is solved,
we purify the right hand-side from the contributions of the coupling blocks B j and
C j , thus decoupling the system into independent subsystems, one corresponding to
each diagonal block. Solving these independent systems simultaneously using the
previously computed LU- or UL-factorizations as shown below:
⎧
⎪
⎪ 0 (t)
⎪
⎪ A1 X 1 = F1 − B j X2 ,
⎪
⎪ I m
⎪
⎪
⎪
⎪
⎨
0 (t) I (b)
A j X j = Fj − B j X j+1 − m C j X j−1 , j = 2, . . . , p − 1
⎪
⎪ I m 0
⎪
⎪
⎪
⎪
⎪
⎪
⎪
⎪ Im (b)
⎩ p p
A X = F p − C j X p−1 .
0
We introduce yet another variant of the Spike scheme for solving banded systems
that depends on solving simultaneously several independent underdetermined linear
least squares problems under certain constraints [7].
Consider the nonsingular linear system
Ax = f, (5.21)
where A is an n×n banded matrix with bandwidth β = 2m+1. For ease of illustration
we consider first the case in which n is even, e.g. n = 2s, and A is partitioned into
two equal block rows (each of s rows), as shown in Fig. 5.7. The linear system can
be represented as
A2
C2
106 5 Banded Linear Systems
⎛ ⎞
x1
A1 B1 ⎝ ξ ⎠ = f1 , (5.22)
C 2 A2 f2
x2
While these two underdetermined systems can be solved independently, they will
yield the solution of the global system (5.22), only if ξ = ξ̃ . Here, the vectors x1 , x2
are of order (s − m), and ξ , ξ̃ each of order 2m.
Let the matrices of the underdetermined systems be denoted as follows:
E 1 = (A1 , B1 ),
E 2 = (C2 , A2 ),
then the general form of the solution of the two underdetermined systems (5.23) and
(5.24) is given by
vi = pi + Z i yi , i = 1, 2,
ξ(y1 ) = ξ̃ (y2 ),
i.e.
p1,2 + Z 1,2 y1 = p2,1 + Z 2,1 y2 .
5.3 The Spike-Balance Scheme 107
assures us of obtaining the global solution of (5.21) and yields the balance linear
system
My = g
Bz = d,
where B is an (s × (s + m)) matrix with full row-rank. The general solution is given
by
z = B + d + PN (B) u,
The general solution z can be obtained using a variety of techniques, e.g., see
[21, 22]. Using sparse orthogonal factorization, e.g. see [23], we obtain the decom-
position,
R
B = (Q, Z ) ,
0
z = z̄ + Z y,
where z̄ = Q R −T d.
and
Im Q
12 Q 21
à à = .
Q 21 Q 12 Im
Let, H = Q
12 Q 21 . Then, since ( Ã Ã ) is symmetric positive definite,
λ( Ã Ã ) = 1 ± η > 0
Q i Q i = I − Z i Z i i = 1, 2,
+ Z Z ),
N = 2I − (Z 12 Z 12 21 21
= 2I − M M .
5.3 The Spike-Balance Scheme 109
But, the eigenvalues of N lie within the spectrum of à Ã, or à à , i.e.,
1 − ρ ≤ 2 − λ(MM ) ≤ 1 + ρ
0 < 1 − ρ ≤ λ(MM ) ≤ 1 + ρ.
With p equal block rows, the columns of A are divided into 2 p − 1 column blocks,
where the unknowns ξi , i = 1, . . . p − 1, are common to consecutive blocks of the
B
A 1
1
C
2 B
A 2
2
c
3 B
A 3
3
C B
p –1 p–1
A
p –1
C
p A
p
E 1 = (A1 , B1 ),
E i = (Ci , Ai , Bi ), i = 2, . . . , p − 1,
E p = (C p , A p ),
vi = pi + Z i yi ,
Here, each ξ and ξ̃ is of order 2m, and each yi , i = 2, 3, . . . .., p − 1 is of the same
order 2m, with y1 , y p being of order m. Since the submatrices E i are of full row-
rank, the above underdetermined linear systems are consistent, and we can enforce
the following conditions to insure obtaining the unique solution of the global system:
Thus, the banded linear system (5.27) may be solved by simultaneously computing
the solutions of the underdetermined systems (5.28)–(5.30), followed by solving
the balance system M y = g. Next, we summarize the Spike-balance scheme for p
blocks.
The Spike-Balance Scheme
In fact, for large linear systems (5.21), and a relatively large number of partitions p,
e.g. when solving (5.21) on a cluster with large number of nodes, it is preferable not
to form the resulting large balance matrix M. Instead, we use an iterative scheme in
which the major kernels are: (i) matrix–vector multiplication, and (ii) solving systems
involving the preconditioner. From the above derivation of the balance system, we
observe the following:
M y = r (0) − r (y),
where r (0) = g. In this case, however, one needs to solve the underdetermined
systems in each iteration.
Note that conditioning the matrix M is critical for the rapid convergence of any
iterative method used for solving the balance system. The following theorem provides
an estimate of the condition number of M, κ(M).
Theorem 5.1 ([7]) The coefficient matrix M of the balance system has a condition
number which is at most equal to that of the coefficient matrix A of the original
banded system (5.21), i.e.,
κ(M) ≤ κ(A).
In order to form the coefficient matrix M of the balance system, we need to obtain the
matrices Z i , i = 1, 2, . . . ., p. Depending on the size of the system and the parallel
architecture at hand, however, the storage requirements could be excessive. Below,
we outline an alternate approach that does not form the balance system explicitly.
A Projection-Based Spike-Balance Scheme
In this approach, the matrix M is not computed explicitly, rather, the balance system
is available only implicitly in the form of a matrix-vector product, in which the matrix
under consideration is given by MM . As a result, an iterative scheme such as the
conjugate gradient method (CG) can be used to solve the system MM ŵ = g instead
of the balance system in step 2 of the Spike-balance scheme.
This algorithm is best illustrated by considering the 2-partition case (5.23) and
(5.24),
x1
E1 = f1,
ξ
ξ̃
E2 = f2 ,
x2
5.3 The Spike-Balance Scheme 113
where the general solution for these underdetermined systems are given by
x1
= p1 + (I − P1 )u 1 , (5.35)
ξ
ξ̃
= p2 + (I − P2 )u 2 , (5.36)
x2
pi = E i (E i E i )−1 f i , i = 1, 2,
Pi = E i (E i E i )−1 E i , i = 1, 2. (5.37)
in which,
g = (0, Iˆ) p1 − ( Iˆ, 0) p2 ,
N1 = −(0, Iˆ)(I − P1 ),
N2 = ( Iˆ, 0)(I − P2 ).
MM w = g, (5.38)
where ˆ
0 I
MM = (0, Iˆ)(I − P1 ) ˆ + ( Iˆ, 0)(I − P2 ) . (5.39)
I 0
The system (5.38) is solved using the CG scheme with a preconditioning strategy. In
each CG iteration, the multiplication of MM by a vector requires the simultaneous
multiplications (or projections),
c j = (I − P j )b j , j = 1, 2
114 5 Banded Linear Systems
forming an outer level of parallelism, e.g., using two nodes. Such a projection cor-
responds to computing the residuals of the least squares problems,
min b j − E j c j 2 , j = 1, 2.
cj
These residuals, in turn, can be computed using inner CG iterations with each iteration
involving matrix-vector multiplication of the form,
h i = E i E i vi , i = 1, 2
forming an inner level of parallelism, e.g., taking advantage of the multicore archi-
tecture of each node. Once the solution w of the balance system in (5.38) is obtained,
the right-hand sides of (5.35) and (5.36) are updated as follows:
x1 0
= p1 − (I − P1 ) ,
ξ w
and
ξ̂ w
= p2 + (I − P2 ) .
x2 0
For the general p-partition case, the matrix MM in (5.38) becomes of the form,
p
MM = Mi Mi , (5.40)
i=1
I − Pi = I − E i+ E i . (5.41)
More specifically,
− Iˆ 0 0 − Iˆ 0 0
M̃i M̃i = (I − Pi ) .
0 0 Iˆ 0 0 Iˆ
Hence, it can be seen that MM can be expressed as the sum of sections of the
projectors (I − Pi ), i = 1, . . . , p. As outlined above, for the 2-partition case, such a
form of MM allows for the exploitation of multilevel parallelism when performing
matrix-vector product in the conjugate gradient algorithm for solving the balance
system. Once the solution of the balance system is obtained, the individual particular
solutions for the p partitions are updated simultaneously as outlined above for the
2-partition case.
5.3 The Spike-Balance Scheme 115
While the projection-based approach has the advantage of replacing the orthogonal
factorization of the block rows E i with projections onto the null spaces of E i , leading
to significant savings in storage and computation, we are now faced with a problem
of solving balance systems in the form of the normal equations. Hence, it is essential
to adopt a preconditioning strategy (e.g. using the block-diagonal of MM as a
preconditioner) to achieve a solution for the balance system in few CG iterations.
5.4.1 Introduction
Here we consider solving wide-banded linear systems that can be expressed as over-
lapping diagonal blocks in which each is a block-tridiagonal matrix. Our approach
in this tearing-based scheme is different from the Spike algorithm variants discussed
above. This scheme was first outlined in [24], where the study was restricted to diag-
onally dominant symmetric positive definite systems . This was later generalized to
nonsymmetric linear systems without the requirement of diagonal dominance, e.g.
see [25]. Later, in Chap. 10, we extend this tearing scheme for the case when we strive
to obtain a central band preconditioner that encapsulates as many nonzero elements
as possible.
First, we introduce the algorithm by showing how it tears the original block-
tridiagonal system and extract a smaller balance system to solve. Note that the
extracted balance system here is not identical to that described in the Spike-balance
scheme. Second, we analyze the conditions that guarantee the nonsingularity of the
balance system. Further, we show that if the original system is symmetric positive
definite and diagonally dominant then the smaller balance system is also symmet-
ric positive definite as well. Third, we discuss preconditioned iterative methods for
solving the balance system.
5.4.2 Partitioning
(k)
The blocks Aμν = Aη+μ,η+ν for η = 2(k − 1) and μ, ν = 1, 2, 3, except for the
overlaps between partitions. The overlaps consist of the top left blocks of the last and
middle partitions, and bottom right blocks of the first and middle partitions. For these
(k−1) (k)
blocks the following equality holds A33 + A11 = Aη+1,η+1 . The exact choice of
(k−1) (k)
splitting A33 into A33 and A11 will be discussed below.
Thus, we can rewrite the original system as a set of smaller linear systems, k =
1, 2, 3,
⎛ ⎞⎛ ⎞
A(k) (k) ⎛x1(k) ⎞
11 A12 (1 − αk−1 ) f η+1 − yk−1
⎜ (k) (k) (k) ⎟ ⎜ (k) ⎟
⎜ A A A ⎟⎜x ⎟ = ⎝ f η+2 ⎠, (5.45)
⎝ 21 22 23 ⎠ ⎝ 2 ⎠
α f
k η+3 + y
A(k) (k)
x3(k)
k
32 A33
(ζ ) (ζ +1)
x3 = x1 for ζ = 1, 2. (5.46)
5.4 A Tearing-Based Banded Solver 117
i.e.
(k) (k) (k) (k)
x1 = B11 ((1 − αk−1 ) f η+1 − yk−1 ) + B12 f η+2 + B13 (αk f η+3 + yk ),
(k) (k) (k) (k)
x3 = B31 ((1 − αk−1 ) f η+1 − yk−1 ) + B32 f η+2 + B33 (αk f η+3 + yk ).
(5.48)
Using (5.46) and (5.48) we obtain
(ζ ) (ζ +1) (ζ ) (ζ +1)
(B33 + B11 )yζ = gζ + B31 yζ −1 + B13 yζ +1 (5.49)
for ζ = 1, 2, where
⎛ ⎞
f η−1
⎜ fη ⎟
(ζ ) (ζ ) (ζ +1) (ζ ) (ζ +1) (ζ +1) ⎜ ⎟
gζ = ((αζ −1 −1)B31 , −B32 , (1−αζ )B11 −αζ B33 , B12 , αζ +1 B13 )⎜
⎜ f η+1 ⎟
⎟ . (5.50)
⎝ f η+2 ⎠
f η+3
Finally, letting g = (g1 , g2 ), the adjustment vector y can be found by solving the
balance system
M y = g, (5.51)
where
(1) (2) (2)
B33 + B11 −B13
M= (2) (2) (3) . (5.52)
−B31 B33 + B11
Once y is obtained,we can solve the linear systems in (5.45) independently for
each k = 1, 2, 3. Next we focus our attention on solving (5.51). First we note that the
matrix M is not available explicitly, thus using a direct method to solve the balance
system (5.52) is not possible and we need to resort to iterative schemes that utilize
M implicitly for performing matrix-vector multiplications of the form z = M ∗ v.
For example, one can use Krylov subspace methods, e.g. CG or BiCGstab for the
118 5 Banded Linear Systems
Since the tearing approach that enables parallel scalability depends critically on how
effectively we can solve the balance system via an iterative method, we explore
further the characteristics of its coefficient matrix M.
The Symmetric Positive Definite Case
First, we assume that A is symmetric positive definite (SPD) and investigate the
conditions under which the balance system is also SPD.
Theorem 5.2 ([25]) If the partitions Ak in (5.44) are SPD for k = 1, . . . , p then
the balance system in (5.51) is also SPD .
Proof Without loss of generality, let p = 3. Since each Ak is SPD then A−1
k is also
SPDṄext, let
I 0 0
Q = , (5.54)
0 0 −I
which is also symmetric positive definite. But, since the balance system coefficient
matrix M can be written as the sum
(2) (2)
(1)
B33 0 B11 −B13 0 0
M = M1 + M2 + M3 = + (2) (2)
+ (3) , (5.56)
0 0 −B31 B33 0 B11
5.4 A Tearing-Based Banded Solver 119
(ζ,1) (ζ ) 1 (ζ ) (ζ +1)
h ii = ei |A32 | e + ei | offdiag (A33 + A11 )| e, and (5.57)
2
(ζ +1,2) (ζ +1) 1 (ζ ) (ζ +1)
h ii = ei |A12 | e + ei | offdiag (A33 + A11 )| e, (5.58)
2
(ζ,1) (ζ +1,2)
respectively. Note that h ii and h ii are the sum of absolute values of all the
off-diagonal elements, with elements in the overlap being halved, in the ith row
to the left and right of the diagonal, respectively. Next, let the difference between
the positive diagonal elements and the sums of absolute values of all off-diagonal
elements in the same row be given by
Now, if
(ζ ) 1 1 (ζ ) (ζ +1)
A33 = Hζ(1) + Dζ + offdiag (A33 + A11 ),
2 2
(ζ +1) (2) 1 1 (ζ ) (ζ +1)
A11 = Hζ +1 + Dζ + offdiag (A33 + A11 ), (5.60)
2 2
(ζ ) (ζ +1)
it is easy to verify that A33 + A11 = A2ζ +1,2ζ +1 and each Ak , for k = 1, . . . , p,
is SPDd.d. Consequently, if (5.43) is SPD and d.d., so are the partitions Ak and by
Theorem 5.2, the balance system is guaranteed to be SPD.
The Nonsymmetric Case
Next, if A is nonsymmetric, we explore under which conditions will the balance
system (5.51) become nonsingular.
120 5 Banded Linear Systems
Theorem 5.4 ([25]) Let the matrix A in (5.42) be nonsingular with partitions Ak ,
k = 1, 2, . . . , p, in (5.44) that are also nonsingular. Then the coefficient matrix M
of the balance system in (5.51), is nonsingular.
Next, let ⎛ ⎞ ⎛ ⎞
A1 Is−τ
AL = ⎝ Is−2τ ⎠ , AR = ⎝ A2 ⎠, (5.62)
A3 Is−τ
C = A−1 A A−1
⎛L R ⎞⎛ ⎞ ⎛ ⎞⎛ ⎞
Is Is−τ A−1 0s−τ
⎠ . (5.63)
1
= ⎝ 0s−2τ ⎠⎝ A−1
2
⎠+⎝ Is−2τ ⎠⎝ Is
Is Is−τ A−1
3
0s−τ
where Is and 0s are the identity and zero matrices of order s, respectively.
Using (5.61), (5.63) and (5.47), we obtain
⎛ (1)
⎞
⎛ ⎞ 0τ 0 B13
Iτ ⎜ (1) ⎟
⎜ ⎟ ⎜ 0 0s−2τ B23 ⎟
⎜ Is−2τ ⎟ ⎜ ⎟
⎜ (2) (2) (2) ⎟ ⎜
⎜ (1) ⎟
⎟
⎜ B11 B12 B13 ⎟ ⎜ 0 0 B33 ⎟
⎜ ⎟ ⎜ ⎟
C =⎜
⎜ 0 0s−2τ 0 ⎟+⎜
⎟ ⎜ I s−2τ ⎟,
⎜ ⎟ ⎜ ⎟
⎜ (2) (2) (2) ⎟ ⎜ (3) ⎟
⎜ B31 B32 B33 ⎟ ⎜ B 0 0 ⎟
⎝ ⎠ ⎜
11
⎟
B21 0s−2τ 0 ⎟
(3)
Is−2τ
Iτ ⎝ ⎠
(3)
B31 0 0τ
⎛ (1)
⎞
Iτ 0 B13
⎜ (1) ⎟
⎜ 0 Is−2τ B23 ⎟
⎜ ⎟
⎜ (1) (2) (2) (2) ⎟
⎜0 0 B +B B B ⎟
⎜ 33 11 12 13 ⎟
⎜ ⎟
=⎜ 0 Is−2τ 0 ⎟, (5.64)
⎜ ⎟
⎜ (2) (2) (2)
+
(3) ⎟
⎜ B B B B 0 0 ⎟
⎜ 31 32 33 11
⎟
⎜ (3) ⎟
⎝ B21 I s−2τ 0 ⎠
(3)
B31 0 Iτ
where the zero matrices denoted by 0 without subscripts, are considered to be of the
appropriate sizes. Using the orthogonal matrix P, of order 3s − 2τ , given by
5.4 A Tearing-Based Banded Solver 121
⎛ ⎞
Iτ
⎜ Is−2τ 0 ⎟
⎜ ⎟
⎜ 0 Is−2τ 0 ⎟
⎜ ⎟
P =⎜
⎜ 0 0 Is−2τ ⎟,
⎟ (5.65)
⎜ 0 0 Iτ ⎟
⎜ ⎟
⎝ Iτ 0 ⎠
−Iτ
we obtain
⎛ (1)
⎞
Iτ B13
⎜ ⎟
⎜ Is−2τ B23
(1) ⎟
⎜ ⎟
⎜ Is−2τ ⎟
⎜ ⎟
⎜ (3)
−B21 ⎟
⎜ Is−2τ ⎟
P C P = ⎜ ⎟. (5.66)
⎜ (3) ⎟
⎜ Iτ B31 ⎟
⎜ ⎟
⎜ ⎟
⎜ B12
(2) (1)
B33 + B11
(2)
−B13
(2) ⎟
⎝ ⎠
(2) (2) (2) (3)
−B32 −B31 B33 + B11
we see that premultiplying the first block row of (5.68) by Z 2 and noticing that
Z 2 Z 1 = 0, we obtain
(1 − λ)Z 2 u 1 = 0. (5.69)
Thus, the eigenvalues of C are either 1 or identical to those of the balance system,
i.e.,
λ(C) = λ(P C P) ⊆ {1, λ(M)}. (5.71)
Since the size of the coefficient matrix M of the balance system is much smaller
than n, the above theorem indicates that A L and A R are effective left and right
preconditioners of system (5.42).
Next, we explore those conditions which guarantee that there exists a splitting of
the coefficient matrix in (5.42) resulting in nonsingular partitions Ak , and provide a
scheme for computing such a splitting.
Theorem 5.5 ([25]) Assume that the matrix A in (5.42) is nonsymmetric with a
positive definite symmetric part H = 21 (A + A ). Also let
⎛ ⎞ ⎛ ⎞
0(s−τ )×τ 0(s−τ )×τ
⎜ ⎟ ⎜ ⎟
⎜ A(2) ⎟
T
⎜ (2)
A11 ⎟ ⎜ ⎟
⎜ ⎟ ⎜
11
⎟
⎜ (2) ⎟ ⎜ A(2)T ⎟
⎜ A21 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ 12
⎟
⎜ (3) ⎟ ⎜ (3)T ⎟
⎜ A11 ⎟ ⎜ A11 ⎟
⎜ ⎟ ⎜ ⎟
⎜ (3) ⎟ ⎜ ⎟
⎜ A21 ⎟ ⎜ (3)T
⎟.
B1 = ⎜ ⎟ and B2 = ⎜
A12
⎟
⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ ⎜ .. ⎟
⎜ . ⎟ ⎜ . ⎟
⎜ ⎟ ⎜ ⎟
⎜ ( p) ⎟ ⎜ ⎟
⎜ A11 ⎟ ⎜ ( p)T
⎟
⎜ ⎟ ⎜
A11
⎟
⎜ ( p) ⎟ ⎜ ⎟
⎜ A21 ⎟ ⎜ ( p)T ⎟
⎝ ⎠ ⎝ A12 ⎠
0τ 0τ
(5.72)
Assuming that partial symmetry holds, i.e., B1 = B2 = B, then there exists a splitting
such that the partitions Ak in (5.44) for k = 1, . . . , p are nonsingular.
Proof Let p = 3, and let à be the block-diagonal matrix in which the blocks are the
partitions Ak .
⎛ ⎞
A(1) (1)
11 A12
⎜ (1) (1) (1) ⎟
⎜ A21 A22 A23 ⎟
⎜ ⎟
⎜ (1) (1) ⎟
⎜ A32 A33 ⎟
⎜ ⎟
⎜ (2) (2) ⎟
⎜ A11 A12 ⎟
⎜ ⎟
⎜ (2) (2) (2) ⎟
⎜
à = ⎜ A A A ⎟. (5.73)
21 22 23 ⎟
⎜ (2) (2) ⎟
⎜ A32 A33 ⎟
⎜ ⎟
⎜ (3) (3) ⎟
⎜ A11 A12 ⎟
⎜ ⎟
⎜ (3) (3) (3) ⎟
⎜ A21 A22 A23 ⎟
⎝ ⎠
(3) (3)
A32 A33
5.4 A Tearing-Based Banded Solver 123
(5.74)
then,
⎛ ⎞
A(1) (1)
11 A12
⎜ ⎟
⎜ A(1) A(1) (1) ⎟
⎜ 21 22 A23 ⎟
⎜ ⎟
⎜ A(1) (1) (2)
A(2) A(2) ⎟
⎜ 32 A33 + A11 ⎟
⎜ 12 11
⎟
⎜ (2) (2) (2) (2) ⎟
⎜ A21 A22 A23 A21 ⎟
⎜ ⎟
⎜ (2) (2) (3) (3) (3) ⎟
J Ã J = ⎜
⎜
A32 A33 + A11 A12 A11 ⎟ .
⎟
⎜ (3) ⎟
⎜
⎜ A(3)
21 A(3) (3)
22 A23 A21 ⎟⎟
⎜ (3) (3) ⎟
⎜ A32 A33 ⎟
⎜ ⎟
⎜ ⎟
⎜ (2) (2) (2) ⎟
⎜ A11 A12 A11 ⎟
⎝ ⎠
A(3)
11 A(3)
12
(3)
A11
(5.75)
Writing (5.75) as a block 2 × 2 matrix, we have
⎛ ⎞
A B1
J Ã J = ⎝ B K ⎠ . (5.76)
2
(k) (k−1)
Using the splitting A11 = 21 (Aη+1,η+1 +A η+1,η+1 )+β I and A33 = 21 (Aη+1,η+1 −
Aη+1,η+1 ) − β I , we can choose β so as to ensure that B1 and B2 are of full rank and
K is SPD Thus, using Theorem 3.4 on p. 17 of [26], we conclude that J Ã J is of
full rank, hence à has full rank and consequently the partitions Ak are nonsingular.
Note that the partial symmetry assumption B1 = B2 in the theorem above is not as
restrictive as it seems. Recalling that the original matrix is banded, it is easy to see
that the matrices A(k) (k)
12 and A21 are almost completely zero except for small parts in
their respective corners, which are of size no larger than the overlap. This condition
then can be viewed as a requirement of symmetry surrounding the overlaps.
124 5 Banded Linear Systems
Let us now focus on two special cases. First, if the matrix A is SPD the conditions
of Theorem 5.5 are immediately satisfied and we obtain the following.
Corollary 5.1 If the matrix A in (5.42) is SPD then there is a splitting (as described
in Theorem 5.5) such that the partitions Ak in (5.44) for k = 1, . . . , p are nonsin-
gular and consequently the coefficient matrix M of the balance system in (5.51), is
nonsingular.
Second, note that Theorem 5.3 still holds even if the symmetry requirement is
dropped. Combining the results of Theorems 5.3 and 5.4, without any requirement
of symmetry, we obtain the following.
Corollary 5.2 If the matrix A in (5.42) is d.d., then the partitions Ak in (5.44)
can be chosen such that they are also nonsingular and d.d. for k = 1, . . . , p and
consequently the coefficient matrix M, of the balance system in (5.51), is nonsingular.
Next, we show how one can compute the residual rinit needed to start a Krylov
subspace scheme for solving the balance system. Rewriting (5.47) as
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
(k) (k) (k)
x1 h1 ȳ1
⎜ (k) ⎟ ⎜ (k) ⎟ ⎜ (k) ⎟
⎜ x ⎟ = ⎜ h ⎟ + ⎜ ȳ ⎟ , (5.77)
⎝ 2 ⎠ ⎝ 2 ⎠ ⎝ 2 ⎠
(k) (k) (k)
x3 h3 ȳ3
where,
⎛ ⎞ ⎛ ⎞ ⎛ (k) ⎞
h (k) (1 − αk−1 ) f η+1 ȳ1 ⎛ ⎞
1
⎜ (k) ⎟ −yk−1
⎜ h ⎟ = A−1 ⎜ ⎟ ⎜ ⎟
⎠,⎜
(k) ⎟ −1 ⎝
⎝ 2 ⎠ k ⎝ f η+2 ⎝ ȳ2 ⎠ = Ak 0 ⎠, (5.78)
(k) αk f η+3 (k) yk
h3 ȳ3
where the second equality in (5.79) follows from the combination of (5.46), (5.48),
(5.50) and (5.52). Let the initial guess be yinit = 0, then we have
5.4 A Tearing-Based Banded Solver 125
⎛ (2) (1) ⎞
h1 − h3
⎜ (3) ⎟
⎜ h 1 − h (2) ⎟
⎜ 3 ⎟
rinit =g=⎜ . ⎟. (5.80)
⎜ .. ⎟
⎝ ⎠
( p) ( p−1)
h1 − h3
Thus, to compute the initial residual we must solve the p independent linear systems
ζ
(5.45) and subtract the bottom part of the solution vector of partition ζ , h 3 , from the
(ζ +1)
top part of the solution vector of partition ζ + 1, h 1 , for ζ = 1, . . . , p − 1.
Finally, to compute matrix-vector products of the form q = M p, we use (5.79)
and (5.80), to obtain
⎛ (1) (2) ⎞
ȳ3 − ȳ1
⎜ (2) ⎟
⎜ ȳ3 − ȳ1(3) ⎟
⎜ ⎟
M y = g − r = rinit − r = ⎜ .. ⎟. (5.81)
⎜ ⎟
⎝ . ⎠
( p−1) ( p)
ȳ3 − ȳ1
Hence, we can compute the matrix-vector products M p for any vector p in a fashion
similar to computing the initial residual using (5.81) and (5.78). The modified Krylov
subspace methods (e.g. CG, or BiCGstab) used to solve (5.51) are the standard ones
except that the initial residual and the matrix-vector products are computed using
(5.80) and (5.81), respectively. We call these solvers Domain-Decomposition-CG
(DDCG) and Domain-Decomposition-BiCGstab (DDBiCGstab). They are schemes
in which the solutions of the smaller independent linear systems in (5.45) are obtained
via a direct solver, while the adjustment vector y is obtained using an iterative method,
CG for SPD and BiCGStab for nonsymmetric linear systems. The outline of the
DDCG algorithm is shown in Algorithm 5.2. The usual steps of CG are omitted, but
the two modified steps are shown in detail. The outline of the DDBiCGstab scheme
is similar and hence ommitted.
The balance system is preconditioned using a block-diagonal matrix of the form,
⎛ (1) (2) ⎞
B̃33 + B̃11
⎜ .. ⎟
M̃ = ⎝ . ⎠, (5.82)
( p−1) ( p)
B̃33 + B̃11
So far, we discussed algorithms for general banded systems; this section is devoted to
tridiagonal systems because their simple structure makes possible the use of special
purpose algorithms. Moreover, many applications and algorithms contain a tridiag-
onal system solver as a kernel, used directly or indirectly.
The parallel solution of linear systems of equations, Ax = f , with coefficient
matrix A that is point (rather than block) tridiagonal,
5.5 Tridiagonal Systems 127
⎛ ⎞
α1,1 α1,2
⎜α2,1 α2,2 α2,3 ⎟
⎜ ⎟
⎜ .. .. .. ⎟
A=⎜
⎜ . . . ⎟
⎟ (5.84)
⎜ . .. ⎟
⎝ .. . αn−1,n ⎠
αn,n−1 αn,n
and abbreviated as [αi,i−1 , αi,i , αi,i+1 ] when the dimension is known from the con-
text, has been the subject of many studies. Even though the methods can be extended
to handle multiple right-hand sides, here we focus on the case of only one, that we
denote by f . Because of their importance in applications, tridiagonal solvers have
been developed for practically every type of parallel computer system to date. The
classic monographs [27–29] discuss the topic extensively. The activity is contin-
uing, with refinements to algorithms and implementations in libraries for parallel
computer systems such as ScaLAPACK, and with implementations on different
computer models and novel architectures; see e.g. [30–32].
In some cases, the matrix is not quite tridiagonal but it differs from one only by
a low rank modification. It is then possible to express the solution (e.g. by means
of the Sherman-Morrison-Woodbury formula [22]) in a manner that involves the
solution of tridiagonal systems, possibly with multiple right-hand sides. The latter is
a special case of a class of problems that requires the solution of multiple systems,
all with the same or with different tridiagonal matrices. This provides the algorithm
designer with more opportunities for parallelism. For very large matrices, it might
be preferable to use a parallel algorithm for a single or just a few right-hand sides
so as to be able to maintain all required coefficients in fast memory; see for instance
remarks in [33, 34]. Sometimes the matrices have special properties such as Toeplitz
structure, diagonal dominance, or symmetric positive definiteness. Algorithms that
take advantage of these properties are more effective and sometimes safer in terms of
their roundoff error behavior. We also note that because the cost of standard Gaussian
elimination for tridiagonal systems is already linear, we do not expect impressive
speedups. In particular, observe that in a general tridiagonal system of order n, each
solution element, ξi , depends on the values of all inputs, that is 4n − 2 elements.
This is also evident by the fact that A−1 is dense (though data sparse, which could
be helpful in designing inexact solvers). Therefore, with unbounded parallelism, we
need O(log n) steps to compute each ξi . Without restricting the number of processors,
the best possible speedup of a tridiagonal solver is expected to be O( logn n ); see also
[35] for this fan-in based argument.
The best known and earliest parallel algorithms for tridiagonal systems are recur-
sive doubling, cyclic reduction and parallel cyclic reduction. We first review these
methods as well as some variants. We then discuss hybrid and divide-and-conquer
strategies that are more flexible as they offer hierarchical parallelism and ready adap-
tation for limited numbers of processors.
128 5 Banded Linear Systems
We first describe a very fast parallel but potentially unstable algorithm described in
[12, Algorithm III]. This was inspired by the shooting methods [36–38] and by the
marching methods described in [39, 40] and discussed also in Sect. 6.4.7 of Chap. 6
It is worth noting that the term “marching” has been used since the early days of
numerical methods; cf. [41]. In fact, a definition can be found in the same book by
Richardson [42, Chap. I, p. 2] that also contains the vision of the “human parallel
computer” that we discussed in the Preface.
In the sequel we will assume that the matrix is irreducible, thus all elements in
the super- and subdiagonal are nonzero; see also Definition 9.1 in Chap. 9. Otherwise
the matrix can be brought into block-upper triangular (block diagonal, if symmetric)
form with tridiagonal submatrices along the diagonal, possibly after row and column
permutations and solved using block back substitution, where the major computation
that has the leading cost at each step is the solution of a smaller tridiagonal system.
In practice, we assume that there exists a preprocessing stage which detects such
elements and if such are discovered, the system is reduced to smaller ones.
The key observation is the following: If we know the last element, ξn , of the solu-
tion x = (ξi )1:n , then the remaining elements can be computed from the recurrence
1
ξn−k−1 = (φn−k − αn−k,n−k ξn−k − αn−k,n−k+1 ξn−k+1 ). (5.85)
αn−k,n−k−1
where for simplicity we denote R̃ = A2:n,1:n−1 that is non-singular and banded upper
triangular with bandwidth 3, b = A2:n,n , a = A1,1:n−1 , x̂ = x1:n−1 , and g = f 2:n .
Applying block LU factorization
R̃ b I 0 R̃ b
= −1
a 0 a R̃ 1 0 −a R̃ −1 b
it follows that
and
x̂ = R̃ −1 g − ξn R̃ −1 b.
Taking into account that certain terms in the formulas for ξn and x̂ are repeated, the
leading cost is due to the solution of a linear system with coefficient matrix R̃ and
two right-hand sides ( f 2:n , A2:n,n ). From these results, ξ1 and x2:n can be obtained
with few parallel operations. If the algorithm of Theorem 3.2 is extended to solve
non-unit triangular systems with two right-hand sides, the overall parallel cost is
T p = 3 log n + O(1) operations on p = 4n + O(1) processors. The total number
of operations is O p = 11n + O(1), which is only slightly larger than Gaussian
elimination, and the efficiency E p = 12 11 log n .
Remark 5.2 The above methodology can also be applied to more general banded
matrices. In particular, any matrix of bandwidth (2m +1) (it is assumed that the upper
and lower half-bandwidths are equal) of order n can be transformed by reordering
the rows, into
R B
(5.86)
C 0
where R is of order n −m and upper triangular with bandwidth 2m +1, and the corner
zero matrix is of order m. This property has been used in the design of other banded
solvers, see e.g. [43]. As in Sect. “Notation”, let J = e1 e2 + · · · + en−1 en , and set
S = J + en e1 be the “circular shift matrix”. Then S m A is as (5.86). Indeed, this is
straightforward to show by probing its elements, e.g. ei S m Ae j = 0 if j < i < n −m
or if i + 2m < j < n − m. Therefore, instead of solving Ax = b, we examine the
equivalent system S m Ax = S m b and then use block LU factorization. We then solve
by exploiting the banded upper triangular structure of R and utilizing the parallel
algorithms described in Chap. 3.
The marching algorithm has the smallest parallel cost among the tridiagonal
solvers described in this section. On the other hand, it has been noted in the lit-
erature that marching methods are prone to considerable error growth that render
them unstable unless special precautions are taken. An analysis of this issue when
solving block-tridiagonal systems that arise from elliptic problems was conducted
in [40, 44]; cf. Sect. 6.4.7 in Chap. 6. We propose an explanation of the source of
instability that is applicable to the tridiagonal case. Specifically, the main kernel of
the algorithm is the solution of two banded triangular systems with the same coeffi-
cient matrix. The speed of the parallel marching algorithm is due to the fact that these
systems are solved using a parallel algorithm, such as those described in Chap. 3.
As is well known, serial substitution algorithms for solving triangular systems are
130 5 Banded Linear Systems
extremely stable and the actual forward error is frequently much smaller than what
a normwise or componentwise analysis would predict; cf. [45] and our discussion
in Sect. 3.2.3. Here, however, we are interested in the use of parallel algorithms
for solving the triangular systems. Unfortunately, normwise and componentwise
analyses show that their forward error bounds, in the general case, depend on the
cube of the condition number of the triangular matrix; cf. [46]. As was shown in
[47], the 2-norm condition number, κ2 (An ), of order-n triangular matrices An whose
nonzero entries are independent and normally√ distributed with mean 0 and vari-
ance 1 grows exponentially, in particular n κ2 (An ) → 2 almost surely. Therefore,
the condition is much worse than that for random dense matrices, where it grows
linearly. There is therefore the possibility of exponentially fast increasing condi-
tion number compounded by the error’s cubic dependence on it. Even though we
suspect that banded random matrices are better behaved, our experimental results
indicate that the increase in condition number is still rapid for triangular matrices
such as R̃ that have bandwidth 3.
It is also worth noting that even in the serial application of marching, in which
case the triangular systems are solved by stable back substitution, we cannot assume
any special conditions for the triangular matrix except that it is banded. In fact, in
experiments that we conducted with random matrices, the actual forward error also
increases rapidly as the dimension of the system grows. Therefore, when the problem
is large, the parallel marching algorithm is very likely to suffer from severe loss of
accuracy.
Cyclic reduction (cr) relies on the combination of groups of three equations, each
consisting of an even indexed one, say 2i, together with its immediate neighbors,
indexed (2i ± 1), and involve five unknowns. The equations are combined in order to
eliminate odd-indexed unknowns and produce one equation with three unknowns per
group. Thus, a reduced system with approximately half the unknowns is generated.
Assuming that the system size is n = 2k −1, the process is repeated for k steps, until a
single equation involving one unknown remains. After this is solved, 2, 22 , . . . , 2k−1
unknowns are computed and the system is fully solved. Consider, for instance, three
adjacent equations from (5.84).
⎛ ⎞
⎛ ⎞ ξi−2 ⎛ ⎞
αi−1,i−2 αi−1,i−1 αi−1,i ⎜ξi−1 ⎟ φi−1
⎜ ⎟
⎝ αi,i−1 αi,i αi,i+1 ⎠ ⎜ ξi ⎟ = ⎝ φi ⎠
⎜ ⎟
αi+1,i αi+1,i+1 αi+1,i+2 ⎝ξi+1 ⎠ φi+1
ξi+2
5.5 Tridiagonal Systems 131
under the assumption that unknowns indexed n or above and 0 or below are zero. If
both sides are multiplied by the row vector
α αi,i+1
− αi−1,i−1
i,i−1
, 1, − αi+1,i+1
This involves only the unknowns ξi−2 , ξi , ξi+2 . Note that 12 floating-point operations
suffice to implement this transformation.
To simplify the description, unless mentioned otherwise we assume that cr is
applied on a system of size n = 2k − 1 for some k and that all the steps of the algo-
rithm can be implemented without encountering division by 0. These transformations
can be applied independently for i = 2, 2 × 2, . . . , 2 × (2k−1 − 1) to obtain a tridi-
agonal system that involves only the (even numbered) unknowns ξ2 , ξ4 , . . . , ξ2k−1 −2
and is of size 2k−1 −1 which is almost half the size of the previous one. If one were to
compute these unknowns by solving this smaller tridiagonal system then the remain-
ing ones can be recovered using substitution. Cyclic reduction proceeds recursively
by applying the same transformation to the smaller tridiagonal system until a single
scalar equation remains and the middle unknown, ξ2k−1 is readily obtained. From
then on, using substitutions, the remaining unknowns are computed.
The seminal paper [48] introduced odd-even reduction for block-tridiagonal sys-
tems with the special structure resulting from discretizing Poisson’s equation; we
discuss this in detail in Sect. 6.4. That paper also presented an algorithm named
“recursive cyclic reduction” for the multiple “point” tridiagonal systems that arise
when solving Poisson’s equation in 2 (or more) dimensions.
An key observation is that cr can be interpreted as Gaussian elimination with
diagonal pivoting applied on a system that is obtained from the original one after
renumbering the unknowns and equations so that those that are odd multiples of 20
are ordered first, then followed by the odd multiples of 21 , the odd multiples of 22 ,
etc.; cf. [49]. This equivalence is useful because it reveals that in order for cr to be
numerically reliable, it must be applied to matrices for which Gaussian elimination
without pivoting is applicable. See also [50–52].
The effect of the above reordering on the binary representation of the indices of the
unknown and right-hand side vectors, is an unshuffle permutation of the bit-permute-
complement (BPC) class; cf. [53, 54]. If Q denotes the matrix that implements the
unshuffle, then
Do B
Q AQ = ,
C De
132 5 Banded Linear Systems
Note that B and C are of size 2k−1 × (2k−1 − 1). Then we can write
I2k−1 0 Do B
Q AQ = .
C Do−1 I2k−1 −1 0 De − C Do−1 B
Therefore the system can be solved by computing the subvectors containing the odd
and even numbered unknowns x (o) , x (e)
where f (o) , f (e) are the subvectors containing the odd and even numbered elements
of the right-hand side. The crucial observation is that the Schur complement De −
C Do−1 B is of almost half the size, 2k−1 − 1, and tridiagonal, because the term
C Do−1 B is tridiagonal and De diagonal. The tridiagonal structure of C Do−1 B is
due to the fact that C and B are upper and lower bidiagonal respectively (albeit not
square). It also follows that cr is equivalent to Gaussian elimination with diagonal
pivoting on a reordered matrix. Specifically, the unknowns are eliminated in the
nested dissection order (cf. [55]) that is obtained by the repeated application of the
above scheme: First the odd-numbered unknowns (indexed by 2k − 1), followed by
those indexed 2(2k − 1), followed by those indexed 22 (2k − 1), and so on. The cost
to form the Schur complement tridiagonal matrix and corresponding right-hand side
is 12 operations per equation and so the cost for the first step is 12 × (2k−1 − 1)
operations. Applying the same method recursively on the tridiagonal system with
the Schur complement matrix for the odd unknowns one obtains after k − 1 steps
1 scalar equation for ξ2k−1 that is the unknown in middle position. From then on,
the remaining unknowns can be recovered using repeated applications of the second
equation above. The cost is 5 operations per computed unknown, independent for
each. On p = O(n) processors, the cost is approximately T p = 17 log n parallel
operations and O p = 17n − 12 log n in total, that is about twice the number of
operations needed by Gaussian elimination with diagonal pivoting. cr is listed as
Algorithm 5.3. The cr algorithm is a component in many other parallel tridiagonal
5.5 Tridiagonal Systems 133
(A + δ A)x̃ = f, δ A ∞ ≤ 10(log n) A ∞ u
x̃ − x ∞
≤ 10(log n)κ∞ (A)u
x̃ ∞
Lemma 5.1 The product of two t-lower (resp. upper) diagonal matrices of equal
size is 2t-lower (resp. upper) diagonal. Also if 2t > n then the product is the zero
5.5 Tridiagonal Systems 135
matrix. If L is t-lower diagonal and U is t-upper diagonal, then LU and UL are both
diagonal. Moreover the first t elements of the diagonal of LU are 0 and so are the
last t diagonal elements of UL.
Proof The nonzero structure of a 1-upper diagonal matrix is the same (without loss
of generality, we ignore the effect of zero values along the superdiagonal) with that of
J , where as usual J = e1 e2 +e2 e3 +· · ·+en−1 en is the 1-upper diagonal matrix
with all nonzero elements equal to 1. A similar result holds for 1-lower diagonal
matrices, except that we use J . Observe that (J )2 = e1 e3 + · · · + en−2 en which
is 2-upper diagonal and in general
(J )t = e1 et+1
+ · · · en−t en .
which is t-upper diagonal and has the same nonzero structure as any t-upper diagonal
U . If we multiply two t-upper diagonal matrices, the result has the same nonzero
structure as (J )2t and thus will be 2t-upper diagonal. A similar argument holds for
the t-lower diagonal case, which proves the first part of the lemma. Note next that
(J )t J t = (e1 et+1
+ · · · en−t en )(e1 et+1
+ · · · en−t en )
= e1 e1 + · · · + en−t en−t
which is a diagonal matrix with zeros in the last t elements of its diagonal. In the
same way, we can show that J t (J )t is diagonal with zeros in the first t elements.
Corollary 5.3 Let A = D − L − U be a t-tridiagonal matrix for some nonnegative
integer t such that t < n, L (resp. U ) is t-lower (resp. upper) diagonal and all
diagonal elements of D are nonzero. Then (D + L + U )D −1 A is 2t-tridiagonal. If
2t > n then the result of all the above products is the zero matrix.
Proof After some algebraic simplifications we obtain
(I + L D −1 + U D −1 )A = D − (L D −1 U + U D −1 L) − (L D −1 L + U D −1 U ). (5.87)
Multiplications with D −1 have no effect on the nonzero structure of the results. Using
the previous lemma and the nonzero structure of U and L, it follows that UD−1 U
is 2t-upper diagonal, LD−1 L is 2t-lower diagonal. Also UD−1 L is diagonal with its
last t diagonal elements zero and LD−1 U is diagonal with its first t diagonal elements
zero. Therefore, in formula (5.87), the second right-hand side term (in parentheses)
is diagonal and the third is the sum of a 2t-lower diagonal and a 2t-upper diagonal
matrix, proving the claim.
Let now n = 2k and consider the sequence of transformations
for j = 1, . . . , k −1, where (A(1) , f (1) ) = (A, f ). Observe the structure of A( j) . The
initial A(1) is tridiagonal and so from Corollary 5.3, matrix A(2) is 2-tridiagonal, A(3)
136 5 Banded Linear Systems
It is frequently the case that the dominance of the diagonal terms becomes more
pronounced as cr and paracr progress. One can thus consider truncated forms of the
above algorithms to compute approximate solutions. This idea was explored in [51]
where it was called incomplete cyclic reduction, in the context of cyclic reduction for
block-tridiagonal matrices. The idea is to stop the reduction stage before the log n
steps, and instead of solving the reduced system exactly, obtain an approximation
followed by the necessary back substitution steps. It is worth noting that terminating
the reduction phase early alleviates or avoids completely the loss of parallelism of cr.
5.5 Tridiagonal Systems 137
10 10
15 15
0 5 10 15 0 5 10 15
nz = 44 nz = 40
0 0
5 5
10 10
15 15
0 5 10 15 0 5 10 15
nz = 32 nz = 16
Incomplete point and block cyclic reduction were studied in detail in [64, 65];
it was shown that if the matrix is diagonally dominant by rows then if the row
dominance factor (revealing the degree of diagonal dominance) of A, defined by
⎧ ⎫
⎨ 1 ⎬
rdf(A) := max |αi, j |
i ⎩ |αi,i | ⎭
j=i
is 2-tridiagonal (and thus triadic). The first term in parentheses provides the diagonal
elements and the second term the off-diagonal ones. Next note that A ∞ = |A| ∞
and the terms L D −1 L and U D −1 U have their nonzeros at different position. There-
fore, the row dominance factor of A(2) is equal to
rdf(A(2) ) = (D − L D −1 U − U D −1 L)−1 (L D −1 L + U D −1 U ) ∞ .
138 5 Banded Linear Systems
and because of strict row diagonal dominance, ψ̂i θ̂i + ψi θi < 1. Therefore if we
compute for i = 1, . . . , n the maximum values of the function
ψ̂ζi + ψηi
gi (ψ̂, ψ) =
1 − ψ̂ θ̂i − ψθi
over (ψ, ψ̂) assuming that conditions such as (5.89) hold, then the row dominance
factor of A(2) is less than or equal to the maximum of these values. It was shown in
5.5 Tridiagonal Systems 139
ψ̂ζi + ψηi
≤ ε2 . (5.90)
1 − ψ̂ θ̂i − ψθi
We also observe that it is possible to use the matrix splitting framework in order
to describe cr. We sketch the basic idea, assuming this time that n = 2k − 1. At
each step j = 1, . . . , k − 1, of the reduction phase the following transformation is
implemented:
( j) ( j)
(A( j+1) , f ( j+1) ) = (I + L e (D ( j) )−1 + Ue (D ( j) )−1 )(A( j) , f ( j) ), (5.91)
where initially (A(1) , f (1) ) = (A, f ). We assume that the steps of the algorithm can
be brought to completion without division by zero. Let A( j) = D ( j) − L ( j) − U ( j)
denote the splitting of A( j) into its diagonal and strictly lower and upper triangular
( j) ( j)
parts. From L ( j) and U ( j) we extract the strictly triangular matrices L e and Ue
( j) (
following the rule that they contain the values of L and U at locations (i2 , i2 −
j) j j
2 j−1 ) and (i2 j −2 j−1 , i2 j ) respectively for i = 1, . . . , 2k− j −1 and zero everywhere
else. Then at the end of the reduction phase, row 2k−1 (at the middle) of matrix A(k−1)
will only have its diagonal (middle) element nonzero and the unknown ξ2k−1 can be
computed in one division. This is the first step of the back substitution phase which
consists of k steps. At step j = 1, . . . , k, there is a vector division of length 2 j−1
to compute the unknowns indexed by (1, 3, . . . , 2 j − 1) · 2k− j and 2 j _AXPY,
BLAS1, operations on vectors of length 2k− j − 1 for the updates. The panels in
Fig. 5.10 illustrate the matrix structures that result after k steps of cyclic reduction
(k = 3 in the left and k = 4 in the right panel).
The previous procedure can be extended to block-tridiagonal systems. An analysis
similar to ours for the case of block-tridiagonal systems and block cyclic reduction
was described in [66]; cf. Sect. 6.4.
Assume that all leading principal submatrices of A are nonsingular. Then there exists
a factorization A = L DU where D = diag(δ1 , . . . , δn ) is diagonal and
140 5 Banded Linear Systems
0 0
1 2
2 4
3 6
4 8
5 10
6 12
7 14
8 16
0 2 4 6 8 0 5 10 15
nz = 15 nz = 37
Fig. 5.10 Nonzero structure of A(3) ∈ R7×7 (left) and A(4) ∈ R15×15 (right). In both panels, the
unknown corresponding to the middle equation is obtained using the middle value of each matrix
enclosed by the rectangle with double border. The next set of computed unknowns (2 of them)
correspond to the diagonal elements enclosed by the simple rectangle, the next set of computed
unknowns (22 of them) correspond to the diagonal elements enclosed by the dotted rectangles. For
A(4) , the final set of 23 unknowns correspond to the encircled elements
⎛ ⎞ ⎛ ⎞
1 1 υ2
⎜λ2 ⎟ ⎜ .. .. ⎟
⎜ ⎟ ⎜ . . ⎟
L=⎜ . . ⎟ ,U = ⎜ ⎟.
⎝ .. .. ⎠ ⎝ υn ⎠
λn 1 1
We can write the matrix A as a sum of the rank-1 matrices formed by the columns
of L and the rows of U multiplied by the corresponding diagonal element of D.
Observing the equalities along the diagonal, the following recurrence of degree 1
holds:
αi−1,i αi,i−1
δi = αi,i − , i = 2 : n. (5.92)
δi−1
This recurrence is linearized as described in Sect. 3.4. Specifically, using new vari-
ables τi and setting δi = τi /τi−1 with τ0 = 1, τ1 = α1,1 it follows that
The right-hand side is the unit vector, thus the solution is the first column of the
inverse of the coefficient matrix, that is lower triangular with bandwidth 3. This
specific system, as was shown in [18, Lemma 2], can be solved in 2 log n + O(1)
steps using at most 2n processors. It is worth noting that this is faster than the
approximate cost of 3 log n steps predicted by Theorem 3.2 and this is due to the unit
vector in the right-hand side. Once the τi ’s are available, the elements of D, L and
U are computable in O(1) parallel steps using n processors. Specifically
αi+1,i αi,i+1
δi = τi /τi−1 , λi+1 = , υi = .
di di
This is a vector recurrence for the row vector t (i) = (τi , τi−1 ) that we can write as
t (i) = t (i−1) Si ,
142 5 Banded Linear Systems
This can be done using a matrix (product) parallel prefix algorithm on the set
t (1) , S1 , . . . , Sn . Noticing that t (i) and t (i+1) both contain the value τi , it is suffi-
cient to compute approximately half the terms, say t (1) , t (3) , . . . . For convenience
assume that n is odd. Then this can be accomplished by first computing the prod-
ucts t (1) S1 and S2i S2i+1 for i = 1, . . . , (n − 1)/2 and then applying parallel prefix
matrix product on these elements. For example, when n = 7, the computation can
be accomplished in 3 parallel steps, shown in Table 5.1.
Table 5.1 Steps of a parallel prefix matrix product algorithm to compute the recurrence (5.93)
Step 1 t (1) S1 S2:3 S4:5 S6:7
Step 2 t (1) S1:3 S2:5 S4:7
Step 3 t (1) S1:5 t (1) S1:7
For j > i, the term Si: j denotes the product Si Si+1 · · · S j
5.5 Tridiagonal Systems 143
Regarding stability in finite precision, to achieve the stated complexity, the algo-
rithm uses either parallel prefix matrix products or a lower banded triangular solver
(that also uses parallel prefix). An analysis of the latter process in [18] and some
improvements in [68], show that the bound for the 2-norm of the absolute forward
error contains a factor σ n+1 , where σ = maxi Si 2 . This suggests that the absolute
forward error can be very large in norm. Focusing on the parallel prefix approach, it
is concluded that the bound might be pessimistic, but deriving a tighter one, for the
general case, would be hard. Therefore, the algorithm must be used with precaution,
possibly warning the user when the error becomes large.
αi,i αi,i−1 φi
ξi+1 = − ξi − ξi−1 +
αi, i + 1 αi, i + 1 αi,i+1
⎛ ⎞
ξi
If we set x̂i = ⎝ξi−1 ⎠ then we can write the matrix recurrence
1
⎛ ⎞
ρi σi τi
αi,i αi,i−1 φi
x̂i+1 = Mi x̂i , where Mi = ⎝ 1 0 0 ⎠ , ρi = − , σi = − , τi = .
0 0 1 αi,i+1 αi,i+1 αi,i+1
The initial value is x̂1 = (ξ1 , 0, 1) . Observe that ξ1 is yet unknown. If it were
available, then the matrix recurrence can be used to compute all elements of x from
x̂2 , . . . , x̂n :
Algorithm 5.6 rd_pref: tridiagonal system solver using rd and matrix parallel
prefix.
Input: Irreducible A = [αi,i−1 , αi,i , αi,i+1 ] ∈ Rn×n , and right-hand side f ∈ Rn .
Output: Solution of Ax = f .
1: doall i = 1, . . . , n
αi,i αi,i−1 φi
2: ρi = − αi,i+1 , σi = − αi,i+1 , τi = αi,i+1 .
3: end
4: Compute the products P2⎛= M2 M1⎞, . . . , Pn = Mn · · · M1 using a parallel prefix matrix product
ρi σi τi
algorithm, where Mi = ⎝ 1 0 0 ⎠.
0 0 1
5: Compute ξ1 = −(Pn )1,3 /(Pn )1,1
6: doall i = 2, . . . , n
7: x̂i = Pi x̂1 where x̂1 = (ξ1 , 0, 1) .
8: end
9: Gather the elements of x from {x̂1 , . . . , x̂n }
We first consider the Givens based parallel algorithm from [12, Algorithm I]).
As will become evident, the algorithm shares many features with the parallel LDU
factorization (Algorithm 5.5). Then in Sect. 5.5.5 we revisit the Givens rotation solver
based on Spike partitioning from [12, Algorithm II].
We will assume that the matrix is irreducible and will denote the elementary
2 × 2 Givens rotation submatrix that will be used to eliminate the element in position
(i + 1, i) of the tridiagonal matrix by
ci si
G i+1,i = (5.95)
−si ci .
5.5 Tridiagonal Systems 145
(i) (1)
Let A(0) = A and A(i) = G i+1,i · · · G 2,1 A(0) be the matrix after i = 1, . . . , n − 1
rotation steps so that the subdiagonal elements in positions (2, 1), . . . , (i, i − 1) of
A(i) are 0 and A(n−1) = R is upper triangular. Let
⎛ ⎞
λ1 μ1 ν1
⎜ λ2 μ2 ν2 ⎟
⎜ ⎟
⎜ . .
.. .. . .. ⎟
⎜ ⎟
⎜ ⎟
A(i−1) =⎜
⎜ λ i−1 μ i−1 νi−1
⎟
⎟ (5.96)
⎜ π c α ⎟
⎜ i i−1 i,i+1 ⎟
⎜ α α α ⎟
⎝ i+1,i i+1,i+1 i+1,i+2 ⎠
.. .. ..
. . .
be the result after the subdiagonal entries in rows 2, . . . , i have been annihilated.
Observe that R is banded, with bandwidth 3. If i = n − 1, the process terminates
and the element in position (n, n) is πn ; otherwise, if i < n − 1 then rows i and i + 1
need to be multiplied by a rotation matrix to zero out the subdiagonal element in row
i + 1.
Next, rows i and i + 1 of A(i−1) need to be brought into their final form. The
following relations hold for i = 1, . . . , n − 1 and initial values c0 = 1, s0 = 0:
πi αi+1,i
ci = , si = (5.97)
λi λi
πi = ci−1 αi,i − si−1 ci−2 αi−1,i (5.98)
μi = si αi+1,i+1 + ci ci−1 αi,i+1 (5.99)
νi = si αi+1,i+2 (5.100)
that we write as
ci αi,i−1 αi−1,i
λi = αi,i − ci−1 .
ci−2 λi−1
ci−1
146 5 Banded Linear Systems
Observe the similarity with the nonlinear recurrence (5.92) we encountered in the
steps leading to the parallel LDU factorization Algorithm 5.5. Following the same
approach, we apply the change of variables
τi ci
= λi (5.101)
τi−1 ci−1
θi = αi+1,i
2
θi−1 + τi2 , for i = 1, . . . , n − 1 (5.106)
with initial values θ0 = τ0 = 1. This amounts to a linear recurrence for θi that can
be expressed as the unit lower bidiagonal system
5.5 Tridiagonal Systems 147
⎛ ⎞
⎛ ⎞⎛ ⎞ τ02
1 θ0 ⎜ ⎟
⎜−α2,1
2 1 ⎟⎜ θ1 ⎟ ⎜ τ12 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ −α 2 ⎟⎜ θ2 ⎟ ⎜ τ22 ⎟
⎜ 3,2 1 ⎟⎜ ⎟=⎜ ⎟ (5.107)
⎜ .. .. ⎟⎜ .. ⎟ ⎜ ⎟
⎝ . . ⎠⎝ . ⎠ ⎜ .. ⎟
⎝ . ⎠
−αn,n−1
2 1 θn−1
τn−1
2
On 2(n − 1) processors, one step is needed to form the coefficient matrix and right-
hand side. This lower bidiagonal system can be solved in 2 log n steps on n − 1
processors, (cf. [18, Lemma 1] and Theorem 3.2).
As before, we can also write the recurrence (5.106) as
2
αi+1,i 0
θi 1 = θi−1 1 , θ0 = 1. (5.108)
τi2 1
It is easy to see that all θi ’s can be obtained using a parallel algorithm for the prefix
matrix product. Moreover, the matrices are all nonnegative, and so the componen-
twise relative errors are expected to be small; cf. [68]. This further strengthens the
finding in [12, Lemma 3] that the normwise relative forward error bound in comput-
ing the θi ’s only grows linearly with n.
Using the values {θi , τi } the following values can be computed in 3 parallel steps
for i = 0, . . . , n − 1:
τi2 2 θi αi+1,i
2
ci2 = , λi = , si2 = . (5.109)
θi θi−1 λi2
From these, the values μi , νi along the superdiagonals of R = A(n−1) can be obtained
in 3 parallel steps on 2n processors.
The solution of a system using QR factorization also requires the multiplication of
the right-hand side vector by the matrix Q that is the product of the n − 1 associated
Givens rotations. By inspection, it can be easily seen that this product is lower
Hessenberg; cf. [73]. An observation with practical implications made in [12] that is
not difficult is that Q can be expressed as Q = W LY + S J [12], where L is the
lower triangular matrix with all nonzero elements equal to 1, J = (e2 , . . . , en , 0)
and W = diag(ω1 , . . . , ωn ), Y = diag(η1 , . . . , ηn ) and S = diag(s1 , . . . , sn−1 , 0),
where the {ci , si } are as before and
ci−1
ωi = ci ρi−1 , ηi = , i = 1, . . . , n, (5.110)
ρi−1
i
and ρi = (−1)i s j , i = 1, . . . , n − 1 with ρ0 = c0 = cn = 1.
j=1
148 5 Banded Linear Systems
The elements ρi are obtained from a parallel prefix product in log n steps using
n−2
2 processors. It follows that multiplication by Q can be decomposed into five
easy steps involving arithmetic, 3 being multiplications of a vector with the diagonal
matrices W, Y and S, and a computation of all partial sums of a vector (the effect of
L). The latter is a prefix sum and can be accomplished in log n parallel steps using
n processors; see for instance [74].
Algorithm 5.7, called pargiv, incorporates all these steps to solve the tridiagonal
system using the parallel generation and application of Givens rotations.
The leading cost of pargiv is 9 log n parallel operations: There are 2 log n in
each of stages I , II because of the special triangular systems that need to be solved,
2 log n in Stage IV (line 4) because of the special structure of the factors W, L , Y, S, J
and finally, another 3 log n in stage IV (line 6) also because of the banded trangular
system. Stage III contributes only a constant to the cost.
From the preceding discussion the following theorem holds:
Theorem 5.6 ([12, Theorem 3.1, Lemma 3.1]) Let A be a nonsingular irreducible
tridiagonal matrix of order n. Algorithm pargiv for solving Ax = f based on the
orthogonal factorization QA = R constructed by means of Givens rotations takes
T p = 9 log n + O(1) parallel steps on 3n processors. The resulting speedup is
S p = O( logn n ) and the efficiency E p = O( log1 n ). Specifically, the speedup over
8n
the algorithm implemented on a single processor is approximately 3 log n and over
Gaussian elimination with partial pivoting approximately Moreover, if B ∈ 4n
3 log n .
R , then the systems AX = B can be solved in the same number of parallel
n×k
is required in using the method in the general case. Another difficulty is that the
computation of the diagonal elements of W and Y in Stage I V can lead to underflow
and overflow. Both of the above become less severe when n is small. We next describe
a partitioning approach for circumventing these problems.
with all the diagonal submatrices being square. Furthermore, to render the exposition
simpler we assume that p divides n and that each Ai,i is of size, m = n/ p. Under these
assumptions, the Ai,i ’s are tridiagonal, the subdiagonal blocks A2,1 , . . . , A p, p−1 are
multiples of e1 em and each superdiagonal block A , . . . , A
1,2 p−1, p is a multiple of
em e1 .
The algorithm we describe here uses the Spike methodology based on a Givens-
QR tridiagonal solver applied to each subsystem. In fact, the first version of this
method was proposed in [12] as a way to restrict the size of systems on which pargiv
(Algorithm 5.7) is applied and thus prevent the forward error from growing too large.
This partitioning for stabilization was inspired by work on marching methods; cf. [12,
p. 87] and Sect. 6.4.7 of Chap. 6. Today, of course, partitioning is considered primarily
as a method for adding an extra level of parallelism and greater opportunities for
exploiting the underlying architecture. Our Spike algorithm, however, does not have
to use pargiv to solve each subsystem. It can be built, for example, using a sequential
method for each subsystem.
Key to the method are the following two results that restrict the rank deficiency
of tridiagonal systems.
Proposition 5.5 Let the nonsingular tridiagonal matrix A be of order n, where
n = pm and let it be partitioned into a block-tridiagonal matrix of p blocks of order
m each in the diagonal with rank-1 off-diagonal blocks. Then each of the tridiagonal
submatrices along the diagonal and in block positions 2 up to p − 1 have rank at
least m − 2 and the first and last submatrices have rank at least m − 1. Moreover, if
A is irreducible, then the rank of each diagonal submatrix is at least m − 1.
The proof is omitted; cf. related results in [12, 80]. Therefore, the rank of each
diagonal block of an irreducible tridiagonal matrix is at least n −1. With this property
in mind, in infinite precision at least, we can correct rank deficiencies by only adding
a rank-1 matrix. Another result will also be useful.
Proposition 5.6 ([12, Lemma 3.2]) Let the order n irreducible tridiagonal matrix
A be singular. Then its rank is exactly n − 1 and the last row of the triangular factor
R in any Q R decomposition of A will be zero, that is πn = 0. Moreover, if this
is computed via the sequence of Givens rotations (5.95), the last element satisfies
cn−1 = 0.
The method, we call SP_Givens, proceeds as follows. Given the partitioned sys-
tem (5.111), each of the submatrices Ai,i along the diagonal is reduced to upper
triangular form using Givens transformations. This can be done, for example, by
means of Algorithm 5.7 (pargiv).
5.5 Tridiagonal Systems 151
In the course of this reduction, we obtain (in implicit form) the orthogonal matrices
Q 1 , . . . , Q p such that Q i Ai,i = Ri , where Ri is upper triangular. Applying the same
transformations on the right-hand side, the original system becomes
⎛ ⎞⎛ ⎞ ⎛ ⎞
R1 B2 x1 f˜1
⎜C 2 R2 B3 ⎟ ⎜ x2 ⎟ ⎜ f˜2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. .. .. ⎟ ⎜ . ⎟=⎜ . ⎟ (5.112)
⎝ . . . ⎠ ⎝ .. ⎠ ⎝ .. ⎠
Cp Rp xp f˜p
where
Q i = Wi LYi + Si J
where the elements of the diagonal matrices Wi , Yi , Si , see (5.109) and (5.110)
correspond to the values obtained in the course of the upper triangularization of Ai,i
by Givens rotations. Recalling that Ai,i−1 = α(i−1)m+1,(i−1)m e1 em and A
i,i+1 =
αim,im+1 em e1 , then Ci and Bi+1 are given by:
= α(i−1)m+1,(i−1)m w(i) em
, where w(i) = diag(Wi ),
(i)
(note that J e1 = 0 and e1 Yi e1 = η1 = 1 from (5.110)) and
(i)
From relations (5.110), and the fact that cm = 1, it holds that
(i) (i)
Bi+1 = αim,im+1 (cm−1 em + sm−1 em−1 )e1 . (5.113)
Note that the only nonzero terms of each Ci = Q i Ai,i−1 are in the last column
whereas the only nonzero elements of Bi+1 = Q i Ai,i+1 are the last two elements of
the first column.
Consider now any block row of the above partition. If all diagonal blocks Ri
are invertible, then we proceed exactly as with the Spike algorithm. Specifically, for
152 5 Banded Linear Systems
each block row, consisting of (Ci , Ri , Bi+1 ) and the right-hand side f˜i , the following
(3 p − 2) independent triangular systems can be solved simultaneously:
Note that we need to solve upper triangular systems with 2 or 3 right-hand sides
to generate the spikes as well as update the right-hand side systems.
We next consider the case when one or more of the tridiagonal blocks Ai,i is
singular. Then it is possible to modify à so that the triangular matrices above are
made invertible. In particular, since the tridiagonal matrix A is invertible, so wiil
be the coefficient matrix à in (5.112). When one or more of the of the submatrices
Ai,i is singular, then so will be the corresponding submatrices Ri and, according to
Proposition 5.6, this would manifest itself with a zero value appearing at the lower
right corner of the corresponding Ri ’s. To handle this situation, the algorithm applies
multiple boostings to shift away from zero the values at the corners of the triangular
blocks Ri along the diagonal.) As we will see, we do not have to do this for the
last block R p . Moreover, all these boostings are independent and can be applied in
parallel. This step for blocks other than the last one is represented as a multiplication
of à with a matrix, Pboost , of the form
p−1
Pboost = In + ζi eim+1 eim ,
i=1
1 if |(Ri )m,m | < threshold
where ζi =
0 otherwise.
Observe that the right-hand side is nonzero since in case (Rk )m,m = 0, the term
immediately to its right (that is in position (km, km + 1)) is nonzero. In this way,
matrix à Pboost has all diagonal blocks nonsingular except possibly the last one. Call
these (modified in case of singularity, unmodified otherwise) diagonal blocks R̃i and
set the bock diagonal matrix R̃ = diag[ R̃1 , . . . , R̃ p−1 , R̃ p ]. If A p, p is found to be
singular, then R̃ p = diag[ R̂ p , 1], where R̂ p is the leading principal submatrix of R p
of order m − 1 which will be nonsingular in case R p was found to be singular. Thus,
as constructed, R̃ is nonsingular.
5.5 Tridiagonal Systems 153
p−1
−1
Pboost = In − ζi eim+1 eim .
i=1
where the matrix à Pboost is block-tridiagonal with all its p diagonal blocks upper
triangular and invertible with the possible exception of the last one, which might
contain a zero in the last diagonal element. It is also worth noting that the above
amounts to the Spike DS factorization of a modified matrix, in particular
A Pboost = D̃ S̃ (5.116)
where T is tridiagonal and nonsingular. The reduced system T x̂1 = fˆ1 is solved
first. Note that even when its order is 2 p − 1 and the last diagonal element of T is
zero, the reduced system is invertible.
Finally, once x̂1 is computed, the remaining unknowns x̂2 are easily computed in
parallel.
We next consider the solution of the reduced system. This is essentially tridiagonal
and thus a hybrid scheme can be used: That is, apply the same methodology and use
partitioning and DS factorization producing a new reduced system, until the system
is small enough to switch to a parallel method not based on partitioning, e.g. pargiv,
and later to a serial method.
The idea of boosting was originally described in [81] as a way to reduce row
interchanges and preserve sparsity in the course of Gaussian elimination. Boosting
was used in sparse solvers including Spike; cf. [16, 82]. The mechanism consists
of adding a suitable value whenever a small diagonal element appears in the pivot
position so as to avoid instability. This is equivalent to a rank-1 modification of the
matrix. For instance, if before the first step of LU on a matrix A, element α1,1 were
found to be small enough, then A would be replaced by A + γ1 e1 e1 with γ chosen
154 5 Banded Linear Systems
which amounts to right preconditioning with a low rank modification of the identity
and this as well as the inverse transformation can be applied without solving auxiliary
systems.
We note that the method is appropriate when the partitioning results in blocks
along the diagonal whose singularities are revealed by the Givens QR factorization.
In this case, the blocks can be rendered nonsingular by rank-1 modifications. It will
not be effective, however, if some blocks are almost, but not exactly singular, and
QR is not rank revealing without extra precautions, such as pivoting. See also [83]
for more details regarding this algorithm and its implementation on GPUs.
Another partition-based method that is also applicable to general banded matrices
and does not necessitate that the diagonal submatrices are invertible is proposed
in [80]. The algorithm uses row and possibly column pivoting when factoring the
diagonal blocks so as to prevent loss of stability. The linear system is partitioned as
in [5], that is differently from the Spike.
Factorization based on block diagonal pivoting without interchanges
If the tridiagonal matrix is symmetric and all subsystems Ai,i are nonsingular, then
it is possible to avoid interchanges by computing an L B L factorization, where L
is unit lower triangular and B is block-diagonal, with 1 × 1 and 2 × 2 diagonal
blocks. This method avoids the loss of symmetry caused by partial pivoting when
the matrix is indefinite; cf. [84, 85] as well as [45]. This strategy can be extended
for nonsymmetric systems computing an L B M factorization with M also unit
lower triangular; cf. [86, 87]. These methods become attractive in the context of
high performance systems where interchanges can be detrimental to performance. In
[33] diagonal pivoting of this type was combined with Spike as well as “on the fly”
recursive Spike (cf. Sect. 5.2.2 of this chapter) to solve large tridiagonal systems on
GPUs, clusters of GPUs and systems consisting of CPUs and GPUs. We next briefly
describe the aforementioned L B M factorization.
The first step of the algorithm illustrates the basic idea. The crucial observation is
that if a tridiagonal matrix, say A, is nonsingular, then it must hold that either α1,1 or
the determinant of its 2 × 2 leading principal submatrix, that is α1,1 α2,2 − α1,2 α2,1
must be nonzero. Therefore, if we partition the matrix as
5.5 Tridiagonal Systems 155
A1,1 A1,2
A= ,
A2,1 A2,2
Moreover, the corresponding Schur complement, Sd = A1,1 − A2,1 A−1 1,1 A1,2 is tridi-
agonal and nonsingular and so the same strategy can be recursively applied until the
final factorization is computed. In finite precision, instead of testing √
if any of these val-
ues are exactly 0, a different approach is used. In this, the root μ = ( 5−1)/2 ≈ 0.62
of the quadratic equation μ2 + μ − 1 = 0 plays an important role. Two pivoting
strategies proposed in the literature are as follows. If the largest element, say τ , of
A is known, then select d = 1 if
and apply the same strategy, (5.118). From the error analysis conducted in [86], it is
claimed that the method is backward stable.
It is one of the basic tenets of numerical linear algebra that determinants and Cramer’s
rule are not to be used for solving general linear systems. When the matrix is tridi-
agonal, however, determinental equalities can be used to show that the inverse has
a special form. This can be used to design methods for computing elements of the
inverse, and for solving linear systems. These methods are particularly useful when
we want to compute one or few selected elements of the inverse or the solution,
something that is not possible with standard direct methods. In fact, there are several
applications in science and engineering that require such selective computations, e.g.
see some of the examples in [88].
156 5 Banded Linear Systems
It has long been known, for example, that the inverse of symmetric irreducible
tridiagonal matrices, has special structure; reference [89] provides several remarks
on the history of this topic, and the relevant discovery made in [90]. An algorithm
for inverting such matrices was presented in [91]. Inspired by that work, a paral-
lel algorithm for computing selected elements of the solution or the inverse that is
applicable for general tridiagonal matrices was introduced in [67]. A similar algo-
rithm for solving tridiagonal linear systems was presented in [92]. We briefly outline
the basic idea of these methods.
For the purpose of this discussion we recall the notation A = [αi,i−1 , αi,i , αi,i+1 ].
As before we assume that it is nonsingular and irreducible. Let
A−1 = [τi, j ] i, j = 1, 2, . . . , n,
then
τi, j = (−1)i+ j det(A( j, i))/det(A)
where Ai−1 and à j+1 are tridiagonal of order i −1 and n− j respectively, S j−i is of or-
der j −i and is lower triangular with diagonal elements αi,i+1 , αi+1,i+2 , . . . , α j−1, j .
(i−1)
Moreover, Z 1 is of size ( j − i) × (i − 1) with first row αi,i−1 (ei−1 ) and Z 2 of
( j−i)
order (n − j) × ( j − i) with first row α j+1, j (e j−i ) Hence, for i ≤ j,
j−1
det(Ai−1 )det( Ã j+1 ) k=i αk,k+1
τi, j = (−1) i+ j
.
det(A)
This shows that the lower (resp. upper) triangular part of the inverse of an irreducible
tridiagonal matrix is the lower (resp. upper) triangular part of a rank-1 matrix. The
respective rank-1 matrices are generated by the vectors u = (υ1 , . . . , υn ) v =
(ν1 , . . . , νn ) and w = (ω1 , . . . , ωn ) . Moreover, to compute one element of the
inverse we only need a single element from each of u, v and w. Obviously, if the
vectors u, v and w are known, then we can compute any of the elements of the inverse
and with little extra work, any element of the solution. Fortunately, these vectors can
be computed independently from the solution of three lower triangular systems of
very small bandwidth and very special right-hand sides, using, for example, the
algorithm of Theorem 3.2. Let us see how to do this.
Let dk = det(Ak ). Observing that in block form
(k−1)
Ak−1 αk−1,k ek−1
Ak = (k−1) ,
αk,k−1 (ek−1 ) αk,k
Thus,
υi = −αi−1, i−1 υi−1 − αi−2, i−1 υi−2 i = 2, 3, . . . , n
in which
α̂i = αi,i /αi+1,i
β̂i = αi,i+1 /αi+2,i+1
υ0 = 0, υ1 = −1.
(n)
L 1 u = e1 (5.119)
where ⎛ ⎞
1
⎜ α̂1 1 ⎟
⎜ ⎟
⎜ β̂1 α̂2 1 ⎟
L1 = ⎜ ⎟.
⎜ . . . . . . ⎟
⎝ . . . ⎠
β̂n−2 α̂n−1 1
158 5 Banded Linear Systems
We next follow a similar approach considering instead the trailing tridiagonal sub-
matrix
αk,k αk,k+1 (e1(n−k) )
Ãk = (n−k) .
αk+1,k e1 Ãk+1
where,
γ̂ j = α j, j /α j−1, j ,
δ̂ j = α j, j−1 /α j−2, j−1 , and
νn+1 = 0, νn = (−1)n .
where ⎛ ⎞
1
⎜ γ̂n 1 ⎟
⎜ ⎟
⎜ δ̂n γ̂n−1 1 ⎟
L2 = ⎜ ⎟.
⎜ . .. .. .. ⎟
. .
⎝ ⎠
δ̂2 γ̂1 1
Finally, from
n−1
i
ωi = αk,k+1 αk,k−1 /det(T )
k=i k=2
we have
ωk+1 = θk+1 ωk k = 1, 2, . . . , n − 1
where,
θi = αi,i−1 /αi,i+1
ω1 = (α1,2 α2,3 · · · αk,k+1 )/det(T ) = (1/ν0 ).
5.5 Tridiagonal Systems 159
(n)
L 3 w = (1/ν0 )e1 (5.121)
where, ⎛ ⎞
1
⎜ −θ2 1 ⎟
⎜ ⎟
L3 = ⎜ .
⎝ . .. .. ⎟
. ⎠
−θn 1
L 1 u = e1(n) ,
(n+1)
L 2 v = (−1)n e1 , and
(n)
L 3 w̃ = e1
References
1. Arbenz, P., Hegland, M.: On the stable parallel solution of general narrow banded linear systems.
High Perform. Algorithms Struct. Matrix Probl. 47–73 (1998)
2. Arbenz, P., Cleary, A., Dongarra, J., Hegland, M.: A comparison of parallel solvers for general
narrow banded linear systems. Parallel Distrib. Comput. Pract. 2(4), 385–400 (1999)
3. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,
Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: ScaLAPACK
User’s Guide. SIAM, Philadelphia (1997). URL http://www.netlib.org/scalapack
4. Conroy, J.: Parallel algorithms for the solution of narrow banded systems. Appl. Numer. Math.
5, 409–421 (1989)
5. Dongarra, J., Johnsson, L.: Solving banded systems on a parallel processor. Parallel Comput.
5(1–2), 219–246 (1987)
160 5 Banded Linear Systems
6. George, A.: Numerical experiments using dissection methods to solve n by n grid problems.
SIAM J. Numer. Anal. 14, 161–179 (1977)
7. Golub, G., Sameh, A., Sarin, V.: A parallel balance scheme for banded linear systems. Numer.
Linear Algebra Appl. 8, 297–316 (2001)
8. Johnsson, S.: Solving narrow banded systems on ensemble architectures. ACM Trans. Math.
Softw. 11, 271–288 (1985)
9. Meier, U.: A parallel partition method for solving banded systems of linear equations. Parallel
Comput. 2, 33–43 (1985)
10. Tang, W.: Generalized Schwarz splittings. SIAM J. Sci. Stat. Comput. 13, 573–595 (1992)
11. Wright, S.: Parallel algorithms for banded linear systems. SIAM J. Sci. Stat. Comput. 12,
824–842 (1991)
12. Sameh, A., Kuck, D.: On stable parallel linear system solvers. J. Assoc. Comput. Mach. 25(1),
81–91 (1978)
13. Dongarra, J.J., Sameh, A.: On some parallel banded system solvers. Technical Report
ANL/MCS-TM-27, Mathematics Computer Science Division at Argonne National Labora-
tory (1984)
14. Gallivan, K., Gallopoulos, E., Sameh, A.: CEDAR—an experiment in parallel computing.
Comput. Math. Appl. 1(1), 77–98 (1994)
15. Lawrie, D.H., Sameh, A.: The computation and communication complexity of a parallel banded
system solver. ACM TOMS 10(2), 185–195 (1984)
16. Polizzi, E., Sameh, A.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel
Comput. 32, 177–194 (2006)
17. Polizzi, E., Sameh, A.: SPIKE: a parallel environment for solving banded linear systems.
Compon. Fluids 36, 113–120 (2007)
18. Sameh, A., Kuck, D.: A parallel QR algorithm for symmetric tridiagonal matrices. IEEE Trans.
Comput. 26(2), 147–153 (1977)
19. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
20. Demko, S., Moss, W., Smith, P.: Decay rates for inverses of band matrices. Math. Comput.
43(168), 491–499 (1984)
21. Björck, Å.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
22. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins. University Press,
Baltimore (2013)
23. Davis, T.: Algorithm 915, SuiteSparseQR: multifrontal multithreaded rank-revealing sparse
QR factorization. ACM Trans. Math. Softw. 38(1), 8:1–8:22 (2011). doi:10.1145/2049662.
2049670, URL http://doi.acm.org/10.1145/2049662.2049670
24. Lou, G.: Parallel methods for solving linear systems via overlapping decompositions. Ph.D.
thesis, University of Illinois at Urbana-Champaign (1989)
25. Naumov, M., Sameh, A.: A tearing-based hybrid parallel banded linear system solver. J. Com-
put. Appl. Math. 226, 306–318 (2009)
26. Benzi, M., Golub, G., Liesen, J.: Numerical solution of saddle-point problems. Acta Numer.
1–137 (2005)
27. Hockney, R., Jesshope, C.: Parallel Computers. Adam Hilger (1983)
28. Ortega, J.M.: Introduction to Parallel and Vector Solution of Linear Systems. Plenum Press,
New York (1988)
29. Golub, G., Ortega, J.: Scientific Computing: An Introduction with Parallel Computing. Acad-
emic Press Inc., San Diego (1993)
30. Davidson, A., Zhang, Y., Owens, J.: An auto-tuned method for solving large tridiagonal systems
on the GPU. In: Proceedings of IEEE IPDPS, pp. 956–965 (2011)
31. Lopez, J., Zapata, E.: Unified architecture for divide and conquer based tridiagonal system
solvers. IEEE Trans. Comput. 43(12), 1413–1425 (1994). doi:10.1109/12.338101
32. Santos, E.: Optimal and efficient parallel tridiagonal solvers using direct methods. J. Super-
comput. 30(2), 97–115 (2004). doi:10.1023/B:SUPE.0000040615.60545.c6, URL http://dx.
doi.org/10.1023/B:SUPE.0000040615.60545.c6
References 161
33. Chang, L.W., Stratton, J., Kim, H., Hwu, W.M.: A scalable, numerically stable, high-
performance tridiagonal solver using GPUs. In: Proceedings International Conference High
Performance Computing, Networking Storage and Analysis, SC’12, pp. 27:1–27:11. IEEE
Computer Society Press, Los Alamitos (2012). URL http://dl.acm.org/citation.cfm?id=
2388996.2389033
34. Goeddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed-
precision multigrid. IEEE Trans. Parallel Distrib. Syst. 22(1), 22–32 (2011)
35. Codenotti, B., Leoncini, M.: Parallel Complexity of Linear System Solution. World Scientific,
Singapore (1991)
36. Ascher, U., Mattheij, R., Russell, R.: Numerical Solution of Boundary Value Problems for
Ordinary Differential Equations. Classics in Applied Mathematics. SIAM, Philadelphia (1995)
37. Isaacson, E., Keller, H.B.: Analysis of Numerical Methods. Wiley, New York (1966)
38. Keller, H.B.: Numerical Methods for Two-Point Boundary-Value Problems. Dover Publica-
tions, New York (1992)
39. Bank, R.E.: Marching algorithms and block Gaussian elimination. In: Bunch, J.R., Rose, D.
(eds.) Sparse Matrix Computations, pp. 293–307. Academic Press, New York (1976)
40. Bank, R.E., Rose, D.: Marching algorithms for elliptic boundary value problems. I: the constant
coefficient case. SIAM J. Numer. Anal. 14(5), 792–829 (1977)
41. Roache, P.: Elliptic Marching Methods and Domain Decomposition. CRC Press Inc., Boca
Raton (1995)
42. Richardson, L.F.: Weather Prediction by Numerical Process. Cambridge University Press.
Reprinted by Dover Publications, 1965 (1922)
43. Arbenz, P., Hegland, M.: The stable parallel solution of narrow banded linear systems. In: Heath,
M., et al. (eds.) Proceedings of Eighth SIAM Conference Parallel Processing and Scientific
Computing SIAM, Philadelphia (1997)
44. Bank, R.E., Rose, D.: Marching algorithms for elliptic boundary value problems. II: the variable
coefficient case. SIAM J. Numer. Anal. 14(5), 950–969 (1977)
45. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
46. Higham, N.: Stability of parallel triangular system solvers. SIAM J. Sci. Comput. 16(2), 400–
413 (1995)
47. Viswanath, D., Trefethen, L.: Condition numbers of random triangular matrices. SIAM J.
Matrix Anal. Appl. 19(2), 564–581 (1998)
48. Hockney, R.: A fast direct solution of Poisson’s equation using Fourier analysis. J. Assoc.
Comput. Mach. 12, 95–113 (1965)
49. Gander, W., Golub, G.H.: Cyclic reduction: history and applications. In: Luk, F., Plemmons, R.
(eds.) Proceedings of the Workshop on Scientific Computing, pp. 73–85. Springer, New York
(1997). URL http://people.inf.ethz.ch/gander/papers/cyclic.pdf
50. Amodio, P., Brugnano, L.: Parallel factorizations and parallel solvers for tridiagonal linear
systems. Linear Algebra Appl. 172, 347–364 (1992). doi:10.1016/0024-3795(92)90034-8,
URL http://www.sciencedirect.com/science/article/pii/0024379592900348
51. Heller, D.: Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems.
SIAM J. Numer. Anal. 13(4), 484–496 (1976)
52. Lambiotte Jr, J., Voigt, R.: The solution of tridiagonal linear systems on the CDC STAR 100
computer. ACM Trans. Math. Softw. 1(4), 308–329 (1975). doi:10.1145/355656.355658, URL
http://doi.acm.org/10.1145/355656.355658
53. Nassimi, D., Sahni, S.: An optimal routing algorithm for mesh-connected parallel computers.
J. Assoc. Comput. Mach. 27(1), 6–29 (1980)
54. Nassimi, D., Sahni, S.: Parallel permutation and sorting algorithms and a new generalized
connection network. J. Assoc. Comput. Mach. 29(3), 642–667 (1982)
55. George, A.: Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. 10(2),
345–363 (1973). URL http://www.jstor.org/stable/2156361
56. Amodio, P., Brugnano, L., Politi, T.: Parallel factorization for tridiagonal matrices. SIAM J.
Numer. Anal. 30(3), 813–823 (1993)
162 5 Banded Linear Systems
57. Johnsson, S.: Solving tridiagonal systems on ensemble architectures. SIAM J. Sci. Stat. Com-
put. 8, 354–392 (1987)
58. Zhang, Y., Cohen, J., Owens, J.: Fast tridiagonal solvers on the GPU. ACM SIGPLAN Not.
45(5), 127–136 (2010)
59. Amodio, P., Mazzia, F.: Backward error analysis of cyclic reduction for the solution of tridi-
agonal systems. Math. Comput. 62(206), 601–617 (1994)
60. Higham, N.: Bounding the error in Gaussian elimination for tridiagonal systems. SIAM J.
Matrix Anal. Appl. 11(4), 521–530 (1990)
61. Zhang, Y., Owens, J.: A quantitative performance analysis model for GPU architectures. In:
Proceedings of the 17th IEEE International Symposium on High-Performance Computer Ar-
chitecture (HPCA 17) (2011)
62. El-Mikkawy, M., Sogabe, T.: A new family of k-Fibonacci numbers. Appl. Math. Com-
put. 215(12), 4456–4461 (2010). URL http://www.sciencedirect.com/science/article/pii/
S009630031000007X
63. Fang, H.R., O’Leary, D.: Stable factorizations of symmetric tridiagonal and triadic matrices.
SIAM J. Math. Anal. Appl. 28(2), 576–595 (2006)
64. Mikkelsen, C., Kågström, B.: Parallel solution of narrow banded diagonally dominant linear
systems. In: Jónasson, L. (ed.) PARA 2010. LNCS, vol. 7134, pp. 280–290. Springer (2012).
doi:10.1007/978-3-642-28145-7_28, URL http://dx.doi.org/10.1007/978-3-642-28145-7_28
65. Mikkelsen, C., Kågström, B.: Approximate incomplete cyclic reduction for systems which are
tridiagonal and strictly diagonally dominant by rows. In: Manninen, P., Öster, P. (eds.) PARA
2012. LNCS, vol. 7782, pp. 250–264. Springer (2013). doi:10.1007/978-3-642-36803-5_18,
URL http://dx.doi.org/10.1007/978-3-642-36803-5_18
66. Bini, D., Meini, B.: The cyclic reduction algorithm: from Poisson equation to stochastic
processes and beyond. Numer. Algorithms 51(1), 23–60 (2008). doi:10.1007/s11075-008-
9253-0, URL http://www.springerlink.com/index/10.1007/s11075-008-9253-0; http://www.
springerlink.com/content/m40t072h273w8841/fulltext.pdf
67. Sameh, A.: Numerical parallel algorithms—a survey. In: Kuck, D., Lawrie, D., Sameh, A.
(eds.) High Speed Computer and Algorithm Optimization, pp. 207–228. Academic Press, Sans
Diego (1977)
68. Mathias, R.: The instability of parallel prefix matrix multiplication. SIAM J. Sci. Comput.
16(4) (1995), to appear
69. Eğecioğlu, O., Koç, C., Laub, A.: A recursive doubling algorithm for solution of tridiagonal
systems on hypercube multiprocessors. J. Comput. Appl. Math. 27, 95–108 (1989)
70. Dubois, P., Rodrigue, G.: An analysis of the recursive doubling algorithm. In: Kuck, D., Lawrie,
D., Sameh, A. (eds.) High Speed Computer and Algorithm Organization, pp. 299–305. Acad-
emic Press, San Diego (1977)
71. Hammarling, S.: A survey of numerical aspects of plane rotations. Report Maths. 1, Middlesex
Polytechnic (1977). URL http://eprints.ma.man.ac.uk/1122/. Available as Manchester Institute
for Mathematical Sciences MIMS EPrint 2008.69
72. Bar-On, I., Codenotti, B.: A fast and stable parallel QR algorithm for symmetric tridiagonal
matrices. Linear Algebra Appl. 220, 63–95 (1995). doi:10.1016/0024-3795(93)00360-C, URL
http://www.sciencedirect.com/science/article/pii/002437959300360C
73. Gill, P.E., Golub, G., Murray, W., Saunders, M.: Methods for modifying matrix factorizations.
Math. Comput. 28, 505–535 (1974)
74. Lakshmivarahan, S., Dhall, S.: Parallelism in the Prefix Problem. Oxford University Press,
New York (1994)
75. Cleary, A., Dongarra, J.: Implementation in ScaLAPACK of divide and conquer algorithms
for banded and tridiagonal linear systems. Technical Report UT-CS-97-358, University of
Tennessee Computer Science Technical Report (1997)
76. Bar-On, I., Codenotti, B., Leoncini, M.: Checking robust nonsingularity of tridiagonal matrices
in linear time. BIT Numer. Math. 36(2), 206–220 (1996). doi:10.1007/BF01731979, URL
http://dx.doi.org/10.1007/BF01731979
References 163
77. Bar-On, I.: Checking non-singularity of tridiagonal matrices. Electron. J. Linear Algebra 6,
11–19 (1999). URL http://math.technion.ac.il/iic/ela
78. Bondeli, S.: Divide and conquer: a parallel algorithm for the solution of a tridiagonal system
of equations. Parallel Comput. 17, 419–434 (1991)
79. Wang, H.: A parallel method for tridiagonal equations. ACM Trans. Math. Softw. 7, 170–183
(1981)
80. Wright, S.: Parallel algorithms for banded linear systems. SIAM J. Sci. Stat. Comput. 12(4),
824–842 (1991)
81. Stewart, G.: Modifying pivot elements in Gaussian elimination. Math. Comput. 28(126), 537–
542 (1974)
82. Li, X., Demmel, J.: SuperLU-DIST: A scalable distributed-memory sparse direct solver for
unsymmetric linear systems. ACM TOMS 29(2), 110–140 (2003). URL http://doi.acm.org/10.
1145/779359.779361
83. Venetis, I.E., Kouris, A., Sobczyk, A., Gallopoulos, E., Sameh, A.: A direct tridiagonal solver
based on Givens rotations for GPU-based architectures. Technical Report HPCLAB-SCG-
06/11-14, CEID, University of Patras (2014)
84. Bunch, J.: Partial pivoting strategies for symmetric matrices. SIAM J. Numer. Anal. 11(3),
521–528 (1974)
85. Bunch, J., Kaufman, K.: Some stable methods for calculating inertia and solving symmetric
linear systems. Math. Comput. 31, 162–179 (1977)
86. Erway, J., Marcia, R.: A backward stability analysis of diagonal pivoting methods for solv-
ing unsymmetric tridiagonal systems without interchanges. Numer. Linear Algebra Appl. 18,
41–54 (2011). doi:10.1002/nla.674, URL http://dx.doi.org/10.1002/nla.674
87. Erway, J.B., Marcia, R.F., Tyson, J.: Generalized diagonal pivoting methods for tridiagonal
systems without interchanges. IAENG Int. J. Appl. Math. 4(40), 269–275 (2010)
88. Golub, G.H., Meurant, G.: Matrices, Moments and Quadrature with Applications. Princeton
University Press, Princeton (2009)
89. Vandebril, R., Van Barel, M., Mastronardi, N.: Matrix Computations and Semiseparable
Matrices. Volume I: Linear Systems. Johns Hopkins University Press (2008)
90. Gantmacher, F., Krein, M.: Sur les matrices oscillatoires et complèments non négatives. Com-
position Mathematica 4, 445–476 (1937)
91. Bukhberger, B., Emelyneko, G.: Methods of inverting tridiagonal matrices. USSR Comput.
Math. Math. Phys. 13, 10–20 (1973)
92. Swarztrauber, P.N.: A parallel algorithm for solving general tridiagonal equations. Math. Com-
put. 33, 185–199 (1979)
93. Yamamoto, T., Ikebe, Y.: Inversion of band matrices. Linear Algebra Appl. 24, 105–111 (1979).
doi:10.1016/0024-3795(79)90151-4, URL http://www.sciencedirect.com/science/article/pii/
0024379579901514
94. Strang, G., Nguyen, T.: The interplay of ranks of submatrices. SIAM Rev. 46(4), 637–646
(2004). URL http://www.jstor.org/stable/20453569
Chapter 6
Special Linear Systems
One key idea when attempting to build algorithms for large scale matrix problems
is to detect if the matrix has special properties, possibly due to the characteristics
of the application, that could be taken into account in order to design faster solution
methods. This possibility was highlighted early on by Turing himself, when he noted
in his report for the Automatic Computing Engine (ACE) that even though with the
storage capacities available at that time it would be hard to store and handle systems
larger than 50 × 50,
... the majority of problems have very degenerate matrices and we do not need to store
anything like as much (...) the coefficients in these equations are very systematic and mostly
zero. [1].
The special systems discussed in this chapter encompass those that Turing char-
acterized as “degenerate” in that they can be represented and stored much more
economically than general matrices as their entries are systematic (in ways that will
be made precise later), and frequently, most are zero. Because the matrices can be
represented with fewer parameters, they are also termed structured [2] or data sparse.
In this chapter, we are concerned with the solution of linear systems with methods
that are designed to exploit the matrix structure. In particular, we show the opportuni-
ties for parallel processing when solving linear systems with Vandermonde matrices,
banded Toeplitz matrices, a class of matrices that are called SAS-decomposable, and
special matrices that arise when solving elliptic partial differential equations which
are amenable to the application of fast direct methods, commonly referred as rapid
elliptic solvers (RES).
Observe that to some degree, getting high speedup and efficiency out of parallel
algorithms for matrices with special structure is more challenging than for general
ones since the gains are measured vis-a-vis serial solvers of reduced complexity.
It is also worth noting that in some cases, the matrix structure is not known a
priori or is hidden and it becomes necessary to convert the matrix into a special
representation permitting the construction of fast algorithms; see for example [3–5].
This can be a delicate task because arithmetic and data representation are in finite
precision.
© Springer Science+Business Media Dordrecht 2016 165
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_6
166 6 Special Linear Systems
Another type of structure that is present in the Vandermonde and Toeplitz matrices
is that they have small displacement rank, a property introduced in [6] to characterize
matrices for which it is possible to construct low complexity algorithms; cf. [7].
What this means is that if A is the matrix under consideration, there exist lower
triangular matrices P, Q such that the rank of either A − PAQ or PA − AQ (called
displacements) is small. A similar notion, of block displacement, exists for block
matrices. For this reason, such matrices are also characterized as “low displacement”.
Finally note that even if a matrix is only approximately but not exactly structured,
that is it can be expressed as A = S + E, where S structured and E is nonzero but
small, in some sense, (e.g. has small rank or small norm), this can be valuable because
then the corresponding structured matrix S could be an effective preconditioner in
an iterative scheme.
A detailed treatment of structured matrices (in particular structured rank matrices,
that is matrices for which any submatrix that lies entirely below or above the main
diagonal has rank that is bounded above by some fixed value smaller than its size)
can be found in [8]. See also [9, 10] regarding data sparse matrices.
We recall that Vandermonde matrices are determined from one vector, say x =
(ξ1 , . . . , ξn ) , as ⎛ ⎞
1 1 ··· 1
⎜ ξ1 ξ2 · · · ξn ⎟
⎜ ⎟
Vm (x) = ⎜ .. .. .. ⎟ ,
⎝ . . . ⎠
ξ1m−1 ξ2m−1 · · · ξnm−1
where m indicates the number of rows and the number of columns is the size of x.
When the underlying vector or the row dimension are implied by the context, the
symbols are omitted.
If V (x) is a square Vandermonde matrix of order n and Q = diag(ξ1 , . . . , ξn ),
then rank(V (x) − JV(x)Q) = 1. It follows that Vandermonde matrices have small
displacement rank (equal to 1). We are interested in the following problems for any
given nonsingular Vandermonde matrix V and vector of compatible size b.
1. Compute the inverse V −1 .
2. Solve the primal Vandermonde system V a = b.
3. Solve the dual Vandermonde system V a = b.
The inversion of Vandermonde matrices (as described in [11]) and the solution
of Vandermonde systems (using algorithms in [12] or via inversion and multipli-
cation as proposed in [13]) can be accomplished with fast and practical algorithms
that require only O(n 2 ) arithmetic operations rather than the O(n 3 ) predicted by
6.1 Vandermonde Solvers 167
(structure-oblivious) Gaussian elimination. See also [14, 15] and historical remarks
therein. Key to many algorithms is a fundamental result from the theory of poly-
nomial interpolation, namely that given n + 1 interpolation (node, value)-pairs,
{(ξk , βk )}k=0:n , where the ξk are all distinct, there exists a unique polynomial of
degree at most n, say pn , that satisfies pn (ξ kn) = βk j for k = 0, . . . , n. Writing
the polynomial in power form, pn (ξ ) = j=0 α j ξ , the vector of coefficients
a = (α0 , . . . , αn ) is the solution of problem (3). We also recall the most com-
mon representations for the interpolating polynomial (see for example [16, 17]):
Lagrange form:
n
pn (ξ ) = βk lk (ξ ), (6.1)
k=0
n
(ξ − ξ j )
lk (ξ ) = . (6.2)
(ξk − ξ j )
j=0
k= j
Newton form:
pn (ξ ) = γ0 + γ1 (ξ − ξ0 ) + · · · + γn (ξ − ξ0 ) · · · (ξ − ξn−1 ), (6.3)
n
p(ξ ) = γ (ξ − ξ j ) (6.5)
j=1
is the product form representation of the polynomial, we seek fast and practical
algorithms for computing the transformation
F : (ξ1 , . . . , ξn , γ ) → (α0 , . . . , αn ),
n
where i=0 αi ξ i is the power form representation of p(ξ ). The importance of making
available the power form representation of the interpolation polynomials was noted
early on in [11]. By implication, it is important to provide fast and practical trans-
formations between representations. The standard serial algorithm, implemented for
example by MATLAB’s poly function, takes approximately n 2 arithmetic opera-
tions. The algorithms presented in this section for converting from product to power
form (6.1) and (6.2) are based on the following lemma.
(n+1) (n+1)
Proposition 6.1 Let u j = e2 − ρ j e1 forj = 1, . . . , n be the vectors con-
taining the coefficients for term x − ρ j of p(x) = nj=1 (x − ρ j ), padded with zeros.
Denote by dftk (resp. dft−1k ) the discrete (resp. inverse discrete) Fourier transform
of length k and let a = (α0 , . . . , αn ) be the vector of coefficients of the power form
of pn (x). Then
a = dft−1
n+1 dftn+1 (u 1 ) · · · dftn+1 (u n ) .
6.1 Vandermonde Solvers 169
The proposition can be proved from classical results for polynomial multiplication
using convolution (e.g. see [24] and [25, Problem 2.4]).
In the following observe that if V is as given in (6.4), then
1
dftn+1 (u) = V u, and dft−1
n+1 (u) = V ∗ u, (6.6)
n+1
Algorithm 6.1 (pr2pw) uses the previous proposition to compute the coefficients
of the power form from the roots of the polynomial.
Algorithm 6.1 pr2pw: Conversion of polynomial from product form to power form.
Input: r = (ρ1 , . . . , ρn ) //product form nj=1 (x − ρ j )
n
Output: coefficients a = (α0 , . . . , αn ) //power form i=0 αi ξ j
(n+1) (n+1) (n)
1: U = −e1 r + e2 (e )
2: doall j = 1 : n
3: Û:, j = dftn+1 (U:, j )
4: end
5: doall i = 1 : n + 1
6: α̂i = prod (Ûi,: )
7: end
8: a = dft−1
n+1 (â) //â = (αˆ1 , . . . , α̂n+1 )
We next comment on Algorithm 6.1 (pr2pw). Both dft and prod can be imple-
mented using parallel algorithms. If p = O(n 2 ) processors are available, the cost
is O(log n). On a distributed memory system with p ≤ n processors, using a one-
dimensional block-column distribution for U , then in the first loop (lines 2–4) only
local computations in each processor are performed. In the second loop (lines 5–7)
there are (n + 1)/ p independent products of vectors of length n performed sequen-
tially on each processor at a total cost of (n − 1)(n + 1)/ p. The result is a vector of
length n + 1 distributed across the p processors. Vector â would then be distributed
across the processors, so the final step consists of a single transform of length n + 1
p log(n +1))
that can be performed in parallel over the p processors at a cost of O( n+1
parallel operations. So the parallel cost for Algorithm 6.1 (pr2pw) is
n (n + 1)
Tp = τ1 + (n − 1) + τ p , (6.7)
p p
polynomials being multiplied are all linear and monic. From (6.6) it follows that the
DFT of each u j is then
Since all terms are independent, this can be computed in 2n(n + 1)/ p parallel oper-
ations. It follows that the overall cost to convert from product to power form on p
processors is approximately
n(n + 1)
Tp = 2 + τp, (6.9)
p
Algorithm 6.2 powform: Conversion from full product (roots and leading coef-
ficient) (6.5) to power form (coefficients) using the explicit formula (6.8) for the
transforms.
Function: r = powform(r, γ )
Input: vector r = (ρ1 , . . . , ρn ) and scalar γ
Output: power form coefficients a = (α0 , . . . , αn )
1: ω = exp(−ι2π/(n + 1))
2: doall i = 1 : n + 1
3: doall j = 1 : n
4: υ̂i, j = ωi−1 − ρ j
5: end
6: end
7: doall i = 1 : n + 1
8: α̂i−1 = prod (Ûi,: )
9: end
10: a = dft−1 n+1 (â)
11: if γ = 1 then
12: a = γ r
13: end if
⎛ ⎞
v̂1
⎜ ⎟
V −1 = ⎝ ... ⎠
v̂n+1
then each row v̂i , i = 1, . . . , n + 1, is the vector of coefficients of the power form
representation of the corresponding Lagrange basis polynomial li−1 (ξ ).
Algorithm 6.3 for computing the inverse of V , by rows, is based on the previous
proposition and Algorithm 6.2 (powform).
Most operations of ivand occur in the last loop, where n + 1 conversions to
power form are computed. Using pr2pw on O(n 3 ) processors, these can be done in
O(log n) operations; the remaining computations can be accomplished in equal or
fewer steps so the total parallel cost of ivand is O(log n).
The solution of primal or dual Vandermonde systems can be computed by first obtain-
ing V −1 using Algorithm 6.3 (ivand) and then using a dense BLAS2 (matrix vector
multiplication) to obtain V −1 b or V − b.
We next describe an alternative approach for dual Vandermonde systems,
(V (x)) a = b, that avoids the computation of the inverse. The algorithm consists
of the following major stages.
1. From the values in (x, b) compute the divided difference coefficients g =
(γ0 , . . . , γn ) for the Newton form representation (6.3).
2. Using g and x construct the coefficients of the power form representation for each
j−1
term γ j i=0 (ξ − ξi ) of (6.3).
3. Combine these terms to produce the solution a as the vector of coefficients of the
power form representation for the Newton polynomial (6.3).
Algorithm 6.4 Neville: computing the divided differences by the Neville method.
Input: x = (ξ0 , . . . , ξn ) , b = (β0 , . . . , βn ) where the ξ j are pairwise distinct.
Output: Divided difference coefficients c = (γ0 , . . . , γn )
1: c = b //initialization
2: do k = 0 : n − 1
3: doall i = k + 1 : n
4: γi = (γi − γi−1 )/(ξi − ξi−k−1 )
5: end
6: end
6.1 Vandermonde Solvers 173
U = xe − ex + I, (6.10)
Observe that the matrix is shifted skew-symmetric and that it is also used in Algorithm
6.3 (line 1). The divided differences can be expresssed as a linear combination of the
values β j with coefficients computed from U . Specifically, for l = 0, . . . , n,
1 1 1
γl = β0 + β1 + · · · + βl .
υ1,1 · · · υ1,l+1 υ2,1 · · · υ2,l+1 υl+1,1 · · · υl+1,l+1
(6.11)
The coefficients of the terms β0 , . . . , βl for γl are the (inverses) of the products of the
elements in rows 1 to l +1 of the order l +1 leading principal submatrix, U1:l+1,1:l+1 ,
of U . The key to the parallel algorithm is the following observation [29, 30]:
For fixed j, the coefficients of β j in the divided difference coefficients γ0 , γ1 , . . . , γn are
the (inverses) of the n + 1 prefix products {υ j,1 , υ j,1 υ j,2 , . . . , υ j,1 · · · υ j,n+1 }.
The need to compute prefixes arises frequently in Discrete and Computational Math-
ematics and is recognized as a kernel and as such has been studied extensively in
the literature. For completeness, we provide a short description of the problem and
parallel algorithms in Sect. 6.1.3.
We now return to computing the divided differences using (6.11). We list the
steps as Algorithm 6.5 (dd_pprefix). First U is calculated. Noting that the matrix is
shifted skew-symmetric, this needs 1 parallel subtraction on n(n + 1)/2. Applying
n +1 independent instances of parallel prefix (Algorithm 6.8) and 1 parallel division,
all of the inverse coefficients of the divided differences are computed in log n steps
using (n + 1)n processors. Finally, the application of n independent instances of
a logarithmic depth tree addition algorithm yields the sought values in log(n + 1)
arithmetic steps on (n + 1)n/2 processors. Therefore, the following result holds.
Proposition 6.3 ([30]) The divided difference coefficients of the Newton inter-
polating polynomial of degree n can be computed from the n + 1 value pairs
{(ξk , βk ), k = 0, . . . , n} in 2 log(n + 1) + 2 parallel arithmetic steps using (n + 1)n
processors.
174 6 Special Linear Systems
Steps 2–3: Constructing power form coefficients from the Newton form
The solution a of the dual Vandermonde system V (x)a = b contains the coeffients
of the power form representation of the Newton interpolation polynomial in (6.3).
Based on the previous discussion, this will be accomplished by independent invoca-
tions of Algorithm 6.2 (powform). Each of these returns the set of coefficients for
the power form of an addend, γl l−1 j=0 (ξ − ξ j ), in (6.3). Finally, each of these inter-
mediate vectors is summed to return the corresponding element of a. When O(n 2 )
processors are available, the parallel cost of this algorithm is T p = O(log n). These
steps are listed in Algorithm 6.6 (dd2pw).
Definition 6.1 Let be an associative binary operation on a set S . The prefix compu-
tation problem is as follows: Given an ordered n-tuple (α1 , α2 , . . . , αn ) of elements
of S , compute all n prefixes α1 · · · αi for i = 1, 2, . . . , n.
The computations of interest in this book are prefix sums on scalars and prefix
products on scalars and matrices. Algorithm 6.7 accomplishes the prefix computation;
to simplify the presentation, it is assumed that n = 2k . The cost of the algorithm is
T p = O(log n) parallel “” operations.
6.1 Vandermonde Solvers 175
The pioneering array oriented programming language APL included the scan
instruction for computing prefixes [34]. In [35] many benefits of making prefix avail-
able as a primitive instruction are outlined. The monograph [32] surveys early work
on parallel prefix. Implementations have been described for several parallel archi-
tectures, e.g. see [36–40], and via using the Intel Threading Building Blocks C++
template library (TBB) in [41].
6.2.1 Introduction
where J is the lower bidiagonal matrix defined in section and the rank of the term
on the right-hand side of the equality above is at most 2.
Lemma 6.1 ([50]) Let A ∈ Rn×n be a nonsingular Toeplitz matrix. Then, both A
and A−1 are persymmetric; i.e., E n AE n = A , and E n A−1 E n = A− .
The proof follows directly from the fact that A is nonsingular, and E n2 = I .
Lemma 6.2 ([61]) Let A ∈ R2n×2n be a symmetric Toeplitz matrix of the form
B C
A= .
C B
we have
B + CEn 0
P2n A P2n = .
0 B − CEn
Theorem 6.1 ([53, 62, 63]) Let A ∈ Rn×n be a Toeplitz matrix with all its leading
principal submatrices being nonsingular. Let also, Au = αe1 and Av = βen , where
u = (1, μ1 , μ2 , . . . , μn−1 ) ,
v = (νn−1 , νn−2 , . . . , ν1 , 1) .
α A−1 = UV − Ṽ Ũ (6.12)
⎛ ⎞ ⎛ ⎞
⎜ 1 ⎟ ⎜ 1 ν1 ν n−2 ν n−1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ μ1 ⎟ ⎜ ν n−2 ⎟
⎜ 1 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
U =⎜ ⎜ ⎟ V =⎜ ⎟
⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ μ n−2 ⎟ ⎜ ν1 ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
μ n−1 μ n−2 μ1 1 1
⎛ ⎞ ⎛ ⎞
⎜ 0 ⎟ ⎜ 0 μ n−1 μ2 μ 1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ν n−1 0 ⎟ ⎜ μ2 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
Ṽ = ⎜
⎜
⎟ Ũ = ⎜
⎟ ⎜
⎟
⎟
⎜ ⎟ ⎜ ⎟
⎜ ν2 ⎟ ⎜ μ n−1 ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
ν1 ν2 ν n−1 0 0
Equation (6.12) was first given in [62]. It shows that A−1 is completely determined
by its first and last columns, and can be used to compute any of its elements. Cancel-
lation, however, is assured if one attempts to compute an element of A−1 below the
cross-diagonal. Since A−1 is persymmetric, this situation is avoided by computing
the corresponding element above the cross-diagonal. Moreover, if A is symmetric,
its inverse is both symmetric and persymmetric, and u = E n v completely determines
A−1 via
α A−1 = UU − Ũ Ũ . (6.13)
Finally, if A is a lower triangular Toeplitz matrix, then A−1 = α −1 U , see Sect. 3.2.3.
Definition 6.5 Let A ∈ Rn×n be a banded symmetric Toeplitz matrix with elements
αij = α|i− j| = αk , k = 0, 1, . . . , m. where m < n, αm = 0 and αk = 0 for k > m.
Then the complex rational polynomial
φ(ξ ) = αm ξ −m + · · · + α1 ξ −1 + α0 + α1 ξ + · · · + αm ξ m (6.14)
Lemma 6.3 Let A be as in Definition 6.5. Then A is positive definite for all n > m
if and only if:
m
(i) φ(eiθ ) = α0 + 2 αk cos kθ 0 for all θ , and
k=1
(ii) φ(eiθ ) is not identically zero.
Note that condition (ii) is readily satisfied since αm = 0.
Theorem 6.2 ([65]) Let A in Definition 6.5 be positive semi-definite for all n > m.
Then the symbol φ(ξ ) can be factored as
where
ψ(ξ ) = β0 + β1 ξ + · · · + βm ξ m (6.16)
is a real polynomial with βm = 0, β0 > 0, and with no roots strictly inside the unit
circle.
This result is from [66], see also [67, 68]. The factor ψ(ξ ) is called the Hurwitz
factor. Such a factorization arises mainly in the time series analysis and the realization
of dynamic systems. The importance of the Hurwitz factor will be apparent from the
following theorem.
Theorem 6.3 Let φ(eiθ ) > 0, for all θ , i.e. A is positive definite and the symbol
(6.14) has no roots on the unit circle. Consider the Cholesky factorization
A = LL (6.17)
where, ⎛ ⎞
(1)
λ0
⎜ ⎟
⎜ λ(2) λ(2) ⎟
⎜ 1 0 ⎟
⎜
⎜ λ(3)
2 λ(3)
1 λ(3)
0
⎟
⎟
⎜ .. .. .. ⎟
⎜ . . . ⎟
L=⎜
⎜ λ(m+1) (m+1) (m+1) (m+1)
⎟
⎟ (6.18)
⎜ m λm−1 λm−2 · · · λ0 ⎟
⎜ ⎟
⎜ λ(m−2)
m
(m+2)
λm−1 · · · λ(m+2) λ(m+2) ⎟
⎜ 1 0 ⎟
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
(k) (k) (k) (k)
0 λm λm−1 λ1 λ0
Then,
lim λ(k)
j = βj, 0 j m,
k→∞
where β j is given by (6.16). In fact, if τ is the root of ψ(ξ ) closest to the unit circle
(note that |τ | > 1) with multiplicity p, then we have
180 6 Special Linear Systems
(k)
λ j = β j + O[(k − j)2( p−1) /|τ |2(k− j) ]. (6.19)
This theorem shows that the rows of the Cholesky factor L converge, linearly, with
an asymptotic convergence factor that depends on the magnitude of the root of ψ(ξ )
closest to the unit circle. The larger this magnitude, the faster the convergence.
Next, we consider circulant matrices that play an important role in one of the
algorithms in Sect. 6.2.2. In particular, we consider these circulants associated with
banded symmetric or nonsymmetric Toeplitz matrices. Let A = [α−m , . . . , α−1 , α0 ,
α1 , . . . , αm ] ∈ Rn×n denote a banded Toeplitz matrix of bandwidth (2m + 1) where
2m + 1 n, and αm , α−m = 0. Writing this matrix as
⎛ ⎞
B C
⎜ ⎟
⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎜ ⎟
A=⎜
⎜
⎟
⎟
⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎝ ⎠
D B
(6.20)
⎛ ⎞
B C D
⎜ ⎟
⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎜ ⎟
à = ⎜
⎜
⎟
⎟
⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎝ ⎠
C D B
(6.21)
m
m
à = αj K j + α− j K n− j ,
j=0 j=1
⎛ ⎞
0 1
⎜ ⎟
⎜ 0 1 ⎟
K =⎜
⎜
⎟
⎟
⎝ 0 1 ⎠
1 0 0
(6.22)
W ∗ KW = Ω
(6.24)
= diag(1, ω, ω2 , . . . , ωn−1 ),
182 6 Special Linear Systems
and
W ∗ ÃW = Γ
m
m
(6.25)
= αjΩ j + α− j Ω − j
j=0 j=1
γk = φ̃(ωk ) (6.26)
where
m
φ̃(ξ ) = αjξ j
j=−m
In this section we present three algorithms for solving the banded Toeplitz linear
system
Ax = f (6.28)
where A is given by (6.20) and n m. These have been presented first in [59]. We
state clearly the conditions under which each algorithm is applicable, and give com-
plexity upper bounds on the number of parallel arithmetic operations and processors
required.
The first algorithm, listed as Algorithm 6.9, requires the least number of par-
allel arithmetic operations of all three, 6 log n + O(1). It is applicable only when
the corresponding circulant matrix is nonsingular. The second algorithm, numbered
Algorithm 6.10 may be used if A, in (6.28), is positive definite or if all its principal
minors are non-zero. It solves (6.28) in O(m log n) parallel arithmetic operations.
The third algorithm, Algorithm 6.11, uses a modification of the second algorithm to
compute the Hurwitz factorization of the symbol φ(ξ ) of a positive definite Toeplitz
matrix. The Hurwitz factor, in turn, is then used by an algorithm proposed in [65] to
6.2 Banded Toeplitz Linear Systems Solvers 183
solve (6.28). This last algorithm is applicable only if none of the roots of φ(ξ ) lie on
the unit circle. In fact, the root of the factor ψ(ξ ) nearest to the unit circle should be
far enough away to assure early convergence of the modified Algorithm 6.10. The
third algorithm requires O(log m log n) parallel arithmetic operations provided that
the Hurwitz factor has already been computed. It also requires the least storage of all
three algorithms. In the next section we discuss another algorithm that is useful for
block-Toeplitz systems that result from the discretization of certain elliptic partial
differential equations.
A Banded Toeplitz Solver for Nonsingular Associated Circulant Matrices:
Algorithm 6.9
First, we express the linear system (6.28), in which the banded Toeplitz matrix A is
given by (6.20), as
( Ã − S)x = f (6.29)
in which
Im 0 · · · 0 0
U = .
0 0 · · · 0 Im
where −1
−1 0 D
G = U Ã U− .
C 0
Algorithm 6.9 A banded Toeplitz solver for nonsymmetric systems with nonsingular
associated circulant matrices
Input: Banded nonsymmetric Toeplitz matrix A as in (6.20) and the right-hand side f .
Output: Solution of the linear system Ax = f
//Stage 1 //Consider the circulant matrix à associated with A as given by Eq. (6.21). First,
determine whether à is nonsingular and, if so, determine Ã−1 and y = Ã−1 f . Since the inverse
of a circulant matrix is also circulant, Ã−1 is completely determined by solving Ãv1 = e1 . This
is accomplished via (6.25), i.e., y = W Γ −1 W ∗ f and v1 = W Γ −1 W ∗ e1 . This computation is
organized as follows: √
1: Simultaneously form nW a and W ∗ f //see (6.27) and Theorem 6.4 √ (FFT). This is an inex-
pensive test for the nonsingularity of Ã. If none of the elements of nW a (eigenvalues of Ã)
vanish, we proceed to step (2). √
2: Simultaneously, obtain Γ −1 (W ∗ e1 ) and Γ −1 (W ∗ f ). //Note that nW ∗ e1 = (1, 1, . . . , 1) .
3: Simultaneously obtain v1 = W (Γ W e1 ) and y = W (Γ W ∗ f ) via the FFT.
−1 ∗ −1 //see
Theorem 6.4
//Stage 2 //solve the linear system
y1
Gz = U y = (6.31)
yν
O(mn) elements, which may not be acceptable for large n and relatively large m.
Even if the corresponding circulant matrix is positive definite, Theorem 6.3 indicates
that convergence can indeed be slow if the magnitude of that root of the Hurwitz
factor ψ(ξ ) (see Theorem 6.2) nearest to the unit circle is only slightly greater than
1. If s is that row of L at which convergence takes place, the parallel Cholesky
factorization and the subsequent forward and backward sweeps needed for solving
(6.28) are more efficient the smaller s is compared to n.
In the following, we present an alternative parallel algorithm that solves the
same positive definite system in O(m log n) parallel arithmetic operations with O(n)
processors, which requires no more than 2n + O(m 2 ) temporary storage locations.
For n = 2q p, the algorithm consists of (q + 1) stages which we outline as follows.
Stage 0.
Let the pth leading principal submatrix of A be denoted by A0 , and the right-hand
side of (6.28) be partitioned as
(0) (0)
f = ( f1 , f2 , . . . , f η(0) ),
(0) ( p) (0)
A0 (z 0 , y1 , . . . , yη(0) ) = (e1 , f 1 , . . . , f p(0) ) (6.32)
( p)
where e1 is the first column of the identity I p . From Theorem 4.1 and the discussion
of the column sweep algorithm (CSweep) in Sect. 3.2.1 it follows that the above
systems can be solved in 9( p −1) parallel arithmetic operations using mη processors.
Stage ( j = 1, 2, . . . , q).
Let ⎛ ⎞
A j−1 0
⎜ C ⎟
Aj = ⎜
⎝
⎟
⎠ (6.33)
C
0 A j−1
( j)
where f i ∈ R2r , and ν = n/2r . Here, we simultaneously solve the (ν + 1) linear
systems
A j z j = e1(2r ) , (6.34)
( j) ( j)
A j yi = fi , i = 1, 2, . . . , ν, (6.35)
(r ) ( j−1) ( j−1)
where stage ( j − 1) has already yielded z j−1 = A−1 j−1 e1 and yi = A−1j−1 f i ,
i = 1, 2, . . . , 2ν. Next, we consider solving the ith linear system in (6.35). Observing
that
( j−1)
( j) f 2i−1
fi = ( j−1) ,
f 2i
D j = diag(A−1 −1
j−1 , A j−1 )
where
0 C
Gj = A−1
j−1 and H j = A−1
j−1 .
C 0
u = αz j−1
= (1, μ1 , μ2 , . . . , μr −1 ) ,
and
ũ = Jr Er u.
Hence, H j is given by
α H j = (YY
1 − Ỹ Ỹ1 )C , (6.37)
⎛ ⎞
1
⎜ ⎟
⎜ ⎟
⎜ μ1 1 ⎟
⎜ ⎟
⎜ ⎟
Y1 = (Im , 0)Y = ⎜
⎜ μ2 μ1 1 ⎟
⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ⎠
μ m−1 μ2 μ1 1
and
⎛ ⎞
0
⎜ ⎟
⎜ ⎟
⎜ μ r−1 0 ⎟
⎜ ⎟
⎜ ⎟
Ỹ1 = (Im , 0)Ỹ = ⎜
⎜ μ r−2 μ r−1 0 ⎟
⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ⎠
μ r−m+1 μ r−2 μ r−1 0
( j)
where the central block N2 is clearly nonsingular since A j is invertible. Note
( j)
also that the eigenvalues of N2 ∈ R2m×2m are the same as those eigenvalues of
(D −1 −1
j A j ) that are different from 1. Since D j A j is similar to the positive definite
−1/2 −1/2 ( j)
matrix D j A j D j , N2 is not only nonsingular, but also has all its eigenvalues
positive. Hence, the solution of (6.36) is trivially obtained if we first solve the middle
2m equations,
( j)
N2 h = g, (6.38)
or
Im E m M j E m h 1,i g1,i
= , (6.39)
Mj Im h 2,i g2,i
where
M j = (Im , 0)H j = α −1 (Y1 Y1 − Ỹ1 Ỹ1 )C
188 6 Special Linear Systems
( j) ( j)
and gk,i , h k,i , k = 1, 2, are the corresponding partitions of fi and yi , respectively.
( j)
Observing that N2 is centrosymmetric (see Definition 6.4), then from Lemma 6.2
we reduce (6.39) to the two independent linear systems
( j) has all its leading principal minors positive, these two systems
Since P2m N2 P2m
may be simultaneously solved using Gaussian elimination without partial pivoting.
( j)
Once h 1 and E m h 2 are obtained, yi is readily available.
( j)
( j) y2i−1 − Er H j E m h 2,i
yi = ( j−1) . (6.41)
y2i − H j h 1,i
⎛ ⎞
S1
⎜ ⎟
⎜ ⎟
⎜ V1 S2 ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
L=⎜
⎜
Vi−1 Si ⎟
⎟
⎜ ⎟
⎜ V S ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ V S ⎟
⎝ ⎠
(6.42)
in which ı̂ = mi + 1, S j and V j ∈ Rm×m are lower and upper triangular, respectively,
and S, V are Toeplitz and given by
6.2 Banded Toeplitz Linear Systems Solvers 189
⎛ ⎞ ⎛ ⎞
⎜ β0 ⎟ ⎜ β m β m−1 β1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ β1 ⎟ ⎜ β2 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
S=⎜ ⎟ and V = ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ β m−1 ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
β m−1 β1 β0 βm
B = SS + VV ,
and
C = SV ,
where B, and C are given in (6.20), note that in this case D = C . Hence, A can be
expressed as
Im
A = R R + VV (Im , 0) (6.43)
0
⎛ ⎞
S
⎜ ⎟
⎜ ⎟
⎜ V S ⎟
⎜ ⎟
⎜ ⎟
R =⎜
⎜
⎟
⎟
⎜ ⎟
⎜ V S ⎟
⎜ ⎟
⎝ ⎠
V S
(6.44)
m
Assuming that an approximate Hurwitz factor, ψ̃(ξ ) = β̃ j ξ j , is obtained from
j=0
the Cholesky factorization of A, i.e.,
190 6 Special Linear Systems
⎛ ⎞
⎜ S̃ ⎟
⎜ ⎟
⎜ Ṽ ⎟
⎜ S̃ ⎟
⎜ ⎟
⎜ ⎟
R̃ = ⎜
⎜
⎟
⎟
⎜ ⎟
⎜ Ṽ S̃ ⎟
⎜ ⎟
⎝ ⎠
Ṽ S̃
is available, the solution of the linear system (6.28) may be computed via the
Sherman-Morrison-Woodbury formula [14], e.g. see [65],
−1 −1 Ṽ
x=F f −F Q −1 [Ṽ , 0]F −1 f (6.45)
0
where
F = R̃ R̃,
and
−1 Ṽ
Q = Im + (Ṽ , 0)F .
0
E m Mk E m = S − S −1 C. (6.46)
In other words, the matrices M j in (6.39) converge to a matrix M (say), and the ele-
ments β̃ j , 0 j m −1, of S̃ are obtained by computing the Cholesky factorization
of −1
−1 −1 0
E m C M E m = (0, Im )A j−1 .
Im
Now, the only remaining element of Ṽ , namely β̃m is easily obtained as αm /β̃0 ,
which can be verified from the relation C = SV .
Observing that we need only to compute the matrices M j , Algorithm 6.10 may be
(2r )
modified so that in any stage j, we solve only the linear systems A j z j = e1 , where
j = 0, 1, 2, . . ., and 2r = 2 j+1 p. Since we assume that convergence takes place in
the kth stage, where r̃ = 2k p n/2, the number of parallel arithmetic operations
required for computing the Hurwitz factor is approximately 6mk 6m log(n/2 p)
using 2m r̃ mn processors.
Using positive definite Toeplitz systems with bandwidths 5, i.e. m = 2. Lemma 6.3
states that all sufficiently large pentadiagonal Toeplitz matrices of the form [1, σ, δ,
σ, 1] are positive definite provided that the symbol function φ(eiθ ) = δ + 2σ cos θ +
2 cos 2θ has no negative values. For positive δ, this condition is satisfied when (σ, δ)
corresponds to a point on or above the lower curve in Fig. 6.1. The matrix is diagonally
dominant when the point lies above the upper curve. For example, if we choose test
matrices for which δ = 6 and 0 σ 4; every point on the dashed line in Fig. 6.1
represents one of these matrices. In [59], numerical experiments were conducted to
⎧
⎪ 2
2 δ = ⎨ (σ + 8) ⁄ 4 if σ ≤ 4
⎪ 2( σ – 1) if σ ≥ 4
⎩
0
-4 -2 0 2 4
σ
192 6 Special Linear Systems
Table 6.1 Parallel arithmetic operations, number of processors, and overall operation counts for
Algorithms 6.9, 2 and 3
Parallel arith. ops Number of Overall ops. Storage
processors
Algorithm 6.9 6 log n 2n 10n log n 2n
a Algorithm 6.10 18 log n 4n 4n log n 2n
b Algorithm 6.11 16 log n 2n 12n log n n
a The 2-by-2 linear systems (6.40) are solved via Cramer’s rule
b Algorithm 6.11 does not include the computation of the Hurwitz factors
compare the relative errors in the computed solution achieved by the above three
algorithm and other solvers. Algorithm 6.9 with only one step of iterative refinement
semed to yield the lowest relative error.
Table 6.1, summarizes the number of parallel arithmetic operations, and the num-
ber of required processors for each of the three pentadiagonal Toeplitz solvers (Algo-
rithms 6.9, 6.10, and 6.11). In addition, Table 6.1 lists the overall number of arithmetic
operations required by each solver if implemented on a uniprocessor, together with
the required storage. Here, the pentadiagonal test matrices are of order n, with the
various entries showing only the leading term.
It is clear, however, that implementation details of each algorithm on a given
parallel architecture will determine the cost of internode communications, and the
cost of memory references within each multicore node. It is such cost, rather than the
cost of arithmetic operations, that will determine the most scalable parallel banded
Toeplitz solver on a given architecture.
u+v =b (6.47)
where
u = Pu and v = −Pv. (6.48)
U +V = A (6.49)
where
U = PUP and V = −PVP. (6.50)
Proof Similar to Theorem 6.5, the proof is easily established if one takes U =
2 (A + PAP) and V = 2 (A − PAP).
1 1
P = Er ⊗ ±E s , P = Er ⊗ ±Is , P = Ir ⊗ ±E s , (6.54)
with P1 being some signed permutation matrix of order n, for instance. Now, consider
the orthogonal matrix
6.3 Symmetric and Antisymmetric Decomposition (SAS) 195
1 I −P1
X=√ . (6.57)
2 P1 I
From (6.56), it is clear that the linear system has been decomposed into two inde-
pendent subsystems that can be solved simultaneously. This decoupling is a direct
consequence of the assumption that the matrix A is reflexive with respect to P.
In many cases, both of the submatrices A11 + A12 P1 and A22 − A21 P1 still
possess the SAS property, with respect to some other reflection matrix. For exam-
ple, in three-dimensional linear isotropic or orthotropic elasticity problems that are
symmetrically discretized using rectangular hexahedral elements, the decomposition
can be further carried out to yield eight independent subsystems each of order n/4,
and not possessing the SAS property; e.g. see [73, 75]. Now, the smaller decoupled
subsystems in (6.58) can each be solved by either direct or preconditioned iterative
methods offering a second level of parallelism.
While for a large number of orthotropic elasticity problems, the resulting stiffness
matrices can be shown to possess the special SAS property, some problems yield
stiffness matrices A that are low-rank perturbations of matrices that possess the
SAS property. As an example consider a three-dimensional isotropic elastic long
bar with asymmetry arising from the boundary conditions. Here, the bar is fixed at
its left end, ξ = 0, and supported by two linear springs at its free end, ξ = L,
as shown in Fig. 6.2. The spring elastic constants K 1 and K 2 are different. The
dimensionless constants and material properties are given as: length L, width b,
height c, Young’s Modulus E, and the Poisson’s Ratio ν. The loading applied to
P
z P z
M y
z c
K2
K1
b
L
Fig. 6.2 Prismatic bar with one end fixed and the other elastically supported
196 6 Special Linear Systems
this bar is a uniform simple bending moment
M across the cross section at its right
end, and a concentrated force P at L , −b 2 , 2 . For finite element discretization, we
c
If the matrix A, in the eigenvalue problem Ax = λx, possesses the SAS property,
it can be shown (via similarity transformations) that the proposed decomposition
approach can be used for solving the problem much more efficiently. For instance,
if P is of the form
P = E 2 ⊗ E n/2 (6.60)
To see how much effort can be saved, we consider the QR iterations for a real-
valued full matrix (or order N ). In using the QR iterations for obtaining the eigenpairs
of a matrix, we first reduce this matrix to the upper Hessenberg form (or a tridiagonal
matrix for symmetric problems). On a uniprocessor, the reduction step takes about
cN 3 flops [14] for some constant c. If the matrix A satisfying PAP = A can be
decomposed into four submatrices each of order N /4; then the amount of floating
point operations required in the reduction step is reduced to cN 3 /16. In addition,
because of the fully independent nature of the four subproblems, we can further
reduce the computing time on parallel architectures.
Depending on the form of the signed reflection matrix P, several similarity trans-
formations can be derived for this special class of matrices A = PAP. Next, we
present another computationally useful similarity transformation.
Theorem 6.8 ([75]) Let A ∈ Rn×n be partitioned as (Ai, j ), i,j = 1,2, and 3 with
A11 and A33 of order r and A22 of order s, where 2r + s = n. If A = PAP where P
is of the form ⎛ ⎞
0 0 P1
P = ⎝ 0 Is 0 ⎠
P1 0 0
in which P1 is some signed permutation matrix of order r, then there exists an orthog-
onal matrix,
⎛ ⎞
I √0 −P1
1 ⎝
X=√ 0 2Is 0 ⎠ . (6.62)
2 P 0 I
1
It should be noted that if A is symmetric, then both diagonal blocks in (6.63) are
also symmetric. The same argument holds for the two diagonal blocks in (6.59).
Note that the application of the above decomposition method can be extended to
the generalized eigenvalue problem Ax = λBx if B also satisfies the SAS property,
namely B = PBP.
In this section we consider the parallelism in the solution of linear systems with
matrices that result from the discretization of certain elliptic partial differential equa-
tions. As it turns out, there are times when the equations, boundary conditions and
198 6 Special Linear Systems
discretization are such that the structure of the coefficient matrix for the linear system
allow one to develop very fast direct solution methods, collectively known as Rapid
Elliptic Solvers (RES for short) or Fast Poisson Solvers. The terms are due to the
fact that their computational complexity on a uniprocessor is only O(mn log mn) or
less for systems of mn unknowns compared to the O((mn)3 ) complexity of Gaussian
elimination for dense systems.
Here we are concerned with direct methods, in the sense that in the absence of
roundoff the solvers return an exact solution. When well implemented, RES are
faster than other direct and iterative methods [77, 78]. The downside is their limited
applicability: RES can only be used directly for special elliptic PDEs (meaning the
equation, domain of definition) under suitable discretization; moreover, their perfor-
mance in general depends on the boundary conditions and problem size. On the other
hand, there are many cases when RES cannot be used directly but can be helpful as
preconditioners. Interest in the design and implementation of parallel algorithms for
RES started in the early 1970s; see Sect. 6.4.9 for some historical notes. Theoretically,
parallel RES can solve the linear systems under consideration in O(log mn) parallel
operations on O(mn) processors instead of the fastest but impractical algorithm [79]
for general linear systems that requires O(log2 mn) parallel operations on O((mn)4 )
processors. In the sequel we describe and evaluate the properties of some interesting
algorithms from this class.
6.4.1 Preliminaries
The focus of this chapter is on parallel RES for the following model problem:
⎛ ⎞⎛u ⎞ ⎛ f1
⎞
T −I 1
⎜ −I T −I ⎟⎜ u ⎟ ⎜ f2 ⎟
⎟⎜ ⎟ ⎜ ⎟
2
⎜
⎜ . . . ⎜
⎟⎜ .⎟. ⎜ .. ⎟
⎜ . . . . . . ⎟⎜ ⎟ . =⎜ .⎟ ⎟, (6.64)
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. ⎟
⎝ −I T −I ⎠ ⎝ ... ⎠ ⎝ .⎠
−I T un fn
where T = [−1, 4, −1]m and the unknowns and right-hand side u, f are conformally
partitioned into subvectors u i = (υi,1 , . . . , υi,m ) ∈ Rm and f i = (φi,1 , . . . , φi,m ) .
See Sect. 9.1 of Chap. 9 for some details regarding the derivation of this system but
also books such as [80, 81]. We sometimes refer to the linear system as discrete
Poisson system and to the block-tridiagonal matrix A in (6.64) as Poisson matrix.
Note that this is a 2-level Toeplitz matrix, that is a block-Toeplitz matrix with Toeplitz
blocks [82]. The 2-level Toeplitz structure is a consequence of constant coefficients
in the differential equation and the Dirichlet boundary conditions. It is worth noting
that the discrete Poisson system (6.64) can be reordered and rewritten as
6.4 Rapid Elliptic Solvers 199
This time the Poisson matrix has m blocks of order n each. The two systems are
equivalent; for that reason, in the algorithms that follow, the user can select the
formulation that minimizes the cost. For example, if the cost is modeled by T p =
γ1 log m log n + γ2 log2 n for positive γ1 , γ2 , then it is preferable to choose n ≤ m.
The aforementioned problem is a special version of the block-tridiagonal system
where T, W ∈ Rm×m are symmetric and commute under multiplication so that they
are simultaneously diagonalizable by the same orthogonal similarity transformation.
Many of the parallel methods we discuss here also apply with minor modifications
to the solution of system (6.65).
It is also of interest to note that A is of low block displacement rank. In particular
if we use our established notation for matrices Im and Jn then
where
1
X= T, W, 0, . . . , 0 , Y = (Im , 0, . . . , . . . , 0) .
2
Therefore the rank of the result in (6.66) is at most 2m. This can also be a starting point
for constructing iterative solvers using tools from the theory of low displacement rank
matrices; see for example [57, 82].
We define the Chebyshev polynomials that are useful here and in later chapters. See
also refs. [83–86].
Definition 6.6 The degree-k Chebyshev polynomial of the 1st kind is defined as:
cos(k arccos ξ ) when |ξ | ≤ 1
Tk (ξ ) =
cosh(karccoshξ ) when |ξ | > 1.
Tk+1 (ξ ) = 2ξ Tk (ξ ) − Tk−1 (ξ )
where T0 (ξ ) = 1, T1 (ξ ) = ξ.
200 6 Special Linear Systems
The degree-k modified Chebyshev polynomial of the 2nd kind is defined as:
⎧
⎪ sin((k+1)θ) ξ
⎨ sin(θ) where cos(θ ) = 2 when 0 ≤ ξ < 2
Uˆk (ξ ) = k+1 when ξ = 2
⎪
⎩ sinh((k+1)ψ) where cosh(ψ) = ξ
sinh(ψ) 2 when ξ > 2.
A similar formula, involving Chebyshev polynomials can be written for the inverse
of the Poisson matrix.
Proposition 6.5 For any nonsingular T ∈ Rm×m , the matrix A = [−I, T, −I ]n is
nonsingular if and only if Uˆn (T ) is nonsingular. Then, A−1 can be written as a block
matrix that has, as block (i, j) the order m submatrix
−1 Uˆn−1 (T )Uˆi−1 (T )Uˆn− j (T ), j ≥ i,
(A )i, j = (6.68)
Uˆn−1 (T )Uˆj−1 (T )Uˆn−i (T ), i ≥ j.
Matrix decomposition (MD) refers to a large class of methods for solving the model
problem we introduced earlier as well as more general problems, including (6.65);
cf. [91]. As was stated in [92] “It seldom happens that the application of L processors
would yield an L-fold increase in efficiency relative to a single processor, but that
is the case with the MD algorithm.” The first detailed study of parallel MD for the
Poisson equation was presented in [93]. The Poisson matrix can be written as a
Kronecker sum of Toeplitz tridiagonal matrices. Specifically,
where T̃k = [−1, 2, −1]k for k = m, n. To describe MD, we make use of this
representation. We also use the vec operator: Acting on a matrix Y = (y1 , . . . , yn ),
it returns the vector that is formed by stacking the columns of X , that is
202 6 Special Linear Systems
⎛ ⎞
y1
⎜ .. ⎟
vec(y1 , . . . , yn ) = ⎝ . ⎠ .
yn
The unvecn operation does the reverse: For a vector of mn elements, it selects its
n contiguous subvectors of length m and returns the matrix with these as columns.
Finally, the permutation matrix Πm,n ∈ Rmn×mn is defined as the unique matrix,
sometimes called vec-permutation, such that vec(A) = Πm,n vec(A ), where A ∈
Rm×n .
MD algorithms consist of 3 major stages. The first amounts to transforming A
and f , then solving a set of independent systems, followed by back transforming to
compute the final solution. These 3 stages are characteristic of MD algorithms; cf.
[91]. Moreover, directly or after transformations, they consist of several independent
subproblems that enable straightforward implementation on parallel architectures.
Denote by Q x the matrix of eigenvectors of T̃m . In the first stage, both sides of
the discrete Poisson system are multiplied by the block-diagonal orthogonal matrix
In ⊗ Q x . We now write
(In ⊗ Q
x )(In ⊗ T̃m + T̃n ⊗ Im )(In ⊗ Q x )(In ⊗ Q x )u = (In ⊗ Q x ) f
(In ⊗ Λ̃m + T̃n ⊗ Im )(In ⊗ Q
x )u = (In ⊗ Q x ) f. (6.70)
(m)
Matrix B is block-diagonal with diagonal blocks of the form T̃n + λi I where
(m) (m)
λ1 , . . . , λm are the eigenvalues of T̃m . Recall also from Proposition 6.4 that the
eigenvalues are computable from closed formulas. Therefore, the transformed system
is equivalent to m independent subsystems of order n, each of which has the same
structure as T̃n . These are solved in the second stage of MD. The third stage of MD
consists of the multiplication of the result of the previous stage with In ⊗ Q x . From
(6.71), it follows that
u = (In ⊗ Q x )Πm,n B −1 Πm,n (In ⊗ Q
x ) f. (6.72)
Algorithm 6.12 MD- Fourier: matrix decomposition method for the discrete Pois-
son system.
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f = ( f 1 ; . . . ; f n )
//Stage I: apply fast DST on each subvector f 1 , . . . , f n
1: doall j = 1 : n
2: fˆj = Q
x fj
3: set F̂ = ( fˆ1 , . . . , fˆn )
4: end
//Stage II: implemented with suitable solver (Toeplitz, tridiagonal, multiple shifts, multiple
right-hand sides)
5: doall i = 1 : m
(m) and store the result in ith row of a temporary matrix Û ;
6: compute (T̃n + λi I )−1 F̂i,:
λ(m) (m)
1 , . . . , λm are the eigenvalues of T̃m
7: end
//Stage III: apply fast DST on each column of Û
8: doall j = 1 : n
9: u j = Q x Û:, j
10: end
When both T̃m and T̃n in (6.69) are diagonalizable with Fourier-type transforms, it
becomes possible to solve Eq. (6.64) using another approach, called the complete
204 6 Special Linear Systems
Fourier transform method (CFT) that we list as Algorithm 6.13 [94]. Here, instead
of (6.72) we write the solution of (6.64) as
Algorithm 6.13 CFT: complete Fourier transform method for the discrete Poisson
system.
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f
//The right-hand side arranged as the m × n matrix F = ( f 1 , . . . , f n )
Output: Solution u = (u 1 ; . . . ; u n )
//Stage Ia: DST on columns of F
1: doall j = 1 : n
2: fˆj = Q x fj
3: set F̂ = ( fˆ1 , . . . , fˆn )
4: end //Stage Ib: DST on rows of F̂
5: doall i = 1 : m
6: F̃i,: = Q y F̂i,:
7: end //Stage II: elementwise division of F̃ by eigenvalues of A
8: doall i = 1 : m
9: doall j = 1 : n
10: F̃i, j = F̂i, j /(λi(m) + λ(n)
j )
11: end
12: end //Stage IIIa: DST on rows of F̃
13: doall i = 1 : m
14: F̂i,: = Q y F̃i,:
15: end //Stage IIIb: DST on columns of F̂
16: doall j = 1 : n
17: u j = Q x F̂:, j
18: end
6.4 Rapid Elliptic Solvers 205
BCR for the discrete Poisson system (6.65) is a method that generalizes the cr
algorithm (cf. 5.5) for point tridiagonal systems (cf. Chap. 5) while taking advantage
of its 2-level Toeplitz structure; cf. [95]. BCR is more general than Fourier-MD in the
sense that it does not require knowledge of the eigenstructure of T nor does it deploy
Fourier-type transforms. We outline it next for the case that the number of blocks
is n = 2k − 1; any other value can be accommodated following the modifications
proposed in [96].
For steps r = 1, . . . , k − 1, adjacent blocks of equations are combined in groups
of 3 to eliminate 2 blocks of unknowns; in the first step, for instance, unknowns from
even numbered blocks are eliminated and a reduced system with block-tridiagonal
coefficient matrix A(1) = [−I, T 2 − 2I, −I ]2k−1 −1 , containing approximately only
half of the blocks remains. The right-hand side is transformed accordingly. Setting
T (0) = T , f (0) = f , and T (r ) = (T (r −1) )2 − 2I , the reduced system at the r th step
is
[−I, T (r ) , −I ]2k−r −1 u (r ) = f (r ) ,
From the closed form expressions of the eigenvalues of T and the roots of the Cheby-
shev polynomials of the 1st kind T2r , and relation (6.74), it is straightforward to show
that the roots of the matrix polynomial T (r ) are
(r ) (2 j − 1)
ρi = 2 cos π . (6.75)
2r +1
(r ) (r )
T (r ) = (T − ρ2r I ) · · · (T − ρ1 I ). (6.76)
The roots (cf. (6.75)) are distinct therefore the inverse can also be expressed in terms
of the partial fraction representation of the rational function 1/2T2r ( ξ2 ):
206 6 Special Linear Systems
r
2
(r ) (r )
(T (r ) )−1 = γi (T − ρi I )−1 . (6.77)
i=1
From the analytic expression for T (r ) in (6.74) and standard formulas, the partial
fraction coefficients are equal to
(r ) 1 (2i − 1)π
γi = (−1) i+1
sin , i = 1, . . . , 2r .
2r 2r +1
The partial fraction approach for solving linear systems with rational matrix coeffi-
cients is discussed in detail in Sect. 12.1 of Chap. 12. Here we list as Algorithm 6.14
(simpleSolve_PF),
one version that can be applied to solving systems of the form
d
j=1 (T − ρ j I ) x = b for mutually distinct ρ j ’s. This will be applied to solve
any systems with coefficient matrix such as (6.76).
d
Algorithm 6.14 simpleSolve_PF: solving j=1 (T − ρ j I ) x = b for mutually
distinct values ρ j from partial fraction expansions.
Input: T ∈ Rm×m , b ∈ Rm and distinct values {ρ1 , . . . , ρd } none of them equal to an eigenvalue
of T . −1
d
Output: Solution x = j=1 (T − ρ j I ) b.
1: doall j = 1 : d
2: compute coefficient γ j = 1 , where p(ζ ) = dj=1 (ζ − ρ j )
p (τ j )
3: solve (T − ρ j I )x j = b
4: end
5: set c = (γ1 , . . . , γd ) , X = (x1 , . . . , xd )
6: compute and return x = X c
Next implementations of block cyclic reduction are presented that deploy the
product (6.76) and partial fraction representations (6.77) for operations with T (r ) .
After r < k − 1 block cyclic reduction steps the system has the form:
⎛ (r ) ⎞
f 2r
⎜ (r ) ⎟
⎜ f 2·2r ⎟
[−I, T (r ) , −I ]u (r ) = f (r ) , where f (r ) = ⎜
⎜ .. ⎟.
⎟ (6.78)
⎝ . ⎠
(r )
f (2k−r −1)·2r
⎛ ⎞ ⎛ (r −1) ⎞
u 1·2r −1 f 1·2r −1 + u 1·2r
⎜ ⎟ ⎜ f 3·2r −1 + u 3·2r −2r −1 + u 3·2r +2r −1 ⎟
(r −1)
⎜ u 3·2r −1 ⎟ ⎜ ⎟
diag[T (r −1) ] ⎜ .. ⎟=⎜
⎜ .. ⎟.
⎟
⎝ . ⎠ ⎝ . ⎠
u (2r̂ −r +1 −1)·2r −1 (r −1)
f (2r̂ −r +1 −1)·2r −1 + u (2r̂ −r +1 −1)·2r −1 −2r −1
(k−1)
T (k−1) u 2k−1 = f 2k−1 , (6.79)
Algorithm 6.15 CORF: Block cyclic reduction for the discrete Poisson system
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f . It is assumed that n = 2k − 1.
Output: Solution u = (u 1 ; . . . ; u n )
//Stage I: Reduction
1: do r = 1 : k − 1
2: doall j = 1 : 2k−r − 1
(r ) (r −1) (r −1) (r −1)
3: f j2r = f j2r −2r −1 + f j2r +2r −1 + T (r −1) f j2r
//multiply T (r −1) f 2(rr −1)
j exploiting the product form (6.76)
4: end
5: end
//Stage II: Solution by back substitution
(k−1)
6: Solve T (k−1) u 2k−1 = f 2k−1
7: do r = k − 1 : −1 : 1
8: doall j = 1 : 2k−r
(r −1)
9: solve T (r −1) u (2 j−1)·2r −1 = f (2 j−1)·2r −1 + u (2 j−1)·2r −1 −2r −1 + u (2 j−1)·2r −1 +2r −1
10: end
11: end
208 6 Special Linear Systems
Therefore all eigenvalues of T lie in the interval (2, 6) so the largest eigenvalue of
T (r ) will be of the order of 2T2r (3 − δ), for some small δ. This is very large even
for moderate values of r since T2r is known to grow very large for arguments of
magnitude greater than 1. Therefore, computing (6.73) will involve the combination
of elements of greatly varying magnitude causing loss of information. Fortunately,
both the computational bottleneck and the stability problem in the reduction phase
can be overcome. In fact, as we show in the sequel, resolving the stability problem
also enables enhanced parallelism.
6.4 Rapid Elliptic Solvers 209
In one version of this scheme, the recurrence (6.73) for f j2r is replaced by the
recurrences
(r ) (r −1) (r −1) (r −1) (r −1)
pj = pj − (T (r −1) )−1 ( p j−2r −1 + p j+2r −1 − q j )
(r ) (r −1) (r −1) (r )
qj = (q j−2r −1 + q j+2r −1 − 2 p j ).
The steps of stabilized BCR are listed as Algorithm 6.16 (BCR). Note that because
the sequential costs of multiplication and solution with tridiagonal matrices are both
linear, the number of arithmetic operations in Buneman stabilized BCR is the same
as that of CORF.
Algorithm 6.16 BCR: Block cyclic reduction with Buneman stabilization for the
discrete Poisson system
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f . It is assumed n = 2k − 1.
Output: Solution u = (u 1 ; . . . ; u n )
//Initialization
(0) (0)
1: p j = 0m,1 and q j = f j ( j = 1 : n)
//Stage I: Reduction. Vectors with subscript 0 or 2k are taken to be 0
2: do r = 1 : k − 1
3: doall j = 1 : 2k−r − 1
(r ) (r −1) (r −1) (r −1) (r −1)
4: p j2r = p j2r − (T (r −1) )−1 ( p j2r −2r −1 + p j2r +2r −1 − q j2r )
(r ) (r −1) (r −1) (r )
5: q j2r = (q j2r −2r −1 + q j2r +2r −1 − 2 p j2r )
6: end
7: end
//Stage II: Solution by back substitution. It is assumed that u 0 = u 2k = 0
(k)
8: Solve T (k) u 2k−1 = q2k−1
9: do r = k − 1 : −1 : 1
10: doall j = 1 : 2k−r
(k)
11: solve T (r −1) û (2 j−1)2r −1 = q(2 j−1)2r −1
− (u ( j−1)2r −1 + u ( j+1)2r −1 )
12: u (2 j−1)2r −1 = û (2 j−1)2r −1 + p(2 j−1)2r −1
13: end
14: end
210 6 Special Linear Systems
In terms of operations with T (r ) , the reduction phase (line 4) now consists only
of applications of (T (r ) )−1 so parallelism is enabled by utilizing the partial fraction
representation (6.77) as in the back substitution phase of CORF.
Therefore, solutions with coefficient matrix T (r ) and 2k−r − 1 right-hand sides
for r = 1, . . . , k − 1 can be accomplished by solving 2r independent tridiagonal
systems for each right-hand side and then combining the partial solutions by multi-
plying 2k−r − 1 matrices, each of size m × 2r with the vector of 2r partial fraction
coefficients. Therefore, Algorithm 6.16, can be efficiently implemented on parallel
architectures using partial fractions to solve one or more independent linear systems
with coefficient matrix T (r ) in lines 4, 8 and 11. If we assume that the cost of solving
a tridiagonal system of order m using m processors is τ (m), then if there are P = mn
processors, the parallel cost of BCR is approximately equal for the 2 stages: It is easy
to see that there is a total of 2kτ (m) + k 2 + O(k) operations. If we use paracr to
solve the tridiagonal systems, the cost becomes 16 log n log m + log2 n + O(log n).
This is somewhat more than the cost of parallel Fourier-MD and CFT, but BCR is
applicable to a wider range of problems, as we noted earlier.
Historically, the invention of stabilized BCR by Buneman preceded the paralleliza-
tion of BCR based on partial fractions [97, 98]. In light of the preceding discussion,
we can view the Buneman scheme as a method for resolving the parallelization bot-
tleneck in the reduction stage of CORF, that also handles the instability and an exam-
ple where the introduction of multiple levels of parallelism stabilizes a numerical
process. This is interesting, especially in view of discussions regarding the interplay
between numerical stability and parallelism; cf. [99].
It was assumed so far for convenience that n = 2k − 1 and that BCR was applied
for the discrete Poisson problem that originated from a PDE with Dirichlet boundary
conditions. For general values of n or other boundary conditions, the use of reduction
can be shown to generate more general matrix rational functions numerator that has
nonzero degree that is smaller than than of the demoninator. The systems with these
matrices are then solved using the more general method listed as Algorithm 12.3
of Chap. 12. This not only enables parallelization but also eliminates the need for
multiplications with the numerator polynomial; cf. the discussion in the Notes and
References Sect. 6.4.9. In terms of the kernels of Sect. 6.4.2 one calls from category
(1.iii) (when r = k − 1) or (1.v) for other values of r .
It is also worth noting that the use of partial fractions in these cases is numerically
safe; cf. [100] as well as the discussion in Sect. 12.1.3 of Chap. 12 for more details
on the numerical issues that arise when using partial fractions.
There is another way to resolve the computational bottleneck of CORF and BCR
and to stabilize the process. This is to monitor the reduction stage and terminate it
before the available parallelism is greatly reduced and accuracy is compromised and
6.4 Rapid Elliptic Solvers 211
switch to another method for the smaller system. Therefore the algorithm is made to
adapt to the available computational resources yielding acceptable solutions.
The method presented next is in this spirit but combines early stopping with the
MD-Fourier technique. The Fourier analysis-cyclic reduction method (FACR) is a
hybrid method consisting of l block cyclic reduction steps as in CORF, followed
by MD-Fourier for the smaller block-tridiagonal system and back substitution to
compute the final solution. To account for limiting the number of reduction steps,
l, the method is denoted by FACR(l). It is based on the fact that at any step of the
reduction stage of BCR, the coefficient matrix is A(r ) = [−I, T (r ) , −I ]2k−r −1 and
(m)
λ
since T (r ) = 2T2r ( T2 ) has the same eigenvectors as T with eigenvalues 2T2r ( i2 ),
the reduced system can be solved using Fourier-MD. If reduction is applied without
stabilization, l must be small. Several analyses of the computational complexity of
the algorithm (see e.g. [101]) indicate that for the l ≈ log log m, the sequential
complexity is O(mn log(log n)). Therefore, properly designed FACR is faster than
MD-Fourier and BCR. In practice, the best choice for l depends on the relative
performance of the underlying kernels and other characteristics of the target computer
platform. The parallel implementation of all steps can proceed using the techniques
deployed for BCR and MD; FACR(l) can be viewed as an alternative to partial
fractions to avoid the parallel implementation bottleneck that was observed after a
few steps of reduction in BCR. For example, the number of systems solved in parallel
in BCR the algorithm can be monitored in order to trigger a switch to MD-Fourier
before they become so few as not make full use of available parallel resources.
Proof Let Pb and Pc be the permutations that order b and c so that their nonzero ele-
ments are listed first in Pb b and Pc c respectively. Then c Ab = (Pc c) Pc A Pb Pb b
where
 0 bnz
c Ab = (Pc c) Pc A Pb Pb b = cnz 0
0 0μ̂,ν̂ 0
= cnz Âbnz ,
where  is of dimension nnz (c) × nnz (b), μ̂ = nnz (c), ν̂ = nnz (b), and bnz and cnz
are the subvectors of nonzero elements of b and c. This can be computed in
operations, proving the lemma. This cost can be reduced even further if A is also
sparse.
From the explicit formulas in Proposition 6.4 we obtain the following result.
Proposition 6.7 Let T = [−1, α, −1]m be a tridiagonal Toeplitz matrix and con-
sider the computation of ξ = c T −1 b for sparse vectors b, c. Then ξ can be computed
in O(nnz (b)nnz (c)) arithmetic operations.
This holds if the participating elements of T −1 are already available or if they can
be computed in O(nnz (b)nnz (c)) operations as well.
When we seek only k elements of the solution vector, then the cost becomes
O(k nnz (b)). From Proposition 6.5 it follows that these ideas can also be used to
reduce costs when applying the inverse of A. To compute only u n , we write
hence
n
un = Uˆn−1 (T )Uˆj−1 (T ) f j (6.80)
j=1
where
⎛ ⎞
−I T −I
⎜ −I T −I ⎟
⎜ ⎟
⎜ .. .. ⎟
H =⎜ . . ⎟.
⎜ ⎟
⎝ −I T ⎠
−I
The term H −1 f¯ can be computed first using a block recurrence, each step of which
consists of a multiplication of T with a vector and some other simple vector oper-
ations, followed by the solution with Uˆn (T ) which can be computed by utilizing
its product form or the partial fraction representation of its inverse. Either way, this
requires solving n linear systems with coefficient matrices that are simple shifts of
T , a case of kernel (1.iv).
If the right-hand side is also very sparse, so that f j = 0 except for l values of the
index j, then the sum for u n consists of only l block terms. Further cost reductions
are possible if the nonzero f j ’s are also sparse and if we only seek few elements
from u n . One way to compute u n based on these observations is to diagonalize T
as in the Fourier-MD methods, applying the DFT on each of the non-zero f j terms.
Further savings are possible if these f j ’s are also sparse, since DFTs could then be
computed directly as BLAS, without recourse to FFT kernels.
After the transform, the coefficients in the sum (6.80) are diagonal matrices with
(m) (m)
entries Uˆn−1 (λ j )Uˆj−1 (λ j ). The terms can be computed directly from Defini-
tion 6.6. Therefore, each term of the sum (6.80) is the result of the element-by-element
multiplication of the aforementioned diagonal with the vector fˆj , which is the DFT
of f j . Thus
⎛ (m) (m)
⎞
Uˆn−1 (λ1 )Uˆj−1 (λ1 ) fˆj,1
n
⎜ .. ⎟
un = Q ⎝. ⎠
(m) (m)
j=1 Uˆn−1 (λm )Uˆj−1 (λm ) fˆj,m
Proposition 6.5, in combination with partial fractions makes possible the design of a
method for computing all or selected subvectors of the solution vector of the discrete
Poisson system.
From Definition 6.6, the roots of the Chebyshev polynomial Uˆn are
jπ
ρ j = 2 cos , j = 1, . . . , n.
n+1
From Proposition 6.5, we can write the subvector u i of the solution of the discrete
Poisson system as
n
ui = (Aij )−1 f j
j=1
i−1
n
= Uˆn−1 (T )Uˆj−1 (T )Uˆn−i (T ) f j + Uˆn−1 (T )Uˆi−1 (T )Uˆn− j (T ) f j .
j=1 j=i
Because each block (A−1 )i, j is a rational matrix function with denominator of degree
n and numerator of smaller degree and, with the roots of the denominator being
distinct, (Aij )−1 can be expressed as a partial fraction sum of n terms.
6.4 Rapid Elliptic Solvers 215
n
n
(i)
ui = γ j,k (T − ρk I )−1 f j (6.82)
j=1 k=1
n
n
(i)
= (T − ρk I )−1 γ j,k f j (6.83)
k=1 j=1
n
(i)
= (T − ρk I )−1 f˜k , (6.84)
k=1
(i) (i)
where F = ( f 1 , . . . , f n ) ∈ Rm×n , and ρ1 , . . . , ρn and γ j,1 , . . . , γ j,n are the roots
of the denominator and partial fraction coefficients for the rational polynomial for
(i)
(A−1 )i, j , respectively. Let G (i) = [γ j,k ] be the matrix that contains along each row
(i) (i)
the partial fraction coefficients (γ j,1 , . . . , γ j,n ) used in the inner sum of (6.82). In
(i) (i) (i) (i)
column format G (i) = (g1 , . . . , gn ), where in (6.84), f˜k = Fgk . Based on this
formula, it is straightforward to compute the u i ’s as is shown in Algorithm 6.17;
cf. [103]. Algorithm EES offers abundant parallelism and all u i ’s can be evaluated
Algorithm 6.17 EES: Explicit Elliptic Solver for the discrete Poisson system
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f = ( f 1 ; . . . ; f n ).
Output: Solution u = (u 1 ; . . . ; u n ) //the method can be readily modified to produce only selected
subvectors
1: Compute n roots ρk of Uˆn (x) in (6.82)
(i)
2: Compute coefficients γ j,k in (6.83)
3: doall i = 1 : n
4: doall k = 1 : n
(i) (i)
5: compute f˜k = nj=1 γ j,k f j
(i) (i)
6: compute ū k = (T − ρk I )−1 f˜k
7: end (i)
8: compute u i = nk=1 ū k
9: end
in O(log n + log m). For this, however, O(n 3 m) processors appear to be necessary,
which is too high. An O(n) reduction in the processor count is possible by first
observing that for each i, the matrix G (i) of partial fraction coefficients is the sum of
a Toeplitz matrix and a Hankel matrix and then using fast multiplications with these
special matrices; cf. [103].
the Poisson matrix inverse (cf. (6.72)) but has some interest as it reveals a direct
connection between MD- Fourier and EES.
The elements and structure of G (i) are key to establishing the connection. For
i ≥ j (the case i < j can be treated similarly)
where Uˆn denotes the derivative of Uˆn . From standard trigonometric identities it
follows that for i ≥ j the numerator of (6.85) is equal to
jkπ ikπ
sin( jθk ) sin((n + 1 − i)θk ) = (−1)k+1 sin sin . (6.86)
n+1 n+1
The same equality holds when i < j since relation (6.86) is symmetric with respect
to i and j. So from now on we use this numerator irrespective of the relative ordering
of i and j. With some further algebraic manipulations it can be shown that
(n + 1)
Uˆn (x) |x=ρk = (−1)k+1 .
2 sin2 θk
From (6.85) and (6.86) it follows that the elements of G (i) are
Observe that multiplication of a vector with the symmetric matrix Q is a DST. From
(6.83) it follows that
6.4 Rapid Elliptic Solvers 217
n
(i)
ui = (T − ρk I )−1 h k (6.88)
k=1
This amounts to applying m independent DSTs, one for each row of F. These can
be accomplished in O(log n) steps if O(mn) processors are available. Following
that and the multiplication with D (i) , it remains to solve n independent tridiagonal
systems, each with coefficient matrix (T − ρk I ) and right-hand side consisting of the
kth column of H (i) . With O(nm) processors this can be done in O(log m) operations.
Finally, u i is obtained by adding the n partial results, in O(log n) steps using the
same number of processors. Therefore, the overall parallel cost for computing any
u i is O(log n + log m) operations using O(mn) processors.
It is actually possible to compute all subvectors u 1 , . . . , u n without exceeding the
O(log n + log m) parallel cost while still using O(mn) processors. Set Ĥ = FQ and
denote the columns of this matrix by ĥ k . These terms do not depend on the index
i. Therefore, in the above steps, the multiplication with the diagonal matrix D (i)
can be deferred until after the solution of the independent linear systems in (6.88).
Specifically, we first compute the columns of the m × n matrix
We then multiply each of the columns of H̆ with the appropriate diagonal element
of D (i) and then sum the partial results to obtain u i . However, this is equivalent to
multiplying H̆ with the column vector consisting of the diagonal elements of D (i) .
Let us call this vector d (i) , then from (6.87) it follows that the matrix (d (1) , . . . , d (n) )
is equal to the DST matrix Q. Therefore, the computation of
H̆ (d (1) , . . . , d (n) ) = H̆ Q,
can be performed using independent DSTs on the m rows of H̆ . Finally note that the
coefficient matrices of the systems we need to solve are given by,
kπ kπ
T − ρk I = T − 2 cos I = −1, 2 + 2 − 2 cos , −1
n+1 n+1 m
(n)
= T̃m + λk I.
It follows that the major steps of EES can be implemented as follows. First apply
m independent DSTs, one for each row of F, and assemble the results in matrix
Ĥ . Then solve n independent tridiagonal linear systems, each corresponding to the
218 6 Special Linear Systems
6.4.9 Notes
Some important milestones in the development of RES are (a) reference [104],
where Fourier analysis and marching were proposed to solve Poisson’s equation; (b)
reference [87] appears to be the first to provide the explicit formula for the Poisson
matrix inverse, and also [88–90], where this formula, spectral decomposition, and a
first version of MD and CFT were described (called the semi-rational and rational
solutions respectively in [88]); (c) reference [105], on Kronecker (tensor) product
solvers.
The invention of the FFT together with reference [106] (describing FACR(1) and
the solution of tridiagonal systems with cyclic reduction) mark the beginning of the
modern era of RES. These and reference [95], with its detailed analysis of the major
RES were extremely influential in the development of the field; see also [107, 108]
and the survey of [80] on the numerical solution of elliptic problems including the
topic of RES.
The first discussions of parallel RES date from the early 1970s, in particular
references [92, 93] while the first implementation on the Illiac IV was reported in
[109]. There exist descriptions of implementations of RES for most important high
performance computing platforms, including vector processors, vector multiproces-
sors, shared memory symmetric multiprocessors, distributed memory multiproces-
sors, SIMD and MIMD processor arrays, clusters of heterogeneous processors and
in Grid environments. References for specific systems can be found for the Cray-1
[110, 111]; the ICL DAP [112]; Alliant FX/8 [97, 113]; Caltech and Intel hypercubes
[114–118]; Thinking Machines CM-2 [119]; Denelcor HEP [120, 121]; the Univer-
sity of Illinois Cedar machine [122, 123]; Cray X-MP [110]; Cray Y-MP [124]; Cray
T3E [125, 126]; Grid environments [127]; Intel multicore processors and processor
clusters [128]; GPUs [129, 130]. Regarding the latter, it is worth noting the extensive
studies conducted at Yale on an early GPU, the FPS-164 [131]. Proposals for special
purpose hardware can be found in [132]. RES were included in the PELLPACK
parallel problem-solving environment for elliptic PDEs [133].
The inverses of the special Toeplitz tridiagonal and 2-level Toeplitz matrices that
we encountered in this chapter follow directly from usual expansion formula for
determinants and the adjoint formula for the matrix inverse; see [134].
6.4 Rapid Elliptic Solvers 219
Exploiting sparsity and solution probing was observed early on for RES in
reference [26] as we already noted in Sect. 6.1. These ideas were developed fur-
ther in [126, 139–141]. A RES for an SGI with 8 processors and a 16-processo
Beowulf cluster based on [139, 140] was presented in [142].
References
22. Gautschi, W.: Optimally scaled and optimally conditioned Vandermonde and Vandermonde-
like matrices. BIT Numer. Math. 51, 103–125 (2011)
23. Gunnels, J., Lee, J., Margulies, S.: Efficient high-precision matrix algebra on parallel archi-
tectures for nonlinear combinatorial optimization. Math. Program. Comput. 2(2), 103–124
(2010)
24. Aho, A., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms.
Addison-Wesley, Reading (1974)
25. Pan, V.: Complexity of computations with matrices and polynomials. SIAM Rev. 34(2), 255–
262 (1992)
26. Banegas, A.: Fast Poisson solvers for problems with sparsity. Math. Comput. 32(142), 441–
446 (1978). http://www.jstor.org/stable/2006156
27. Cappello, P., Gallopoulos, E., Koç, Ç.: Systolic computation of interpolating polynomials.
Computing 45, 95–118 (1990)
28. Koç, Ç., Cappello, P., Gallopoulos, E.: Decomposing polynomial interpolation for systolic
arrays. Int. J. Comput. Math. 38, 219–239 (1991)
29. Koç, Ç.: Parallel algorithms for interpolation and approximation. Ph.D. thesis, Department
of Electrical and Computer Engineering, University of California, Santa Barbara, June 1988
30. Eğecioğlu, Ö., Gallopoulos, E., Koç, Ç.: A parallel method for fast and practical high-order
Newton interpolation. BIT 30, 268–288 (1990)
31. Breshaers, C.: The Art of Concurrency—A Thread Monkey’s Guide to Writing Parallel Appli-
cations. O’Reilly, Cambridge (2009)
32. Lakshmivarahan, S., Dhall, S.: Parallelism in the Prefix Problem. Oxford University Press,
New York (1994)
33. Harris, M., Sengupta, S., Owens, J.: Parallel prefix sum (scan) with CUDA. GPU Gems 3(39),
851–876 (2007)
34. Falkoff, A., Iverson, K.: The evolution of APL. SIGPLAN Not. 13(8), 47–57 (1978). doi:10.
1145/960118.808372. http://doi.acm.org/10.1145/960118.808372
35. Blelloch, G.E.: Scans as primitive operations. IEEE Trans. Comput. 38(11), 1526–1538 (1989)
36. Chatterjee, S., Blelloch, G., Zagha, M.: Scan primitives for vector computers. In: Proceed-
ings of the 1990 ACM/IEEE Conference on Supercomputing, pp. 666–675. IEEE Computer
Society Press, Los Alamitos (1990). http://dl.acm.org/citation.cfm?id=110382.110597
37. Hillis, W., Steele Jr, G.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986).
doi:10.1145/7902.7903. http://doi.acm.org/10.1145/7902.7903
38. Dotsenko, Y., Govindaraju, N., Sloan, P.P., Boyd, C., Manferdelli, J.: Fast scan algorithms on
graphics processors. In: Proceedings of the 22nd International Conference on Supercomputing
ICS’08, pp. 205–213. ACM, New York (2008). doi:10.1145/1375527.1375559. http://doi.
acm.org/10.1145/1375527.1375559
39. Sengupta, S., Harris, M., Zhang, Y., Owens, J.: Scan primitives for GPU computing. Graphics
Hardware 2007, pp. 97–106. ACM, New York (2007)
40. Sengupta, S., Harris, M., Garland, M., Owens, J.: Efficient parallel scan algorithms for many-
core GPUs. In: Kurzak, J., Bader, D., Dongarra, J. (eds.) Scientific Computing with Multicore
and Accelerators, pp. 413–442. CRC Press, Boca Raton (2010). doi:10.1201/b10376-29
41. Intel Corporation: Intel(R) Threading Building Blocks Reference Manual, revision 1.6 edn.
(2007). Document number 315415-001US
42. Bareiss, E.: Numerical solutions of linear equations with Toeplitz and vector Toeplitz matrices.
Numer. Math. 13, 404–424 (1969)
43. Gallivan, K.A., Thirumalai, S., Van Dooren, P., Varmaut, V.: High performance algorithms
for Toeplitz and block Toeplitz matrices. Linear Algebra Appl. 241–243, 343–388 (1996)
44. Justice, J.: The Szegö recurrence relation and inverses of positive definite Toeplitz matrices.
SIAM J. Math. Anal. 5, 503–508 (1974)
45. Trench, W.: An algorithm for the inversion of finite Toeplitz matrices. J. Soc. Ind. Appl. Math.
12, 515–522 (1964)
46. Trench, W.: An algorithm for the inversion of finite Hankel matrices. J. Soc. Ind. Appl. Math.
13, 1102–1107 (1965)
222 6 Special Linear Systems
47. Phillips, J.: The triangular decomposition of Hankel matrices. Math. Comput. 25, 599–602
(1971)
48. Rissanen, J.: Solving of linear equations with Hankel and Toeplitz matrices. Numer. Math.
22, 361–366 (1974)
49. Xi, Y., Xia, J., Cauley, S., Balakrishnan, V.: Superfast and stable structured solvers for Toeplitz
least squares via randomized sampling. SIAM J. Matrix Anal. Appl. 35(1), 44–72 (2014)
50. Zohar, S.: Toeplitz matrix inversion: The algorithm of W. Trench. J. Assoc. Comput. Mach.
16, 592–701 (1969)
51. Watson, G.: An algorithm for the inversion of block matrices of Toeplitz form. J. Assoc.
Comput. Mach. 20, 409–415 (1973)
52. Rissanen, J.: Algorithms for triangular decomposition of block Hankel and Toeplitz matrices
with applications to factoring positive matrix polynomials. Math. Comput. 27, 147–154 (1973)
53. Kailath, T., Vieira, A., Morf, M.: Inverses of Toeplitz operators, innovations, and orthogonal
polynomials. SIAM Rev. 20, 106–119 (1978)
54. Gustavson, F., Yun, D.: Fast computation of Padé approximants and Toeplitz systems of
equations via the extended Euclidean algorithm. Technical report 7551, IBM T.J. Watson
Research Center, New York (1979)
55. Brent, R., Gustavson, F., Yun, D.: Fast solution of Toeplitz systems of equations and compu-
tation of Padé approximants. J. Algorithms 1, 259–295 (1980)
56. Morf, M.: Doubling algorithms for Toeplitz and related equations. In: Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal Processing, pp. 954–959 (1980)
57. Chandrasekaran, S., Gu, M., Sun, X., Xia, J., Zhu, J.: A superfast algorithm for Toeplitz
systems of linear equations. SIAM J. Matrix Anal. Appl. 29(4), 1247–1266 (2007). doi:10.
1137/040617200. http://dx.doi.org/10.1137/040617200
58. Xia, J., Xi, Y., Gu, M.: A superfast structured solver for Toeplitz linear systems via randomized
sampling. SIAM J. Matrix Anal. Appl. 33(3), 837–858 (2012)
59. Grcar, J., Sameh, A.: On certain parallel Toeplitz linear system solvers. SIAM J. Sci. Stat.
Comput. 2(2), 238–256 (1981)
60. Aitken, A.: Determinants and Matrices. Oliver Boyd, London (1939)
61. Cantoni, A., Butler, P.: Eigenvalues and eigenvectors of symmetric centrosymmetric matrices.
Numer. Linear Algebra Appl. 13, 275–288 (1976)
62. Gohberg, I., Semencul, A.: On the inversion of finite Toeplitz matrices and their continuous
analogues. Mat. Issled 2, 201–233 (1972)
63. Gohberg, I., Feldman, I.: Convolution equations and projection methods for their solution.
Translations of Mathematical Monographs, vol. 41. AMS, Providence (1974)
64. Gohberg, I., Levin, S.: Asymptotic properties of Toeplitz matrix factorization. Mat. Issled 1,
519–538 (1978)
65. Fischer, D., Golub, G., Hald, O., Leiva, C., Widlund, O.: On Fourier-Toeplitz methods for
separable elliptic problems. Math. Comput. 28(126), 349–368 (1974)
66. Riesz, F., Sz-Nagy, B.: Functional Analysis. Frederick Ungar, New York (1956). (Translated
from second French edition by L. Boron)
67. Szegö, G.: Orthogonal Polynomials. Technical Report, AMS, Rhode Island (1959). (Revised
edition AMS Colloquium Publication)
68. Grenander, U., Szegö, G.: Toeplitz Forms and their Applications. University of California
Press, California (1958)
69. Pease, M.: The adaptation of the fast Fourier transform for parallel processing. J. Assoc.
Comput. Mach. 15(2), 252–264 (1968)
70. Householder, A.S.: The Theory of Matrices in Numerical Analysis. Dover Publications, New
York (1964)
71. Morf, M., Kailath, T.: Recent results in least-squares estimation theory. Ann. Econ. Soc. Meas.
6, 261–274 (1977)
72. Franchetti, F., Püschel, M.: Fast Fourier transform. In: Padua, D. (ed.) Encyclopedia of Parallel
Computing. Springer, New York (2011)
References 223
73. Chen, H.C.: The SAS domain decomposition method. Ph.D. thesis, University of Illinois at
Urbana-Champaign (1988)
74. Chen, H.C., Sameh, A.: Numerical linear algebra algorithms on the Cedar system. In: Noor,
A. (ed.) Parallel Computations and Their Impact on Mechanics. Applied Mechanics Division,
vol. 86, pp. 101–125. American Society of Mechanical Engineers, New York (1987)
75. Chen, H.C., Sameh, A.: A matrix decomposition method for orthotropic elasticity problems.
SIAM J. Matrix Anal. Appl. 10(1), 39–64 (1989)
76. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, Oxford (1965)
77. Botta, E.: How fast the Laplace equation was solved in 1995. Appl. Numer. Math.
24(4), 439–455 (1997). doi:10.1016/S01689274(97)00041X. http://dx.doi.org/10.1016/
S0168-9274(97)00041-X
78. Knightley, J.R., Thompson, C.P.: On the performance of some rapid elliptic solvers on a vector
processor. SIAM J. Sci. Stat. Comput. 8(5), 701–715 (1987)
79. Csansky, L.: Fast parallel matrix inversion algorithms. SIAM J. Comput. 5, 618–623 (1977)
80. Birkhoff, G., Lynch, R.: Numerical Solution of Elliptic Problems. SIAM, Philadelphia (1984)
81. Iserles, A.: Introduction to Numerical Methods for Differential Equations. Cambridge Uni-
versity Press, Cambridge (1996)
82. Olshevsky, V., Oseledets, I., Tyrtyshnikov, E.: Superfast inversion of two-level Toeplitz matri-
ces using Newton iteration and tensor-displacement structure. Recent Advances in Matrix and
Operator Theory. Birkhäuser Verlag, Basel (2007)
83. Bank, R.E., Rose, D.: Marching algorithms for elliptic boundary value problems. I: the con-
stant coefficient case. SIAM J. Numer. Anal. 14(5), 792–829 (1977)
84. Lanczos, C.: Tables of the Chebyshev Polynomials Sn (x) and Cn (x). Applied Mathematics
Series, vol. 9. National Bureau of Standards, New York (1952)
85. Rivlin, T.: The Chebyshev Polynomials. Wiley-Interscience, New York (1974)
86. Abramowitz, M., Stegun, I.: Handbook of Mathematical Functions. Dover, New York (1965)
87. Karlqvist, O.: Numerical solution of elliptic difference equations by matrix methods. Tellus
4(4), 374–384 (1952). doi:10.1111/j.2153-3490.1952.tb01025.x. http://dx.doi.org/10.1111/
j.2153-3490.1952.tb01025.x
88. Bickley, W.G., McNamee, J.: Matrix and other direct methods for the solution of systems of
linear difference equations. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 252(1005), 69–131
(1960). doi:10.1098/rsta.1960.0001. http://rsta.royalsocietypublishing.org/cgi/doi/10.1098/
rsta.1960.0001
89. Egerváry, E.: On rank-diminishing operations and their application to the solution of linear
equations. Zeitschrift fuer angew. Math. und Phys. 11, 376–386 (1960)
90. Egerváry, E.: On hypermatrices whose blocks are computable in pair and their application in
lattice dynamics. Acta Sci. Math. Szeged 15, 211–222 (1953/1954)
91. Bialecki, B., Fairweather, G., Karageorghis, A.: Matrix decomposition algorithms for elliptic
boundary value problems: a survey. Numer. Algorithms (2010). doi:10.1007/s11075-010-
9384-y. http://www.springerlink.com/index/10.1007/s11075-010-9384-y
92. Buzbee, B.: A fast Poisson solver amenable to parallel computation. IEEE Trans. Comput.
C-22(8), 793–796 (1973)
93. Sameh, A., Chen, S.C., Kuck, D.: Parallel Poisson and biharmonic solvers. Computing 17,
219–230 (1976)
94. Swarztrauber, P.N., Sweet, R.A.: Vector and parallel methods for the direct solution of Pois-
son’s equation. J. Comput. Appl. Math. 27, 241–263 (1989)
95. Buzbee, B., Golub, G., Nielson, C.: On direct methods for solving Poisson’s equation. SIAM
J. Numer. Anal. 7(4), 627–656 (1970)
96. Sweet, R.A.: A cyclic reduction algorithm for solving block tridiagonal systems of arbitrary
dimension. SIAM J. Numer. Anal. 14(4), 707–720 (1977)
97. Gallopoulos, E., Saad, Y.: Parallel block cyclic reduction algorithm for the fast solution of
elliptic equations. Parallel Comput. 10(2), 143–160 (1989)
98. Sweet, R.A.: A parallel and vector cyclic reduction algorithm. SIAM J. Sci. Stat. Comput.
9(4), 761–765 (1988)
224 6 Special Linear Systems
99. Demmel, J.: Trading off parallelism and numerical stability. In: Moonen, M.S., Golub, G.H.,
Moor, B.L.D. (eds.) Linear Algebra for Large Scale and Real-Time Applications. NATO ASI
Series E, vol. 232, pp. 49–68. Kluwer Academic Publishers, Dordrecht (1993)
100. Calvetti, D., Gallopoulos, E., Reichel, L.: Incomplete partial fractions for parallel evaluation
of rational matrix functions. J. Comput. Appl. Math. 59, 349–380 (1995)
101. Temperton, C.: On the FACR(l) algorithm for the discrete Poisson equation. J. Comput. Phys.
34, 314–329 (1980)
102. Sameh, A., Kuck, D.: On stable parallel linear system solvers. J. Assoc. Comput. Mach. 25(1),
81–91 (1978)
103. Gallopoulos, E., Saad, Y.: Some fast elliptic solvers for parallel architectures and their com-
plexities. Int. J. High Speed Comput. 1(1), 113–141 (1989)
104. Hyman, M.: Non-iterative numerical solution of boundary-value problems. Appl. Sci. Res. B
2, 325–351 (1951–1952)
105. Lynch, R., Rice, J., Thomas, D.: Tensor product analysis of partial differential equations. Bull.
Am. Math. Soc. 70, 378–384 (1964)
106. Hockney, R.: A fast direct solution of Poisson’s equation using Fourier analysis. J. Assoc.
Comput. Mach. 12, 95–113 (1965)
107. Haigh, T.: Bill Buzbee, Oral History Interview (2005). http://history.siam.org/buzbee.htm
108. Cooley, J.: The re-discovery of the fast Fourier transform algorithm. Mikrochim. Acta III,
33–45 (1987)
109. Ericksen, J.: Iterative and direct methods for solving Poisson’s equation and their adaptabil-
ity to Illiac IV. Technical report UIUCDCS-R-72-574, Department of Computer Science,
University of Illinois at Urbana-Champaign (1972)
110. Sweet, R.: Vectorization and parallelization of FISHPAK. In: Dongarra, J., Kennedy, K.,
Messina, P., Sorensen, D., Voigt, R. (eds.) Proceedings of the Fifth SIAM Conference on
Parallel Processing for Scientific Computing, pp. 637–642. SIAM, Philadelphia (1992)
111. Temperton, C.: Fast Fourier transforms and Poisson solvers on Cray-1. In: Hockney, R.,
Jesshope, C. (eds.) Infotech State of the Art Report: Supercomputers, vol. 2, pp. 359–379.
Infotech Int. Ltd., Maidenhead (1979)
112. Hockney, R.W.: Characterizing computers and optimizing the FACR(l) Poisson solver on
parallel unicomputers. IEEE Trans. Comput. C-32(10), 933–941 (1983)
113. Jwo, J.S., Lakshmivarahan, S., Dhall, S.K., Lewis, J.M.: Comparison of performance of three
parallel versions of the block cyclic reduction algorithm for solving linear elliptic partial
differential equations. Comput. Math. Appl. 24(5–6), 83–101 (1992)
114. Chan, T., Resasco, D.: Hypercube implementation of domain-decomposed fast Poisson
solvers. In: Heath, M. (ed.) Proceedings of the 2nd Conference on Hypercube Multiprocessors,
pp. 738–746. SIAM (1987)
115. Resasco, D.: Domain decomposition algorithms for elliptic partial differential equations.
Ph.D. thesis, Yale University (1990). http://www.cs.yale.edu/publications/techreports/tr776.
pdf. YALEU/DCS/RR-776
116. Cote, S.: Solving partial differential equations on a MIMD hypercube: fast Poisson solvers
and the alternating direction method. Technical report UIUCDCS-R-91-1694, University of
Illinois at Urbana-Champaign (1991)
117. McBryan, O., Van De Velde, E.: Hypercube algorithms and implementations. SIAM J. Sci.
Stat. Comput. 8(2), s227–s287 (1987)
118. Sweet, R., Briggs, W., Oliveira, S., Porsche, J., Turnbull, T.: FFTs and three-dimensional
Poisson solvers for hypercubes. Parallel Comput. 17, 121–131 (1991)
119. McBryan, O.: Connection machine application performance. Technical report CH-CS-434-
89, Department of Computer Science, University of Colorado, Boulder (1989)
120. Briggs, W.L., Turnbull, T.: Fast Poisson solvers for MIMD computers. Parallel Comput. 6,
265–274 (1988)
121. McBryan, O., Van de Velde, E.: Elliptic equation algorithms on parallel computers. Commun.
Appl. Numer. Math. 2, 311–318 (1986)
References 225
122. Gallivan, K.A., Heath, M.T., Ng, E., Ortega, J.M., Peyton, B.W., Plemmons, R.J., Romine,
C.H., Sameh, A., Voigt, R.G.: Parallel Algorithms for Matrix Computations. SIAM, Philadel-
phia (1990)
123. Gallopoulos, E., Sameh, A.: Solving elliptic equations on the Cedar multiprocessor. In: Wright,
M.H. (ed.) Aspects of Computation on Asynchronous Parallel Processors, pp. 1–12. Elsevier
Science Publishers B.V. (North-Holland), Amsterdam (1989)
124. Chan, T.F., Fatoohi, R.: Multitasking domain decomposition fast Poisson solvers on the Cray
Y-MP. In: Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific
Computing. SIAM (1989) (to appear)
125. Giraud, L.: Parallel distributed FFT-based solvers for 3-D Poisson problems in meso-scale
atmospheric simulations. Int. J. High Perform. Comput. Appl. 15(1), 36–46 (2001). doi:10.
1177/109434200101500104. http://hpc.sagepub.com/cgi/content/abstract/15/1/36
126. Rossi, T., Toivanen, J.: A parallel fast direct solver for block tridiagonal systems with separable
matrices of arbitrary dimension. SIAM J. Sci. Stat. Comput. 20(5), 1778–1796 (1999)
127. Tromeur-Dervout, D., Toivanen, J., Garbey, M., Hess, M., Resch, M., Barberou, N., Rossi,
T.: Efficient metacomputing of elliptic linear and non-linear problems. J. Parallel Distrib.
Comput. 63(5), 564–577 (2003). doi:10.1016/S0743-7315(03)00003-0
128. Intel Cluster Poisson Solver Library—Intel Software Network. http://software.intel.com/en-
us/articles/intel-cluster-poisson-solver-library/
129. Rossinelli, D., Bergdorf, M., Cottet, G.H., Koumoutsakos, P.: GPU accelerated simulations of
bluff body flows using vortex particle methods. J. Comput. Phys. 229(9), 3316–3333 (2010)
130. Wu, J., JaJa, J., Balaras, E.: An optimized FFT-based direct Poisson solver on CUDA GPUs.
IEEE Trans. Parallel Distrib. Comput. 25(3), 550–559 (2014). doi:10.1109/TPDS.2013.53
131. O’Donnell, S.T., Geiger, P., Schultz, M.H.: Solving the Poisson equation on the FPS-164.
Technical report, Yale University, Department of Computer Science (1983)
132. Vajteršic, M.: Algorithms for Elliptic Problems: Efficient Sequential and Parallel Solvers.
Kluwer Academic Publishers, Dordrecht (1993)
133. Houstis, E.N., Rice, J.R., Weerawarana, S., Catlin, A.C., Papachiou, P., Wang, K.Y., Gai-
tatzes, M.: PELLPACK: a problem-solving environment for PDE-based applications on mul-
ticomputer platforms. ACM Trans. Math. Softw. (TOMS) 24(1) (1998). http://portal.acm.org/
citation.cfm?id=285864
134. Meurant, G.: A review on the inverse of symmetric tridiagonal and block tridiagonal matrices.
SIAM J. Matrix Anal. Appl. 13(3), 707–728 (1992)
135. Hoffmann, G.R., Swarztrauber, P., Sweet, R.: Aspects of using multiprocessors for meteoro-
logical modelling. In: Hoffmann, G.R., Snelling, D. (eds.) Multiprocessing in Meteorological
Models, pp. 125–196. Springer, New York (1988)
136. Johnsson, S.: The FFT and fast Poisson solvers on parallel architectures. Technical Report
583, Yale University, Department of Computer Science (1987)
137. Hockney, R., Jesshope, C.: Parallel Computers. Adam Hilger, Bristol (1983)
138. Bini, D., Meini, B.: The cyclic reduction algorithm: from Poisson equation to stochastic
processes and beyond. Numerical Algorithms 51(1), 23–60 (2008). doi:10.1007/s11075-008-
9253-0. http://www.springerlink.com/content/m40t072h273w8841/fulltext.pdf
139. Kuznetsov, Y.A., Matsokin, A.M.: On partial solution of systems of linear algebraic equations.
Sov. J. Numer. Anal. Math. Model. 4(6), 453–467 (1989)
140. Vassilevski, P.: An optimal stabilization of the marching algorithm. Comptes Rendus Acad.
Bulg. Sci. 41, 29–32 (1988)
141. Rossi, T., Toivanen, J.: A nonstandard cyclic reduction method, its variants and stability.
SIAM J. Matrix Anal. Appl. 20(3), 628–645 (1999)
142. Bencheva, G.: Parallel performance comparison of three direct separable elliptic solvers. In:
Lirkov, I., Margenov, S., Wasniewski, J., Yalamov, P. (eds.) Large-Scale Scientific Computing.
Lecture Notes in Computer Science, vol. 2907, pp. 421–428. Springer, Berlin (2004). http://
dx.doi.org/10.1007/978-3-540-24588-9_48
Chapter 7
Orthogonal Factorization and Linear Least
Squares Problems
7.1 Definitions
A = Q R, (7.1)
where Q isorthogonal
and R is either upper triangular when m ≥ n (i.e. R is of the
R1
form R = , or upper trapezoidal when m < n).
0
If A is of maximal column rank n, and if we require that the nonzero diagonal
elements of R1 to be positive, then the QR-factorization is unique. Further, R1 is the
transpose of the Cholesky factor of the matrix A A. Computing the factorization
(7.1) consists of pre-multiplying A by a finite number of elementary orthogonal
transformations Q
1 , . . . , Q q , such that
Q q · · · Q
2 Q 1 A = R, (7.2)
A = U R1 , (7.3)
This is referred to as thin factorization, whereas the one obtained in (7.2) is referred
to as thick factorization.
For the sake of illustrating this parallel factorization scheme, we consider only square
matrices A ∈ Rn×n even though the procedure can be easily extended to rectangular
matrices A ∈ Rm×n , with m > n. In this procedure, each orthogonal matrix Q j
which appears in (7.2) is built as the product of several plane rotations of the form
7.2 QR Factorization via Givens Rotations 229
⎛ ⎞
Ii−1
⎜ ( j) ( j) ⎟
⎜ ci si ⎟
Ri j = ⎜
⎜ I j−i−2 ⎟,
⎟ (7.6)
⎝ ( j)
−si
( j)
ci ⎠
I j−i−1 In− j
↑ ↑
th th
i col. j col.
( j) 2 ( j) 2
where ci + si = 1.
For example, one of the simplest organization for successively eliminating entries
below the main diagonal column by column starting with the first column, i.e. j =
1, 2, . . . , n − 1, i.e. Q is the product of n(n − 1)/2 plane rotations.
( j)
Let us denote by Ri,i+1 that plane rotation which uses rows i and (i + 1) of A
to annihilate the off-diagonal element in position (i + 1, j). Such a rotation is given
by:
Theorem 7.1 ([6]) Let A be a nonsingular matrix of even order n, then we can
obtain the factorization (7.2) in T p = 10n − 15 parallel arithmetic operations and
(2n − 3) square roots using p = n(3n − 2)/2 processors. This results in a speedup
of O(n 2 ), efficiency O(1) and no redundant arithmetic operations.
Proof Let
( j) ( j)
( j) ci si
Ui,i+1 = ( j) ( j)
−si ci
be that plane rotation that acts on rows i and i + 1 for annihilating the element in
position (i +1, j). Starting with A1 = A, we construct the sequence Ak+1 = Q k Ak ,
k = 1, . . . , 2n − 3, where Q
k is the direct sum of the independent plane rotations
( j)
Ui,i+1 where the indices i, j are as given in Algorithm 7.1. Thus, each orthogonal
230 7 Orthogonal Factorization and Linear Least Squares Problems
rotate the two row vectors u and v ∈ Rq so as to annihilate v1 , the first component
of v. Then U can be determined in 3 parallel steps and one square root using two
processors. The resulting vectors û and v̂ have the elements
û 1 = (u 21 + v12 )1/2
û i = cu i + svi , 2 ≤ i ≤ q, (7.8)
v̂i = −su i + cvi , 2 ≤ i ≤ q.
7.2 QR Factorization via Givens Rotations 231
Each pair û i and v̂i in (7.8) is evaluated in 2 parallel arithmetic operations using
4 processors. Note that the cost of computing û 1 has already been accounted for in
determining U . Consequently, the total cost of each transformation Ak+1 = Q k Ak
is 5 parallel arithmetic operations and one square root. Thus we can triangularize A
in 5(2n − 3) parallel arithmetic operations and (2n − 3) square roots. The maximum
number of processors is needed for k = n − 1, and is given by
n/2
p=4 n − j = n(3n − 2)/2.
j=1
Other orderings for parallel Givens reduction are given in [7–9], with the ordering
presented here judged as being asymptotically optimal [10].
If n is not even, we simply consider the orthogonal factorization of diag(A, 1).
Also, using square root-free Givens’ rotations, e.g., see [11] or [12], we can obtain the
positive diagonal matrix D and the unit upper triangular matrix R in the factorization
Q A = D 1/2 R in O(n) parallel arithmetic operations (and no square roots) employing
O(n 2 ) processors.
Observe that Givens rotations are also useful for determining QR factorization of
matrices with structures such as Hessenberg matrices or more special patterns (see
e.g. [13]).
232 7 Orthogonal Factorization and Linear Least Squares Problems
with vk chosen such that all but the first entry of the column vector H (vk )Ak (k : m, k)
(in Matlab notations) are zero.
At step k, the rest of Ak+1 = Pk Ak is obtained using the multiplication
H (vk )Ak (k : m, k : n). It can be implemented by two BLAS2 routines [15], namely
the matrix-vector multiplication DGEMV and the rank-one update DGER. The pro-
cedure stores the matrix Ak+1 in place of Ak .
On a uniprocessor, the total procedure involves 2n 2 (m − n/3) + O(mn) arith-
metic operations for obtaining R. When necessary for subsequent calculations, the
sequence of the vectors (vk )1,n , can be stored for instance in the lower part of the trans-
formed matrix Ak . When an orthogonal basis Q
= [q1 , . . . , qn ] ∈ Rm×n of the range
I
of A is needed, the basis is obtained by pre-multiplying the matrix n successively
0
by Pn , Pn−1 , . . . , P1 . On a uniprocessor, this procedure involves 2n 2 (m − n/3) +
O(mn) additional arithmetic operations instead of 4(m 2 nmn 2 + n 3 /3) + O(mn)
operations when the whole matrix Q = [q1 , · · · , qm ] ∈ Rm×m must be assembled.
Fine Grain Parallelism
The general pipelined implementation has already been presented in Sect. 2.3. Addi-
tional details are given in [16]. Here, however, we explore the finest grain tasks that
can be used in this orthogonal factorization.
Similar to Sect. 7.2, and in order to present an account similar to that in [17], we
assume that A is a square matrix of order n.
The classical Householder reduction produces the sequence Ak+1 = Pk Ak , k =
1, 2, . . . , n − 1, so that An is upper triangular, with Ak+1 given by,
Rk−1 bk−1 Bk−1
Ak+1 = .
0 ρkk e1 Hk Ck−1
where Rk−1 is upper triangular of order k −1. The elementary reflector Hk = Hk (νk ),
and the scalar ρkk can be obtained in (3+log(n−k +1)) parallel arithmetic operations
7.3 QR Factorization via Householder Reductions 233
n−1
(7 + 2 log(n − r + 1)) = 2n log n + O(n),
r =1
parallel arithmetic operations and (n −1) square roots, using no more than n 2 proces-
sors. Since the sequential algorithm requires T1 = O(n 3 ) arithmetic operations,
we obtain a speedup of S p = O(n 2 / log n) and an efficiency E p proportional to
(1/ log n) using p = O(n 2 ) processors. Such a speedup is not as good as that real-
ized by parallel Givens’ reduction.
Block Reflectors
Block versions of this Householder reduction have been introduced in [18] (the W Y
form), and in [19] (the GG form). The generation and application of the W Y form
are implemented in the routines of LAPACK [20]. They involve BLAS3 primitives
[21]. More specifically, the W Y form consists of considering a narrow window of
few columns that is reduced to the upper triangular form using elementary reflectors.
If s is the window width, the s elementary reflectors are accumulated first in the form
Pk+s · · · Pk+1 = I + W Y where W, Y ∈ Rm×s . This expression allows the use of
BLAS3 in updating the remaining part of A.
On a limited number of processors, the block version of the algorithm is the
preferred parallel scheme; it is implemented in ScaLAPACK [22] where the matrix
A is distributed on a two-dimensional grid of processes according to the block cyclic
scheme. The block size is often chosen large enough to allow BLAS3 routines to
achieve maximum performance on each involved uniprocessor. Note, however, that
while increasing the block size improves the granularity of the computation, it may
negatively affect concurrency. An optimal tradeoff therefore must be found depending
on the architecture of the computing platform.
that of MGS (2mn 2 ) while that of B2GS involves 2mq 2 = 2mnq additional arith-
metic operations to reorthogonalize the blocks of size m × q. Thus, the number of
arithmetic operations required by B2GS is (1 + 1 ) times that of MGS. Consequently,
for B2GS to be competitive, we need to be large or the number of columns in each
block to be small. Having blocks with a small number of columns, however, will not
allow us to capitalize on the higher performance afforded by BLAS3. Using blocks
of 32 or 64 columns is often a reasonable compromise on many architectures.
For linear least squares problems (7.4) in which m n, normal equations are
very often used in several applications including statistical data analysis. In matrix
computation literature, however, one is warned of possible loss of information in
forming A A explicitly, and solving the linear system,
A Ax = A b. (7.10)
236 7 Orthogonal Factorization and Linear Least Squares Problems
As an alternative to the normal equations approach for solving the linear least squares
problem (7.4) when m n, we offer the following orthogonal factorization scheme
for tall-narrow matrices of maximal column rank. Such matrices arise in a variety of
applications. In this book, we consider two examples: (i) the adjustment of geodetic
networks (see Sect. 7.7), and (ii) orthogonalization of a nonorthogonal Krylov basis
(see Sect. 9.3.2). In both cases, m could be larger than 106 while n is as small as
several hundreds.
The algorithm for the orthogonal factorization of a tall-narrow matrix A on a
distributed memory architecture was first introduced in [25]. For the sake of illus-
tration, consider the case of using two multicore nodes. Partitioning A as follows:
A = [A
1 , A2 ], In the first stage of the algorithm, the orthogonal factorization
of A1 and A2 is accomplished simultaneously to result in the upper triangular fac-
tors T1 and T2 , respectively (Fig. 7.2a). Thus, the first most time consuming stage
⎛ ⎞
B1 C1
⎜ B2 C2 ⎟
⎜ ⎟
A=⎜ .. .. ⎟ (7.11)
⎝ . . ⎠
Bp C p
where A, as well as each block Bi , for i = 1, . . . , p, is of full column rank. The block
angular form implies that the QR factorization of A yields an upper triangular matrix
with the same block structure; this can be seen from the Cholesky factorization of the
block arrowhead matrix A A. Therefore, similar to the first stage of the algorithm
described above in Sect. 7.6, for i = 1, . . . , p, multicore node i first performs the
orthogonal factorization
of the block (Bi , Ci ). The transformed block is now of
Ri E i1
the form . As a result, after the first orthogonal transformation, and an
0 E i2
appropriate permutation, we have a matrix of the form:
238 7 Orthogonal Factorization and Linear Least Squares Problems
⎛ ⎞
R1 E 11
⎜ R2 E 21 ⎟
⎜ ⎟
⎜ .. .. ⎟
⎜ . . ⎟
⎜ ⎟
⎜ R p E p1 ⎟
A1 = ⎜
⎜
⎟. (7.12)
⎜ E 12 ⎟
⎟
⎜ E 22 ⎟
⎜ ⎟
⎜ .. ⎟
⎝ . ⎠
E p2
F(x) = q (7.13)
where x is the vector containing the unknown coordinates, and q represents the
observation vector. Using the Gauss-Newton method for solving the nonlinear system
(7.13), we will need to solve a sequence of linear least squares problems of the
form (7.4), where A denotes the Jacobian of F at the current vector of unknown
coordinates, initially x 0 , and r 0 = q − F(x 0 ). The least squares solution vector y
is the adjustment vector, i.e., x 1 = x 0 + y is the improved approximation of the
coordinate vector x, e.g. see [30, 31].
The geodetic adjustment problem just described has the computationally con-
venient feature that the geodetic network domain can be readily decomposed into
smaller subproblems. This decomposition is based upon the Helmert blocking of the
network as described in [32]. Here, the observation matrix A is assembled into the
block angular form (7.11) by the way in which the data is collected by regions.
7.7 Orthogonal Factorization of Block Angular Matrices 239
a3 c a4
where T
2 and T
3 are upper triangular and nonsingular. This annihilation is organized
so as to minimize intercluster communication. To illustrate this procedure, consider
the annihilation of T1 by T2 where each is of order n b . It is performed by elimination
of diagonals as outlined in Sect. 7.6 (see Fig. 7.2). The algorithm may be described
as follows:
7.7 Orthogonal Factorization of Block Angular Matrices 241
do k = 1 : n b ,
rotate ek T2 and e1 T1 to annihilate the element of T1 in posi-
tion (1, k), where ek denotes the k-th column of the identity
(requires intercluster communication).
do i = 1 : n b − k,
rotate rows i and i + 1 of T1 to annihilate the element in
position (i + 1, i + k) (local to the first cluster).
end
end
Stage 3: This is similar to stage 2. Each cluster i obtains the orthogonal factoriza-
tion of Si in (7.17). Notice here, however, that the computational load is not perfectly
balanced among the four clusters since S2 and S3 may have fewer rows that S1 and
S4 . The resulting matrix A(3a) is of the form,
⎛ ⎞
R1 ∗ ∗
⎜ 0 0 V1 ⎟ cluster 1
⎜ ⎟
⎜ 0 0 0 ⎟
⎜ ⎟
⎜ R2 ∗ ∗ ⎟
⎜ ⎟
⎜ 0
2
T ∗ ⎟ cluster 2
⎜ ⎟
⎜ 0 0 V2 ⎟
A (3a)
=⎜
⎜
⎟
⎟ (7.18)
⎜ R3 ∗ ∗ ⎟
⎜ 0
2
T ∗ ⎟ cluster 3
⎜ ⎟
⎜ 0 0 V3 ⎟
⎜ ⎟
⎜ R4 ∗ ∗ ⎟
⎜ ⎟
⎝ 0 0 V4 ⎠ cluster 4
0 0 0
where the diagonal entries of the matrix Σ = diag(σ1 , . . . , σq ) are the singular
values of A, and U ∈ Rn×n , V ∈ Rm×m are orthogonal matrices. If bis the right-
c1
hand side of the linear least squares problem (7.4), and c = V b = where
c2
c1 ∈ R q , then x is a solution in S if and only if there exists y ∈ Rn−q such that
Σ −1 c1
x =U . In this case, the corresponding minimum residual norm is equal
y
to c2 . In S , the classically selected solution is the one with smallest 2-norm, i.e.
when y = 0.
Householder Orthogonalization with Column Pivoting
In order to avoid computing the singular-value decomposition which is quite ex-
pensive, the factorization (7.20) is often replaced by the orthogonal factorization
introduced in [33]. This procedure computes the QR factorization by Householder
reductions with column pivoting. Factorization (7.2) is now replaced by
T
Q q · · · Q 2 Q 1 AP1 P2 · · · Pq = , (7.21)
0
An upper bound on the Frobenius norm of the error in such an approximation of the
generalized inverse is given by
is an upper triangular matrix in which R11 ∈ Rq×q , where q is the number of singular
values larger than an a priori fixed threshold η > 0, and the 2-norm R22 = O(η).
Here, the parameter η > 0 is chosen depending on the 2-norm of R since at the end
of the process cond(R11 ) ≤ R η .
Algorithm 7.5 insures that R22 = O(η) as shown in the following theorem:
Theorem 7.2 Let us assume that the smallest singular values of blocks defined by
the Eq. (7.23) satisfy the following conditions:
R11 u 1
σmin (R11 ) > η and σmin ≤η
0 u2
u1 R12
for any column of the matrix ∈ Rn×(n−q) . Therefore the following
u2 R22
bounds hold
⎛ ⎞
−1
1 + R11 u 1 2
u 2 ≤ ⎝ −1
⎠ η, (7.24)
1 − ηR11
⎛ ⎞
−1
n − q + R11 R12 2F
R22 F ≤ ⎝ −1
⎠ η, (7.25)
1 − ηR11
where R11 ∈ Rq×q and where . F denotes the Frobenius norm.
Proof We have
R11 u 1 R11 u 1
η ≥ σmin = σmin ,
0 u2 0 ρ
where ρ = u 2 . Therefore,
−1 −1
1 R11 − ρ1 R11 u1
≤ 1 ,
η 0 ρ
−1 1 −1
≤ R11 + 1 + R11 u 1 2
ρ
which implies (7.24). The bound (7.25) is obtained by summing the squares of the
bounds for all the columns of R22 .
−1
Remark 7.1 The hypothesis of the theorem insure that τsep = 1−ηR11 is positive.
This quantity, however, might be too small if the singular values of R are not well
separated in the neighborhood of η.
Remark 7.2 The number of column permutations corresponds to the rank deficiency:
K = n − q,
(L , 0) = T Q. (7.27)
This post processing rank-revealing procedure offers only limited chances for ex-
ploiting parallelism. In fact, if the matrix A is tall and narrow or when its rank
deficiency is relatively small, this post processing procedure can be used on a
uniprocessor.
References
1. Halko, N., Martinsson, P., Tropp, J.: Finding structure with randomness: probabilistic algo-
rithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011).
doi:10.1137/090771806
2. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series
in Statistics. Springer, New York (2001)
3. Kontoghiorghes, E.: Handbook of Parallel Computing and Statistics. Chapman & Hall/CRC,
New York (2005)
4. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
5. Björck, Å.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
6. Sameh, A., Kuck, D.: On stable parallel linear system solvers. J. Assoc. Comput. Mach. 25(1),
81–91 (1978)
7. Modi, J., Clarke, M.: An alternative givens ordering. Numerische Mathematik 43, 83–90 (1984)
8. Cosnard, M., Muller, J.M., Robert, Y.: Parallel QR decomposition of a rectangular matrix.
Numerische Mathematik 48, 239–249 (1986)
9. Cosnard, M., Daoudi, E.: Optimal algorithms for parallel Givens factorization on a coarse-
grained PRAM. J. ACM 41(2), 399–421 (1994). doi:10.1145/174652.174660
10. Cosnard, M., Robert, Y.: Complexity of parallel QR factorization. J. ACM 33(4), 712–723
(1986)
11. Gentleman, W.M.: Least squares computations by Givens transformations without square roots.
IMA J. Appl. Math. 12(3), 329–336 (1973)
12. Hammarling, S.: A note on modifications to the givens plane rotation. J. Inst. Math. Appl. 13,
215–218 (1974)
13. Kontoghiorghes, E.: Parallel Algorithms for Linear Models: Numerical Methods and Estima-
tion Problems. Advances in Computational Economics. Springer, New York (2000). http://
books.google.fr/books?id=of1ghCpWOXcC
14. Lawson, C., Hanson, R., Kincaid, D., Krogh, F.: Basic linear algebra subprogams for Fortran
usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979)
15. Dongarra, J., Croz, J.D., Hammarling, S., Hanson, R.: An extended set of FORTRAN basic
linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)
References 247
16. Gallivan, K.A., Plemmons, R.J., Sameh, A.H.: Parallel algorithms for dense linear algebra
computations. SIAM Rev. 32(1), 54–135 (1990). doi:10.1137/1032002
17. Sameh, A.: Numerical parallel algorithms—a survey. In: Kuck, D., Lawrie, D., Sameh, A.
(eds.) High Speed Computer and Algorithm Optimization, pp. 207–228. Academic Press, San
Diego (1977)
18. Bischof, C., van Loan, C.: The WY representation for products of Householder matrices. SIAM
J. Sci. Stat. Comput. 8(1), 2–13 (1987). doi:10.1137/0908009
19. Schreiber, R., Parlett, B.: Block reflectors: theory and computation. SIAM J. Numer. Anal.
25(1), 189–205 (1988). doi:10.1137/0725014
20. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
21. Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.: A set of level-3 basic linear algebra
subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
22. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,
Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: ScaLAPACK
User’s Guide. SIAM, Philadelphia (1997). http://www.netlib.org/scalapack
23. Björck, Å.: Solving linear least squares problems by Gram-Schmidt orthogonalization. BIT 7,
1–21 (1967)
24. Jalby, W., Philippe, B.: Stability analysis and improvement of the block Gram-Schmidt algo-
rithm. SIAM J. Stat. Comput. 12(5), 1058–1073 (1991)
25. Sameh, A.: Solving the linear least-squares problem on a linear array of processors. In: Sny-
der, L., Gannon, D., Jamieson, L.H., Siegel, H.J. (eds.) Algorithmically Specialized Parallel
Computers, pp. 191–200. Academic Press, San Diego (1985)
26. Sidje, R.B.: Alternatives for parallel Krylov subspace basis computation. Numer. Linear Alge-
bra Appl. 4, 305–331 (1997)
27. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and se-
quential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). doi:10.1137/
080731992
28. Chang, X.W., Paige, C.: An algorithm for combined code and carrier phase based GPS posi-
tioning. BIT Numer. Math. 43(5), 915–927 (2003)
29. Chang, X.W., Guo, Y.: Huber’s M-estimation in relative GPS positioning: computational as-
pects. J. Geodesy 79(6–7), 351–362 (2005)
30. Bomford, G.: Geodesy, 3rd edn. Clarendon Press, England (1971)
31. Golub, G., Plemmons, R.: Large scale geodetic least squares adjustment by dissection and
orthogonal decomposition. Numer. Linear Algebra Appl. 35, 3–27 (1980)
32. Wolf, H.: The Helmert block method—its origin and development. In: Proceedings of the
Second Symposium on Redefinition of North American Geodetic Networks, pp. 319–325
(1978)
33. Businger, P., Golub, G.H.: Linear least squares solutions by Householder transformations.
Numer. Math. 7, 269–276 (1965)
34. Quintana-Ortí, G., Quintana-Ortí, E.: Parallel algorithms for computing rank-revealing QR fac-
torizations. In: Cooperman, G., Michler, G., Vinck, H. (eds.) Workshop on High Performance
Computing and Gigabit Local Area Networks. Lecture Notes in Control and Information Sci-
ences, pp. 122–137. Springer, Berlin (1997). doi:10.1007/3540761691_9
35. Quintana-Ortí, G., Sun, X., Bischof, C.: A BLAS-3 version of the QR factorization with column
pivoting. SIAM J. Sci. Comput. 19, 1486–1494 (1998)
36. Bischof, C.: A parallel QR factorization algorithm using local pivoting. In: Proceedings of
1988 ACM/IEEE Conference on Supercomputing, Supercomputing’88, pp. 400–499. IEEE
Computer Society Press, Los Alamitos (1988)
37. Bischof, C.H.: Incremental condition estimation. SIAM J. Matrix Anal. Appl. 11, 312–322
(1990)
Chapter 8
The Symmetric Eigenvalue
and Singular-Value Problems
Eigenvalue problems form the second most important class of problems in numerical
linear algebra. Unlike linear system solvers which could be direct, or iterative, eigen-
solvers can only be iterative in nature. In this chapter, we consider real symmetric
eigenvalue problems (and by extension complex hermitian eigenvalue problems), as
well as the problem of computing the singular value decomposition.
Given a symmetric matrix A ∈ Rn×n , the standard eigenvalue problem consists
of computing the eigen-elements of A which are:
• either all or a few selected eigenvalues only: these are all the n, or p n selected,
roots λ of the nth degree characteristic polynomial:
det(A − λI ) = 0. (8.1)
(A − λI )x = 0. (8.2)
The procedures discussed in this chapter apply with minimal changes in case the
matrix A is complex Hermitian since the eigenvalues are still real and the complex
matrix of eigenvectors is unitary.
Since any symmetric or Hermitian matrix can be diagonalized by an orthogonal
or a unitary matrix, the eigenvalues are insensitive to small symmetric perturbations
of A. In other words, when A + E is a symmetric update of A, the eigenvalues of
A + E cannot be at a distance exceeding E2 compared to those of A. Therefore,
the computation of the eigenvalues of A with largest absolute values is always well
conditioned. Only the relative accuracy of the eigenvalues with smallest absolute
values can be affected by perturbations. Accurate computation of the eigenvectors is
more difficult in case of poorly separated eigenvalues, e.g. see [1, 2].
V AU = Σ,
Parallel Schemes
We consider the following three classical standard eigenvalue solvers for problems
(8.1) and (8.2):
• Jacobi iterations,
• QR iterations, and
• the multisectioning method.
Details of the above three methods for uniprocessors are given in many references,
see e.g. [1–4].
While Jacobi’s method is capable of yielding more accurate eigenvalues and per-
fectly orthogonal eigenvectors, in general it consumes more time on uniprocessors, or
even on some parallel architectures, compared to tridiagonalization followed by QR
iterations. It requires more arithmetic operations and memory references. Its high
potential for parallelism, however, warrants its examination. The other two meth-
ods are often more efficient when combined with the tridiagonalization process:
T = Q AQ, where Q is orthogonal and T is tridiagonal.
Computing the full or partial singular value decomposition of a matrix is based
on symmetric eigenvalue problem solvers involving either Anrm , or Aaug , as given by
8 The Symmetric Eigenvalue and Singular-Value Problems 251
T = B B, (8.6)
where A ∈ Rn×n is a dense symmetric matrix. The original Jacobi’s method for
determining all the eigenpairs of (8.8) reduces the matrix A to the diagonal form by
an infinite sequence of plane rotations.
Ak+1 = Uk Ak Uk , k = 1, 2, . . . ,
252 8 The Symmetric Eigenvalue and Singular-Value Problems
2αikj
tan 2θikj = ,
αiik − α kj j
1
ck = and sk = ck tk ,
1 + tk2
αiik − α kj j
tk2 + 2τk tk − 1 = 0, in which τk = cot 2θikj = . (8.9)
2αikj
sign (τk )
tk = (8.10)
|τk | + 1 + τk2
Each Ak+1 remains symmetric and differs from Ak only in the ith and jth rows and
columns, with the modified elements given by
and
k+1
αir = ck αir
k
+ sk α kjr , (8.12)
α k+1
jr = −sk αir + ck α jr ,
k k
Ak = Dk + E k + E k , (8.13)
(k)
λn ) (here, · F denotes the Frobenius norm). Similarly, the transpose of the product
(Uk · · · U2 U1 ) approaches a matrix whose jth column is an eigenvector correspond-
ing to λ j .
Several schemes are possible for selecting the sequence of elements αikj to be
eliminated via the plane rotations Uk . Unfortunately, Jacobi’s original scheme,
which consists of sequentially searching for the largest off-diagonal element, is
too time consuming for implementation on a multiprocessor. Instead, a simpler
scheme in which the off-diagonal elements (i, j) are annihilated in the cyclic fash-
ion (1, 2), (1, 3), . . . , (1, n), (2, 3), . . . , (2, n), . . . , (n − 1, n) is usually adopted as
its convergence is assured [6]. We refer to each sequence of n(n−1) 2 rotations as a
sweep. Furthermore, quadratic convergence for this sequential cyclic Jacobi scheme
has been established (e.g. see [8, 9]). Convergence usually occurs within a small
number of sweeps, typically in O(log n) sweeps.
A parallel version of this cyclic Jacobi algorithm is obtained by the simultaneous
annihilation of several off-diagonal elements by a given Uk , rather than annihilat-
ing only one off-diagonal element and its symmetric counterpart as is done in the
sequential version. For example, let A be of order 8 and consider the orthogonal
matrix Uk as the direct sum of 4 independent plane rotations, where the ci ’s and si ’s
for i = 1, 2, 3, 4 are simultaneously determined. An example of such a matrix is
where Rk (i, j) is that rotation which annihilates the (i, j) and ( j, i) off-diagonal
elements. Now, a sweep can be seen as a collection of orthogonal similarity trans-
formations where each of them simultaneously annihilates several off-diagonal pairs
and such that each of the off-diagonal entries is annihilated only once by the sweep.
For a matrix of order 8, an optimal sweep will consist of 7 successive orthogonal
transformations with each one annihilating distinct groups of 4 off-diagonal elements
simultaneously, as shown in the left array of Table 8.1, where the similarity trans-
formation of (8.14) is U6 . On the right array of Table 8.1, the sweep for a matrix
n = 8. n = 9.
The entries indicate the step in which these elements are annihilated
254 8 The Symmetric Eigenvalue and Singular-Value Problems
The derivation of the one-sided Jacobi method is motivated by the singular value
decomposition of rectangular matrices. It can be used to compute the eigenvalues of
the square matrix A in (8.8) when A is symmetric positive definite. It is considerably
more efficient to apply a one-sided Jacobi method which in effect only post-multiplies
A by plane rotations. Let A ∈ Rm×n with m ≥ n and rank(A) = r ≤ n. The singular
value decomposition of A is defined by the corresponding notations:
8.1 The Jacobi Algorithms 255
A = V ΣU , (8.15)
qi q j = σi2 δi j ,
then factorization (8.15) is entirely determined. We construct the matrix U via the
plane rotations
c −s
(ai , a j ) = (ãi , ã j ), i < j,
s c
so that
ãi ã j = 0 and ãi ≥ ã j , (8.17)
where ai designates the ith column of the matrix A. This is accomplished by choosing
1/2
β +γ α
c= and s = if β > 0, (8.18)
2γ 2γ c
or 1/2
γ −β α
s= and c = if β < 0, (8.19)
2γ 2γ s
8.1 The Jacobi Algorithms 257
Usi = U1 U2 · · · U2q−1 ,
where q = n+1
2 and hence
a k+1
j = −saik + ca kj , (8.22)
where aik denotes the ith column of Ak , and c, s are determined by either (8.18) or
(8.19). On a parallel architecture with vector capability, one expects to realize high
performance in computing (8.21) and (8.22). Each processor is assigned one rotation
and hence orthogonalizes one pair of the n columns of matrix Ak .
258 8 The Symmetric Eigenvalue and Singular-Value Problems
ai a j
, (8.23)
(ai ai )(a
j aj)
falls below a given tolerance in any given sweep with the algorithm terminating when
the counter reaches 21 n(n − 1), the total number of column pairs, after any sweep.
Upon termination, the first r columns of the matrix A are overwritten by the matrix
Q from (8.16) and hence the non-zero singular values σi can be obtained via the r
square roots of the first r diagonal entries of the updated A A. The matrix V in (8.15),
which contains the leading r , left singular vectors of the original matrix A, is readily
obtained by column scaling of the updated matrix A (now overwritten by Q = V Σ)
by the r non-zero singular values. Similarly, the matrix U , which contains the right
singular vectors of the original matrix A, is obtained as in (8.20) as the product of the
orthogonal Uk ’s. This product is accumulated in a separate two-dimensional array
by applying the rotations used in (8.21) and (8.22) to the identity matrix of order n. It
8.1 The Jacobi Algorithms 259
is important to note that the use of the fraction in (8.23) is preferable to using ai a j ,
since this inner product is necessarily small for relatively small singular values.
Although 1JAC concerns the singular value decomposition of rectangular matri-
ces, it is most effective for handling the eigenvalue problem (8.8) when obtaining all
the eigenpairs of square symmetric matrices. Thus, if 1JAC is applied to a symmet-
ric matrix A ∈ Rn×n , the columns of the resulting V ∈ Rn×r are the eigenvectors
corresponding to the nonzero eigenvalues of A. The eigenvalue corresponding to vi
(i = 1, . . . , r ) is obtained by the Rayleigh quotient λi = vi Avi ; therefore: |λi | = σi .
The null space of A is the orthogonal complement of the subspace spanned by the
columns of V .
Algorithm 1JAC has two advantages over 2JAC: (i) no need to access both rows
and columns, and (ii) the matrix U need not be accumulated.
A = QR (8.24)
where A1 ≡ A, and Uk = Uk (i, j, φikj ), Vk = Vk (i, j, θikj ) are plane rotations that
affect rows and columns i, and j. It follows that Ak approaches the diagonal matrix
Σ = diag(σ1 , σ2 , . . . , σn ), where σi is the ith singular value of A, and the products
(Vk · · · V2 V1 ), (Uk · · · U2 U1 ) approach matrices whose ith column is the respective
left and right singular vector corresponding to σi . When the σi ’s are not pathologically
close, it has been shown in [22] that the row (or column) cyclic Kogbetliantz method
ultimately converges quadratically. For triangular matrices, it has been demonstrated
in [23] that Kogbetliantz algorithm converges quadratically for those matrices having
multiple or clustered singular values provided that singular values of the same cluster
occupy adjacent diagonal elements of Aν , where ν is the number of sweeps required
for convergence. Even if we were to assume that R in (8.24) satisfies this condition for
quadratic convergence of the parallel Kogbetliantz method in [22], the ordering of the
rotations and subsequent row (or column) permutations needed to maintain the upper-
triangular form is less efficient on many parallel architectures. One clear advantage
of using 1JAC for obtaining the singular value decomposition of R lies in that the
rotations given in (8.18) or (8.19), and applied via the parallel ordering illustrated in
Table 8.1, see also Algorithm 8.1, require no processor synchronization among any
set of the 21 n or 21 (n − 1) simultaneous plane rotations. The convergence rate of
1JAC, however, does not necessarily match that of the Kogbetliantz algorithm.
Let
Sk = Rk Rk = D̃k + Ẽ k + Ẽ k , (8.26)
8.1 The Jacobi Algorithms 261
The above algorithms are well-suited for shared memory architectures. While they
can also be implemented on distributed memory systems, their efficiency on such
systems will suffer due to communication costs. In order to increase the granularity
of the computation (i.e., to increase the number of arithmetic operations between
two successive message exchanges), block algorithms are considered.
Allocating blocks of a matrix in place of single entries or vectors and replacing
the basic Jacobi rotations by more elaborate orthogonal transformations, we obtain
what is known as block Jacobi schemes. Let us consider the treatment of the pair of
off-diagonal blocks (i, j) in such algorithms:
262 8 The Symmetric Eigenvalue and Singular-Value Problems
• Two-sided Jacobi for the symmetric eigenvalue problem: the matrix A ∈ Rn×n is
partitioned into p × p blocks Ai j ∈ Rq×q (1 ≤ i, j ≤ p and n = pq). For any
Aii Ai j
pair (i, j) with i < j, let Ji j = ∈ R2q×2q . The matrix U (i, j) is the
Aij A j j
orthogonal matrix that diagonalizes Ji j .
• One-sided Jacobi for the SVD: the matrix A ∈ Rm×n (m ≥ n = pq) is partitioned
into p blocks Ai ∈ Rm×q (1 ≤ i ≤ p). For any pair (i, j) with i < j, let Ji j =
(Ai , A j ) ∈ Rm×2q . The matrix U (i, j) is the orthogonal matrix that diagonalizes
Jij Ji j .
• Two-sided Jacobi for the SVD (Kogbetliantz algorithm) the matrix A ∈ Rn×n is
partitioned into p× p blocks Ai j ∈ R
q×q (1 ≤ i, j ≤ p). For any pair (i, j) with
Aii Ai j
i < j, let Ji j = ∈ R2q×2q . The matrices U (i, j) and V (i, j) are the
A ji A j j
orthogonal matrices defined by the SVD of Ji j .
For one-sided algorithms, each processor is allocated a block of columns instead of
a single column. The main features of the algorithm remain the same as discussed
above with the ordering of the rotations within a sweep is as given in [25].
For the two-sided version, the allocation manipulates 2-D blocks instead of single
entries of the matrix. A modification of the basic algorithm in which one annihilates,
in each step, two symmetrically positioned off-diagonal blocks by performing a full
SVD on the smaller sized off-diagonal block has been proposed in [26, 27]. While
reasonable performance can be realized on distributed memory architectures, this
block strategy increases the number of sweeps needed to achieve convergence. In
order to reduce the number of needed iterations, a dynamic ordering has been inves-
tigated in [28] in conjunction with a preprocessing step consisting of a preliminary
QR factorization with column pivoting [29]. This procedure has also been considered
for the block one-sided Jacobi algorithm for the singular value decomposition, e.g.
see [30].
is almost as fast as an arithmetic operation for obtaining all the eigenpairs of a modest
size matrix.
In addition to the higher cost of communications in Jacobi schemes, in general,
they also incur almost the same order of overall arithmetic operations. To illustrate
this, let us assume that n is even. Diagonalizing a matrix A ∈ Rn×n on an architecture
of p = n2 processors is obtained by a sequence of sweeps in which each sweep
requires (n − 1) parallel applications of p rotations. Therefore, each sweep costs
O(n 2 ) arithmetic operations. If, in addition, we assume that the number of sweeps
needed to obtain accurate eigenvalues to be O(log n), then the overall number of
arithmetic operations is O(n 2 log n). This estimation can even become as low as
O(n log n) by using O(n 2 ) processors. As seen in Sect. 8.1, one sweep of rotations
involves 6n 3 + O(n 2 ) arithmetic operations when taking advantage of symmetry,
and including accumulation of the rotations. This estimate must be compared to the
8n 3 + O(n 2 ) arithmetic operations needed to reduce the matrix A to the tridiagonal
form and to build the orthogonal matrix Q which realizes such tridiagonalization
(see next section).
A favorable situation for Jacobi schemes arises when one needs to investigate the
evolution of the eigenvalues of a sequence of slowly varying symmetric matrices
(Ak )k≥0 ⊂ Rn×n . If Ak has the spectral decomposition Ak = Uk Dk Uk , where Uk is
an orthogonal matrix (full set of eigenvectors), and Dk a diagonal matrix containing
the eigenvalues. Hence, if the quantity Ak+1 − Ak / Ak is small, the matrix
, with
and therefore yielding the spectral decomposition Ak+1 = Uk+1 Dk+1 Uk+1
Uk+1 = Uk Vk+1 . When one sweep only is sufficient to get convergence, Jacobi
schemes are competitive.
Obtaining all the eigenpairs of a symmetric matrix A can be achieved by the following
two steps: (i) obtaining a symmetric tridiagonal matrix T which is orthogonally
similar to A: T = U AU where U is an orthogonal matrix ; and (ii) obtaining the
spectral factorization of the resulting tridiagonal matrix T , i.e. D = V T V , and
computing the eigenvectors of A by the back transformation Q = UV.
264 8 The Symmetric Eigenvalue and Singular-Value Problems
v := α Au;
w := v − 21 α(u v)u;
A := A − (uw + wu ).
By exploiting symmetry, the rank-2 update (that appears in the last step) results in
computational savings [32].
The tridiagonal matrix T is obtained by repeating the process successively on
columns 2 to (n − 2). The total procedure involves 4n 3 /3 + O(n 2 ) arithmetic oper-
ations. Assembling the matrix U = H1 · · · Hn−2 requires 4n 3 /3 + O(n 2 ) additional
operations. The benefit of the BLAS2 variant over that of the original BLAS1 based
scheme is illustrated in [33].
As indicated in Chap. 7, successive applications of Householder reductions can
be done in blocks (block Householder transformations) which allows the use of
BLAS3 [34]. Such reduction is implemented in routine DSYTRD of LAPACK [35]
which also takes advantage of the symmetry of the process. A parallel version of the
algorithm is implemented in routine PDSYTRD of ScaLAPACK [36].
8.2 Tridiagonalization-Based Schemes 265
⎧A0 = A and Q 0 = I
⎨ For k ≥ 0,
(8.31)
(Q k+1 , Rk+1 ) = QR-Factorization of Ak ,
⎩
Ak+1 = Rk+1 Q k+1 ,
⎧T0 = T and Q 0 = I
⎨ For k ≥ 0,
(8.32)
(Q k+1 , Rk+1 ) = QR-Factorization of (Tk − μk I )
⎩
Tk+1 = Rk+1 Q k+1 + μk I,
ρ = βm /θ,
T1 = T1:m,1:m − ρ em em ,
and T2 = Tm+1:n,m+1:n − ρθ 2 em+1 em+1 ,
where the parameter θ is an arbitrary nonzero real number. A strategy for determining
a safe partitioning is given in [38].
Assuming that the full spectral decompositions of T1 and T2 are already available:
Q i Ti Q i = Λi where Λi is a diagonal matrix and Q i an orthogonal matrix, for
i = 1, 2. Let
Q1 O Λ1 O
Q̃ = and Λ̃ = .
0 Q2 0 Λ2
n
ζk 2
1+ρ = 0. (8.33)
k=1
λ̃k − λ
Since λ is not a diagonal entry of Λ̃, the eigenvalues of J are the zeros of the function
f (λ) = det(I + ρ(Λ̃ − λI )−1 zz ), or the roots of
Following the early work in [39], or the TREPS procedure in [40], one can use
the Sturm sequence properties to enable the computation of all the eigenvalues of a
268 8 The Symmetric Eigenvalue and Singular-Value Problems
T = [βi , αi , βi+1 ],
pn (λ) = det(T − λI ).
Then the sequence of the principal minors of the matrix can be built using the fol-
lowing recursion:
p0 (λ) = 1,
p1 (λ) = α1 − λ, (8.35)
pi (λ) = (αi − λ) pi−1 (λ) − βi2 pi−2 (λ), i = 2, . . . , n.
pi (λ)
qi (λ) = , i = 1, n.
pi−1 (λ)
The second order linear recurrence (8.35) is then replaced by the nonlinear recurrence
βi2
q1 (λ) = α1 − λ, qi (λ) = αi − λ − , i = 2, . . . , n. (8.36)
qi−1 (λ)
Here, the number of eigenvalues that are smaller than λ is equal to the number of
negative terms in the sequence (qi (λ))i=1,...,n . It can be easily proved that qi (λ) is
the ith diagonal element of D in the factorization L DL of T − λI . Therefore, for
any i = 1, . . . , n,
i
pi (λ) = q j (λ).
j=1
Given an initial interval, we can find all the eigenvalues lying in it by repeated
bisection or multisection of the interval. This partitioning process can be performed
until we obtain each eigenvalue to a given accuracy. On the other hand, we can stop
8.2 Tridiagonalization-Based Schemes 269
the process once we have isolated each eigenvalue. In the latter case, the eigenvalues
may be extracted using a faster method. Several methods are available for extracting
an isolated eigenvalue:
• Bisection (linear convergence);
• Newton’s method (quadratic convergence);
• The ZEROIN scheme [42] which is based √ on a combination of the secant and
bisection methods (convergence of order ( 5 + 1)/2).
Ostrowski [43] defines an efficiency index which links the amount of computation
to be done at each step and the order of the convergence. The respective indices of the
three methods are 1, 1.414 and 1.618. This index, however, is not the only aspect to
be considered here. Both ZEROIN and Newton methods require the use of the linear
recurrence (8.35) in order to obtain the value of det(T − λI ), and its first derivative,
for a given λ. Hence, if the possibility of over- or underflow is small, selecting the
ZEROIN method is recommended, otherwise selecting the bisection method is the
safer option. After the computation of an eigenvalue, the corresponding eigenvector
can be found by inverse iteration [3], which is normally a very fast process that often
requires no more than one iteration to achieve convergence to a low relative residual.
When some eigenvalues are computationally coincident, the isolation process
actually performs “isolation of clusters”, where a cluster is defined as a single eigen-
value or a number of computationally coincident eigenvalues. If such a cluster of
coincident eigenvalues is isolated, the extraction stage is skipped since convergence
has been achieved.
The whole computation consists of the following five steps:
1. Isolation by partitioning;
2. Extraction of a cluster by bisection or by the ZEROIN method;
3. Computation of the eigenvectors of the cluster by inverse iteration;
4. Grouping of close eigenvalues;
5. Orthogonalization of the corresponding groups of vectors by the modified Gram-
Schmidt process.
The method TREPS (standing for Tridiagonal Eigenvalue Parallel Solver by Sturm
sequences) is listed as Algorithm 8.4 and follows this strategy.
The Partitioning Process
Parallelism in this process is obviously achieved by performing simultaneously the
computation of several Sturm sequences. However, there are several ways for achiev-
ing this. Two options are:
• Performing bisection on several intervals, or
• Partitioning of one interval into several subintervals.
A multisection of order k splits the interval [δ, γ ] into k+1 subintervals [μi , μi+1 ],
where μi = δ + i((γ − δ)/(k + 1)) for i = 0, . . . , k + 1. If interval I contains only
one eigenvalue, approximating it with an absolute error ε, will require
270 8 The Symmetric Eigenvalue and Singular-Value Problems
γ −δ
n k = log2 / log2 (k + 1)
2ε
Hence, for extraction of eigenvalues via parallel bisections is preferable to one mul-
tisectioning of higher order. On the other hand, during the isolation step, the effi-
ciency of multisectioning is higher because: (i) multisectioning creates more tasks
than bisection, and (ii) often, one interval contains much more than one eigenvalue.
A reasonable strategy for choosing bisection or multisectioning is outlined in [44].
8.2 Tridiagonalization-Based Schemes 271
The algorithm of Sect. 3.2.2 may be used, however, to capitalize on the vectorization
possible in evaluating the linear recurrence (8.35). For a vector length k < n/2 the
total number of arithmetic operations in the parallel algorithm is roughly 10n +
11k, compared to only 4n for the uniprocessor algorithm (we do not consider the
operations needed to compute βi2 since these quantities can be provided by the user),
resulting in an arithmetic redundancy which varies between 2.5 and 4. This algorithm,
therefore, is efficient only when vector operations are at least 4 times faster than their
sequential counterparts.
Computation of the Eigenvectors and Orthonormalization
The computation of an eigenvector can be started as soon as the corresponding eigen-
value is computed. It is obtained by inverse iteration (see Algorithm 8.5). When the
References
1. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
2. Stewart, G.W., Sun, J.: Matrix Perturbation Theory. Academic Press, Boston (1990)
3. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
4. Parlett, B.: The Symmetric Eigenvalue Problem. SIAM (1998)
5. Hestenes, M.: Inversion of matrices by biorthogonalization and related results. J. Soc. Ind.
Appl. Math. 6(1), 51–90 (1958). doi:10.1137/0106005, http://epubs.siam.org/doi/abs/10.1137/
0106005
6. Forsythe, G.E., Henrici, P.: The cyclic Jacobi method for computing the principal values of a
complex matrix (January 1960)
7. Demmel, J., Veselic, K., Physik, L.M., Hagen, F.: Jacobi’s method is more accurate than QR.
SIAM J. Matrix Anal. Appl 13, 1204–1245 (1992)
8. Schönhage, A.: Zur konvergenz des Jacobi-verfahrens. Numer. Math. 3, 374–380 (1961)
9. Wilkinson, J.H.: Note on the quadratic convergence of the cyclic Jacobi process. Numer. Math.
4, 296–300 (1962)
10. Sameh, A.: On Jacobi and Jacobi-like algorithms for a parallel computer. Math. Comput. 25,
579–590 (1971)
11. Luk, F., Park, H.: A proof of convergence for two parallel Jacobi SVD algorithms. IEEE Trans.
Comp. (to appear)
12. Luk, F., Park, H.: On the equivalence and convergence of parallel Jacobi SVD algorithms. IEEE
Trans. Comp. 38(6), 806–811 (1989)
13. Henrici, P.: On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing
eigenvalues of Hermitian matrices. Soc. Ind. Appl. Math. 6, 144–162 (1958)
14. Brent, R., Luk, F.: The solution of singular-value and symmetric eigenvalue problems on
multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6(1), 69–84 (1985)
15. Sameh, A.: Solving the linear least-squares problem on a linear array of processors. In: L.
Snyder, D. Gannon, L.H. Jamieson, H.J. Siegel (eds.) Algorithmically Specialized Parallel
Computers, pp. 191–200. Academic Press (1985)
16. Kaiser, H.: The JK method: a procedure for finding the eigenvectors and eigenvalues of a real
symmetric matrix. Computers 15, 271–273 (1972)
17. Nash, J.: A one-sided transformation method for the singular value decomposition and algebraic
eigenproblem. Computers 18(1), 74–76 (1975)
18. Luk, F.: Computing the singular value decomposition on the Illiac IV. ACM Trans. Math. Sftw.
6(4), 524–539 (1980)
19. Brent, R., Luk, F., van Loan, C.: Computation of the singular value decomposition using mesh
connected processors. VLSI Comput. Syst. 1(3), 242–270 (1985)
20. Berry, M., Sameh, A.: An overview of parallel algorithms for the singular value and symmetric
eigenvalue problems. J. Comp. Appl. Math. 27, 191–213 (1989)
21. Charlier, J., Vanbegin, M., van Dooren, P.: On efficient implementations of Kogbetlianz’s
algorithm for computing the singular value decomposition. Numer. Math. 52, 279–300 (1988)
22. Paige, C., van Dooren, P.: On the quadratic convergence of Kogbetliantz’s algorithm for com-
puting the singular value decomposition. Numer. Linear Algebra Appl. 77, 301–313 (1986)
23. Charlier, J., van Dooren, P.: On Kogbetliantz’s SVD algorithm in the presence of clusters.
Numer. Linear Algebra Appl. 95, 135–160 (1987)
24. Wilkinson, J.: Almost diagonal matrices with multiple or close eigenvalues. Numer. Linear
Algebra Appl. 1, 1–12 (1968)
25. Luk, F.T., Park, H.: On parallel Jacobi orderings. SIAM J. Sci. Stat. Comput. 10(1), 18–26
(1989)
26. Bečka, M., Vajteršic, M.: Block-Jacobi SVD algorithms for distributed memory systems I:
hypercubes and rings. Parallel Algorithms Appl. 13, 265–287 (1999)
27. Bečka, M., Vajteršic, M.: Block-Jacobi SVD algorithms for distributed memory systems II:
meshes. Parallel Algorithms Appl. 14, 37–56 (1999)
274 8 The Symmetric Eigenvalue and Singular-Value Problems
28. Bečka, M., Okša, G., Vajteršic, M.: Dynamic ordering for a parallel block-Jacobi SVD algo-
rithm. Parallel Comput. 28(2), 243–262 (2002). doi:10.1016/S0167-8191(01)00138-7, http://
dx.doi.org/10.1016/S0167-8191(01)00138-7
29. Okša, G., Vajteršic, M.: Efficient pre-processing in the parallel block-Jacobi SVD algo-
rithm. Parallel Comput. 32(2), 166–176 (2006). doi:10.1016/j.parco.2005.06.006, http://www.
sciencedirect.com/science/article/pii/S0167819105001341
30. Bečka, M., Okša, G., Vajteršic, M.: Parallel Block-Jacobi SVD methods. In: M. Berry, K.
Gallivan, E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, F. Saied (eds.) High-Performance
Scientific Computing, pp. 185–197. Springer, London (2012), http://dx.doi.org/10.1007/978-
1-4471-2437-5_1
31. Dongarra, J., Croz, J.D., Hammarling, S., Hanson, R.: An extended set of FORTRAN basic
linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)
32. Dongarra, J.J., Kaufman, L., Hammarling, S.: Squeezing the most out of eigenvalue solvers on
high-performance computers. Linear Algebra Appl. 77, 113–136 (1986)
33. Gallivan, K.A., Plemmons, R.J., Sameh, A.H.: Parallel algorithms for dense linear algebra
computations. SIAM Rev. 32(1), 54–135 (1990). http://dx.doi.org/10.1137/1032002
34. Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.: A set of level-3 basic linear algebra
subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
35. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
36. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,
Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: ScaLAPACK
User’s Guide. SIAM, Philadelphia (1997). http://www.netlib.org/scalapack
37. Cuppen, J.: A divide and conquer method for the symmetric tridiagonal eigenproblem. Numer.
Math. 36, 177–195 (1981)
38. Dongarra, J.J., Sorensen, D.C.: A fully parallel algorithm for the symmetric eigenvalue problem.
SIAM J. Sci. Stat. Comput. 8(2), s139–s154 (1987)
39. Kuck, D., Sameh, A.: Parallel computation of eigenvalues of real matrices. In: Information
Processing ’71, pp. 1266–1272. North-Holland (1972)
40. Lo, S.S., Philippe, B., Sameh, A.: A multiprocessor algorithm for the symmetric tridiagonal
eigenvalue problem. SIAM J. Sci. Stat. Comput. 8, S155–S165 (1987)
41. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, New York (1965)
42. Forsythe, G., Malcom, M., Moler, C.: Computer Methods for Mathematical Computation.
Prentice-Hall, New Jersey (1977)
43. Ostrowski, A.: Solution of Equations and Systems of Equations. Academic Press, New York
(1966)
44. Bernstein, H., Goldstein, M.: Parallel implementation of bisection for the calculation of eigen-
values of tridiagonal symmetric matrices. Computing 37, 85–91 (1986)
45. Garbow, B.S., Boyle, J.M., Dongarra, J.J., Moler, C.B.: Matrix Eigensystem Routines—
EISPACK Guide Extension. Springer, Heidelberg (1977)
46. Demmel, J.W., Dhillon, I., Ren, H.: On the correctness of some bisection-like parallel eigen-
value algorithms in floating point arithmetic. Electron. Trans. Numer. Anal. pp. 116–149 (1995)
47. Dhillon, D., Parlett, B., Vömel, C.: The design and implementation of the MRRR algorithm.
ACM Trans. Math. Softw. 32, 533–560 (2006)
48. Rui, Ralha: One-sided reduction to bidiagonal form. Linear Algebra Appl. 358(1–3), 219–238
(2003)
49. Bosner, N., Barlow, J.L.: Block and parallel versions of one-sided bidiagonalization. SIAM J.
Matrix Anal. Appl. 29, 927–953 (2007)
Part III
Sparse Matrix Computations
Chapter 9
Iterative Schemes for Large Linear Systems
9.1 An Example
Au = g, (9.2)
( j) ( j) ( j)
in which A j = [γi , αi , βi ], j = 1, 2, . . . , n, is tridiagonal of order n, B j =
( j) ( j) ( j) ( j) ( j) ( j)
diag(μ1 , μ2 , . . . , μn ), and C j = diag(ν1 , ν2 , . . . , νn ). Correspondingly,
we write
u = (u
1 , u 2 , . . . , u n ),
and
If we assume that a(x, y), c(x, y) > 0, and f (x, y) ≥ 0 on the unit square, then
( j)
αi > 0, and provided that
2ai± 1 , j 2ci± 1 , j
0 < h < min 2
, 2
,
i, j |di j | |ei j |
9.1 An Example 279
where di j and ei j are the values of d(i h, j h) and e(i h, j h), respectively, while
ai± 1 , j and ci± 1 , j denote the values of a(x, y) and c(x, y) on the staggered grid.
2 2
Thus, we have
( j) ( j) (j ( j)
βi , γi , μi , and νi < 0. (9.4)
Furthermore,
( j) ( j) ( j) ( j) ( j)
αi ≥ |βi + γi + μi + νi |. (9.5)
With the above assumptions it can be shown that the linear system (9.2) has a unique
solution [13]. Before we show this, however, we would like to point out that the above
block-tridiagonal is a special one with particular properties. Nevertheless, we will
use this system to illustrate some of the basic iterative methods for solving sparse
linear systems. We will explore the properties of such a system by first presenting
the following preliminaries. These and their proofs can be found in many textbooks,
see for instance the classical treatises [13–15].
Definition 9.1 A square matrix B is irreducible if there exists no permutation matrix
Q for which
B11 B12
Q B Q =
0 B22
It is clear then from Theorem 9.2 that the linear system (9.2) has a unique solution
u, which can be obtained by an iterative or a direct linear system solver. We explore
first some basic iterative methods based on classical options for matrix splitting.
A = M − N, (9.6)
Mu = N u + g. (9.7)
Mu k+1 = N u k + g, k ≥ 0, (9.8)
with u 0 chosen arbitrarily. In order to determine the condition necessary for the
convergence of the iteration (9.8), let δu k = u k − u and subtract (9.7) from (9.8) to
obtain the relation
M δu k+1 = N δu k , k ≥ 0. (9.9)
Thus,
δu k = H k δu 0 , (9.10)
Simple examples of regular splitting of A are the classical Jacobi and Gauss-Seidel
iterative methods. For example, if we express A as
A = D − L − U,
(1) (1) (n) (n)
where D = diag(α1 , . . . , αn ; . . .; α1 , . . . , αn ), and −L, −U are the strictly
lower and upper triangular parts of A, respectively, then the iteration matrices of the
Jacobi and Gauss-Seidel schemes are respectively given by,
H J = D −1 (L + U )
and
u k+1 = H J u k + b, k ≥ 0, (9.12)
where D R and D B are diagonal matrices each of order n 2 /2, and each row of E R (or
E B ) contains no more than 4 nonzero elements. The subscripts (or superscripts) B,
R denote the quantities associated with the black and red points, respectively. One
can also show that
DR E R
A = = P A P,
E B DB
282 9 Iterative Schemes for Large Linear Systems
where A is given by (9.3), and P is a permutation matrix. Hence, the iterative point-
Jacobi scheme can be written as
(R)
(R)
DR 0 u k+1 0 −E R uk gR
= + , k ≥ 0.
0 DB u
(B) −E B 0 u
(B) g B
k+1 k
(B) (R)
u 2k+1 = −E B u 2k + g B ,
(R) (B)
k = 0, 1, 2, . . . (9.14)
u 2k+2 = −E R u 2k+1 + g R
where
E B = D −1 −1
B E B , E R = DR E R
(9.15)
g B = D −1 −1
B gB , gR = DR gR .
u (R) = −E R u (B)
j + gR .
Clearly, the point Jacobi iteration using the red-black ordering of the mesh points
exhibits a higher degree of parallelism than using the natural ordering.
Employing the natural ordering of the uniform mesh, the iterative scheme is given by
Here again the triangular system to be solved is of a special structure, and obtaining
an approximation of the action of (D − L)−1 on a column vector can be obtained
effectively on a parallel architecture, e.g. via use of the Neumann series. Using the
9.2 Classical Splitting Methods 283
point red-black ordering of the mesh, and applying the point Gauss-Seidel splitting
to (9.13), we obtain the iteration
(R) (R)
u k+1 D −1 0 0 −E R uk gR
= R + , k ≥ 0.
(B)
u k+1 D B E B D −1
−1
R D −1
B
0 0 (B)
uk gB
Simplifying, we get
(R) (B)
u k+1 = E R u k + g R ,
and (9.17)
(B) (R)
u k+1 = −E B u k+1 + g B
(B)
where u 0 is chosen arbitrarily, and E R , E B , g B , g R are obtained in parallel via a
preprocessing stage as given in (9.15). Taking in consideration, however, that [13]
ρ(HG.S. ) = ρ 2 (H J ), (9.18)
where,
D −1 0 0 −E R
H J = R ,
0 D −1
B
−E B 0
−1
DR 0 0 −E R
HG.S. = ,
E B DB 0 0
we see that the red-black point-Jacobi scheme (9.14) and the red-black point-Gauss-
Seidel scheme (9.17) are virtually equivalent regarding degree of parallelism and for
solving (9.13) to a given relative residual.
Here we consider the so called line red-black ordering [15], where one row of mesh
points is colored red while the one below and above it is colored black. In this case
the linear system (9.2) is replaced by A u = g , or
T R FR v(R) fR
= , (9.19)
FB T B v(B) fB
A = Q AQ,
(R) (B)
vk+1 = TR−1 (−FR vk + f R ),
(B) (R)
k≥0 (9.20)
vk+1 = TB−1 (−FB vk + f B ),
(B) (R)
v2k+1 = TB−1 (−FB v2k + f B ),
(R) (B)
k≥0 (9.21)
v2k+2 = TR−1 (−FR v2k+1 + f R ),
(R)
with v0 chosen arbitrarily. For efficient implementation of (9.21), we start the pre-
processing stage by obtaining the factorizations A j = D j L j U j , j = 1, 2, . . . , n,
where D j is diagonal, and L j , U j are unit lower and unit upper bidiagonal matrices,
respectively. Since each A j is diagonally dominant, the above factorizations are
obtained by Gaussian elimination without pivoting. Note that processor i handles
the factorization of A2i−1 and A2i . Using an obvious notation, we write
TR = D R L R U R , and TB = D B L B U B , (9.22)
9.2 Classical Splitting Methods 285
where
G R = D −1 −1
R F R , G B = D B FB , (9.24)
h R = D −1 −1
R f R , h B = DB f B , (9.25)
(R)
with v0 chosen arbitrarily. Thus, if we also compute G R , h R , and G B , h B in the
pre-processing stage, each iteration (9.23) has ample degree of parallelism that can
be further enhanced if solving triangular systems involving each L j is achieved via
any of the parallel schemes outlined in Chap. 3. Considering the ample parallelism
inherent in each iteration of the red-black point Jacobi scheme using n/2 multicore
nodes, it is natural to ask why should one consider the line red-black ordering. The
answer is essentially provided by the following theorem.
Theorem 9.4 ([13]) Let A be that matrix given in (9.2). Also let A = M1 − N1 =
M2 − N2 be two regular splittings of A. If N2 ≥ N1 ≥ 0, with neither N1 nor
(N2 − N1 ) being the null matrix, then
and
TR 0 0 −FR
A = − ,
0 TB −FB 0
= MJ − NJ
286 9 Iterative Schemes for Large Linear Systems
A = M J − N J = S M J S − S N J S,
−1 −1
H J = M J N J and H J = M J N J , (9.26)
i.e., the line red-black Jacobi converges in fewer iterations than the point red-black
Jacobi scheme. Such advantage, however, quickly diminishes as n increases, i.e., as
h = 1/(n + 1) → 0, ρ(H J ) → ρ(H J ). For example, in the case of the Poisson
equation, i.e., when a(x, y) = c(x, y) = 1, d(x, y) = e(x, y) = 0, and f (x, y) =
f , we have
ρ(H J ) 1
= .
ρ(H J ) 2 − cos π h
(B)
with v0 chosen arbitrarily. If, in the pre-processing stage, TR and TB are factored as
shown in (9.22), the iteration (9.27) reduces to solving the following linear systems
where G and h are as given in (9.24 and 9.25). Consequently, each iteration (9.28)
can be performed with high degree of parallelism using (n/2) nodes assuming that
G R , B B , h R and h B have already been computed in a preprocessing stage. Similarly,
such parallelism can be further enhanced if we use one of the parallel schemes in
Chap. 3 for solving each triangular system involving L j , j = 1, 2, ....., n.
9.2 Classical Splitting Methods 287
If d(x, y) = e(x, y) = 0 in (9.1), then the linear system (9.2) is symmetric. Since A
( j)
is nonsingular with αi > 0, i, j = 1, 2, . . . , n, then by virtue of Gerschgorin’s the-
orem, e.g., see [16] and (9.5), the system (9.2) is positive-definite. Consequently, the
linear systems in (9.13) and (9.19) are positive-definite. In particular, the matrices TR
and TB in (9.19) are also positive-definite. In what follows, we consider two classical
acceleration schemes: (i) the cyclic Chebyshev semi-iterative method e.g. see [13,
17] for accelerating the convergence of line-Jacobi, and (ii) Young’s successive over-
relaxation, e.g. see [13, 15] for accelerating the convergence of line Gauss-Seidel,
without adversely affecting the parallelism inherent in these two original iterations.
In this section, we do not discuss the notion of multisplitting where the coefficient
matrix A has, say k possible splittings of the form M j − N j , j = 1, 2, ...., k. On
a parallel architecture k iterations as in Eq. (9.8) can proceed in parallel but, under
some conditions, certain combinations of such iterations can converge faster than
any based on one splitting. For example, see [5, 18–20].
The Cyclic Chebyshev Semi-Iterative Scheme
For the symmetric positive definite case, each A j in (9.2) is symmetric positive
definite tridiagonal matrix, and C j+1 = B
j . Hence the red-black line-Jacobi iteration
(9.20) reduces to,
(R)
(R)
TR 0 vk+1 0 −FR vk fR
= + .
0 TB (B)
vk+1 −FR 0 (B)
vk fB
−1
I K bR
x= . (9.30)
K I bB
k
yk = ψ j (K )x j , k≥0 (9.31)
j=0
with the coefficients ψ j (K ) chosen such that yk approaches the solution x faster than
(R) (B)
the iterates x
j = (x j , x j ) produced by (9.29). Again from (9.29) it is clear
that if x0 is chosen as x, then x j = x for j > 0. Hence, if yk is to converge to x we
must require that
k
ψ j (K ) = 1, k ≥ 0.
j=0
where
k
0 −K
H= , qk (H ) = ψ j (K )H j ,
−K 0
j=0
and δy0 = δx0 . Note that qk (I ) = I . If all the eigenvalues of H were known
beforehand, it would have been possible to construct the polynomials qk (H ), k ≥ 1,
such that the exact solution (ignoring roundoff errors) is obtained after a finite number
of iterations. Since this is rarely the case, and since
τk (ξ/ρ)
q̂k (ξ ) = , −ρ ≤ ξ ≤ ρ (9.32)
τk (1/ρ)
τk (ξ ) = cos(k cos−1 ξ ), |ξ | ≤ 1,
= cosh(k cosh−1 ξ ), |ξ | > 1
with τ0 (ξ ) = 1, and τ1 (ξ ) = ξ . Hence, δyk = q̂k (H )δy0 , and from (9.29), (9.32),
and (9.33), we obtain the iteration
2 τk (1/ρ)
ωk+1 = ρ τ
k+1 (1/ρ) (9.35)
1/ 1 − ρ4 ωk ,
2
= k = 2, 3, . . .
where
θ = loge (ρ −1 + ρ −2 − 1).
Such reduction in errors is quite superior to the line-Jacobi iterative scheme (9.21)
in which
δvk ≤ ρ k δv0 .
(B) (R)
z1 = TB−1 ( f B − FR z 0 ),
(R) (B) (R) (9.37)
Δz 2k = TR−1 ( f R − FR z 2k−1 ) − z 2k−2 ,
and
(B) (R)
Δz 2k+1 = TB−1 ( f B − FR z 2k ) − z 2k−1 .
Assuming that we have a good estimate of ρ, the spectral radius of H J , and hence
have available the array ω j , in a preprocessing stage, together with a Cholesky
factorization of each tridiagonal matrix A j = L j D j L j , then each iteration (9.37)
can be performed with almost perfect parallelism. This is assured if each pair A2i−1
and A2i , as well as B2i−1 and B2i are stored in the local memory of the ith multicore
node i, i = 1, 2, . . . , n/2. Note that each matrix-vector multiplication involving FR
or FR , as well as solving tridiagonal systems involving TR or TB in each iteration
(9.37), will be performed with very low communication overhead. If we have no prior
knowledge of ρ, then we either have to deal with evaluating the largest eigenvalue
in modulus of the generalized eigenvalue problem
0 FR TR 0
u=λ u
FR 0 0 TB
before the beginning of the iterations, or use an adaptive procedure for estimating
ρ during the iterations (9.36). In the latter, early iterations can be performed with
nonoptimal acceleration parameters ω j in which ρ 2 is replaced by
(R) (R)
Δz 2k FR TB−1 FR Δz 2k
ρ2k
2
= ,
(R) (R)
Δz 2k TR Δz 2k
and convergence of (9.38) is assured. The following theorem shows how large can
ω be and still assuring that ρ(Hω ) < 1, where
Hω = Mω−1 Nω ,
in which
1
Mω = ω TR 0 , (9.40)
FR ω1 TB
and
1
ω − 1 TR −FR
Nω = . (9.41)
ω − 1 TB
1
0
i.e.,
I 0 (1 − ω)I −ωTR−1 FR
Hω = . (9.42)
−ωTB−1 FR I 0 (1 − ω)I
Proof The proof follows directly from the fact that the matrix
S=Mω + Nω
ω − 1 TR
2
0
=
ω − 1 TB
2
0
where
TR−1 0 0 −FR
H J = .
0 TB−1 −FR 0
Then
(1 − ω − λ)2
TB−1 FR TR−1 FR u 2 = u2.
λω2
(λ + ω − 1)2
μ2 = (9.43)
λω2
λ2 + [2(ω − 1) − ω2 μ2 ]λ + (ω − 1)2 = 0.
1 2 2
λ+ω−1= μ ω . (9.44)
2
From (9.43) and (9.44) ρ(Hω ) is a minimum when
9.2 Classical Splitting Methods 293
ρ 4 (H J )ω4
ρ 2 (H J ) = 1 ,
2 (H )ω2
4ω2 2ρ J −ω+1
i.e.,
ρ 2 (H J )ω2 − 4ω + 4 = 0.
Taking the smaller root of the above quadratic, we obtain the optimal value of ω,
2
ω0 = ,
1+ 1 − ρ 2 (H J )
Using the factorization (9.22) of TR and TB , then similar to the line Gauss-Seidel
iteration (9.23), the line-SOR iteration is given by Algorithm 9.2.5 where ω0 is the
(B)
optimal acceleration parameter, v0 is chosen arbitrarily, and G, h are as given in
(9.23). Assuming that G R , G B , and h R , h B are computed in a preprocessing stage
each iteration exhibits ample parallelism as before provided ω0 is known. If the
optimal parameter ω0 is not known beforehand, early iterations may be performed
with the non-optimal parameters
2
ωj = ,
1 + 1 − μ2j
294 9 Iterative Schemes for Large Linear Systems
where
s −1
j FR T R FR s j
μ2j =
s
j TB s j
(B) (B)
in which s j = v j − v j−1 , and as j increases μ j approaches ρ(H J ), see [21].
x = x0 + A−1 r0 , (9.46)
where pk−1 ∈ Pk−1 is the set of all polynomials of degree no greater than k − 1.
In the following two sections, we consider two such methods. In Sect. 9.3.1 we
consider the use of an explicit polynomial for symmetric positive definite linear
systems. In Sect. 9.3.2 we consider the more general case where the linear system
could be nonsymmetric and the polynomial pk−1 is defined implicitly. This results
in an iterative scheme characterized by the way the vector pk−1 (A)r0 is selected.
This class of methods is referred to as Krylov subspace methods.
In this section we consider the classical Chebyshev method (or Stiefel iteration
[23]), and one of its generalizations—the block Stiefel algorithm [24]—for solving
symmetric positive definite linear systems of the form,
Ax = f, (9.48)
where A is of order n. Further, for the sake of illustration, let us assume that A has
a spectral radius less than 1. Also, let
0 < ν = μ1 μ2 · · · μn = μ < 1,
9.3 Polynomial Methods 295
where ν and μ are the smallest and largest eigenvalues of A, respectively. Then, the
classical Chebyshev iteration (or Stiefel iteration [23]) for solving the above linear
system is given by,
1. Δx j = ω j r j + (γ ω j − 1)Δx j−1 ,
2. x j+1 = x j + Δx j , (9.49)
3. r j+1 = f − Ax j+1 .
2 μ+ν
α= , β=
μ−ν μ−ν
and
−1
1
ω j = γ − 2 ω j−1 , j ≥1
4α
with ω0 = 2/γ .
This iterative scheme produces residuals r j that satisfy the relation
r j = P j (A)r0 (9.50)
τ j (β − αλ)
P j (λ) = , ν λ μ, (9.51)
τ j (β)
As a result
r j 2
[τ j (β)]−1 .
r0 2
296 9 Iterative Schemes for Large Linear Systems
s
n
rj = ηi P j (μi )z i + ηi P j (μi )z i = r j + r j . (9.52)
i=1 i=s+1
r j |2
[τ j (β)]−1 , (9.53)
r0 2
r j 2 τ j (β − αμ1 )
.
r0 2 τ j (β)
The basic strategy of the block Stiefel algorithm is to annihilate the contributions of
the eigenvectors z 1 , z 2 , . . . , z s to the residuals r j so that eventually r j 2 approaches
zero as ζ j = 1/τ j [(μn +μs+1 )/(μn −μs+1 )] rather than ψ j = 1/τ j [(μn +μ1 )/(μn −
μ1 )] as in the classical Stiefel iteration [25]. Let Z = [z 1 , z 2 , . . . , z s ] be the ortho-
normal matrix consisting of the s-smallest eigenvectors. Then, from the fact that
r j = −A(x j − x) a projection process [26] produces the improved iterate
x̂ j = x j + Z (Z AZ )−1 Z r j ,
s
x̂+1 = x+1 + μi (z ir+1 )z i . (9.54)
i=1
While it is reasonable to expect that the optimal parameters μ and ν(μn μ <
1, μs ν < μs+1 ) are known, for example, as a result of previously solving
problem (9.48) with a different right-hand side using the CG algorithm, it may not
be reasonable to assume that Z is known a priori. In this case, the projection step
(9.54) may be performed as follows.
Let k = − s + 1, where is determined as before, i.e., so that τk (β) is large
enough to assure that rk 2 is negligible compared to rk 2 . Now, from (9.50) and
(9.52)
rk Pk (A)Z y,
where y = (η1 , η2 , . . . , ηs ), or
rk Z wk
in which
Consequently,
Let
R = Q U (9.55)
Q Z Θ
x̂+1 = x+1 + Q (Q −1
AQ ) Q r+1 (9.56)
has the desired property that Z r̂+1 0. Note that the eigenvalues of Q
AQ are
good approximations of μ1 , . . . , μs .
The projection stage consists of the six steps shown below.
1. The modified Gram-Schmidt factorization R = Q U , as described in Algo-
rithm 7.3, where
298 9 Iterative Schemes for Large Linear Systems
⎡ ⎤
p11 p12 · · · p1s
(1) (1)
R = [r−s+1 , . . . , r ], Q = [q−s+1 , . . . , q ], and U = ⎣ p22 · · · p2s ⎦ .
pss
Remark 9.1 On a parallel computing platform the block Stiefel iteration takes more
advantage of parallelism than the classical Chebyshev scheme. Further, on a platform
of many multicore nodes (peta- or exa-scale architectures), the block Stiefel itera-
tion could be quite scalable and consume far less time than the conjugate gradient
algorithm (CG) for obtaining a solution with a given level of the relative residual.
This is due to the fact that the block Stiefel scheme avoids the repeated fan-in and
fan-out operations (inner products) needed in each CG iteration.
9.3 Polynomial Methods 299
Modern algorithms for the iterative solution of linear systems involving large sparse
matrices are frequently based on approximations in Krylov subspaces. They allow
for implicit polynomial approximations in classical linear algebra problems, namely:
(i) computation of eigenvalues and corresponding eigenvectors, (ii) solving linear
systems, and (iii) matrix function evaluation. To make such a claim more precise,
we consider the problem of solving the linear system:
Ax = f, (9.57)
Proposition 9.1 With the previous notations, we have the following assertions:
• The sequence of the Krylov subspaces is nested:
• There exists η ≥ 0 such that the previous sequence is increasing for k ≤ η (i.e.
dim (Kk (A, r0 )) = k), and is constant for k ≥ η (i.e. Kk (A, r0 ) = Kη (A, r0 )).
The subspace Kη (A, r0 ) is invariant with respect to A. When A is nonsingular,
the solution of (9.57) satisfies x ∈ x0 + Kη (A, r0 ).
• If Ak+1 r0 ∈ Kk (A, r0 ), then Kk (A, r0 ) is an invariant subspace of A.
Working with the Krylov subspace Kk (A, r0 ) usually requires the knowledge of a
basis. As we will point out later, the canonical basis {r0 , Ar0 , A2 r0 , . . . , Ak−1 r0 } is
ill-conditioned and hence not appropriate for direct use. The most robust approach
300 9 Iterative Schemes for Large Linear Systems
Theorem 9.7 (Arnoldi identities) The matrices Vm+1 and Ĥm satisfy the following
relation:
where
• em ∈ Rm is the last canonical vector of Rm ,
• Hm ∈ Rm×m is the matrix obtained by deleting the last row of Ĥm .
The upper Hessenberg matrix Hm corresponds to the projection of the restriction of
the operator A on the Krylov subspace with basis Vm :
where the matrix Wm+1 ∈ Rn×(m+1) has orthonormal columns with Wm+1 e1 =
r0 /r0 , and where Rm+1 ∈ R(m+1)×(m+1) is upper triangular, then Z m = Wm Rm ,
where Rm is the leading m × m principal submatrix of Rm+1 , and Wm consists of the
first m columns of Wm+1 . Consequently, Z m = Wm Rm is an orthogonal factorization.
Substituting the QR factorizations of Z m+1 and Z m into (9.63) yields
−1
AWm = Wm+1 Ĝ m , Ĝ m = Rm+1 T̂m Rm , (9.65)
Proposition 9.2 Assume that m steps of the Arnoldi process can be applied to the
matrix A ∈ Rn×n with an initial vector r0 ∈ Rn without breakdown yielding the
decomposition (9.58). Let
AWm = Wm+1 Ĝ m
Proof The proposition is a consequence of the Implicit Q Theorem (e.g. see [16]).
max Z m+1 y
y=1
cond(Z m+1 ) = , (9.66)
min Z m+1 y
y=1
then, when the basis Z m+1 is ill-conditioned the recovery fails since cond(Z m+1 ) =
cond(Rm+1 ). This is exactly the situation with the canonical basis. In the next section,
we propose techniques for choosing the scalars αk , βk , and γk in (9.62) in order to
limit the growth of the condition number of Z k as k increases.
Limiting the Growth of the Condition Number of Krylov Subspace Bases
The goal is therefore to define more appropriate recurrences of the type (9.61). We
explore two options: (i) using orthogonal polynomials, and (ii) defining a sequence
304 9 Iterative Schemes for Large Linear Systems
of shifts in the recursion (9.61). A comparison of the two approaches with respect
to controling the condition number is given in [35]. Specifically, both options are
compared by examining their effect on the convergence speed of GMRES.
The algorithms corresponding to these options require some knowledge of the
spectrum of the matrix A, i.e. Λ(A). Note that, Λ(A) need not be determined to
high accuracy. Applying, Algorithm 9.3 with a small basis dimension m 0 ≤ m, the
convex hull of Λ(A) is estimated by the eigenvalues of the Hessenberg matrix Hm 0 .
( j)
Since the Krylov bases are repeatedly built from a sequence of initial vectors r0 ,
the whole computation is not made much more expensive by such an estimation of
Λ(A). Moreover, at each restart, the convex hull may be updated by considering the
convex hull of the union of the previous estimates and the Ritz values obtained from
the last basis generation.
In the remainder of this section, we assume that A ∈ Cn×n . We shall indicate how
to maintain real arithmetic operations when the matrix A is real.
Chebyshev Polynomial Bases
The short generation recurrences of Chebyshev polynomials of the first kind are
ideal for our objective of creating Krylov subspace bases. As outlined before, such
recurrence is given by
Generating such Krylov bases has been discussed in several papers, e.g. see [35–37].
In what follows, we adopt the presentation given in [35]. Introducing the family of
ellipses in C,
E (ρ) := eiθ + ρ −2 e−iθ : −π < θ ≤ π , ρ ≥ 1,
where E (ρ) has foci at ±c with c = 2ρ −1 , with semi-major and semi-minor axes
given, respectively, by α = 1 + ρ12 and β = 1 − ρ12 , in which
α+β
ρ= . (9.68)
c
When ρ grows from 1 to ∞, the ellipse E (ρ) evolves from the interval [−2, +2] to
the unit circle.
Consider the scaled Chebyshev polynomials
(ρ) 1 ρ
Ck (z) := Tk ( z), k = 0, 1, 2, . . . .
ρk 2
and
(ρ) 1 ikθ
Ck (eiθ + ρ −2 e−iθ ) = e + ρ −2k e−ikθ , (9.69)
2
(ρ)
it follows that when z ∈ E (ρ) we have Ck (z) ∈ 21 E (ρ ) . Consequently, it can be
2k
(ρ)
proved that, for any ρ, the scaled Chebyshev polynomials (Ck ) are quite well con-
ditioned on E (ρ) for the uniform norm [35, Theorem 3.1]. This result is independent
of translation, rotation, and scaling of the ellipse, provided that the standard Cheby-
shev polynomials Tk are translated, rotated, and scaled accordingly. Let E (c1 , c2 , τ )
denote the ellipse with foci c1 and c2 and semi-major axis of length τ . This ellipse
can then be mapped by a similarity transformation φ(z) = μz + ν onto the ellipse
E (ρ) = E (− 2ρ1
, 2ρ
1
, 1 + ρ12 ) with a suitable value of ρ ≥ 1 by translation, rotation,
and scaling:
⎡
m = c1 +c
2 , is the center point of E (c1 , c2 , τ ),
2
⎢c |c2 −c1 |
= √ 2 , is the semi-focal distance of E (c1 , c2 , τ ),
⎢
⎢
⎢s = τ 2 − c2 , is the semi-minor axis of E (c1 , c2 , τ ),
⎢ (9.70)
⎢ρ = τ +s
c ,
⎢
⎣μ = ρ(c22−m) ,
ν = −μm.
The three last expressions are obtained from (9.68) and from the equalities φ(m) = 0
and φ(c2 ) = ρ2 .
It is now easy to define polynomials Sk (z) which are well conditioned on the
ellipse E (c1 , c2 , τ ), by defining for k ≥ 0
(ρ)
Sk (z) = Ck (φ(z)). (9.71)
Since
(ρ)
Sk+1 (z) = Ck+1 (φ(z))
1 ρ
= k+1 Tk+1 ( φ(z))
ρ 2
1 ρ ρ
= k+1 ρφ(z) Tk ( φ(z)) − Tk−1 ( φ(z)) ,
ρ 2 2
306 9 Iterative Schemes for Large Linear Systems
form a basis for the Krylov subspace Km+1 (A, r0 ), with m being a small integer.
Then the matrix T̂m , which satisfies (9.63), is given by,
⎛ ⎞
− 2ν
μ μ
2
⎜ 1
−μν 1 ⎟
⎜ μρ 2 μ ⎟
⎜ ⎟
⎜ 1
− μν 1 ⎟
⎜ μρ 2 μ ⎟
⎜ .. .. ⎟
⎜ . . ⎟
⎟ ∈ C(m+1)×m .
1
T̂m = ⎜ μρ 2 (9.75)
⎜ ⎟
⎜ .. ⎟
⎜ . − μν μ1 ⎟
⎜ ⎟
⎜ 1
− μν ⎟
⎝ μρ 2 ⎠
1
μρ 2
The remaining question concerns how the ellipse is selected for a given operator
A ∈ Cn×n . By extension of the special situation when A is normal, one selects the
smallest ellipse which contains the spectrum Λ(A) of A, see [35] and Sect. 9.3.2.
A straightforward application of the technique presented here is described in
Algorithm 9.4.
−1 .
in which Ťm = Dm+1 T̂m Dm
Newton Polynomial Bases
This approach has been developed in [39] from an earlier result [40], and its imple-
mentation on parallel architectures is given in [32]. More recently, an improvement
of this approach has been proposed in [35].
Recalling the notations of Sect. 9.3.2, we consider the recurrence (9.61) with the
additional condition that γk = 0 for k ≥ 1. This condition reduces the matrix T̂m
(introduced in (9.62)) to the bidiagonal form. Denoting this bidiagonal matrix by
B̂m , the corresponding polynomials pk (z) can then be generated by the following
recurrence
z k+1 = ηk (A − βk I )z k , k = 1, 2, . . . , m. (9.78)
Therefore
with
⎛ ⎞
β1
⎜ α1 β2 ⎟
⎜ ⎟
⎜ α2 β3 ⎟
⎜ ⎟
⎜ . . ⎟
⎜
B̂m = ⎜ α3 . ⎟ ∈ C(m+1)×m . (9.80)
⎟
⎜ .. ⎟
⎜ . β ⎟
⎜ m−1 ⎟
⎝ αm−1 βm ⎠
αm
Definition 9.5 (Leja points) Let S be a compact set in C, such that (C ∪ {∞})\S
is connected and possesses a Green’s function. Let ζ1 ∈ S be arbitrary and let ζ j
for j = 2, 3, 4, . . . , satisfy
"
k "
k
|ζk+1 − ζ j | = max |z − ζ j |, ζk+1 ∈ S , k = 1, 2, 3, . . . . (9.81)
z∈S
j=1 j=1
In [39], the set S is chosen to be Λ(Hm ), the spectrum of the upper Hessenberg
matrix Hm generated in Algorithm 9.3. These eigenvalues of Λ(Hm ) are the Ritz
values of A corresponding to the Krylov subspace Km (A, r0 ). They are sorted with
respect to the Leja ordering, i.e., they are ordered to satisfy (9.81) with S = Λ(Hm ),
and are used as the nodes βk in the Newton polynomials (9.78).
In [35], this idea is extended to handle more elaborate convex sets S containing
the Ritz values. This allows starting the process with a modest integer m 0 to determine
the convex hull of Λ(Hm 0 ). From this set, an infinite Leja sequence for which S =
Λ(Hm ). Note that the sequence has at most m terms. Moreover, when an algorithm
builds a sequence of Krylov subspaces, the compact set S can be updated by S :=
co (S ∪ Λ(Hm )) at every restart.
The general pattern of a Newton-Krylov procedure is given by Algorithm 9.5.
When the matrix A is real, we can choose the set S to be symmetric with respect
to the real axis. In such a situation, the nonreal shifts βk are supposed to appear in
conjugate complex pairs. The Leja ordering is adapted to keep consecutive elements
of the same pair. Under this assumption, the recurrence (which appears in Algo-
rithm 9.6) involves real operations. The above bidiagonal matrix B̂m becomes the
tridiagonal matrix T̂m since at each conjugate pair of shifts an entry appears on the
superdiagonal.
References 309
References
1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings
of 7th International Conference World Wide Web, pp. 107–117. Elsevier Science Publishers
B.V, Brisbane (1998)
2. Langville, A., Meyer, C.: Google’s PageRank and Beyond: The Science of Search Engine
Rankings. Princeton University Press, Princeton (2006)
3. Gleich, D., Zhukov, L., Berkhin., P.: Fast parallel PageRank: a linear system approach. Technical
report, Yahoo Corporate (2004)
4. Gleich, D., Gray, A., Greif, C., Lau, T.: An inner-outer iteration for computing PageRank.
SIAM J. Sci. Comput. 32(1), 349–371 (2010)
5. Bahi, J., Contassot-Vivier, S., Couturier, R.: Parallel Iterative Algorithms. Chapman &
Hall/CRC, Boca Raton (2008)
6. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation. Prentice Hall, Engle-
wood Cliffs (1989)
7. Kollias, G., Gallopoulos, E., Szyld, D.: Asynchronous iterative computations with web infor-
mation retrieval structures: The PageRank case. In: PARCO, pp. 309–316 (2005)
8. Ishii, H., Tempo, R.: Distributed randomized algorithms for the PageRank computation. IEEE
Trans. Autom. Control 55(9), 1987–2002 (2010)
9. Kalnay, E., Takacs, L.: A simple atmospheric model on the sphere with 100% parallelism.
Advances in Computer Methods for Partial Differential Equations IV (1981). http://ntrs.nasa.
gov/archive/nasa/casi.ntrs.nasa.gov/19820017675.pdf. Also published as NASA Technical
Memorandum No. 83907, Laboratory for Atmospheric Sciences, Research Review 1980–1981,
pp. 89–95, Goddard Sapce Flight Center, Maryland, December 1981
10. Gallopoulos, E.: Fluid dynamics modeling. In: Potter, J.L. (ed.) The Massively Parallel Proces-
sor, pp. 85–101. The MIT Press, Cambridge (1985)
11. Gallopoulos, E., McEwan, S.D.: Numerical experiments with the massively parallel processor.
In: Proceedings of the 1983 International Conference on Parallel Processing, August 1983,
pp. 29–35 (1983)
310 9 Iterative Schemes for Large Linear Systems
12. Potter, J. (ed.): The Massively Parallel Processor. MIT Press, Cambridge (1985)
13. Varga, R.S.: Matrix Iterative Analysis. Prentice Hall Inc., Englewood Cliffs (1962)
14. Wachspress, E.L.: Iterative Solution of Elliptic Systems. Prentice-Hall Inc., Englewood Cliffs
(1966)
15. Young, D.: Iterative Solution of Large Linear Systems. Academic Press, New York (1971)
16. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
17. Golub, G.H., Varga, R.S.: Chebychev semi-iterative methods, successive overrelaxation itera-
tive methods, and second order Richardson iterative methods: part I. Numer. Math. 3, 147–156
(1961)
18. O’Leary, D., White, R.: Multi-splittings of matrices and parallel solution of linear systems.
SIAM J. Algebra Discret. Method 6, 630–640 (1985)
19. Neumann, M., Plemmons, R.: Convergence of parallel multisplitting iterative methods for
M-matrices. Linear Algebra Appl. 88–89, 559–573 (1987)
20. Szyld, D.B., Jones, M.T.: Two-stage and multisplitting methods for the parallel solution of
linear systems. SIAM J. Matrix Anal. Appl. 13, 671–679 (1992)
21. Hageman, L., Young, D.: Applied Iterative Methods. Academic Press, New York (1981)
22. Keller, H.: On the solution of singular and semidefinite linear systems by iteration. J. Soc.
Indus. Appl. Math. 2(2), 281–290 (1965)
23. Stiefel, E.L.: Kernel polynomials in linear algebra and their numerical approximations. U.S.
Natl. Bur. Stand. Appl. Math. Ser. 49, 1–22 (1958)
24. Saad, Y., Sameh, A., Saylor, P.: Solving elliptic difference equations on a linear array of
processors. SIAM J. Sci. Stat. Comput. 6(4), 1049–1063 (1985)
25. Rutishauser, H.: Refined iterative methods for computation of the solution and the eigenvalues
of self-adjoint boundary value problems. In: Engli, M., Ginsburg, T., Rutishauser, H., Seidel,
E. (eds.) Theory of Gradient Methods. Springer (1959)
26. Householder, A.S.: The Theory of Matrices in Numerical Analysis. Dover Publications, New
York (1964)
27. Dongarra, J., Duff, I., Sorensen, D., van der Vorst, H.: Numerical Linear Algebra for High-
Performance Computers. SIAM, Philadelphia (1998)
28. Meurant, G.: Computer Solution of Large Linear Systems. Studies in Mathematics and its
Applications. Elsevier Science (1999). http://books.google.fr/books?id=fSqfb5a3WrwC
29. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)
30. van der Vorst, H.A.: Iterative Krylov Methods for Large Linear Systems. Cambridge University
Press, Cambridge (2003). http://dx.doi.org/10.1017/CBO9780511615115
31. Saad, Y., Schultz, M.H.: GMRES: A generalized minimal residual algorithm for solving non-
symmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856–869 (1986)
32. Sidje, R.B.: Alternatives for parallel Krylov subspace basis computation. Numer. Linear Alge-
bra Appl. 305–331 (1997)
33. Nuentsa Wakam, D., Erhel, J.: Parallelism and robustness in GMRES with the Newton basis
and the deflated restarting. Electron. Trans. Linear Algebra (ETNA) 40, 381–406 (2013)
34. Nuentsa Wakam, D., Atenekeng-Kahou, G.A.: Parallel GMRES with a multiplicative Schwarz
preconditioner. J. ARIMA 14, 81–88 (2010)
35. Philippe, B., Reichel, L.: On the generation of Krylov subspace bases. Appl. Numer. Math.
(APNUM) 62(9), 1171–1186 (2012)
36. Joubert, W.D., Carey, G.F.: Parallelizable restarted iterative methods for nonsymmetric linear
systems. Part I: theory. Int. J. Comput. Math. 44, 243–267 (1992)
37. Joubert, W.D., Carey, G.F.: Parallelizable restarted iterative methods for nonsymmetric linear
systems. Part II: parallel implementation. Int. J. Comput. Math. 44, 269–290 (1992)
38. Parlett, B.: The Symmetric Eigenvalue Problem. SIAM, Philadelphia (1998)
39. Bai, Z., Hu, D., Reichel, L.: A Newton basis GMRES implementation. IMA J. Numer. Anal.
14, 563–581 (1994)
40. Reichel, L.: Newton interpolation at Leja points. BIT 30, 332–346 (1990)
Chapter 10
Preconditioners
The tearing-based solver described in Sect. 5.4 is applicable for handling systems
involving such sparse preconditioner M. The only exception here is that each linear
system corresponding to an overlapped diagonal block M j can be solved using a
sparse direct solver. Such a direct solver, however, should have the following capa-
bility: given a system M j Y j = G j in which G j has only nonzero elements at the top
and bottom m rows, then the solver should be capable of obtaining the top and bottom
m rows of the solution Y j much faster than computing all of Y j . The sparse direct
solver PARDISO possesses such a feature [6]. This will allow solving the balance
system, and hence the original system, much faster and with a high degree of parallel
scalability.
Most sparse nonsymmetric linear system solvers either require: (i) storage and
computation that grow excessively as the number of iterations increases, (ii) spe-
cial spectral properties of the coefficient matrix A to assure convergence or, (iii) a
symmetrization process that could result in potentially disastrously ill-conditioned
problems. One group of methods which avoids these difficulties is accelerated row
projection (RP) algorithms. This class of sparse system solvers start by partitioning
the coefficient matrix A of the linear system Ax = f of order n, into m block rows:
A = (A1 , A2 , . . . , Am ), (10.1)
and partition the vector f accordingly. A row projection (RP) method is any
algorithm which requires the computation of the orthogonal projections Pi x =
Ai (Ai Ai )−1 Ai x of a vector x onto R(Ai ), i = 1, 2, . . . , m. Note that the nonsin-
gularity of A implies that Ai has full column rank and so (Ai Ai )−1 exists.
In this section we present two such methods, bearing the names of their inventors,
and describe their properties. The first (Kaczmarz) has an iteration matrix formed as
the product of orthogonal projectors, while the second RP method (Cimmino) has an
iteration matrix formed as the sum of orthogonal projectors. Conjugate gradient (CG)
acceleration is used for both. Most importantly, we show the underlying relationship
between RP methods and the CG scheme applied to the normal equations. This, in
turn, provides an explanation for the behavior of RP methods, a basis for comparing
them, and a guide for their effective use.
Possibly the most important implementation issue for RP methods is that of choos-
ing the row partitioning which defines the projectors. An approach for banded systems
yields scalable parallel algorithms that require only a few extra vectors of storage,
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 313
and allows for accurate computations involving the applications of the necessary pro-
jections. Numerous numerical experiments show that these algorithms have superior
robustness and can be quite competitive with other solvers of sparse nonsymmetric
linear systems.
RP schemes have also been attracting the attention of researchers because of their
robustness in solving overdetermined and difficult systems that appear in applications
and because of the many interesting variations in their implementation, including
parallel asynchronous versions; see e.g. [7–12].
Here ρik is the ith component of the residual rk = f − Axk and ai is the ith
row of the matrix A. For each k, Step 2 consists of n projections, one for each row
of A. Kaczmarz’s method was recognized as a projection in [17, 18]. It converges
for any system of linear equations with nonzero rows, even when it is singular and
inconsistent, e.g. see [19, 20], as well as [21–23]. The method has been used under
the name (unconstrained) ART (Algebraic Reconstruction Techniques) in the area
of image reconstruction [24–28].
The idea of projection methods was further generalized by in [29, 30] to include
the method of steepest descent, Gauss-Seidel, and other relaxation schemes.
In order to illustrate this idea, let the error and the residual at the kth step be
defined, respectively, as
314 10 Preconditioners
δxk = x − xk , (10.2)
and
rk = f − Axk . (10.3)
Then a method of projection is one in which at each step k, the error δxk is resolved
into two components, one of which is required to lie in a subspace selected at that
step, and the other is δxk+1 , which is required to be less than δxk in some norm. The
subspace is selected by choosing a matrix Yk whose columns are linearly independent.
Equivalently,
where u k is a vector (or a scalar if Yk has one column) to be selected at the kth step
so that
where · is some vector norm. The method of projection depends on the choice of
the matrix Yk in (10.4) and the vector norm in (10.5).
If we consider ellipsoidal norms, i.e.
where G is a positive definite matrix, then u k is selected such that δxk+1 is mini-
mized yielding,
and
Since we do not know δxk , the process can be made feasible if we require that,
in which the matrix Vk will have to be determined at each iteration. This will allow
u k to be expressed in terms of rk ,
G = I. (10.11)
and
z 1 = xk , z 2 = z 1 + A1 (A −1
1 A1 ) ( f 1 − A1 z 1 ),
−1 (10.16)
z 3 = z 2 + A2 (A2 A2 ) ( f 2 − A2 z 2 ), xk+1 = z 3 .
A A y = f, A y = x. (10.17)
Symmetrized m-Partition
Variations of (10.16) include the block-row Jacobi, the block row JOR and the block-
row SOR methods [31–34]. For each of these methods, we also have the correspond-
ing block-column method, obtained by partitioning the matrix in (10.14) by columns
instead of rows. Note that if A had been partitioned into n parts, each part being a
row of A, then (10.16) would be equivalent to the Kaczmarz method. In general, for
m partitions, the method of successive projections yields the iteration
where
When A is nonsingular and 0 < ω < 2, the eigenvalues of the symmetric matrix
(I − Q(ω)) lie in the interval (0,1] and so the conjugate gradient (CG) may be used
to solve
(I − Q(ω))x = c. (10.21)
Note that iteration, (10.19) is equivalent to that of using the block symmetric suc-
cessive overrelaxation (SSOR) method to solve the linear system
A A y = f
(10.22)
x = A y
in which the blocking is that induced by the row partitioning of A. This gives a simple
expression for the right-hand side c = T (ω) f where,
The first implementation issue is the choice of ω in (10.21). Normally, the ‘opti-
mal’ ω is defined as the ωmin that minimizes the spectral radius of Q(ω). Later, we
show that ωmin = 1 for the case in which A is partitioned into two block rows, i.e.,
m = 2, see also [14]. This is no longer true for m > 3, as can be seen by considering
⎛ ⎞ ⎛ ⎞
100 A1
A = ⎝ 1 1 0 ⎠ = ⎝ A
2
⎠.
101 A
3
For Q(ω) defined in (10.20), it can be shown that the spectral radii of Q(1), and
Q(0.9) satisfy,
√
ρ(Q(1)) = (7 + 17)/16 = 0.69519 ± 10−5 ,
ρ(Q(0.9)) ≤ 0.68611 ± 10−5 < ρ(Q(1)).
Hence, ωmin = 1.
However, taking ω = 1 is recommended as we explain it in the following. First,
however, we state two important facts.
Proposition 10.1 At least rank(A1 ) of the eigenvalues of Q(1) are zero.
Proof Using the definition Pi = Ai (Ai Ai )−1 Ai = Ai Ai+ , (10.23) can be expanded
to show that the ith block column of T (1) is given by
⎡ ⎤
i−1
m
i+1
(I − P j ) ⎣ I + (I − P j ) (I − P j )⎦ (Ai )+ . (10.24)
j=1 j=i j=m−1
The first product above should be interpreted as I when i = 1 and the third product
should be interpreted as I when i = m − 1 and 0 when i = m, so that the first
summand in forming T (1) f is
[I + (I − P1 ) · · · (I − Pm ) · · · (I − P2 )](A +
1 ) f1 .
Since A +
1 (I − P1 ) = 0, then A1 x 1 = A1 (A1 ) f 1 = f 1 . The succeeding iterates
xk are obtained by adding elements in R(Q(1)) ⊆ R(I − P1 ) = N (A 1 ) to x 1 .
318 10 Preconditioners
Definition 10.1 Considering the previous notations, for solving the system Ax = f ,
the system resulting from the choice Q = Q(1) and T = T (1) is :
(I − Q)x = c, (10.25)
Where Q and T are as given by (10.20) and (10.23), respectively, with ω = 1, and
c = T f . The corresponding solver given by Algorithm 10.2 is referred to as KACZ,
the symmetrized version of the Kaczmarz method.
à = (A1 (A −1 −1 −1
1 A1 ) , A2 (A2 A2 ) , . . . , Am (Am Am ) ) (10.26)
we obtain
(P1 + P2 + · · · + Pm )x = Ã f. (10.27)
This system can also be derived as a block Jacobi method applied to the system
(10.22); see [37]. For nonsingular A, this system is symmetric positive definite and
can be solved via the CG algorithm. The advantage of this approach is that the
projections can be computed in parallel and then added.
In 1938, Cimmino [22, 38] first proposed an iteration related to (10.27), and since
then it has been examined by several others [21, 31, 32, 37, 39–42]. Later, we will
show how each individual projection can, for a wide class of problems, be computed
in parallel creating a solver with two levels of parallelism.
Although KACZ can be derived as a block SSOR, and the Cimmino scheme as a block
Jacobi method for (10.22), a more instructive comparison can be made with CGNE—
the conjugate gradient method applied to the normal equations A Ax = A f .
All three methods consist of the CG method applied to a system with coefficient
matrix W W , where W is shown for each of the three methods in Table 10.1.
Intuitively an ill-conditioned matrix A is one in which some linear combination of
rows yields approximately the zero vector. For a block row partitioned matrix near
linear dependence may occur within a block, that is, some linear combination of the
rows within a particular block is approximately zero, or across the blocks, that is, the
Table 10.1 Comparison of system matrices for three row projection methods
Method W
CGNE (A1 , A2 , . . . , Am )
Cimmino (Q 1 , Q 2 , . . . , Q m )
m−1
(P1 , (I − P1 )P2 , (I − P1 )(I − P2 )P3 , . . . , (I − Pi )Pm )
KACZ i=1
320 10 Preconditioners
linear combination must draw on rows from more than one block row Ai . Now let
Ai = Q i Ui be the orthogonal decomposition of Ai in which the columns of Q i are
orthonormal. Examining the matrices W shows that CGNE acts on A A, in which
near linear dependence could occur both from within and across blocks. Cimmino,
however, replaces each Ai with the orthonormal matrix Q i . In other words, Cimmino
avoids forming linear dependence within each block, but remains subject to linear
dependence formed across the blocks.
Similar to Cimmino, KACZ also replaces each Ai with the orthonormal matrix
Q i , but goes a step further since Pi (I − Pi ) = 0.
Several implications follow from this heuristic argument. We make two practical
observations. First, we note that the KACZ system matrix has a more favorable
eigenvalue distribution than that of Cimmino in the sense that KACZ has fewer small
eigenvalues and many more near the maximal one. Similarly, the Cimmino system
matrix is better conditioned than that of CGNE. Second, we note that RP methods
will require fewer iterations for matrices A where the near linear dependence arises
primarily from within a block row rather than across block rows. A third observation
is that one should keep the number of block rows small. The reason is twofold: (i)
partial orthogonalization across blocks in the matrix W of Table 10.1 becomes less
effective as more block rows appear; (ii) the preconditioner becomes less effective
(i.e. the condition number of the preconditioned system increases) when the number
of blocks increases. Further explanation of the benefit of keeping the number of block
rows small is seen from the case in which m = n, i.e. having n blocks. In such a
case, the ability to form a near linear dependence occurs only across rows where the
outer CG acceleration method has to deal with it.
10.2.4 CG Acceleration
Although the CG algorithm can be applied directly to the RP systems, special prop-
erties allow a reduction in the amount of work required by KACZ. CG acceleration
for RP methods was proposed in [35], and considered in [14, 36]. The reason that a
reduction in work is possible, and assuring that A
1 x k = f 1 is satisfied in every CG
outer iteration for accelerating KACZ follows from:
Theorem 10.1 Suppose that the CG algorithm is applied to the KACZ system
(10.25). Also, let rk = c − (I − Q)xk be the residual, and dk the search direc-
tion. If x0 = c is chosen as the starting vector, then rk , dk ∈ R(I − P1 ) for all k.
Proof
r0 = c − (I − Q)c = Qc
= (I − P1 )(I − P2 ) · · · (I − Pm ) · · · (I − P2 )(I − P1 )c ∈ R(I − P1 ).
Since d0 = r0 , the same is true for d0 . Suppose now that the theorem holds for step
(k − 1). Then dk−1 = (I − P1 )dk−1 and so
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 321
wk ≡ (I − Q)dk−1
= (I − P1 )dk−1 − (I − P1 )(I − P2 ) · · · (I − Pm ) · · · (I − P2 )(I − P1 )dk−1
∈ R(I − P1 ).
A
1 x k = A1 (x k−1 + αk dk−1 ) = A1 x k−1 = · · · = A1 x 0 = f 1 .
If the matrix A is partitioned into two block rows (i.e. m = 2), a complete eigen-
analysis of RP methods is possible using the concept of the angles θk between the
two subspaces L i = R(Ai ), i = 1, 2. The definition presented here follows [43], but
for convenience L 1 and L 2 are assumed to have the same dimension. The smallest
angle θ1 ∈ [0, π/2] between L 1 and L 2 is defined by
Let u 1 and v1 be the attainment vectors; then for k = 2, 3, . . . , n/2 the remaining
angles between the two subspaces are defined as
C = diag(c1 , c2 , . . . , cn/2 )
S = diag(s1 , s2 , . . . , sn/2 )
(10.28)
I = C 2 + S2
1 ≥ c1 ≥ c2 ≥ · · · ≥ cn/2 ≥ 0.
In the above theorem ck = cos θk and sk = sin θk , where the angles θk are as defined
above. Now consider the nonsymmetric RP iteration matrix Q u = (I − ω P1 )(I −
ω P2 ). Letting α = 1 − ω, using the above expressions of P1 and P2 , we get
α 2 C −αS
Q u = U1 U2 .
αS C
Hence,
α 2 C 2 + αS 2 (1 − α)C S
U2 Q u U2 = . (10.29)
α(1 − α)C S C 2 + αS 2
Since each of the four blocks is diagonal, U2 Q u U2 has the same eigenvalues as
the scalar 2 × 2 principal submatrices of the permuted matrix (inverse odd-even
permutation on the two sides). The eigenvalues are given by,
1
(1 − α) ci + 2α ± |1 − α|ci (1 − α) ci + 4α ,
2 2 2 2 (10.30)
2
Q(ω) = (I −
ω2P1 )(I2 − ω P 2 ) (I − ω P1 )
2
where each block is n/2 × n/2, with C and S satisfying (10.28) and D =
I 0 C
diag(d1 , d2 , . . . , dn/2 ). Then P1 = and P2 = (C S), and the eigenval-
0 0 S
ues of [I − Q(1)] are {s12 , s22 , . . . , sn/2
2 , 1} while those of A are [c ± c2 + 4s d ]/2.
i i i i
Clearly, A is nonsingular provided that each si and di is nonzero. If the si ’s are close
to 1 while the di ’s are close to 0, then [I − Q(1)] has eigenvalues that are clustered
near 1, while A has singular values close to both 0 and 1. Hence, A is badly condi-
tioned while [I − Q(1)] is well-conditioned. Conversely if the di ’s are large while
the si ’s are close to 0 in such a way that di si is near 1, then A is well-conditioned
while [I − Q(1)] is badly conditioned. Hence, the conditioning of A and its induced
RP system matrix may be significantly different.
We have already stated that the eigenvalue distribution for the KACZ systems is
better for CG acceleration than that of Cimmino, which in turn is better than that of
CGNE (CG applied to A Ax = A f ). For m = 2, the eigenvalues of KACZ are
{s12 , s22 , . . . , sn/2
2 , 1} while those of Cimmino are easily seen to be {1 − c , . . . , 1 −
1
cn/2 , 1 + cn/2 , . . . , 1 + c1 }, verifying that KACZ has better eigenvalue distribution
than that of Cimmino. The next theorem shows that in terms of condition numbers
this heuristic argument is valid when m = 2.
Proof Without loss of generality, suppose that R(A1 ) and R(A2 ) have dimension
n/2. Let U1 = (G 1 , G 2 ) and U2 = (H1 , H2 ) be the matrices defined in Theorem
10.2 so that P1 = G 1 G
1 , P2 = H1 H1 , and G 1 H1 = C. Set X = G 1 A1 and
Y = H1 A2 so that A1 = P1 A1 = G 1 G 1 A1 = G 1 X and A2 = H1 Y . It is easily
verified that the eigenvalues of (P1 + P2 ) are 1±ci , corresponding to the eigenvectors
gi ± h i where G 1 = (g1 , g2 , . . . , gn/2 ) and H1 = (h 1 , h 2 , . . . , h n/2 ). Furthermore,
G 1 g1 = H1 h 1 = e1 and G 1 h 1 = H1 g1 = c1 e1 , where e1 is the first unit vector. Then
A A = A1 A
1 + A2 A2 = G 1 X X G 1 + H1 Y Y H1 , so (g1 + h 1 ) (A A)(g1 +
2
h 1 ) = (1+c1 ) e1 (A A)e1 , and (g1 −h 1 ) (A A)(g1 −h 1 ) = (1−c1 ) e1 (A A)e1 .
2
1 + c1
and λmin (A A) ≤ (1 − c1 )(e1 (A A)e1 )/2. Hence κ(A A) ≥ = κ(P1 +
1 − c1
P2 ), proving the first inequality.
To prove the second inequality, we see that from the CS Decomposition the eigen-
values of (I − (I − P1 )(I − P2 )(I − P1 )) are given by: {s12 , s22 , . . . , sn/2
2 , 1}, where
si2 = 1 − ci2 are the square of the canonical sines with the eigenvalue 1 being of
multiplicity n/2. Consequently,
1
κ(I − (I − P1 )(I − P2 )(I − P1 )) = ,
s12
and
κ(P1 + P2 ) 1 + c1
= × s12 = (1 + c1 )2 .
κ(I − (I − P1 )(I − P2 )(I − P1 )) 1 − c1
The first criterion for a row partitioning strategy is that the projections Pi x =
Ai (Ai Ai )−1 Ai x must be efficiently computable. One way to achieve this is through
parallelism: if Ai is the direct sum of blocks C j for j ∈ Si , then Pi is block-diagonal
[14]. The computation of Pi x can then be done by assigning each block of Pi to a dif-
ferent multicore node of a multiprocessor. The second criterion is storage efficiency.
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 325
The additional storage should not exceed O(n), that is, a few extra vectors, the num-
ber of which must not grow with increasing problem size n. The third criterion is
that the condition number of the subproblems induced by the projections should be
kept under control, i.e. monitored. The need for this is made clear by considering
the case when m = 1, i.e. when A is partitioned into a single block of rows. In
this case, KACZ simply solves the normal equations A Ax = A b, an approach
that can fail if A is severely ill-conditioned. More generally when m > 1, computing
y = Pi x requires solving a system of the form Ai Ai v = w. Hence the accuracy with
which the action of Pi can be computed depends on the condition number of Ai Ai
(i.e., κ(Ai Ai )), requiring the estimation of an upper bound of κ(Ai Ai ). The fourth
criterion is that the number of partitions m, i.e. the number of projectors should be
kept as small as possible, and should not depend on n. One reason has already been
outlined in Proposition 10.1.
In summary, row partitioning should allow parallelism in the computations,
require at most O(n) storage, should yield well conditioned subproblems, with the
number of partitions m being a small constant. We should note that all four goals can
be achieved simultaneously for an important class of linear systems, namely banded
systems.
For general linear systems, algorithm KACZ (Algorithm 10.2) can exploit parallelism
only at one level, that within each projection: given a vector u obtain v = (I − P j )u,
where P j = A j (Aj A j )−1 Aj . This matrix-vector multiplication is essentially the
solution of the least-squares problem
min u − A j w2
v
where each G i , Hi , Ji is square of order q, for example, then using suitable row
permutations, one can enhance parallelism within each projection by extracting block
rows in which each block consists of independent submatrices.
326 10 Preconditioners
Case 1: m = 2,
⎛ ⎞
G 1 H1
⎜ J2 G 2 H2 ⎟
⎜ ⎟
⎜ J G H ⎟
⎜ 5 5 5 ⎟
⎜ J G H ⎟
⎜
π1 A = ⎜ 6 6 6 ⎟
J3 G 3 H3 ⎟
⎜ ⎟
⎜ J4 G 4 H4 ⎟
⎜ ⎟
⎝ J7 G 7 H7 ⎠
J8 G 8
Case 2: m = 3,
⎛ ⎞
G 1 H1
⎜ J4 G 4 H4 ⎟
⎜ ⎟
⎜ J7 G 7 H7 ⎟
⎜ ⎟
⎜ J2 G 2 H2 ⎟
π2 A = ⎜
⎜
⎟
⎟
⎜ J5 G 5 H5 ⎟
⎜ J8 G 8 ⎟
⎜ ⎟
⎝ J3 G 3 H3 ⎠
J6 G 6 H6
Such permutations have two benefits. First, they introduce an outer level of paral-
lelism in each single projection. Second, the size of each independent linear least-
squares problem being solved is much smaller, leading to reduction in time and
memory requirements.
The Cimmino row projection scheme (Cimmino) is capable of exploiting paral-
lelism at two levels. For m = 3 in the above example, we have
(i) An outer level of parallel projections on all block rows using 8 nodes, one node
for each independent block, and
(ii) An inner level of parallelism in which one performs each projection (i.e., solv-
ing a linear least-squares problem) as efficiently as possible on the many cores
available on each node of the parallel architecture.
robust than its additive counterpart. By using Newton Krylov bases, as introduced
in Sect. 9.3.2, the resulting method does not suffer from the classical bottleneck of
excessive internode communications.
In order to simplify the notation, the sets Wi are assumed to be intervals of integers.
This is not a restriction, since it is always possible to define a new numbering of
the unknowns which satisfies this constraint. Following this definition, a “domain”
decomposition can be considered as resulting from a graph partitioner but with poten-
tial overlap between domains. It can be observed that such a decomposition does
not necessarily exist (e.g. when A is dense matrix). For the rest of our discussion,
we will assume that a graph partitioner has been applied resulting in p intervals
Wi = wi + [1 : m i ] whose union is W , W = [1 : n]. The submatrix of A corre-
sponding to Wi × Wi is denoted by Ai . We shall denote by Ii ∈ Rn×n the diagonal
matrix, or the sub-identity matrix, whose diagonal elements are set to one if the corre-
sponding node belongs to Wi and set to zero otherwise. In effect, Ii is the orthogonal
projector onto the subspace L i corresponding to the unknowns numbered by Wi . We
still denote by Ai the extension of block Ai to the whole space, in other words,
328 10 Preconditioners
Ai = Ii AIi , (10.33)
For example, from Fig. 10.1 , we see that unlike the tearing method, the whole overlap
block C j belongs to A j as well as A j+1 .
Also, let
where I¯i = I − Ii is the complement sub-identity matrix. For the sake of simplifying
the presentation of what follows, we assume that all the matrices Āi , for i = 1, . . . , p
are nonsingular. Hence, the generalized inverse Ai+ of Ai is given by Ai+ = Ii Āi−1 =
Āi−1 Ii .
Proposition 10.3 For any domain decomposition as given in Definition 10.2 the
following property holds:
Proof Let (k, l) ∈ Wi × W j such that ak,l = 0. Since (k, l) ∈ P, there exists m ∈
{1 . . . n} such that k ∈ Wm and l ∈ Wm ; therefore Wi ∩ Wm = ∅ and W j ∩ Wm = ∅.
Consequently, from Definition 10.2, |i − m| ≤ 1 and | j − m| ≤ 1, which implies
|i − j| ≤ 2.
Definition 10.3 The domain decomposition is with a weak overlap if and only if the
following is true:
The set of unknowns which represents the overlap is defined by the set of integers
Ji = Wi ∩ Wi+1 , i = 1, . . . , p − 1, with the size of the overlap being si . Similar to
(10.33) and (10.34), we define
Ci = Oi AOi , (10.35)
and
Example 10.1 In the following matrix, it can be seen that the decomposition is of a
weak overlap if A2b,t = 0, and At,b
2 = 0, i.e. when A is block-tridiagonal.
⎛ ⎞
A1m,m A1m,b
⎜ b,m ⎟
⎜ A1 C1 At,m At,b ⎟
⎜ 2 2 ⎟
A=⎜ m,t
A2 A2 m,m m,b
A2 ⎟ (10.37)
⎜ t,m ⎟
⎝ b,t
A2 A2 b,m
C 2 A3 ⎠
At,m
3 A3m,m
The goal of Multiplicative Schwarz methods is to iteratively solve the linear system
Ax = f (10.38)
Convergence of this iteration to the solution x of the system (10.38) is proven for
M-matrices and SPD matrices (eg. see [46]).
Embedding in a System of Larger Dimension
If the subdomains do not overlap, it can be shown [47] that the Multiplicative Schwarz
is equivalent to a Block Gauss-Seidel method applied on an extended system. In this
section, following [48], we present an extended system which embeds the original
system (10.38) into a larger one with no overlapping between subdomains.
For that purpose, we define the prolongation mapping and the restriction map-
ping. We assume for the whole section that the set of indices defining the domains
are intervals. As mentioned before, this does not limit the scope of the study since a
preliminary symmetric permutation of the matrix, corresponding to the same renum-
bering of the unknowns and the equations, can always end up with such a system.
For any vector x ∈ Rn , we consider the set of overlapping subvectors x (i) ∈ Rm i
for i = 1, . . . , p, where x (i) is the subvector of⎛x corresponding
⎞ to the indices Wi .
x (i,t)
This vector can also be partitioned into x (i) = ⎝ x (i,m) ⎠ accordingly to the indices
x (i,b)
of the overlapping blocks Ci−1 and Ci (with the obvious convention that x (1,t) and
x ( p,b) are zero-length vectors).
Definition 10.4 The prolongation mapping which injects Rn into a space Rm where
p p−1
m = i=1 m i = n + i=1 si is defined as follows :
D : Rn → Rm
x →
x,
where
x is obtained from vector x by duplicating all the blocks of entries cor-
responding
⎛ (1) ⎞to overlapping blocks: therefore, according to the previous notation,
x
⎜ ⎟
x = ⎝ ... ⎠.
x ( p)
The restriction mapping consists of projecting a vector
x ∈ Rm onto Rn , which
consists of deleting the subvectors corresponding to the first appearance of each
overlapping blocks
P : Rm → Rn
x → x.
Embedding the original system in a larger one is done for instance in [47, 48].
We present here a special case. In order to avoid a tedious formal presentation of
the augmented system, we present its construction on an example which is generic
10.3 Multiplicative Schwarz Preconditioner with GMRES 331
Remark 10.1 The following properties are straightforward consequences of the pre-
vious definitions:
1. Ax = P AD x,
2.
The subspace J = R(D) ⊂ Rm is an invariant subspace of A,
3. PD = In and DP is a projection onto J ,
4. x).
∀x, y ∈ Rn , (y = Ax ⇔ D y = AD
This can be illustrated by diagram (10.43):
A
Rn → Rn
D ↓ ↑ P (10.43)
A
Rm → Rm .
332 10 Preconditioners
One iteration of the Block Multiplicative Schwarz method on the original system
(10.38) corresponds to one Block-Seidel iteration on the enhanced system
x = D f,
A (10.44)
where the diagonal blocks are the blocks defined by the p subdomains. More pre-
the block lower triangular part of A,
cisely, denoting by P the iteration defined in
Algorithm 10.3 can be expressed as follows :
⎧
⎪
⎪
xk = D xk ,
⎨
rk = Drk ,
−1 (10.45)
⎪
⎪
x =xk + P rk ,
⎩ k+1
xk+1 = P xk+1 .
= P
To prove it, let us partition A − N , where N is the strictly upper block-
Matrices P
triangular part of (− A). and N
are partitioned by blocks accordingly to
the domain definition. One iteration of the Block Gauss-Seidel method can then be
expressed by
xk+1 =
−1
xk + P rk .
The resulting block-triangular system is solved successively for each diagonal block.
xk and
To derive the iteration, we partition xk+1 accordingly. At the first step, and
assuming xk,0 = xk and
rk,0 =
rk , we obtain
xk,0 + A−1
xk+1,1 =
1 rk,0 , (10.46)
which is identical to the first step of the Multiplicative Schwarz xk,1 = xk,0 + A−1
1 +
rk,0 . The ith step (i = 2, . . . , p)
xk+1,i =
xk,i−1 +
Ai−1 i,1:i−1
fi − P xk+1,1:i−1 − Ai i,i+1: p
xk,i + N xk,i+1: p , (10.47)
P −1
Rn → Rn
D ↓ ↑ P
−1
P
Rm → Rm
10.3 Multiplicative Schwarz Preconditioner with GMRES 333
D,
N = PN (10.48)
−1 −1 ¯ −1
P −1 = A¯p C̄ p−1 Ā−1
p−1 C̄ p−2 · · · Ā2 C 1 Ā1 (10.50)
Proof In [49], two proofs are proposed. We present here the proof by induction.
Since xk+1 = xk + P −1 rk and since xk+1 is obtained as result of successive steps
as defined in Algorithm 10.3, for i = 1, . . . , p: xk,i+1 = xk,i + Ai+rk,i , we shall
prove by reverse induction on i = p − 1, . . . , 0 that:
−1
xk, p = xk,i + A¯p C̄ p−1 · · · C̄i+1 Āi+1
−1
Ii+1: p rk,i (10.51)
The last transformation was possible since the supports of A p , and of C p−1 , C p−1 ,
. . . , Ci+1 are disjoint from domain i. Let us transform the matrix expression:
B = Ai+ + Ai+1
+ −1
Ii+1: p − Āi+1 Ii+1: p A Ai+
−1 +
= Āi+1 (Ai+1 Ii + Ii+1: p Āi − Ii+1: p AIi ) Āi−1
which proves that relation (10.51) is valid for i − 1. This ends the proof.
and A p = B p .
Proposition 10.4 The matrix N , is defined by the multiplicative Schwarz splitting
A = P − N . By embedding it in N as expressed in (10.48), block Ni, j is the upper
i, j as referred by the block structure of N
part of block N . The blocks can be expressed
as follows:
10.3 Multiplicative Schwarz Preconditioner with GMRES 335
⎧
⎨ Ni, j = G i · · · G j−1 B j , when j > i + 1
Ni,i+1 = G i Bi+1 − [Fi , 0], (10.52)
⎩
Ni, j = 0 other wise,
Therefore, the rank of row block of Ni is limited by the rank of factor [Fi 0] which
p−1
cannot exceed si . This implies r ≤ i=1 si .
0 0
A1
Nonzero blocks
5 5
A2
C1
10 A3 10
C2
15 15
0 5 10 15 0 5 10 15
A N
Fig. 10.3 Expression of the block multiplicative Schwarz splitting for a block-tridiagonal matrix.
Left Pattern of A; Right Pattern of N where A = P − N is the corresponding splitting
336 10 Preconditioners
Proposition 10.5 With the previous notations, if rank(N ) = r < n, then any Krylov
method, as it has just been defined, reaches the exact solution in at most r + 1
iterations.
For a general nonsingular matrix, this result is applicable to the methods BiCG and
QMR, preconditioned by the Multiplicative Schwarz method. In exact arithmetic, the
number of iterations cannot exceed the total dimension s of the overlap by more than
1. The same result applies to GMRES(m) when m is greater than s.
10.3 Multiplicative Schwarz Preconditioner with GMRES 337
is carried out recursively through the domains, whereas our explicit formulation (see
Theorem 10.4) decouples the two computations. The computation of the residual is
therefore more easily implemented in parallel since it is included in the recursion.
Another advantage of the explicit formulation arises when it is used as a precondi-
tioner of a Krylov method. In such a case, the user is supposed to provide a code
for the procedure x → P −1 x. Since the method computes the residual, the classical
algorithm implies calculation of the residual twice.
The advantage is even higher when considering the a priori construction of a
nonorthogonal basis of the Krylov subspace. For this purpose, the basis is built by
a recursion of the type of (9.61) but where the operator A is now replaced by either
P −1 A or A P −1 . The corresponding algorithm is one of the following: 9.4, 9.5,
or 9.6. Here, we assume that we are using the Newton-Arnoldi iteration (i.e. the
Algorithm 9.5) with a left preconditionner.
Next, we consider the parallel implementation of the above scheme on a linear
array of processors P(q), q = 1, . . . , p. We assume that the matrix A and the
vectors involved are distributed as indicated in Sect. 2.4.1. In order to get rid of any
global communication, the normalization of w, must be postponed. For a clearer
presentation, we skip for now the normalizing factors but we will show later how to
incorporate these normalization factors without harming parallel scalability.
Given a vector z 1 , building a Newton-Krylov basis can be expressed by the
loop:
do k = 1 : m,
z k+1 = P −1 A z k − λk+1 z k ;
end
At each iteration, this loop involves multiplying A by a vector (MV as defined in
Sect. 2.4.1), followed by solving a linear system involving the preconditioner P.
As shown earlier, Sect. 2.4.1, the MV kernel is implemented via Algorithm 2.11.
Solving systems involving the preconditioner P corresponds to the application of
one Block Multiplicative Schwarz iteration, which can be expressed from the explicit
formulation of P −1 as given by Theorem 10.4, and implemented by Algorithm 10.4.
Algorithms 2.11 and 10.4 can be concatenated into one. Running the resulting
program on each processor defines the flow illustrated in Fig. 10.4. The efficiency
of this approach is analyzed in [50]. It shows that if there is a good load balance
across all the subdomains and if τ denotes the maximum number of steps necessary
to compute a subvector vq , then the number of steps to compute one entire vector is
given by T = pτ , and consequently, to compute m vectors of the basis the number
of parallel steps is
T p = ( p − 3 + 3m)τ. (10.54)
338 10 Preconditioners
→
z 1 z 2 · · · zq · · ·
↓ A1 . . . ∗
. . . . k=1:m
. . . . → zk+1 = αk (P−1 A − λk I)zk ;
A4 . . ∗
.
.
. . .
.
.
. . . →
.
. Recursion across the domains.
. . ∗
. .
. . →
A3q−2 ∗
. The stars illustrate the wavefront of the
. flow of the computation.
.
. 1/3 for 1 domain / processor.
.
.
Ap
Fig. 10.4 Pipelined costruction of the Newton-Krylov basis corresponding to the block multiplica-
tive Schwarz preconditioning
While efficiency E p = 1
3+( p−3)/m grows with m, it is asymptotically limited to 13 .
Assuming that the block-diagonal decomposition is with weak overlap as given by
Definition 10.3, then by using (10.53) we can show that the asymptotic efficiency
can reach 21 .
If we do not skip the computation of the normalizing coefficients αk , the actual
computation should be as follows:
do k = 1 : m,
z k+1 = αk (P −1 A z k − λk+1 z k );
end
where αk = P −1 A z 1−λ z . When a slice of the vector z k+1 has been computed on
k k+1 k
processor Pq , the corresponding norm can be computed and added to the norms of
the previous slices which have been received from the previous processor Pq−1 . Then
the result is sent to the next processor p(q + 1). When reaching the last processor
Pp , the norm of the entire vector is obtained, and can be sent back to all processors in
order to update the subvectors. This procedure avoids global communication which
could harm parallel scalability. The entries of the tridiagonal matrix Tˆm , introduced
in (9.62), must be updated accordingly.
A full implementation of the GMRES method preconditioned with the block mul-
tiplicative Schwarz iteration is coded in GPREMS following the PETSc formulations.
A description of the method is given in [50] where the pipeline for building the New-
ton basis is described. The flow of the computation is illustrated by Fig. 10.5. A new
version is available in a more complete set where deflation is also incorporated [51].
Deflation is of importance here as it limits the increase of the number of GMRES
iterations for large number of subdomains.
P3
v1 = (P−1 A − λ1 )v0 v2 v3
Communications for Ax Communications for P−1 y
Communications for the consistency of the computed vector in the overlapped region
Fig. 10.5 Flow of the computation vk+1 = σk P −1 (A − λk I )vk (courtesy of the authors of [50])
340 10 Preconditioners
References
1. Axelsson, O., Barker, V.A.: Finite Element Solution of Boundary Value Problems. Academic
Press Inc., Orlando (1984)
2. Meurant, G.: Computer Solution of Large Linear Systems. Studies in Mathematics and
its Applications. Elsevier Science, North-Holland (1999). http://books.google.fr/books?id=
fSqfb5a3WrwC
3. Chen, K.: Matrix Preconditioning Techniques and Applications. Cambridge University Press,
Cambridge (2005)
4. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)
5. van der Vorst, H.A.: Iterative Krylov Methods for Large Linear Systems. Cambridge University
Press, Cambridge (2003). http://dx.doi.org/10.1017/CBO9780511615115
6. Schenk, O., Gärtner, K.: Solving unsymmetric sparse systems of linear equations with pardiso.
Future Gener. Comput. Syst. 20(3), 475–487 (2004)
7. Censor, Y., Gordon, D., Gordon, R.: Component averaging: an efficient iterative parallel algo-
rithm for large and sparse unstructured problems. Parallel Comput. 27(6), 777–808 (2001)
8. Gordon, D., Gordon, R.: Component-averaged row projections: a robust, block-parallel scheme
for sparse linear systems. SIAM J. Sci. Stat. Comput. 27(3), 1092–1117 (2005)
9. Zouzias, A., Freris, N.: Randomized extended Kaczmarz for solving least squares. SIAM J.
Matrix Anal. Appl. 34(2), 773–793 (2013)
10. Liu, J., Wright, S., Sridhar, S.: An Asynchronous Parallel Randomized Kaczmarz Algorithm.
CoRR (2014). arXiv:abs/1201.3120 math.NA
11. Popa, C.: Least-squares solution of overdetermined inconsistent linear systems using Kacz-
marz’s relaxation. Int. J. Comput. Math. 55(1–2), 79–89 (1995)
12. Popa, C.: Extensions of block-projections methods with relaxation parameters to inconsistent
and rank-deficient least-squares problems. BIT Numer. Math. 38(1), 151–176 (1998)
13. Bodewig, E.: Matrix Calculus. North-Holland, Amsterdam (1959)
14. Kamath, C., Sameh, A.: A projection method for solving nonsymmetric linear systems on
multiprocessors. Parallel Comput. 9, 291–312 (1988/1989)
15. Tewarson, R.: Projection methods for solving space linear systems. Comput. J. 12, 77–80
(1969)
16. Tompkins, R.: Methods of steep descent. In: E. Beckenbach (ed.) Modern Mathematics for the
Engineer. McGraw-Hill, New York (1956). Chapter 18
17. Gastinel, N.: Procédé itératif pour la résolution numérique d’un système d’équations linéaires.
Comptes Rendus Hebd. Séances Acad. Sci. (CRAS) 246, 2571–2574 (1958)
18. Gastinel, N.: Linear Numerical Analysis. Academic Press, Paris (1966). Translated from the
original French text Analyse Numerique Lineaire
19. Tanabe, K.: A projection method for solving a singular system of linear equations. Numer.
Math. 17, 203–214 (1971)
20. Tanabe, K.: Characterization of linear stationary iterative processes for solving a singular
system of linear equations. Numer. Math. 22, 349–359 (1974)
21. Ansorge, R.: Connections between the Cimmino-methods and the Kaczmarz-methods for the
solution of singular and regular systems of equations. Computing 33, 367–375 (1984)
22. Cimmino, G.: Calcolo approssimato per le soluzioni dei sistemi di equazioni lineari. Ric. Sci.
Progr. Tech. Econ. Naz. 9, 326–333 (1938)
23. Dyer, J.: Acceleration of the convergence of the Kaczmarz method and iterated homogeneous
transformations. Ph.D. thesis, University of California, Los Angeles (1965)
24. Gordon, R., Bender, R., Herman, G.: Algebraic reconstruction photography. J. Theor. Biol. 29,
471–481 (1970)
25. Gordon, R., Herman, G.: Three-dimensional reconstruction from projections, a review of algo-
rithms. Int. Rev. Cytol. 38, 111–115 (1974)
26. Trummer, M.: A note on the ART of relaxation. Computing 33, 349–352 (1984)
27. Natterer, F.: Numerical methods in tomography. Acta Numerica 8, 107–141 (1999)
References 341
The simplest method for computing the dominant eigenvalue (i.e. the eigenvalue of
largest modulus) and its corresponding eigenvector is the Power Method, listed in
Algorithm 11.1. The power method converges to the leading eigenpair for almost all
initial iterates x0 if the matrix has a unique eigenvalue of maximum modulus. For
symmetric matrices, this implies that the dominant eigenvalue λ is simple and −λ
is not an eigenvalue. This method can be easily implemented on parallel architec-
tures given an efficient parallel sparse matrix-vector multiplication kernel MV (see
Sect. 2.4.1).
created by Algorithm 11.1 converges to ±u 1 . The method converges linearly with the
convergence factor given by: max{ |λλ21 | , |λλn1 | }.
Note, however, that the power method could exhibit very slow convergence. As
an illustration, we consider the Poisson matrix introduced in Eq. (6.64) of Sect. 6.64.
For order n = 900 (5-point discretization of the Laplace operator on the unit square
with a 30 × 30 uniform grid), the power method has a convergence factor of 0.996.
The most straightforward way for simultaneously improving the convergence of
the power method and enhancing parallel scalability is to generalize it by iterating
on a block of vectors instead of a single vector. A direct implementation of the power
method on each vector of the block will not work since every column of the iterated
block will converge to the same dominant eigenvector. It is therefore necessary to
maintain strong linear independence across the columns. For that purpose, the block
will need to be orthonormalized in every iteration. The simplest implementation of
this method, called Simultaneous Iteration, for the standard symmetric eigenvalue
problem is given by Algorithm 11.2. Simultaneous Iteration was originally introduced
as Treppen-iteration (cf. [2–4]) and later extended for non-Hermitian matrices (cf.
[5, 6]).
For s = 1, this algorithm mimics Algorithm 11.1 exactly.
where Pk is the orthogonal projector onto the subspace spanned by the columns of
X k which is created by Algorithm 11.2.
Proof See e.g. [7] for the special case of a diagonalizable matrix.
In the previous section, we explored methods for computing a few dominant eigen-
pairs. In this section, we outline techniques that allow the computation of those
eigenpairs belonging to the interior of the spectrum of a symmetric matrix (or op-
erator) A ∈ Rn×n . These techniques consist of considering a transformed operator
which has the same invariant subspaces as the original operator but with a rearranged
spectrum.
Deflation
Let us order the eigenvalues of A as: |λ1 | ≥ · · · ≥ |λ p | > |λ p+1 | ≥ · · · ≥ |λn |.
We assume that an orthonormal basis V = (v1 , . . . , v p ) ∈ Rn× p of the invariant
subspace V corresponding to the eigenvalues {λ1 , · · · , λ p } is already known. Let
P = I − V V be the orthogonal projector onto V ⊥ (the orthogonal complement
of V ).
346 11 Large Symmetric Eigenvalue Problems
Proposition 11.1 The invariant subspaces of the operator B = PAP are the same
as those of A, and the spectrum of B is given by,
Using the power method or simultaneous iteration on the operator B provides the
dominant eigenvalues of B which are now λ p+1 , λ p+2 , . . .. To implement the matrix-
vector multiplication involving the matrix B, it is necessary to perform the operation
Pv for any vector v. We need to state here that it is not recommended to use the
expression P x = x − V (V x) since this corresponds to the Classical Gram-Schmidt
method (CGS) which is not numerically reliable unless applied twice (see Sect. 7.4). It
is betterto implement the Modified Gram-Schmidt version (MGS) via the operator:
Pv = i=1 (I − u i u i ) v. Since B = PAP, a multiplication of B by a vector
p
implies two applications of the projector P to this vector. However, we can ignore
the pre-multiplication of A by P since the iterates xk in Algorithm 11.1, and X k in
Algorithm 11.2 satisfy, respectively, the relationships P xk = xk and P X k = X k .
Therefore we can use only the operator B̃ = PA in both the power and simultaneous
iteration methods. Hence, this procedure—known as the deflation process—increases
the number of arithmetic operations in each iteration by 4pn or 4spn, respectively,
on a uniprocessor.
In principle, by applying the above technique, it should be possible to determine
any part of the spectrum. There are two obstacles, however, which make this approach
not viable for determining eigenvalues that are very small in magnitude: (i) the num-
ber of arithmetic operations increases significantly; and (ii) the storage required for
the eigenvectors u 1 , . . . , u p could become too large. Next, we consider two alternate
strategies for a spectral transformation that could avoid these two disadvantages.
Shift-and-Invert Transformation
This shift-and-invert strategy transforms the eigenvalues of A which lie near zero
into the largest eigenvalues of A−1 and therefore allow efficient use of the power or
the simultaneous iteration methods for determining eigenvalues of A lying close to
the origin. This can be easily extended for the computation of the eigenvalues in the
neighborhood of any given value μ.
2
c = min , (11.3)
k=1,2 (ck − a)(ck − b)
348 11 Large Symmetric Eigenvalue Problems
c(b − a)2
if λ ∈ [a, b], then 1 ≤ p(λ) ≤ 1 + g, where g = ,
4
else − 1 ≤ p(λ) ≤ 1.
(k)
u − xi = O(q k ), where (11.4)
|Tm ( p(μ)|
q = max . (11.5)
μ∈Λ(A) |Tm ( p(λ)|
μ∈[a,b]
/
11.1 Computing Dominant Eigenpairs and Spectral Transformations 349
Note that the intervals [a, b] and [c1 , c2 ] should be chosen so as to avoid obtaining
a very small g which could lead to very slow convergence.
The scheme for obtaining the eigenpairs of a tridiagonal matrix that lie in a given
interval, TREPS (Algorithm 8.4), introduced in Sect. 8.2.3, can be applied to any
sparse symmetric matrix A with minor modifications as described below.
Let μ ∈ R be located somewhere inside the spectrum of the sparse symmetric
matrix A, but not one of its eigenvalues. Hence, A − μI is indefinite . Computing
the symmetric factorization P (A − μI )P = L DL , via the stable algorithm
introduced in [9], where D is a symmetric block-diagonal matrix with blocks of
order 1 or 2, P is a permutation matrix, and L is lower triangular matrix with a unit
diagonal, then according to the Sylvester Law of Inertia, e.g. see [10], the number
of negative eigenvalues of D is equal to the number of eigenvalues of A which are
smaller than μ. This allows one to iteratively partition a given interval for extracting
the eigenvalues belonging to this interval. On a parallel computer, the strategy of
TREPS can be followed as described in Sect. 8.2.3. The eigenvector corresponding
to a computed eigenvalue λ can be obtained by inverse iteration using the last LDL
factorization utilized in determining λ.
An efficient sparse LDL factorization scheme by Iain Duff et al. [11], is im-
plemented as routine MA57 in Harwell Subroutine Library [12] (it also exists in
MATLAB as procedure ldl).
If the matrix A ∈ Rn×n is banded with a narrow semi-bandwidth d, two options
are viable for applying Sturm sequences:
1. The matrix A is first tridiagonalized via the orthogonal similarity transformation
T = Q AQ using parallel schemes such as those outlined in either [13] or [14],
which consume O(n 2 d) arithmetic operations on a uniprocessor.
2. The TREPS strategy is used through the band-LU factorization of A −μI with no
pivoting to determine the Sturm sequences, e.g. see [15]. Note that the calculation
of one sequence on a uniprocessor consumes O(nd 2 ) arithmetic operations.
The first option is usually more efficient on parallel architectures since the reduction
to the tridiagonal form is performed only once, while in the second option we will
need to obtain the LU-factorization of A − μI for different values of μ.
In handling very large symmetric eigenvalue problems on uniprocessors, Sturm
sequences are often abandoned as inefficient compared to the Lanczos method. How-
ever, on parallel architectures the Sturm sequence approach offers great advantages.
This is due mainly to the three factors: (i) recent improvements in parallel factoriza-
tion schemes of large sparse matrices on a multicore node, and (ii) the ability of the
Sturm sequence approach in determining exactly the number of eigenvalues lying
350 11 Large Symmetric Eigenvalue Problems
in a given subinterval, and (iii) the use of many multicore nodes, one node per each
subinterval.
Theorem 11.4 (Lanczos identities) The matrices Vm+1 and T̂m satisfy the following
relation:
Vm+1 Vm+1 = Im+1 , (11.7)
and AVm = Vm+1 T̂m , (11.8)
Many arithmetic operations can be skipped in the Lanczos process when compared
to that of Arnoldi: since the entries of the matrix located above the first superdiagonal
are zeros, orthogonality between most of the pairs of vectors is mathematically guar-
anteed. The procedure is implemented as shown in Algorithm 11.5. The matrix T̂m
11.2 The Lanczos Method 351
is usually stored in the form of the two vectors (α1 , . . . , αm ) and (0, β2 , . . . , βm+1 ).
In this algorithm, it is easy to see that it is possible to proceed without storing the
basis Vm since only vk−1 and vk are needed to perform iteration k. This interesting
feature allows us to use large values for m. Theoretically (i.e. in exact arithmetic), at
some m ≤ n the entry βm+1 becomes zero. This implies that Vm is an orthonormal
basis of an invariant subspace and that all the eigenvalues of Tm are eigenvalues of
A. Thus, the Lanczos process terminates with Tm being irreducible (since (βk )k=2:m
are nonzero) with all its eigenvalues simple.
The Effect of Roundoff Errors
Unfortunately, in floating-point arithmetic the picture is not so as straightforward: the
orthonormality expressed in (11.7) is no longer assured. Instead, the relation (11.8)
is now replaced by
initial matrix V1 ∈ Rn× p with orthonormal columns, the block Lanczos algorithm
for reducing A to the block-tridiagonal matrix
⎡ ⎤
G 1 R1
⎢ R1 G 2 R ⎥
⎢ 2 ⎥
⎢ .. .. .. ⎥
T =⎢ . . . ⎥ (11.11)
⎢ ⎥
⎣ ⎦
Rk−2 G k−1 Rk−1
Rk−1 G k
AV = V T + Z (0, . . . , 0, I p ) (11.12)
where
V = (V1 , V2 , . . . , Vk ). (11.13)
Let us assume that the tridiagonalization is given by (11.9). For any eigenpair (μ, y)
of Tm , where y = (γ1 , . . . , γm ) is of unit 2-norm, the vector x = Vm y is called the
Ritz vector which satisfies
11.2 The Lanczos Method 353
Ax − μx
|μ − λ| ≤ , (11.16)
x
The convergence of the Ritz values to the eigenvalues of A has been investigated
extensively (e.g. see the early references [19, 20]), and can be characterized by the
following theorem.
Theorem 11.5 (Eigenvalue convergence) Let Tm be the symmetric tridiagonal ma-
trix built by Algorithms 11.5 from the inital vector v. Let the eigenvalues of A and
Tm be respectively denoted by (λi )i=1:n and (μi(m) )i=1:m and labeled in decreasing
order. The difference between the ith exact and approximate eigenvalues, λi and
(m)
μi , respectively, satisfies the following inequality,
(m)
2
(m) κ tan ∠(v, u i )
0≤ λi − μi ≤ (λ1 − λn ) i (11.18)
Tm−i (1 + 2γi )
354 11 Large Symmetric Eigenvalue Problems
λi −λi+1 (m)
where u i is the eigenvector corresponding to λi , γi = λi+1 −λn , and κi is given by
i−1 μ(m) − λ
(m) (m) j n
κ1 = 1, and κi = (m)
, for i > 1, (11.19)
j=1 μj − λi
where x ≥ σmin (Vm ) can be very small. Several authors discussed how accurately
can the eigenvalues of Tm approximate the eigenvalues of A in finite precision, for
example, see the historical overview in [21]. The framework of the theory proving
that loss of orthogonality appears when Vm includes a good approximation of an
eigenvector of A and, by continuing the process, new copies of the same eigenvalue
could be regenerated was established in [20, 22].
A scheme for discarding “spurious” eigenvalues by computing the eigenvalues of
the tridiagonal matrix T̃m obtained from Tm by deleting its first row and first column
was developed in [23]. As a result, any common eigenvalue of Tm and T̃m is deemed
spurious.
It is therefore possible to compute a given part of the spectrum Λ(A). For in-
stance, if the sought after eigenvalues are those that belong to an interval [a, b],
Algorithm 11.8 provides those estimates with their corresponding eigenvectors with-
out storing the basis Vm . Once an eigenvector y of Tm is computed, the Ritz vector
x = Vm y needs to be computed. To overcome the lack of availability of Vm , however,
a second pass of the Lanczos process is restarted with the same initial vector v and
accumulating products in Step 6 of Algorithm 11.8 on the fly.
Algorithm 11.8 exhibits a high level of parallelism as outlined below:
• Steps 2 and 3: Computing the eigenvalues of a tridiagonal matrix via multisec-
tioning and bisections using Sturm sequences (TREPS method),
• Step 5: Inverse iterations generate q independent tasks, and
• Steps 6 and 7: Using parallel sparse matrix—vector (and multivector) multipli-
cation kernels, as well as accumulating dense matrix-vector products on the fly,
and
• Step 7: Computing simultaneously the norms of the q columns of W .
11.2 The Lanczos Method 355
Since all the desired eigenpairs do not converge at the same iteration, Algo-
rithm 11.9 can be accelerated by incorporating a deflating procedure that allows the
storing of the converged vectors and applying the rest of the algorithm on PA, instead
356 11 Large Symmetric Eigenvalue Problems
of A, where P is the orthogonal projector that maintains the orthogonality of the next
basis with respect to the converged eigenvectors.
Parallel scalability of Algorithm 11.9 can be further enhanced by using a block ver-
sion of the Lanczos eigensolver based on Algorithm 11.7. Once the block-tridiagonal
matrix Tm is obtained, any standard algorithm can be employed to find all its eigen-
pairs. For convergence results for the block scheme analogous Theorem 11.5 see
[17, 18]. An alternative which involves fewer arithmetic operations consists of or-
thogonalizing the newly computed Lanczos block against the (typically few) con-
verged Ritz vectors. This scheme is known as the block Lanczos scheme with selec-
tive orthogonalization. The practical aspects of enforcing orthogonality in either of
these ways are discussed in [28–30]. The IRAM strategy was adapted for the block
Lanczos scheme in [31].
Ax = λx, (11.21)
which are suitable for implementation on parallel architectures with high efficiency.
In several computational science and engineering applications there is the need for
developing efficient parallel algorithms for approximating the extreme eigenpairs of
the series of slightly perturbed eigenvalue problems,
where
A(Si ) = A + B Si B (11.23)
We are now in a position to discuss how to choose the starting block V1 for the block
Lanczos reduction of A(Si ) in order to take advantage of the fact that A has already
11.3 A Block Lanczos Approach for Solving Symmetric Perturbed … 357
been reduced to block tridiagonal form via the algorithm described above. Recalling
that
A(Si ) = A + B Si B ,
where 1 ≤ i ≤ m, p
n, and the matrix B has full column rank, the idea of
the approach is rather simple. Taking as a starting block the matrix V1 given by the
orthogonal factorization of B, B = V1 R0 where R0 ∈ R p× p is upper triangular.
Then, the original matrix A is reduced to the block tridiagonal matrix T via the
block Lanczos algorithm.
In the following, we show that this choice of the starting Lanczos block leads to a
reduced block tridiagonal form of the perturbed matrices A(Si ) which is only a rank
p perturbation of the original block tridiagonal matrix T . Let V , contain the Lanczos
blocks V1 , V2 , . . . , Vk generated by the algorithm. From the orthogonality property
of these matrices, we have
I p for i = j
Vi V j = (11.24)
0 for i = j
Since
V1 = V (I p , 0, . . . , 0), (11.26)
then,
⎛ ⎞
R0 Si R0
⎜ 0 ⎟
⎜ ⎟
V Ei V = ⎜ .. ⎟, (11.27)
⎝ . ⎠
0
and,
where
⎛ ⎞
G̃ 1 R1
⎜ R1 G 2 R ⎟
⎜ 2 ⎟
⎜ .. .. .. ⎟
T (Si ) = ⎜ . . . ⎟ (11.29)
⎜ ⎟
⎝ ⎠
Rk−2 G k−1 Rk−1
Rk−1 G k
358 11 Large Symmetric Eigenvalue Problems
with
G̃ 1 = G 1 + R0 Si R0 ,
i.e. the matrix V A(Si )V is of the same structure as V AV . It is then clear that
the advantage of choosing such a set of starting vectors is that the block Lanczos
algorithm needs to be applied only to the matrix A to yield the block tridiagonal matrix
T . Once T is formed, all block tridiagonal matrices T (Si ) can be easily obtained
by the addition of terms R0 Si R0 to the first diagonal block G 1 of T . The matrix V
is independent of Si and, hence, remains the same for all A(Si ). Consequently, the
computational savings will be significant for large-scale engineering computations
which require many small modifications (or reanalyses) of the structure.
The starting set of vectors discussed in the previous section are useful for handling
A(Si )x = λx when the largest eigenvalues of A(Si ) are required. For many engi-
neering or scientific applications, however, only a few of the smallest (in magnitude)
eigenvalues and their corresponding eigenvectors are desired. In such a case, one usu-
ally considers the shift and invert procedure instead of working with A(Si )x = λx
directly. In other words, to seek the eigenvalue near zero, one considers instead the
1
problem A(Si )−1 x = x. Accordingly, to be able to take advantage of the nature of
λ
the perturbations, we must choose an appropriate set of starting vectors.
It is not clear at this point how the starting vectors for (A + B Si B )−1 x = μx,
1
where μ = , can be properly chosen so as to yield a block tridiagonal structure
λ
analogous to that shown in (11.29). However, if we assume that both A and Si are
nonsingular, the Woodbury formula yields,
Ṽ A−1 Ṽ = T̃ (11.31)
via the block Lanczos scheme, we choose the first block, Ṽ1 of Ṽ as that orthonormal
matrix resulting from the orthogonal factorization
Note that T̃ (Si ) is a block tridiagonal matrix identical to T̃ in (11.31), except for the
first diagonal block. Furthermore, note that as before, Ṽ is independent of Si and
hence, remains constant for 1 ≤ i ≤ m.
In this section, we address the extension of our approach to the perturbed generalized
eigenvalue problems of type
and
where K (Si ) and M(Si ) are assumed to be symmetric positive definite for all i,
1 ≤ i ≤ m. In structural mechanics, K (Si ) and M(Si ) are referred to as stiffness
and mass matrices, respectively.
Let K = L K L K and M = L M L M be the Cholesky factorization of K and
M, respectively. Then the generalized eigenvalue problems (11.35) and (11.36) are
reduced to the standard form
( K̃ + B̃ Si B̃ )y = λy (11.37)
and
( M̂ + B̂ Si B̂ )z = λ−1 z (11.38)
where
K̃ = L −1 − −1
M K L M , B̃ = L M B, and y = L M x (11.39)
360 11 Large Symmetric Eigenvalue Problems
and
M̂ = L −1 − −1
K M L K , B̂ = L K B, and z = L K x. (11.40)
Now, both problems can be treated as discussed above. If one seeks those eigenpairs
closest to zero in (11.35), then we need only obtain the block tridiagonal form asso-
ciated with K̃ −1 once the relevant information about the starting orthonormal block
is obtained from the orthogonal factorization for B̃. Similarly, in (11.36), one needs
the block tridiagonal form associated with M̂ based on the orthogonal factorization
of B̂.
11.3.4 Remarks
Typical examples of the class of problems described in the preceding sections arise
in the dynamic analysis of modified structures. A frequently encountered problem is
how to take into account, in analysis and design, changes introduced after the initial
structural dynamic analysis has been completed. Typically, the solution process is
of an iterative nature and consists of repeated modifications to either the stiffness
or the mass of the structure in order to fine-tune the constraint conditions. Clearly,
the number of iterations depends on the complexity of the problem, together with
the nature and number of constraints. Even though these modifications may be only
slight, a complete reanalysis of the new eigenvalues and eigenvectors of the modified
eigensystem is often necessary. This can drive the computational cost of the entire
process up dramatically especially for large scale structures.
The question then is how information obtained from the initial/previous analysis
can be readily exploited to derive the response of the new modified structure without
extensive additional computations. To illustrate the usefulness of our approach in this
respect, we present some numerical applications from the free vibrations analysis of
an undamped cantilever beam using finite elements. Without loss of generality, we
consider only modifications to the system stiffness matrix.
We assume that the beam is uniform along its span and that it is composed of
a linear, homogeneous, isotropic, elastic material. Further, the beam is assumed
to be slender, i.e. deformation perpendicular to the beam axis is due primarily to
bending (flexing), and shear deformation perpendicular to the beam axis can be
neglected; shear deformation and rotational inertia effects only become important
when analyzing deep beams at low frequencies or slender beams at high frequencies.
Subsequently, we consider only deformations normal to the undeformed beam axis
and we are only interested in the fewest lowest natural frequencies.
The beam possesses an additional support at its free end by a spring (assumed
to be massless) with various stiffness coefficients αi , i = 1, 2, . . . , m, as shown in
Fig. 11.1. The beam is assumed to have length L = 3.0, a distributed mass m̄ per
unit length, and a flexural rigidity E I .
11.3 A Block Lanczos Approach for Solving Symmetric Perturbed … 361
7
EI = 10 m = 390
αi
L = 3.0
First, we discretize the beam, without spring support using one-dimensional solid
beam finite elements, each of length = 0.3. Using a lumped mass approach, we
get a diagonal mass matrix M, while the stiffness matrix K is block-tridiagonal of
the form
⎛ ⎞
As C
⎜ C As C ⎟
⎜ ⎟
EI ⎜ .. .. .. ⎟
K = 3 =⎜ . . . ⎟
⎜ ⎟
⎝ C As C ⎠
C Ar
in which each As , Ar , and C is of order 2. Note that both M and K are symmetric
positive definite and each is of order 20. Including the spring with stiffness αi , we
obtain the perturbed stiffness matrix
where b = e19 (the 19th column of I20 ). In other words, K (αi ) is a rank-1 perturbation
of K . Hence, the generalized eigenvalue problems for this discretization are given
by
which can be reduced to the standard form by symmetrically scaling K (αi ) using
the diagonal mass matrix M, i.e.
in which y = M 1/2 x.
Second, we consider modeling the beam as a three-dimensional object using
regular hexahedral elements, see Fig. 11.2. This element possesses eight nodes, one
at each corner, with each having three degrees of freedom, namely, the components
of displacement u, v, and w along the directions of the x, y, and z axes, respectively.
362 11 Large Symmetric Eigenvalue Problems
o
m
i
k
n p y, v
j l
x, u
In the case of the hexahedral element, [32, 33], the element stiffness and mass
matrices possess the SAS property (see Chap. 6) with respect to some reflection
matrix and, hence, they can each be recursively decomposed into eight submatrices.
Because of these properties of the hexahedral element (i.e. three levels of symmet-
ric and antisymmetric decomposability), if SAS ordering of the nodes is employed,
then the system stiffness matrix K , of order n, satisfies the relation
PKP = K (11.43)
Q K Q = diag(K 1 , . . . , K 4 ) (11.44)
z z
x y
k1 k2
L
in which Q is an orthogonal matrix that can be easily constructed from the permu-
tation submatrices that constitute P. The matrices K i are each of order n/4 and of
a much smaller bandwidth. Recalling that in obtaining the smallest eigenpairs, we
need to solve systems of the form
K z = g. (11.45)
It is evident that in each step of the block Lanczos algorithm, we have four inde-
pendent systems that can be solved in parallel, with each system of a much smaller
bandwidth than that of the stiffness matrix K .
In 1975, an algorithm for solving the symmetric eigenvalue problem, called David-
son’s method [34], emerged from the computational chemistry community. This
successful eigensolver was later generalized and its convergence proved in [35, 36].
The Davidson method can be viewed as a modification of Newton’s method ap-
plied to the system that arises from treating the symmetric eigenvalue problem as a
constrained optimization problem involving the Rayleigh quotient. This method is
a precursor to other methods that appeared later such as Trace Minimization [37,
38], and Jacobi-Davidson [39] which are discussed in Sect. 11.5. These approaches
essentially take the viewpoint that the eigenvalue problem is a nonlinear system of
equations and attempt to find a good way to correct a given approximate eigenpair.
In practice, this requires solving a correction equation which updates the current
approximate eigenvector, in a subspace that is orthogonal to it.
All versions of the Davidson method may be regarded as various forms of precondi-
tioning the basic Lanczos method. In order to illustrate this point, let us consider a
symmetric matrix A ∈ Rn×n . Both the Lanczos and Davidson algorithms generate,
at some iteration k, an orthonormal basis Vk = (v1 , . . . , vk ) of a k-dimensional sub-
space Vk of Rn×k and the symmetric interaction matrix Hk = Vk AVk ∈ Rk×k . In
the Lanczos algorithm, Vk is the Krylov subspace Vk = Kk (A, v), but for Davidson
methods, this is not the case. In the Lanczos method, the goal is to obtain Vk such
that some eigenvalues of Hk are good approximations of some eigenvalues of A.
In other words, if (λ, y) is an eigenpair of Hk , then it is expected that the Ritz pair
(λ, x), where x = Vk y, is a good approximation of an eigenpair of A. Note that this
occurs only for some eigenpairs of Hk and only at convergence.
364 11 Large Symmetric Eigenvalue Problems
Davidson methods differ from the Lanczos scheme in the definition of the new
direction which will be added to the subspace Vk to obtain Vk+1 . In Lanczos schemes,
Vk+1 is the basis of the Krylov subspace Kk+1 (A, v). In other words, if reorthog-
onalization is not considered (see Sect. 11.2), the vector vk+1 is computed by a
three term recurrence. In Davidson methods, however, the local improvement of the
direction of the Ritz vector towards the sought after eigenvector is obtained by a
quasi-Newton step in which the next vector is obtained by reorthogonalization with
respect to Vk . Moreover, in this case, the matrix Hk is no longer tridiagonal. There-
fore, one iteration of Davidson methods involves more arithmetic operations than the
basic Lanczos scheme, and at least as expensive as the Lanczos scheme with full re-
orthogonalization. Note also, that Davidson methods require the storage of the basis
Vk thus limiting the maximum value kmax in order to control storage requirements.
Consequently, Davidson methods are implemented with periodic restarts.
To compute the smallest eigenvalue of a symmetric matrix A, the template of the
basic Davidson method is given by Algorithm 11.10. In this template, the operator
Ck of Step 12 represents any type of correction, to be specified later in Sect. 11.4.3.
Each Davidson method is characterized by this correction step.
11.4.2 Convergence
In this section, we present a general convergence result for the Davidson methods
as well as the Lanczos eigensolver with reorthogonalization (Algorithm 11.9). For a
more general context, we consider the block version of these algorithms. Creating
Algorithm 11.11 as the block version of Algorithm11.10, we generalize Step 11 so
11.4 The Davidson Methods 365
that the corrections Ck,i can be independently chosen. Further, when some, but not
all, eigenpairs have converged, a deflation process is considered by appending the
converged eigenvectors as the leading columns of Vk .
then limk→∞ λk,i is an eigenvalue of A and the elements in {xk,i }k≥1 yield the
corresponding eigenvectors.
366 11 Large Symmetric Eigenvalue Problems
This theorem provides a convergence proof of the single vector version of the David-
son method expressed in Algorithm 11.10, as well as the LANCZOS2 method since
in this case Ck,i = I , even for the block version.
To motivate the distinct approaches for the correction Step 12 in Algorithm 11.10,
let us assume that a vector x of unit norm approximates an unknown eigenvector
(x + y) of A, where y is chosen orthogonal to x. The quantity λ = ρ(x) (where
ρ(x) = x Ax denotes the Rayleigh quotient of x) approximates the eigenvalue λ+δ
A(x+y)
which corresponds to the eigenvector x + y : λ + δ = ρ(x + y) = (x+y) x+y2
.
The quality of the initial approximation is measured by the norm of the residual
r = Ax − λx. Since the residual r = (I − x x )Ax is orthogonal to the vector x,
then if we denote by θ the angle ∠(x, x + y), and t is the orthogonal projection of
x onto x + y and z = x − t, then we have:
Proposition 11.2 Using the previous notations, the correction δ to the approxi-
mation λ of an eigenvalue, and the orthogonal correction y for the corresponding
approximate eigenvector x, satisfy:
(A − λI )y = −r + δ(x + y),
(11.49)
y ⊥ x,
Moreover
λ + δ = ρ(t),
= (1 + tan2 θ ) t At,
= (1 + tan2 θ )(x − z) A(x − z),
= (1 + tan2 θ )(λ − 2x Az + z Az).
Unfortunately, this system has no solution except the trivial solution when λ is an
eigenvalue. Three options are now considered to overcome this difficulty.
First Option: Using a Preconditioner:
The first approach consists of replacing Problem (11.52) by
(M − λI )y = −r, (11.53)
(A − λI )y = −r, (11.54)
provokes an error which is in the direction of the desired eigenvector. The iterative
solver, however, must be capable of handling symmetric indefinite systems.
Third Approach: Using Projections
By considering the orthogonal projector P = I − Vk Vk with respect to Vk ⊥ , the
system to be solved becomes
P(A − λI )y = −r,
(11.55)
P y = y,
Note that the system matrix can be also expressed as P(A − λI )P since P y = y.
This system is then solved via an iterative scheme that accommodates symmetric
indefiniteness. The Jacobi-Davidson method [39] follows this approach.
The study [40] compares the second approach which can be seen as the Rayleigh
Quotient method, and the Newton Grassmann method which corresponds to this
approach. The study concludes that the two correction schemes have comparable
behavior. In [40], the authors also provide a stopping criterion for controlling the
inner iterations of an iterative solver for the correction vectors.
This approach is studied in the section devoted to the Trace Minimization method
as an eigensolver of the generalized symmetric eigenvalue problem (see Sect. 11.5).
Also, in this section we describe the similarity between the Jacobi-Davidson method
and the method that preceded it by almost two decades: the Trace Minimization
method.
Ax = λBx, (11.56)
where A and B are n × n real symmetric matrices with B being positive definite,
arises in many applications, most notably in structural mechanics [41, 42] and plasma
physics [43, 44]. Usually, A and B are large, sparse, and only a few of the eigenvalues
and the associated eigenvectors are desired. Because of the size of the problem, meth-
ods that rely only on operations like matrix-vector multiplications, inner products,
and vector updates are usually considered.
Many methods fall into this category (see, for example [45, 46]). The basic idea in
all of these methods is building a sequence of subspaces that, in the limit, contain the
desired eigenvectors. Most of the early methods iterate on a single vector, i.e. using
one-dimensional subspaces, to compute one eigenpair at a time. If, however, several
eigenpairs are needed, a deflation technique is frequently used. Another alternative
is to use block analogs of the single vector methods to obtain several eigenpairs
11.5 The Trace Minimization Method for the Symmetric … 369
It follows that, for any B-orthonormal basis X k+1 of the new subspace span{X k −Δk },
we have
tr(X k+1 AX k+1 ) < tr(X k AX k ),
lem [60–62] require solving a linear system of the form Bx = b at each iteration
step, or factorizing matrices of the form (A − σ B) during each iteration. Davidson’s
method, which can be regarded as a preconditioned Lanczos method, was intended to
be a practical method for standard eigenvalue problems in quantum chemistry where
the matrices involved are diagonally dominant. In the past two decades, Davidson’s
method has gone through a series of significant improvements [35, 36, 63–65]. A
development is the Jacobi-Davidson method [39], published in 1996, which is a
variant of Davidson’s original scheme and the well-known Newton’s method. The
Jacobi-Davidson algorithm for the symmetric eigenvalue problem may be regarded
as a generalization of the trace minimization scheme (which was published 15 years
earlier) that uses expanding subspaces. Both utilize an idea that dates back to Ja-
cobi [66]. As we will see later, the current Jacobi-Davidson scheme can be further
improved by the techniques developed in the trace minimization method.
Throughout this section, the eigenpairs of (11.56) are denoted by (xi , λi ), 1
i n, with the eigenvalues arranged in ascending order.
(A − μB)x = (λ − μ)Bx
Theorem 11.7 (Beckenbach and Bellman [67], Sameh and Wisniewski [37]). Let A
and B be as given in Problem (11.56), and let X ∗ be the set of all n × p matrices X
for which X BX = I p , 1 p n. Then
p
min tr(X AX) = λi . (11.58)
X ∈X ∗
i=1
X AX/ X BX is called the generalized Rayleigh quotient. Most of the early methods
that compute a few of the smallest eigenvalues are devised explicitly or implicitly
by reducing the generalized Rayleigh quotient step by step. A simple example is
the simultaneous iteration scheme for a positive definite matrix A where the current
approximation X k is updated by (11.58). It can be shown by the Courant-Fischer
theorem [1] and the Kantorovic̆ inequality [68, 69] that
AX
X k+1 k+1 X k AX k
λi BX
λi , 1 i p. (11.59)
X k+1 k+1 X k BXk
and
Z k+1 BZk+1 = Is + Δ
k BΔk , (11.63)
372 11 Large Symmetric Eigenvalue Problems
for any B-orthonormal basis X k+1 of the subspace span{Z k+1 }. The equality in
(11.64) holds only when Δk = 0, i.e. X k spans an eigenspace of (11.56) (see
Theorem 11.10 for details).
Using Lagrange multipliers, the solution of the minimization problem (11.61) can
be obtained by solving the saddle-point problem
A BXk Δk AXk
= , (11.65)
X k B 0 Lk 0
where L k represents the Lagrange multipliers. Several methods may be used to solve
(11.65) using either direct methods via the Schur complement or via preconditioned
iterative schemes, e.g. see the detailed survey [70]. In [37], (11.65) is further reduced
to solving the following positive-semidefinite system
From now on, we will refer to the linear system (11.66) in step 8 as the inner
system(s). It is easy to see that the exact solution of the inner system is
thus the subspace spanned by X k − Δk is the same subspace spanned by A−1 BXk . In
other words, if the inner system (11.66) is solved exactly at each iteration step, the
trace minimization algorithm above is mathematically equivalent to simultaneous
11.5 The Trace Minimization Method for the Symmetric … 373
Theorem 11.8 ([1, 4, 37]) Let A and B be positive definite and let s p be the block
size such that the eigenvalues of Problem (11.56) satisfy 0 < λ1 λ2 · · · λs <
λs+1 · · · λn . Let also the initial iterate X 0 be chosen such that it has linearly
independent columns and is not deficient in any eigen-component associated with
the p smallest eigenvalues. Then the ith column of X k , denoted by xk,i , converges to
the eigenvector xi corresponding to γi for i = 1, 2, . . . , p with an asymptotic rate
of convergence bounded by λi /λs+1 . More specifically, at each step, the error
The main difference between the trace minimization algorithm and simultaneous
iteration is in step 8. If both (11.60) and (11.66) are solved via the CG scheme exactly,
the performance of either algorithm is comparable in terms of time consumed, as
observed in practice. The additional cost in performing the projection P at each CG
step (once rather than twice) is not high because the block size s is usually small,
i.e. s
n. This additional cost is sometimes compensated for by the fact that PAP,
when it is restricted to the subspace {v ∈ R n |Pv = v}, is better conditioned than A
as will be seen in the following theorem.
Theorem 11.9 Let A and B be as given in Theorem 11.8 and P be given as in (11.66),
and let νi , μi , 1 i n be the eigenvalues of A and PAP arranged in ascending
order, respectively. Then, we have
In practice, however, the inner systems (11.66) are always solved approximately,
particularly for large problems. Note that the error (11.68) in the ith column of X k
is reduced asymptotically by a factor of (λi /λs+1 )2 at each iteration step. Thus, we
should not expect high accuracy in the early Ritz vectors even if the inner systems are
solved to machine precision. Further, convergence of the trace minimization scheme
is guaranteed if a constant relative residual tolerance is used for the inner system
(11.66) in each outer iteration.
374 11 Large Symmetric Eigenvalue Problems
A Convergence Result
We prove convergence of the trace minimization algorithm under the assumption
that the inner systems in (11.66) are solved inexactly. We assume that, for each i,
1 i s, the ith inner system in (11.66) is solved approximately by the CG scheme
with zero as the initial iterate such that the 2-norm of the residual is reduced by a factor
γ < 1. The computed correction matrix will be denoted by Δck = (dk,1 c dc , . . . , dc )
k,2 k,s
to distinguish it from the exact solution Δk = (dk,1 , dk,2 , . . . , dk,s ) of (11.66).
We begin the convergence proof with two lemmas. We first show that, in each
iteration, the columns of X k − Δck are linearly independent, and the sequence {X k }∞ 0
in the trace minimization algorithm is well-defined. In the second, we show that the
computed correction matrix Δck satisfies
This assures that, no matter how prematurely the CG process is terminated, tr(X k
AXk ) always forms a decreasing sequence bounded from below by Σi=1s λ .
i
where xk,i is the ith column of X k and P is the projector in (11.66). As a consequence,
c is B-orthogonal to X , i.e. X Bd c = 0. Thus the matrix
for each i, dk,i k k k,i
Z k+1 BZk+1 = Is + (Δck ) BΔck
Lemma 11.3 Suppose that the inner systems in (11.66) are solved by the CG scheme
() ()
with zero as the initial iterate. Then, for each i, (xk,i −dk,i ) A(xk,i −dk,i ) decreases
monotonically with respect to step of the CG scheme.
()
for which PΔk = Δk . For each i, 1 i s, the intermediate dk,i in the CG process
() ()
also satisfies Pdk,i = dk,i . Thus, it follows that
() () () ()
(dk,i − dk,i ) PAP(dk,i − dk,i ) = (dk,i − dk,i ) A(dk,i − dk,i )
() ()
= (xk,i − dk,i ) A(xk,i − dk,i ) − ei (X k B A−1 B X k )−1 ei .
11.5 The Trace Minimization Method for the Symmetric … 375
() ()
Since the CG process minimizes the PAP-norm of the error dk,i = dk,i − dk,i
() ()
on the expanding Krylov subspace [37], both (dk,i − dk,i ) PAP(dk,i − dk,i ) and
() ()
(xk,i − dk,i ) A(xk,i − dk,i ) decrease monotonically.
Theorem 11.10 Let X k , Δck , and Z k+1 be as given in Lemma 11.2. Then limk→∞
Δck = 0.
Z k+1 BZk+1 = Is + (Δck ) BΔck Is + Tk .
(k+1) Z AZ
Denoting by z i the diagonal elements of the matrix Uk+1 k+1 k+1 Uk+1 , it
follows that
−1
−1 AZ
tr X k+1 AXk+1 = tr Dk+1 Uk+1 Z k+1 k+1 Uk+1 Dk+1 ,
(k+1) (k+1) (k+1)
z1 z2 zs
= + + ··· + ,
δ1(k+1) δ2(k+1) δs(k+1)
z 1(k+1) + z 2(k+1) + · · · + z s(k+1) ,
= tr Z k+1 AZk+1 ,
tr X k AXk ,
and
(k+1) (k+1)
z1 + z2 + · · · + z s(k+1) , k = 1, 2 . . .
(k+1)
zi λ1 Uk+1 Z k+1 AZk+1 Uk+1 ,
= λ1 Z k+1 AZk+1 ,
AZ
y Z k+1 k+1 y
= min ,
y=0 y y
AZ
y Z k+1 BZ
y Z k+1
k+1 y k+1 y
= min · ,
y=0 y Z BZk+1 y y y
k+1
AZ
y Z k+1 k+1 y
,
y Z k+1 BZk+1 y
λ1 (A, B),
> 0.
Hence, we have
λ1 (Tk ) → 0, i = 1, 2, . . . , s,
Theorem 11.11 If for each 1 i s, the CG process for the ith inner system
(11.66)
(PAP)dk,i = PAxk,i , dk,i BXk = 0,
is terminated such that the 2-norm of the residual is reduced by a factor γ < 1, i.e.
PAxk,i − (PAP)dk,i
c
2 γ PAxk,i 2 , (11.69)
PAxk,i 2 − PAdk,i
c
2 γ PAxk,i 2 ,
and consequently
1
PAxk,i 2 PAdk,i
c
2 .
1−γ
Randomization
Condition (11.69) in Theorem 11.11 is not essential because the constant γ can
be arbitrarily close to 1. The only deficiency in Theorem 11.11) is that it does not
establish ordered convergence in the sense that the ith column of X k converges to the
ith eigenvector of the problem. This is called unstable convergence in [4]. In practice,
roundoff errors turn unstable convergence into delayed stable convergence. In [4], a
randomization technique to prevent unstable convergence in simultaneous iteration
was introduced. Such an approach can be incorporated into the trace minimization
algorithm as well: After step 8 of Algorithm 11.13, we append a random vector to
X k and perform the Ritz processes 3 and 4 on the augmented subspace of dimension
s + 1. The extra Ritz pair is discarded after step 4.
Randomization slightly improves the convergence of the first s Ritz pairs [47].
Since it incurs additional cost, it should be used only in the first few steps when a
Ritz pair is about to converge.
Terminating the CG Process
Theorem 11.11 gives a sufficient condition for the convergence of the trace minimiza-
tion algorithm. However, the asymptotic rate of convergence of the trace minimiza-
tion algorithm will be affected by the premature termination of the CG processes.
The algorithm behaves differently when the inner systems are solved inexactly. It
is not clear how the parameter γ should be chosen to avoid performing excessive
CG iterations while maintaining the asymptotic rate of convergence. In [37], the CG
process is terminated by a heuristic stopping strategy.
()
Let dk,i be the approximate solution at the ith step of the CG process for the ith
column of X k and dk,i the exact solution, then the heuristic stopping strategy in [37]
can be outlined as follows:
378 11 Large Symmetric Eigenvalue Problems
1. From Theorem 11.8, it is reasonable to terminate the CG process for the ith column
of Δk when the error
1/2
() () ()
εk,i = dk,i − dk,i A dk,i − dk,i ,
For problems in which the desired eigenvalues are poorly separated from the remain-
ing part of the spectrum, the algorithm converges slowly. Like other inverse iteration
schemes, the trace minimization algorithm can be accelerated by shifting. Actually,
the formulation of the trace minimization algorithm makes it easier to incorporate
shifts. For example, if the Ritz pairs (xi , θi ), 1 i i 0 , have been accepted as
eigenpairs and θi0 < θi0 +1 , then θi0 can be used as a shift parameter for computing
subsequent eigenpairs. Due to the deflation effect, the linear systems
are consistent and can still be solved by the CG scheme. Moreover, the trace reduc-
tion property still holds. In the following, we introduce two more efficient shifting
techniques which improve further the performance of the trace minimization algo-
rithm.
Single Shift
We know from Sect. 11.5.1 that global convergence of the trace minimization al-
gorithm follows from the monotonic reduction of the trace, which in turn depends
on the positive definiteness of A. A simple and robust shifting strategy would be
finding a scalar σ close to λ1 from below and replace A by (A − σ B) in step 8 of
Algorithm 11.13. After the first eigenvector has converged, find another σ close to
λ2 from below and continue until all the desired eigenvectors are obtained. If both
A and B are explicitly available, it is not hard to find a σ satisfying σ λ1 .
11.5 The Trace Minimization Method for the Symmetric … 379
Since
Θk X k AYk Θk Ck
Q AQ = = , (11.71)
Yk AXk Yk AYk Ck Yk AYk
min θ1 , λ1 (Yk AYk ) − Ck 2 .
Similarly [1], it is easy to show that Ck 2 = Rk B −1 , in which Rk = AXk − BXΘk
is the residual matrix. If
θk,1 λ1 Yk AYk , (11.72)
we get
This heuristic bound for the smallest eigenvalue suggests the following shifting
strategy (we denote −∞ by λ0 ). If the first i 0 0 eigenvalues have converged,
use σ = max{λi0 , θk,i0 +1 − rk,i0 +1 B −1 } as the shift parameter. If θk,i0 +1 lies in
a cluster, replace the B −1 -norm of rk,i0 +1 by the B −1 -norm of the residual matrix
corresponding to the cluster containing θk,i0 +1 .
Multiple Dynamic Shifts
In [37], the trace minimization algorithm is accelerated with a more aggressive shift-
ing strategy. At the beginning of the algorithm, a single shift is used for all the
380 11 Large Symmetric Eigenvalue Problems
We know from the Courant-Fischer theorem that the targeted eigenvalue λi is always
below the Ritz value θk,i . Further, from Theorem 11.12, if θk,i is already very close
to the targeted eigenvalue λi , then λi must lie in the interval [θk,i − rk,i B −1 , θk,i ].
This observation leads to the following shifting strategy for the trace minimization
algorithm. At step k of the outer iteration, the shift parameters σk,i , 1 i s, are
determined by the following rules (here, λ0 = −∞ and the subscript k is dropped
for the sake of simplicity):
1. If the first i 0 , i 0 0, eigenvalues have converged, choose
!
θi0 +1 if θi0 +1 + ri0 +1 B −1 θi0 +2 − ri0 +2 B −1 ,
σk,i0 +1 =
max{θi0 +1 − ri0 +1 B −1 , λi0 } otherwise.
2. For any other column j, i 0 + 1 < j p, choose the largest θ such that
θl < θ j − r j B −1
The shifting strategies described in the previous section, improve the performance
of the trace minimization algorithm considerably. Although the randomization tech-
nique, the shifting strategy, and the roundoff error actually make the algorithm sur-
prisingly robust for a variety of problems, further measures to guard against unstable
convergence are necessary for problems in which the desired eigenvalues are clus-
tered. A natural way to maintain stable convergence is by using expanding subspaces
with which the trace reduction property is automatically maintained.
The best-known method that utilizes expanding subspaces is that of Lanczos. It
uses the Krylov subspaces to compute an approximation of the desired eigenpairs,
usually the largest. This idea was adopted by Davidson, in combination with the si-
multaneous coordinate relaxation method, to obtain what he called the “compromise
method” [34]) known as Davidson’s method today. In this section, we generalize the
trace minimization algorithm described in the previous sections by casting it into the
framework of the Davidson method. We start by the Jacobi-Davidson method, ex-
plore its connections to the trace minimization method, and develop a Davidson-type
trace minimization algorithm.
The Jacobi-Davidson Method
As mentioned in Sect. 11.5, the Jacobi-Davidson scheme is a modification of the
Davidson method. It uses the same ideas presented in the trace minimization method
to compute a correction term to a previously computed Ritz pair, but with a different
objective. In the Jacobi-Davidson method, for a given Ritz pair (xi , θi ) with xi Bxi =
1, a correction vector di is sought such that
where ri = Axi − θi Bxi is the residual vector associated with the Ritz pair (xi , θi ).
Note that replacing ri with Axi does not affect di . In [39, 55], the Ritz value θi is
used in place of σi at each step.
The block Jacobi-Davidson algorithm, described in [55], may be outlined as shown
in Algorithm 11.14 which can be regarded as a trace minimization scheme with
expanding subspaces.
The performance of the block Jacobi-Davidson algorithm depends on how good
the initial guess is and how efficiently and accurately the inner system (11.78) is
solved. If the right-hand side of (11.78) is taken as the approximate solution to the
inner system (11.78), the algorithm is reduced to the Lanczos method. If the inner
382 11 Large Symmetric Eigenvalue Problems
else
Vk+1 = ModG S B (X k , Δk ).
Here, ModG S B stands for the Gram-Schmidt process with reorthogonalization [73] with respect
to B-inner products, i.e. (x, y) = x By.
End for
All these problems can be partially solved by the techniques developed in the
trace minimization method, i.e. the multiple dynamic shifting strategy, the implicit
deflation technique (dk,i is required to be B-orthogonal to all the Ritz vectors obtained
in the previous iteration step), and the dynamic stopping strategy. We call the modified
algorithm a Davidson-type trace minimization algorithm [38].
The Davidson-Type Trace Minimization Algorithm
Let s p be the block size, m s be a given integer that limits the dimension of
the subspaces. The Davidson-type trace minimization algorithm is given by Algo-
rithm 11.15. The orthogonality requirement di(k) ⊥ B X k is essential in the original
else
Vk+1 = ModG S B (X k , Δk ).
End for
trace minimization algorithm for maintaining the trace reduction property (11.64). In
the current algorithm, it appears primarily as an implicit deflation technique. A more
(k)
efficient approach is to require di to be B-orthogonal only to “good” Ritz vectors.
The number of outer iterations realized by this scheme is decreased compared to
the trace minimization algorithm in Sect. 11.5.3, and compared to the block Jacobi-
Davidson algorithm. In the block Jacobi-Davidson algorithm, the number of outer
iterations cannot be reduced further when the number of iterations for the inner sys-
tems is increased. However, in the Davidson-type trace minimization algorithm, the
number of outer iterations decreases steadily with increasing the number of iterations
for the inner systems. Note that reducing the number of outer iterations enhances the
efficiency of implementation on parallel architectures.
384 11 Large Symmetric Eigenvalue Problems
B-orthonormalization
This may be achieved via the eigendecomposition of matrices of the form W BW
or W AW to obtain a section of the generalized eigenvalue problem, i.e. to obtain a
matrix V (with s columns) for which V AV is diagonal, and V BV is the identity
matrix of order s. This algorithm requires one call to a sparse matrix-multivector
multiplication kernel, one global reduction operation to compute H , one call to a
multithreaded dense eigensolver, one call to a multithreaded dense matrix-matrix
multiplication routine, and s calls to a multithreaded vector scaling routine. Note
that only two internode communication operations are necessary. In addition, one
can take full advantage of the multicore architecture of each node.
Linear System Solver
TraceMINdoes not require a very small relative residual in solving the saddle point
problem, hence we may use a modest stopping criterion when solving the linear
system (11.64). Further, one can simultaneously solve the s independent linear sys-
tems via a single call to the CG scheme. The main advantage of such a procedure is
that one can group the internode communications, thereby reducing the associated
cost of the CG algorithm as compared with solving the systems one at a time. For
instance, in the sparse matrix-vector multiplication and the global reduction schemes
outlined above, grouping the communications results in fewer MPI communication
operations.
We now turn our attention to TraceMIN_2, which efficiently computes a larger
number of eigenpairs corresponding to any interval in the spectrum.
TraceMIN_2: Trace Minimization with Multisectioning
In this implementation of TraceMIN, similar to our implementation of TREPS (see
Sect. 11.1.2), we use multisectioning to divide the interval under consideration into
a number of smaller subintervals. Our hybrid MPI+OpenMP approach assigns dis-
tinct subsets of the subintervals to different nodes. Unlike the implementation of
TraceMIN_1, we assume that each node has"a local memory # capable of storing all
elements of A and B. For each subinterval ai , ai+1 , we compute the L DL —
factorization of A − ai B and A − ai+1 B to determine the inertia at each end-
# the number of eigenvalues of Ax = λBx that
point of the interval. "This yields
lie in the subinterval ai , ai+1 . TraceMIN_1 is then used on each node, taking
full advantage of as many cores as possible, to compute the lowest eigenpairs of
(A − σ B)x = (λ − σ )Bx, in which σ is the mid-point of each corresponding subin-
terval. Thus, TraceMIN_2 has two levels of parallelism: multiple nodes working
concurrently on different sets of subintervals, with each node using multiple cores
for implementing the TraceMIN_1 iterations on a shared memory architecture.
Note that this algorithm requires no internode communication after the intervals are
selected, which means it scales extremely well. We employ a recursive scheme to
divide the interval into subintervals. We select a variable n e , denoting the maximum
number of eigenvalues allowed to belong to any subinterval. Consequently, any in-
terval containing more than n e eigenvalues is divided in half. This process is repeated
386 11 Large Symmetric Eigenvalue Problems
until we have many subintervals, each of which containing less than or equal to n e
eigenvalues. We then assign the subintervals amongst the nodes so that each node
is in charge of roughly the same number of subintervals. Note that TraceMIN_2
requires only one internode communication since the intervals can be subdivided in
parallel.
11.6.1 Basics
Algorithms for computing few of the singular triplets of large sparse matrices consti-
tute a powerful computational tool that has a significant impact on numerous science
and engineering applications. The singular-value decomposition (partial or complete)
is used in data analysis, model reduction, matrix rank estimation, canonical corre-
lation analysis, information retrieval, seismic reflection tomography, and real-time
signal processing.
In what follows, similar to the basic material we introduced in Chap. 8 about the
singular-value decomposition, we introduce the notations we use in this chapter,
together with additional basic facts related to the singular-value problem.
The numerical accuracy of the ith approximate singular triplet (ũ i , σ̃i , ṽi ) of the
real matrix A can be determined via the eigensystem of the 2 − cyclic matrix
0 A
Aaug = , (11.81)
A 0
see Theorem 8.1, and is measured by the norm of the residual vector ri given by
$ %1
ũ i ũ i 2
ri 2 = Aaug − σ̃i 2 / ũ i 22 + ṽi 22 ,
ṽi ṽi
where r = min(m, n), and σi is the ith nonzero singular value of A corresponding to
the right singular vector vi , then the corresponding left singular vector, u i , is obtained
as u i = σ1i Avi . Similarly, if
where σi is the ith nonzero singular value of A corresponding to the left singular
vector u i , then the corresponding right singular vector, vi , is obtained as vi = σ1i A u i .
As stated in Chap. 8, computing the SVD of A via the eigensystems of either
A A or A A may be adequate for determining the largest singular triplets of A, but
some loss of accuracy is observed for the smallest singular triplets,
√ e.g. see also [23].
In fact, if the smallest singular value of A is smaller than ( uA), then it will be
computed as the zero eigenvalue of A A (or A A ). Thus, it is preferable to compute
the smallest
singular values of A via the eigendecomposition of the augmented matrix
0 A
Aaug = .
A 0
Note that, whereas the square of the smallest and largest singular values of A are
the lower and upper bounds of the spectrum of A A or A A , the smallest singular
values of A lie in the middle of the spectrum of Aaug in (11.81). Further, similar to
(11.82), the norms of the residuals corresponding to the i th eigenpairs of A A and
A A , are given by
respectively.
When A is a square nonsingular matrix, it may be advantageous (in certain cases)
to compute the needed few largest singular triplets of A−1 which are σ1n ≥ · · · ≥ σ11 .
This approach has the drawback of needing to solve linear systems involving the
matrix A. If a suitable parallel direct sparse solver is available, this strategy provides a
robust algorithm. The resulting computational scheme becomes of value in enhancing
parallelism for the subspace method described below. This approach can also be
extended to rectangular matrices of full column rank if the upper triangular matrix
R in the orthogonal factorization A = Q R is available:
σ1 ≥ σ2 ≥ · · · ≥ σn , and
σ̃1 ≥ σ̃2 ≥ · · · ≥ σ̃n .
then,
When applied to the smallest singular value, this result yields the following.
Proposition 11.4 The relative condition number of the smallest singular value of a
nonsingular matrix A is equal to κ2 (A) = σσn1 .
This means that the smallest singular value of an ill-conditioned matrix cannot be
computed with high accuracy even with a backward-stable algorithm.
In [76] it is shown that for some special class of matrices, e.g. tall and narrow sparse
matrices, an accurate computation of the smallest singular value may be obtained
via a combination of a parallel orthogonal factorization scheme of A, with column
pivoting, and a one-sided Jacobi algorithm (see Chap. 8).
Since the nonzero singular values are roots of a polynomial (e.g. roots of the
characteristic polynomial of the augmented matrix), then when simple, they are
differentiable with respect to the entries of the matrix. More precisely, one can state
that:
Theorem 11.14 Let σ be a nonzero simple singular value of the matrix A = (αi j )
with u = (μi ) and v = (νi ) being the corresponding normalized left and right
singular vectors. Then, the singular value is differentiable with respect to the matrix
A, or
∂σ
= μi ν j , ∀i, j = 1, . . . , n.
∂αi j
The effect of a perturbation of the matrix on the singular vectors can be more
significant than that on the singular values. The sensitivity of the singular vectors
depends upon the distribution of the singular values. When a simple singular value
is not well separated from the rest, the corresponding left and right singular vec-
tors are poorly determined. This is made precise by the theorem below, see [75],
which we state here without proof. Let A ∈ Rn×m (n ≥ m) have the singular value
decomposition
Σ
U AV = .
0
Next, we present a selection of parallel algorithms for computing the extreme sin-
gular values and the corresponding vectors of a large sparse matrix A ∈ Rm×n . In
particular, we present the simultaneous iteration method and two Lanczos schemes
for computing few of the largest singular eigenvalues and the corresponding eigen-
vectors of A. Our parallel algorithms of choice for computing a few of the smallest
singular triplets, however, are the trace minimization and the Davidson schemes.
390 11 Large Symmetric Eigenvalue Problems
Subspace iteration, presented in Sect. 11.1, can be used to obtain the largest singular
triplets via obtaining the dominant eigenpairs of the symmetric matrix G = Ãaug ,
where
γ Im A
Ãaug = , (11.83)
A γ In
in which the shift parameter γ is chosen to assure that G is positive definite. This
method generates the sequence
Zk = Gk Z0,
Here, ξ1 = 0.04 and ξ2 = 4. The polynomial degree of the current iteration is then
taken to be q = qnew . It can easily be shown that the strategy in (11.84) insures that
. .
. .
.Tq θ1 . = cosh q arccosh θ1 ≤ cosh(8) < 1500.
. θs .2 θs
Although this has been successful for RITZIT, we can generate several variations
of polynomial-accelerated subspace iteration schemes SISVD using a more flexible
bound. Specifically, we can consider an adaptive strategy for selecting the degree q
in which ξ1 and ξ2 are treated as control parameters for determining the frequency
and the degree of polynomial acceleration, respectively. In other words, large (small)
values of ξ1 , inhibit (invoke) polynomial acceleration, and large (small) values of
ξ2 yield larger (smaller) polynomial degrees when acceleration is selected. Corre-
spondingly, the number of matrix-vector multiplications will increase with ξ2 and
the total number of iterations may well increase with ξ1 . Controlling the parameters,
ξ1 and ξ2 , allows us to monitor the method’s complexity so as to maintain an optimal
balance between the dominating kernels (specifically, sparse matrix—vector multi-
plication, orthogonalization, and spectral decomposition). We demonstrate the use
of such controls in the polynomial acceleration-based trace minimization method for
computing a few of the smallest singular triplets in Sect. 11.6.4.
392 11 Large Symmetric Eigenvalue Problems
Note that, the largest singular triplets of A will be obtained with much higher accuracy
than their smallest counterparts using the Lanczos method.
The Single-Vector Lanczos Method
Using Algorithms 11.5 or 11.6 for the tridiagonalization of Aaug , which is refer-
enced only through matrix-vector multiplications, we generate elements of the cor-
responding tridiagonal matrices to be used by the associated Lanczos eigensolvers:
Algorithms 11.8 or 11.9, respectively. We denote this method by LASVD.
The Block Lanczos Method
As mentioned earlier, one could use the block version of the Lanczos method as an
alternate to the single-vector scheme LASVD. The resulting block version BLSVD
uses block three-term recurrence relations which require sparse matrix-tall dense
matrix multiplications, dense matrix multiplications, and dense matrix orthogonal
factorizations. These are primitives that achieve higher performance on parallel ar-
chitectures. In addition, this block version of the Lanczos algorithm is more robust
for eigenvalue problems with multiple or clustered eigenvalues. Again, we consider
the standard eigenvalue problem involving the 2-cyclic matrix Aaug . Exploiting the
structure of the matrix Aaug , we can obtain an alternative form for the Lanczos re-
cursion (11.9). If we apply the Lanczos recursion given by (11.9) to Aaug with a
starting vector ũ = (u, 0) such that ũ2 = 1, then all the diagonal entries of the
real symmetric tridiagonal Lanzcos matrices generated are identically zero. In fact,
this Lanczos recursion, for i = 1, 2, . . . , k, reduces to
u i ↔ Ui , vi ↔ Vi ,
where Ui and Vi are matrices of order m × b and n × b, respectively, with b being the
current block size. The matrix Jk is now a block-upper bidiagonal matrix of order
bk
⎛ ⎞
S1 R1
⎜ S2 R2 ⎟
⎜ ⎟
⎜ · · ⎟
⎜
Jk ≡ ⎜ ⎟, (11.89)
· · ⎟
⎜ ⎟
⎝ · Rk−1 ⎠
Sk
nalization, and the subsequent diagonalization processes offer only poor data locality
and limited parallelism. For this reason, one should adopt instead the single vector
Lanczos bidiagonalization recursion given by (11.86) and (11.87) as the strategy
of choice for reducing the upper block-bidiagonal matrix Jk to the bidiagonal form
(Ck ), i.e.
Jk Q̂ = P̂Ck ,
(11.90)
Jk P̂ = Q̂Ck ,
or
Jk p j = α j q j + β j−1 q j−1 ,
(11.91)
Jk q j = α j p j + β j p j+1 ,
Applying the block Lanczos recursion [10] in Algorithm 11.17 for computing the
eigenpairs of the n×n symmetric positive definite matrix A A, the tridiagonalization
of Hk via an inner Lanczos recursion follows from a simple application of (11.9).
Analogous to the reduction of Jk in (11.89), computation of the eigenpairs of the
resulting tridiagonal matrix can be performed via a Jacobi or a QR-based symmetric
eigensolver.
11.6 The Sparse Singular-Value Problem 395
Another candidate subspace method for the SVD of sparse matrices is based upon the
trace minimization algorithm, discussed earlier in this chapter, for the generalized
eigenvalue problem
H x = λGx, (11.93)
where H and G are symmetric, with G being positive definite. In order to compute
the t smallest singular triplets of an m × n matrix A, we could set H = A A, and
G = In . If Y is defined as the set of all n × p matrices Y for which Y Y = I p ,
where p = 2t, then as illustrated before, we have
p
min trace(Y H Y ) = σ̃n−i+1
2
, (11.94)
Y ∈Y
i=1
396 11 Large Symmetric Eigenvalue Problems
H z = λz, (11.95)
i.e.
Y H Y = Σ̃, Y Y = I p , (11.96)
Σ̃ = diag(σ̃n2 , σ̃n−1
2
, . . . , σ̃n−
2
p+1 ),
our trace minimization scheme TRSVD generates the sequence Yk , whose first t
columns converge to the t left singular vectors corresponding to the t-smallest sin-
gular values of A.
Polynomial Acceleration Techniques for TRSVD
Convergence of the trace minimization algorithm TRSVD can be accelerated via
either a shifting strategy as discussed earlier in Sect. 11.5.3, via a Chebyshev accel-
eration strategy as illustrated for subspace iterations SISVD, or via a combination
of both strategies. For the time being we focus first on Chebyshev acceleration.
Hence, in order to dampen the unwanted (i.e. largest) singular values of A, we need
to consider applying the trace minimization scheme to the generalized eigenvalue
problem
1
x= Pq (H )x, (11.97)
Pq (λ)
where Pq (x) = Tq (x) + ε, and Tq (x) is the Chebyshev polynomial of degree q with
ε chosen so that Pq (H ) is (symmetric) positive definite. The appropriate quadratic
minimization problem here can be expressed as
(k)
Y Pq (H )d j = 0, j = 1, 2, . . . , p.
I Pq (H )Yk d (k) y (k)
j = j , j = 1, 2, . . . , p. (11.99)
Yk Pq (H ) 0 l 0
which are much easier to solve. It can be shown that the updated eigenvector approx-
(k+1)
imation, y j , is determined by
$ %−1
(k+1) (k) (k) (k)
yj = yj − dj = Pq (H )Yk Yk Pq2 (H )Yk Yk Pq (H )y j .
Thus, we may not need to use an iterative solver for determining Yk+1 since the matrix
[Yk Pq2 (H )Yk ]−1 is of relatively small order p. Using the orthogonal factorization
Pq (H )Yk = Q̂ R̂,
we have
(H − ν (k) (k)
j I )z j = (λ j − ν j )z j , j = 1, 2, . . . , s, (11.100)
(k) (k)
where ν j = (σ̃n− j+1 )2 is the j-th approximate eigenvalue at the k-th iteration of
TRSVD, with λ j , z j being an exact eigenpair of H . In other words, we simply use
our most recent approximations to the eigenvalues of H from our k-th section within
TRSVD as Ritz shifts. As was shown by Wilkinson in [83], the Rayleigh quotient
iteration associated with (11.100) will ultimately achieve cubic convergence to the
(k)
square of an exact singular value of A, σn− 2
j+1 , provided ν j is sufficiently close
(k+1) (k)
to σn−
2
j+1 . However, since we have ν j < νj for all k, i.e. we approximate
(k)
the eigenvalues of H from above resulting in H − νj I
not being positive definite
and hence requiring adopting an appropriate linear system solver. Algorithm 11.18
outlines the basic steps of TRSVD that appropriately utilize polynomial (Chebyshev)
acceleration prior to using Ritz shifts. It is important to note that once a shifting has
been invoked (Steps 8–15) we abandon the use of Chebyshev polynomials Pq (H )
and solve the resulting saddle-point problems using appropriate solvers that take
into account that the (1, 1) block could be indefinite. The context switch from either
non-accelerated (or polynomial-accelerated) trace minimization iterations to trace
minimization iterations with Ritz shifts, is accomplished by monitoring the reduction
398 11 Large Symmetric Eigenvalue Problems
(k)
of the residuals in (11.82) for isolated eigenvalues (r j ) or clusters of eigenvalues
(k)
(R j ).
Algorithm 11.18 TRSVD: trace minimization with Chebyshev acceleration and Ritz
shifts.
(0) (0) (0)
1: Choose an initial n × p subspace iterate Y0 = [y1 , y2 , . . . , y p ];
2: Form a section, i.e. determine Y0 such that Y0 Pq (H )Y0 = I p and Y0 Y0 = Γ0 where Γ0 is
diagonal;
3: do k = 0, 1, 2, . . . until convergence,
(k) (k)
4: Determine the approximate singular values Σk = diag(σ̃n , · · · , σ̃n− p+1 ) from the Ritz
values of H corresponding to the columns of Yk ;
5: Rk = H Yk − Yk Σk2 ; //Compute residuals
6: Analyze the current approximate spectrum (Gerschgorin disks determine n c groups G j of
eigenvalues)
7: //Invoke Ritz shifting strategy ([37])
8: do = 1 : n c ,
(0)
9: if G = {σ̃n− j+1 } includes a unique eigenvalue, then
(k) (k )
10: Shift is selected if r j 2 ≤ ηr j 0 2 , where η ∈ [10−3 , 100 ] and k0 < k;
11: else
12: //G is a cluster of c eigenvalues
Shift is selected if R(k) F ≤ ηR(k0 ) F , where R(k) ≡ {r (k) (k)
j , . . . , r j+c−1 } and k0 < k;
13: end if
14: Disable polynomial acceleration if shifting is selected;
15: end
16: //Deflation Reduce subspace dimension, p, by number of the H -eigenpairs accepted;
17: Adjust the polynomial degree q for Pq (H ) in iteration k + 1 (if needed).
18: Update subspace iterate Yk+1 = Yk − Δk as in (11.99) or for the shifted problem;
19: end
The smallest singular value of a matrix A may be obtained by applying one of the
various versions of the Davidson methods to obtain the smallest eigenvalue of the
matrix C = A A, or to obtain the innermost positive eigenvalue of the 2-cyclic
augmented matrix in (11.81). We assume that one has the basic kernels for matrix-
vector multiplications using either A or A . Multiplying A by a vector is often
considered as a drawback. Thus, whenever possible, the so-called “transpose-free”
methods should be used. Even though one can avoid such a drawback when dealing
with the interaction matrix Hk = Vk C Vk = (AVk ) AVk , we still have to compute
the residuals corresponding to the Ritz pairs which do involve multiplication of the
transpose of a matrix by a vector.
11.6 The Sparse Singular-Value Problem 399
For the regular single-vector Davidson method, the correction vector is obtained
by approximately solving the system
A Atk = rk . (11.101)
Obtaining an exact solution of (11.101) would yield the Lanczos algorithm applied to
C −1 . Once the Ritz value approaches the square of the sought after smallest singular
value, it is recommended that we solve (11.101) without any shifts; the benefit is that
we deal with a fixed symmetric positive definite system matrix.
The approximate solution of (11.101) can be obtained by performing a fixed
number of iterations of the Conjugate Gradient scheme, or by solving an approxi-
mate linear system Mtk = rk with a direct method, where M is obtained from an
approximate factorization of A or A A [84]:
Incomplete LU factorization of A: Here, M = U −1 L −1 L − U − , where L and U
are the products of an incomplete LU factorization of A where one drops entries
of the reduced matrix A which are below a given threshold. This version of the
Davidson method is called DAVIDLU.
Incomplete QR factorization of A: Here, M = R −1 R − , where R is the upper
triangular factor of an incomplete QR factorization of A. This version of the
Davidson method is called DAVIDQR.
Incomplete Cholesky of A T A: Here, M = L − L −1 where L is the lower triangular
factor of an incomplete Cholesky factorization of the normal equations. This
version of the Davidson method is called DAVIDIC.
Even though the construction of any of the above approximate factorizations may
fail, experiments presented in [84] show the effectiveness of the above three precon-
ditioners whenever they exist.
The corresponding method is expressed by algorithm 11.19. At step 11, the pre-
conditioner C is defined by one of the methods DAVIDLU, DAVIDIC, or DAVIDQR.
The steps 3, 4, and 13 must be implemented in such a way such that redundant com-
putations are skipped.
Similar to trace minimization, the Jacobi-Davidson method can be used directly on
the matrix A A to compute the smallest eigenvalue and the corresponding eigenvec-
tor. In [85] the Jacobi-Davidson method has been adapted for obtaining the singular
values of A by considering the eigenvalue problem corresponding to the 2-cyclic
augmented matrix.
Having determined approximate singular values, σ̃i and their corresponding right
singular vectors, ṽi , to a user-specified tolerance for the residual
Algorithm 11.19 Computing smallest singular values by the block Davidson method.
where r̂i is the residual given in (11.102) for the symmetric eigenvalue problem for
A A or (γ 2 In − A A). Scaling by σ̃i can easily lead to significant loss of accuracy
in estimating the singular triplet residual norm, ri 2 , especially when σ̃i approaches
the machine unit roundoff.
One remedy is to refine the initial approximation of the left singular vectors,
corresponding to the few computed singular values and right singular vectors, via
11.6 The Sparse Singular-Value Problem 401
where {σ̃i , ũ i , ṽi } is the ith computed smallest singular triplet. By applying block
Gaussian elimination to (11.107) we obtain a more optimal form (reduced system)
of the recursion
(k+1)
(k)
γ In A ṽi ṽi
= (γ + σ̃i ) (k) (k) . (11.108)
0 γ Im − γ1 A A (k+1)
ũ i ũ i − γ1 Aṽi
Our iterative refinement strategy for an approximate singular triplet of A, {σ̃i , ũ i , ṽi },
is then defined by the last m equations of (11.108), i.e.
1 (k+1) (k) 1
γ Im − A A ũ i = (γ + σ̃i ) ũ i − Aṽi , (11.109)
γ γ
where the superscript k is dropped from ṽi since we refine only our left singular
(0)
vector approximation, ũ i . If ũ i ≡ u i from (11.105), then (11.109) can be rewritten
as
1 (k+1) (k)
γ Im − A A ũ i = (γ − σ̃i2 /γ )ũ i , (11.110)
γ
scheme in Algorithm 11.20 terminate once the norms of the residuals of all p ap-
proximate singular triplets (ri 2 ) fall below a user-specified tolerance or after kmax
iterations.
Algorithm 11.20 Refinement procedure for the left singular vector approximations
obtained via scaling.
Input: A ∈ Rm×n , p approximate singular values Σ = diag(σ̃1 , · · · , σ̃ p ) and their corresponding
approximate right singular vectors V = [ṽ1 , · · · , ṽ p ].
(k) (k)
1: U0 = AV Σ −1 ; //By definition: Uk = [ũ 1 , · · · , ũ p ]
2: do j = 1 : p,
3: k = 0;
4: while Aṽ j − σ̃ j ũ (k)
j > τ,
5: k := k+ 1 ;
(k+1) (k)
6: Solve γ Im − γ1 A A ũ i = (γ − σ̃i2 /γ )ũ i ; //See (11.110)
(k+1) (k+1) (k+1)
7: Set ũ i = ũ i /ũ i 2 ;
8: end while
9: end
References
1. Parlett, B.N.: The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs (1980)
2. Bauer, F.: Das verfahren der treppeniteration und verwandte verfahren zur losung algebraischer
eigenwertprobleme. ZAMP 8, 214–235 (1957)
3. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, New York (1965)
4. Rutishauser, H.: Simultaneous iteration method for symmetric matrices. Numer. Math. 16,
205–223 (1970)
5. Stewart, G.W.: Simultaneous iterations for computing invariant subspaces of non-Hermitian
matrices. Numer. Math. 25, 123–136 (1976)
6. Stewart, W.J., Jennings, A.: Algorithm 570: LOPSI: a simultaneous iteration method for real
matrices [F2]. ACM Trans. Math. Softw. 7(2), 230–232 (1981). doi:10.1145/355945.355952
7. Saad, Y.: Numerical Methods for Large Eigenvalue Problems. Halstead Press, New York (1992)
8. Sameh, H., Lermit, J., Noh, K.: On the intermediate eigenvalues of symmetric sparse matrices.
BIT 185–191 (1975)
9. Bunch, J., Kaufman, K.: Some stable methods for calculating inertia and solving symmetric
linear systems. Math. Comput. 31, 162–179 (1977)
10. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins University Press,
Baltimore (2013)
11. Duff, I., Gould, N.I.M., Reid, J.K., Scott, J.A., Turner, K.: The factorization of sparse symmetric
indefinite matrices. IMA J. Numer. Anal. 11, 181–204 (1991)
12. Duff, I.: MA57-a code for the solution of sparse symmetric definite and indefinite systems.
ACM TOMS 118–144 (2004)
13. Kalamboukis, T.: Tridiagonalization of band symmetric matrices for vector computers. Com-
put. Math. Appl. 19, 29–34 (1990)
14. Lang, B.: A parallel algorithm for reducing symmetric banded matrices to tridiagonal form.
SIAM J. Sci. Comput. 14(6), 1320–1338 (1993). doi:10.1137/0914078
References 403
15. Philippe, B., Vital, B.: Parallel implementations for solving generalized eigenvalue problems
with symmetric sparse matrices. Appl. Numer. Math. 12, 391–402 (1993)
16. Carey, C., Chen, H.C., Golub, G., Sameh, A.: A new approach for solving symmetric eigenvalue
problems. Comput. Sys. Eng. 3(6), 671–679 (1992)
17. Golub, G., Underwood, R.: The block Lanczos method for computing eigenvalues. In: Rice, J.
(ed.) Mathematical Software III, pp. 364–377. Academic Press, New York (1977)
18. Underwood, R.: An iterative block Lanczos method for the solution of large sparse symmetric
eigenproblems. Technical Report STAN-CS-75-496, Computer Science, Stanford University,
Stanford (1975)
19. Kaniel, S.: Estimates for some computational techniques in linear algebra. Math. Comput. 20,
369–378 (1966)
20. Paige, C.: The computation of eigenvalues and eigenvectors of very large sparse matrices. Ph.D.
thesis, London University, London (1971)
21. Meurant, G.: The Lanczos and Conjugate Gradient Algorithms: from Theory to Finite Precision
Computations (Software, Environments, and Tools). SIAM, Philadelphia (2006)
22. Paige, C.C.: Accuracy and effectiveness of the Lanczos algorithm for the symmetric eigen-
problem. Linear Algebra Appl. 34, 235–258 (1980)
23. Cullum, J.K., Willoughby, R.A.: Lanczos Algorithms for Large Symmetric Eigenvalue Com-
putations. SIAM, Philadelphia (2002)
24. Lehoucq, R., Sorensen, D.: Deflation techniques for an implicitly restarted Arnoldi iteration.
SIAM J. Matrix Anal. Appl. 17, 789–821 (1996)
25. Lehoucq, R., Sorensen, D., Yang, C.: ARPACK User’s Guide: Solution of Large-Scale Eigen-
value Problems With Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia (1998)
26. Calvetti, D., Reichel, L., Sorensen, D.C.: An implicitly restarted Lanczos method for large
symmetric eigenvalue problems. Electron. Trans. Numer. Anal. 2, 1–21 (1994)
27. Sorensen, D.: Implicit application of polynomial filters in a k-step Arnoldi method. SIAM J.
Matrix Anal. Appl. 13, 357–385 (1992)
28. Lewis, J.G.: Algorithms for sparse matrix eigenvalue problems. Technical Report STAN-CS-
77-595, Department of Computer Science, Stanford University, Palo Alto (1977)
29. Ruhe, A.: Implementation aspects of band Lanczos algorithms for computation of eigenvalues
of large sparse symmetric matrices. Math. Comput. 33, 680–687 (1979)
30. Scott, D.: Block lanczos software for symmetric eigenvalue problems. Technical Report
ORNL/CSD-48, Oak Ridge National Laboratory, Oak Ridge (1979)
31. Baglama, J., Calvetti, D., Reichel, L.: IRBL: an implicitly restarted block Lanczos method for
large-scale Hermitian eigenproblems. SIAM J. Sci. Comput. 24(5), 1650–1677 (2003)
32. Chen, H.C., Sameh, A.: Numerical linear algebra algorithms on the cedar system. In: Noor,
A. (ed.) Parallel Computations and Their Impact on Mechanics, Applied Mechanics Division,
vol. 86, pp. 101–125. American Society of Mechanical Engineers (1987)
33. Chen, H.C.: The sas domain decomposition method. Ph.D. thesis, University of Illinois at
Urbana-Champaign (1988)
34. Davidson, E.: The iterative calculation of a few of the lowest eigenvalues and corresponding
eigenvectors of large real-symmetric matrices. J. Comput. Phys. 17, 817–825 (1975)
35. Morgan, R., Scott, D.: Generalizations of Davidson’s method for computing eigenvalues of
sparse symmetric matrices. SIAM J. Sci. Stat. Comput. 7, 817–825 (1986)
36. Crouzeix, M., Philippe, B., Sadkane, M.: The Davidson method. SIAM J. Sci. Comput. 15,
62–76 (1994)
37. Sameh, A.H., Wisniewski, J.A.: A trace minimization algorithm for the generalized eigenvalue
problem. SIAM J. Numer. Anal. 19(6), 1243–1259 (1982)
38. Sameh, A., Tong, Z.: The trace minimization method for the symmetric generalized eigenvalue
problem. J. Comput. Appl. Math. 123, 155–170 (2000)
39. Sleijpen, G., van der Vorst, H.: A Jacobi-Davidson iteration method for linear eigenvalue
problems. SIAM J. Matrix Anal. Appl. 17, 401–425 (1996)
40. Simoncini, V., Eldén, L.: Inexact Rayleigh quotient-type methods for eigenvalue computations.
BIT Numer. Math. 42(1), 159–182 (2002). doi:10.1023/A:1021930421106
404 11 Large Symmetric Eigenvalue Problems
41. Bathe, K., Wilson, E.: Large eigenvalue problems in dynamic analysis. ASCE J. Eng. Mech.
Div. 98, 1471–1485 (1972)
42. Bathe, K., Wilson, E.: Solution methods for eigenvalue problems in structural mechanics. Int.
J. Numer. Methods Eng. 6, 213–226 (1973)
43. Grimm, R., Greene, J., Johnson, J.: Computation of the magnetohydrodynamic spectrum in
axissymmetric toroidal confinement systems. Method Comput. Phys. 16 (1976)
44. Gruber, R.: Finite hybrid elements to compute the ideal magnetohydrodynamic spectrum of an
axisymmetric plasma. J. Comput. Phys. 26, 379–389 (1978)
45. Stewart, G.: A bibliographical tour of the large, sparse generalized eigenvalue problems. In:
Banch, J., Rose, D. (eds.) Sparse Matrix Computations, pp. 113–130. Academic Press, New
York (1976)
46. van der Vorst, H., Golub, G.: One hundred and fifty years old and still alive: eigenproblems. In:
Duff, I., Watson, G. (eds.) The State of the Art in Numerical Analysis, pp. 93–119. Clarendon
Press, Oxford (1997)
47. Rutishauser, H.: Computational aspects of f. l. bauer’s simultaneous iteration method. Nu-
merische Mathematik 13(1), 4–13 (1969). doi:10.1007/BF02165269
48. Clint, M., Jennings, A.: The evaluation of eigenvalues and eigenvectors of real symmetric
matrices by simultaneous iteration. Computers 13, 76–80 (1970)
49. Levin, A.: On a method for the solution of a partial eigenvalue problem. J. Comput. Math.
Math. Phys. 5, 206–212 (1965)
50. Stewart, G.: Accelerating the orthogonal iteration for the eigenvalues of a Hermitian matrix.
Numer. Math. 13, 362–376 (1969)
51. Sakurai, T., Sugiura, H.: A projection method for generalized eigenvalue problems using nu-
merical integration. J. Comput. Appl. Math. 159, 119–128 (2003)
52. Tang, P., Polizzi, E.: FEAST as a subspace iteration eigensolver accelerated by approximate
spectral projection. SIAM J. Matrix Anal. Appl. 35(2), 354–390 (2014)
53. Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential
and integral operators. J. Res. Natl. Bur. Stand. 45, 225–280 (1950)
54. Fokkema, D.R., Sleijpen, G.A.G., van der Vorst, H.A.: Jacobi-Davidson style QR and QZ
algorithms for the reduction of matrix pencils. SIAM J. Sci. Comput. 20(1), 94–125 (1998)
55. Sleijpen, G., Booten, A., Fokkema, D., van der Vorst, H.: Jacobi-Davidson type methods for
generalized eigenproblems and polynomial eigenproblems. BIT 36, 595–633 (1996)
56. Cullum, J., Willoughby, R.: Lanczos and the computation in specified intervals of the spectrum
of large, sparse, real symmetric matrices. In: Duff, I., Stewart, G. (eds.) Proceedings of the
Sparse Matrix 1978. SIAM (1979)
57. Parlett, B., Scott, D.: The Lanczos algorithm with selective orthogonalization. Math. Comput.
33, 217–238 (1979)
58. Simon, H.: The Lanczos algorithm with partial reorthogonalization. Math. Comput. 42, 115–
142 (1984)
59. Cullum, J., Willoughby, R.: Computing eigenvalues of very large symmetric matrices—an
implementation of a Lanczos algorithm with no reorthogonalization. J. Comput. Phys. 44,
329–358 (1984)
60. Ericsson, T., Ruhe, A.: The spectral transformation Lanczos method for the solution of large
sparse generalized symmetric eigenvalue problems. Math. Comput. 35, 1251–1268 (1980)
61. Grimes, R., Lewis, J., Simon, H.: A shifted block Lanczos algorithm for solving sparse sym-
metric generalized Eigenproblems. SIAM J. Matrix Anal. Appl. 15, 228–272 (1994)
62. Kalamboukis, T.: A Lanczos-type algorithm for the generalized eigenvalue problem ax = λbx.
J. Comput. Phys. 53, 82–89 (1984)
63. Liu, B.: The simultaneous expansion for the solution of several of the lowest eigenvalues and
corresponding eigenvectors of large real-symmetric matrices. In: Moler, C., Shavitt, I. (eds.)
Numerical Algorithms in Chemistry: Algebraic Method, pp. 49–53. University of California,
Lawrence Berkeley Laboratory (1978)
64. Stathopoulos, A., Saad, Y., Fischer, C.: Robust preconditioning of large, sparse, symmetric
eigenvalue problems. J. Comput. Appl. Math. 197–215 (1995)
References 405
65. Wu, K.: Preconditioning techniques for large eigenvalue problems. Ph.D. thesis, University of
Minnesota (1997)
66. Jacobi, C.: Über ein leichtes verfahren die in der theorie der säculärstörungen vorkom menden
gleichungen numerisch aufzulösen. Crelle’s J. für reine und angewandte Mathematik 30, 51–94
(1846)
67. Beckenbach, E., Bellman, R.: Inequalities. Springer, New York (1965)
68. Kantorovic̆, L.: Functional analysis and applied mathematics (Russian). Uspekhi Mat. Nauk.
3, 9–185 (1948)
69. Newman, M.: Kantorovich’s inequality. J. Res. Natl. Bur. Stand. B. Math. Math. Phys. 64B(1),
33–34 (1959). http://nvlpubs.nist.gov/nistpubs/jres/64B/jresv64Bn1p33_A1b.pdf
70. Benzi, M., Golub, G., Liesen, J.: Numerical solution of Saddle-point problems. Acta Numerica
pp. 1–137 (2005)
71. Elman, H., Silvester, D., Wathen, A.: Performance and analysis of Saddle-Point preconditioners
for the discrete steady-state Navier-Stokes equations. Numer. Math. 90, 641–664 (2002)
72. Paige, C.C., Saunders, M.A.: Solution of sparse indefinite systems of linear equations. SIAM
J. Numer. Anal. 12(4), 617–629 (1975)
73. Daniel, J., Gragg, W., Kaufman, L., Stewart, G.: Reorthogonalization and stable algorithms for
updating the Gram-Schmidt QR factorization. Math. Comput. 136, 772–795 (1976)
74. Sun, J.G.: Condition number and backward error for the generalized singular value decompo-
sition. SIAM J. Matrix Anal. Appl. 22(2), 323–341 (2000)
75. Stewart, G.G., Sun, J.: Matrix Perturbation Theory. Academic Press, Boston (1990)
76. Demmel, J., Gu, M., Eisenstat, S.I., Slapničar, K., Veselić, Z.D.: Computing the singular value
decomposition with high accuracy. Linear Algebra Appl. 299(1–3), 21–80 (1999)
77. Sun, J.: A note on simple non-zero singular values. J. Comput. Math. 6(3), 258–266 (1988)
78. Berry, M., Sameh, A.: An overview of parallel algorithms for the singular value and symmetric
eigenvalue problems. J. Comput. Appl. Math. 27, 191–213 (1989)
79. Dongarra, J., Sorensen, D.C.: A fully parallel algorithm for the symmetric eigenvalue problem.
SIAM J. Sci. Stat. Comput. 8(2), s139–s154 (1987)
80. Golub, G., Reinsch, C.: Singular Value Decomposition and Least Squares Solutions. Springer
(1971)
81. Golub, G., Luk, F., Overton, M.: A block Lanczos method for computing the singular values
and corresponding singular vectors of a matrix. ACM Trans. Math. Softw. 7, 149–169 (1981)
82. Berry, M.: Large scale singular value decomposition. Int. J. Supercomput. Appl. 6, 13–49
(1992)
83. Wilkinson, J.: Inverse Iteration in Theory and in Practice. Academic Press (1972)
84. Philippe, B., Sadkane, M.: Computation of the fundamental singular subspace of a large matrix.
Linear Algebra Appl. 257, 77–104 (1997)
85. Hochstenbach, M.: A Jacobi-Davidson type SVD method. SIAM J. Sci. Comput. 23(2), 606–
628 (2001)
Part IV
Matrix Functions and Characteristics
Chapter 12
Matrix Functions and the Determinant
[26], are important references regarding the theoretical background and the design of
high quality algorithms for problems of this type on uniprocessors. Reference [27]
is a useful survey of software for matrix functions. Finally, it is worth noting that the
numerical evaluation of matrix functions is an extremely active area of research, and
many interesting developments are currently under consideration or yet to come.
It is well known from theory that if A is diagonalizable with eigenvalues {λi }i=1 n and if
−1
the scalar function f (ζ ) is such that f (λi ) exists for i = 1, . . . , n and Q AQ = Λ is
the matrix of eigenvalues, then the matrix function is defined as f (A) = Q f (Λ)Q −1 .
If A is non-diagonalizable, and Q −1 AQ = diag(J1 , . . . , J p ) is its Jordan canonical
form, then f (A) is defined as f (A) = Qdiag( f (J1 ), . . . , f (J p ))Q −1 , assuming that
(n i −1)
(λi )
for each eigenvalue λi , the derivatives { f (λi ), f (1) (λi ), . . . , f (n i −1)! } exist where
n i is the size of the largest Jordan block containing λi and f (Ji ) is the Toeplitz upper
triangular matrix with first row the vector
f (m i −1) (λi )
( f (λi ), f (1) (λi ), . . . , ).
(m i − 1)!
D + C f (A)B (12.2)
where f is the function under consideration defined on the spectrum of the square
matrix A and B, C, D are of compatible shapes. A large variety of matrix computa-
tions, including most of the BLAS, matrix powers and inversion, linear systems with
multiple right-hand sides and bilinear forms can be cast as in (12.2). We focus on
rational functions (that is ratios of polynomials) and the matrix exponential. Observe
that the case of functions with a linear denominator amounts to matrix inversion or
solving linear systems, which have been addressed earlier in this book.
Polynomials and rational functions are primary tools in the practical approxima-
tion of scalar functions. Moreover, as is well known from theory, matrix functions
can also be defined by means of a unique polynomial (the Hermite interpolating poly-
nomial) that depends on the underlying matrix and its degree is at most that of the
minimal polynomial. Even though, in general, it is impractical to use it, its existence
12.1 Matrix Functions 411
provides a strong motivation for using polynomials. Finally, rational functions are
sometimes even more effective than polynomials in approximating scalar functions.
Since their manipulation in a matrix setting involves the solution of linear systems,
an interesting question, that we also discuss, is how to manipulate matrix rational
functions efficiently in a parallel environment.
We have already encountered matrix rational functions in this book. Recall our
discussion of algorithms BCR and EES in Sect. 6.4, where we used ratios of Cheby-
shev polynomials, and described the advantage of using partial fractions in a parallel
setting. In this section we extend this approach to more general rational functions.
Specifically, we consider the case where f in (12.2) is a rational function denoted by
q(ζ )
r (ζ ) = (12.3)
p(ζ )
d
p(ζ ) = (ζ − τ j ), where d = deg p.
j=1
Two cases of the template (12.2) amount to computing one of the following:
The vector x can be computed by first evaluating r (A) and then multiplying by b. As
with matrix inversion, we can approximate x without first computing r (A) explicitly.
412 12 Matrix Functions and the Determinant
d
Algorithm 12.1 Computing x = ( p(A))−1 b when p(ζ ) = j=1 (ζ − τ j ) with τ j
mutually distinct.
Input: A ∈ Rn×n , b ∈ Rn and values τ j distinct from the eigenvalues of A.
Output: Solution ( p(A))−1 b.
1: x0 = b
2: do j = 1 : d
3: solve (A − τ j I )x j = x j−1
4: end
5: return x = xd
d
Algorithm 12.2 Computing X = ( p(A))−1 when p(ζ ) = j=1 (ζ − τ j ) with τ j
mutually distinct
Input: A ∈ Rn×n , b ∈ Rn and values τ j distinct from the eigenvalues of A.
Output: ( p(A))−1 .
1: xi(0) = ei for i = 1, . . . , n and X = (x1(0) , . . . , xn(0) ).
2: do j = 1 : d
3: doall i = 1 : n
( j) ( j−1)
4: solve (A − τ j I )xi = xi
5: end
6: end
(d) (d)
7: return X = (x1 , . . . , xn )
7
p(ζ ) = πjζ j.
j=0
This can be written as a cubic polynomial in ζ 2 , with coefficients that are linear in ζ ,
3
(π2 j + π2 j+1 ζ )ζ 2 j
j=0
1
p(ζ ) = (π4 j + π4 j+1 ζ + π4 j+2 ζ 2 + π4 j+3 ζ 3 )ζ 4 j .
j=0
where we use the subscript to explicitly denote the maximum degree of the respective
polynomials. Note that the process can be applied recursively on each term p (1)
j and
(1)
p̂ j . We can express this entirely in matrix form as follows. Define the matrices of
order 2n
414 12 Matrix Functions and the Determinant
A 0
Mj = , j = 0, 1, . . . , 2k − 1. (12.6)
πj I I
and
k−1
A2 0
M2k−1 −1 M2k−1 −2 · · · M0 = (1)
p2k−1 −1 (A) I
(1) (1)
where the terms p2k−1 −1 , p̂2k−1 −1 are as in decomposition (12.5). The connection
of the polynomial multiply-and-add approach (12.5) with the product form (12.7)
using the terms M j defined in (12.6), motivates the design of algorithms for the
parallel computation of the matrix polynomial based on a parallel fan-in approach,
e.g. as that proposed to solve triangular systems in Sect. 3.2.1 (Algorithm 3.2). With
no limit on the number of processors, such a scheme would take log d stages each
consisting of a matrix multiply (squaring) and a matrix multiply-and-add. See also
[37] for interesting connections between these approaches. For a limited number of
processors, the Paterson-Stockmeyer algorithm can be applied in each processor to
evaluate a low degree polynomial corresponding to the multiplication of some of
the terms M j . Subsequently, the fan-in approach can be applied to combine these
intermediate results.
We also need to note that the aforementioned methods require more storage com-
pared to the standard Horner scheme.
(A − τ I )x = (A − σ I )b, (12.8)
the Laurent expansion for the rational function around the pole τ can be written as
ζ −σ 1
= (τ − σ ) + 1,
ζ −τ ζ −τ
x = b + (τ − σ )(A − τ I )−1 b;
cf. [39]. Recall that we have already made use of partial fractions to enable the
parallel evaluation of some special rational functions in the rapid elliptic solvers of
Sect. 6.4. The following result is well known; see e.g. [40].
Theorem 12.1 Let p, q be polynomials such that deg q ≤ deg p, and let τ1 , . . . , τd
be the distinct zeros of p, with multiplicities of μ1 , . . . , μd , respectively, and distinct
)
from the roots of q. Then at each τ j , the rational function r (ζ ) = q(ζ p(ζ ) has a pole of
order at most μ j , and the partial fraction representation of r is given by,
μj
d
r (ζ ) = ρ0 + πi (ζ ), with πi (ζ ) = γi, j (ζ − τ j )− j
i=1 j=1
d
q(τ j )
r (A) = ρ0 I + ρ j (A − τ j I )−1 with ρ j = , (12.9)
j=1
p (τ j )
which exhibits ample opportunities for parallel evaluation. One way to compute r (A)
is to first evaluate the terms (A−τ j I )−1 for j = 1, . . . , d simultaneously, followed by
the weighted summation. To obtain the vector r (A)b, we can first solve
the d systems
(A − τ j I )x j = b simultaneously, followed by computing x = b + dj=1 ρ j x j . The
latter can also be written as an MV operation,
d
Algorithm 12.3 Computing x = ( p(A))−1 q(A)b when p(ζ ) = j=1 (ζ − τ j ) and
the roots τ j are mutually distinct
Input: A ∈ Rn×n , b ∈ Rn , values {τ1 , . . . , τd } (roots of p).
Output: Solution x = ( p(A))−1 q(A)b.
1: doall j = 1 : d
q(τ )
2: compute coefficient ρ j = j
p (τ j )
3: solve (A − τ j I )x j = b
4: end
)
5: set ρ0 = lim|z|→∞ q(ζ
p(ζ ) , r = (ρ1 , . . . , ρd ) , X = (x 1 , . . . , x d )
6: compute x = ρ0 b + Xr //it is important to compute this step in parallel
The partial fraction coefficients are computed only once for any given function,
thus under the last of the standard assumptions listed in Remark 12.1, the cost of line
2 is not taken into account; cf. [40] for a variety of methods. On d processors, each
can undertake the solution of one linear system (line 3 of Algorithm 12.3). Except
for line 6, where one needs to compute the linear combination of the d solution
vectors, the computations are totally independent. Thus, the cost of this algorithm
on d processors is equal to a (possibly complex) linear system solve per processor
and a dense MV. For general dense matrices of size n, the cost of solving the linear
systems dominates. For structured matrices, the cost of each system solve could be
comparable or even less than that of a single dense sequential MV. For tridiagonals
of order n, for example, the cost of a single sequential MV, in line 6, dominates
the O(n) cost of the linear system solve in line 3. Thus, it is important to consider
using a parallel algorithm for the dense MV. If the number of processors is smaller
than d then more systems can be assigned to each processor, whereas if it is larger,
more processors can participate in the solution of each system using some parallel
algorithm.
It is straightforward to modify Algorithm 12.3 to compute r (A) directly. Line 3
becomes an inversion while line 6 a sum of d inverses, costing O(n 2 d). For structured
matrices for which the inversion can be accomplished at a cost smaller than O(n 3 ),
the cost of the final step can become comparable or even larger, assuming that the
inverses have no special structure.
Next, we consider in greater detail the solution of the linear systems in Algo-
rithm 12.3. Notice that all matrices involved are simple diagonal shifts of the form
(A − τ j I ). One possible preprocessing step is to transform A to an orthogonally
similar upper Hessenberg or (if symmetric) tridiagonal form; cf. [41] for an early
use of this observation, [42] for its application in computing the matrix exponential
and [43] for the case of reduction to tridiagonal form when the matrix is symmetric.
This preprocessing, incorporated in Algorithm 12.4, leads to considerable savings
since the reduction step is performed only once while the Hessenberg linear systems
in line 5 require only O(n 2 ) operations. The cost drops to O(n) when the matrix is
symmetric since the reduced system becomes tridiagonal. In both cases, the dominant
cost is the reduction in line 1; e.g. see [44, 45] for parallel solvers of such linear
12.1 Matrix Functions 417
systems. In Chap. 13 we will make use of the same preprocessing step for computing
the matrix pseudospectrum.
d
Algorithm 12.4 Compute x = ( p(A))−1 q(A)b when p(ζ ) = j=1 (ζ − τ j ) and
the roots τ j are mutually distinct
Input: A ∈ Rn×n , b ∈ Rn , values {τ1 , . . . , τd } (roots of p).
Output: Solution x = ( p(A))−1 q(A)b.
1: [Q, H ] = hess(A) //Q obtained as product of Householder transformations so that Q AQ =
H is upper Hessenberg
2: compute b̂ = Q b //exploit the product form
3: doall j = 1 : d
q(τ )
4: compute coefficient ρ j = j
p (τ j )
5: solve (H − τ j I )x̂ j = b̂
6: compute x j = Q x̂ j //exploit the product form
7: end
)
8: set ρ0 = lim|z|→∞ q(ζ
p(ζ ) , r = (ρ1 , . . . , ρd ) , X = Q( x̂ 1 , . . . , x̂ d )
9: compute and return x = ρ0 b + Xr //it is important to compute this step in parallel
Note that when the rational function is real, any complex poles appear in conjugate
pairs. Then the partial fraction coefficients also appear in conjugate pairs which makes
possible significant savings. For example, if (ρ, τ ) and (ρ̄, τ̄ ) are two such conjugate
pairs, then
ρ(A − τ I )−1 + ρ̄(A − τ̄ I )−1 = 2
ρ(A − τ I )−1 . (12.11)
d
d
( p(ω)) = (ω − ωk ).
j=1 k=1
k= j
418 12 Matrix Functions and the Determinant
p ζ d−1 d 1 d
= d =
p ζ −1 ζ − ωj
j=1
and therefore
1
d
1
ζd = where y = .
1− d
ζy
ζ − ωj
j=1
On d processors, the total cost of the algorithm is log d parallel additions for the
reduction, one parallel division and a small number of additional operations. At first
sight, the result is somewhat surprising; why should one go through this process to
compute powers in O(log d) parallel additions rather than apply repeated squaring
to achieve the same in O(log d) (serial) multiplications? The reason is that the com-
putational model in [46] was of a time when the cost of (scalar) multiplication was
higher than addition. Today, this assumption holds if we consider matrix operations.
In fact, [46] briefly mentions the possibility of using partial fraction expansions with
matrix arguments. Note that similar tradeoffs drive other fast algorithms, such as the
3M method for complex matrix multiplication, and Strassen’s method; cf. Sect. 2.2.2.
Partial fraction expansions are very convenient for introducing parallelism but one
must be alert to roundoff effects in floating-point arithmetic. In particular, there are
cases where the partial fraction expansion contains terms that are large and of mixed
sign. If the computed result is small relative to the summands, it could be polluted
by the effect of catastrophic cancellation. For instance, building on an example from
[47], the expansion of p(ζ ) = (ζ − α)(ζ − δ)(ζ + δ) is
When |δ| is very small and |α| |δ| the O( 1δ ) factors in the last two terms lead to
catastrophic cancellation.
In this example, the danger of catastrophic cancellation is evident from the pres-
ence of two nearby poles that trigger the generation of large partial fraction coeffi-
cients of different sign. As shown in [47], however, cancellation
can also be caused
by the distribution of the poles. For instance, if p(ζ ) = dj=1 (ζ − dj ) then
1 ρj
d
d d−1
= , ρ j = d . (12.13)
p(ζ ) ζ− j
k=1 ( j − k)
j=1 d
k= j
When d = 20, then ρ20 = −ρ1 = 2019 /19! ≈ 4.3 × 107 . Finally note that even if
no catastrophic cancellation takes place, multiplications with large partial fraction
coefficients will magnify any errors that are already present in each partial fraction
term.
In order to prevent large coefficients that can lead to cancellation effects when
evaluating partial fraction expansions, we follow [47] and consider hybrid represen-
tations for the rational polynomial, standing between partial fractions and the product
form. Specifically, we use a modification of the incomplete partial fraction decompo-
sition (IPF for short) that was devised in [48] to facilitate computing partial fraction
coefficients corresponding to denominators with quadratic factors which also helps
one avoid complex arithmetic; see also [40]. For example, if p(ζ ) = (ζ −α)(ζ 2 +δ 2 )
is real with one real root α and two purely imaginary roots, ±ιδ, then an IPF that
avoids complex coefficients is
1 1 1 ζ +α
= 2 − 2 .
p(ζ ) (α + δ 2 ) ζ −α ζ + δ2
The coefficients appearing in this IPF are ±(α 2 + δ 2 )−1 and −α(α 2 + δ 2 )−1 . They
are all real and, fortuitously, bounded for very small values of δ.
In general, assuming that the rational function r = q/ p satisfies the μstandard
assumptions and that we have a factorization of the denominator p(ζ ) = l=1 pl (ζ )
into non-trivial polynomial factors p1 , . . . , pμ , then an IPF based on these factors is
μ
h l (ζ )
r (ζ ) = . (12.14)
pl (ζ )
l=1
where the polynomials h l are such that deg h l ≤ deg pl and their coefficients can
be computed using algorithms from [40]. It must be noted that the characterization
“incomplete” does not mean that it is approximate but rather that it is not fully
developed as a sum of d terms with linear denominators or powers thereof and thus
does not fully reveal some of the distinct poles.
420 12 Matrix Functions and the Determinant
where q1 , . . . , qμ are suitable polynomials with deg gl ≤ deg pl and each term
ql / pl is expanded into its partial fractions.
The goal is to construct this decomposition and in turn the partial fraction expan-
sions so that none of the partial fraction coefficients exceeds in absolute value some
selected threshold, τ , and the number of terms, μ, is as small as possible. As in [47]
we expand each term above and obtain the partial fraction representation
⎛ ⎞
μ
kl
ρ (l) μ
r (ζ ) = ⎝ρ (l) + j ⎠, kl = deg p,
0
ζ − τ j,l
l=1 j=1 l=1
where the poles have been re-indexed from {τ1 , . . . , τd } to {τ1,1 , . . . , τk1 ,1 , τ1,2 , . . . ,
τkμ ,μ }. One idea is to perform a brute force search for an IPF that returns coefficients
that are not too large. For instance, for the rational function in Eq. (12.13) above, the
partial fraction coefficients when d = 6 are
6
6
j j
ζ− or ζ−
6 6
j=1 j=1
j=3 j=4
6
Table 12.1 Partial fraction coefficients for 1
p1,k (ζ ) when p1,k (ζ ) = j=1 (ζ − 6j ) for the incomplete
j=k
1 1
partial fraction expansion of p1,k (ζ ) (ζ − k )
n
k=1 2 3 4 5 6
54.0 43.2 32.4 21.6 10.8 54.0
−216.0 −162.0 −108.0 −54.0 −108.0 −216.0
324.0 216.0 108.0 108.0 216.0 324.0
−216.0 −108.0 −54.0 −108.0 −162.0 −216.0
54.0 10.8 21.6 32.4 43.2 54.0
12.1 Matrix Functions 421
From the above we conclude that when μ = 2, there exists an IPF with maximum
coefficient equal to 9, which is a significant reduction from the maximum value of
648, found in Eq. (12.16) for the usual partial fractions (μ = 1).
If the rational function is of the form 1/ p, an exhaustive search like the above
can identify the IPF decomposition that will minimize the coefficients. For the par-
allel setting, one has to consider the tradefoff between the size of the coefficients,
determined by τ , and the level of parallelism, determined by μ. For a rational func-
tion with numerator and denominator degrees (0, d) the total number of cases that
must be examined is equal to the total number of k-partitions of d elements, for
k = 1, . . . , d. For a given k, the number of partitions is equal to the Stirling number
of the 2nd kind, S(d, k), while their sum is Bell number, B(s). These grow rapidly,
for instance the B(6) = 203 whereas B(10) = 115,975 [49]. Therefore, as d grows
larger the procedure can become very costly if not prohibitive. Moreover, we also
have to address the computation of appropriate factorizations that are practical when
the numerator is not trivial. One option is to start the search and stop as soon as an IPF
with coefficients smaller than some selected threshold, say τ , has been computed.
We next consider an effective approach for computing IPF, proposed in [47] and
denoted by IPF(τ ), where the value τ is an upper bound for the coefficients that
is set by the user. When τ = 0 no decomposition is applied and when τ = ∞
(or sufficiently large) it is the usual decomposition with all d terms. We outline the
application of the method for rational functions with numerator 1 and denominator
degree n. The partial fraction coefficients in this case are
−1
d
dp
(τi ) = (τi − τ j ), i = 1, . . . , d
dz
j=1
j=i
k−1
k−1
|τk,1 − τ j,1 | = max |θ − τ j,1 |, τk,1 ∈ T , k = 1, 2, . . . .
θ∈T
j=1 j=1
422
1 6 1 1
Table 12.2 Partial fraction coefficients for p1,(k,i) (ζ ) when p1,(k,i) (ζ ) = j=1 (ζ − 6j ) for the incomplete partial fraction expansion of p1,(k,i) (ζ ) (ζ − k )(ζ − i )
j=k,i n n
(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (2, 3) (2, 4) (2, 5) (2, 6) (3, 4) (3, 5) (3, 6) (4, 5)
36.0 27.0 18.0 9.0 36.0 21.6 14.4 7.2 27.0 10.8 5.4 18.0 3.6
−108.0 −72.0 −36.0 −54.0 −108.0 −54.0 −27.0 −36.0 −72.0 −18.0 −18.0 −36.0 −36.0
108.0 54.0 36.0 72.0 108.0 36.0 18.0 36.0 54.0 18.0 27.0 36.0 54.0
−36.0 −9.0 −18.0 −27.0 −36.0 −3.6 −5.4 −7.2 −9.0 −10.8 −14.4 −18.0 −21.6
6.0 3.0 2.0 1.5 1.2 6.0 3.0 2.0 1.5 6.0 3.0 2.0 6.0
−6.0 −3.0 −2.0 −1.5 −1.2 −6.0 −3.0 −2.0 −1.5 −6.0 −3.0 −2.0 −6.0
The pairs (k, i) in the header row indicate the poles left out to form the first factor. The partial fraction coefficients for the second factor ((ζ − nk )(ζ − ni ))−1
are listed in the bottom two rows. Since these are smaller than the coefficients for the first factor, we need not be concerned about their values
12 Matrix Functions and the Determinant
12.1 Matrix Functions 423
The idea is then to compute the coefficients of the partial fraction decomposition of
−1
k
j=1 (ζ − τ j,1 ) for increasing values of k until one or more coefficients exceed
the threshold τ . Assume this happens for some value of k = k1 + 1. Then set as first
factor of the sought IPF(τ ) the term ( p1 (ζ ))−1 where
k1
p1 (ζ ) = (ζ − τ j,1 ),
j=1
(l)
and denote the coefficients ρ j , j = 1, . . . , k. Next, remove the poles {τ j,1 }kj=1
1
from
T , perform Leja ordering to the remaining ones and repeat the procedure to form
the next factor. On termination, the IPF(τ ) representation is
⎛ ⎞
d μ
kl
ρ (l)
(ζ − τ j )−1 = ⎝ j ⎠
z − τ j,l
j=1 l=1 j=1
1
d
(d)
rd (ζ ) = , where pd (ζ ) = (ζ − τ j ),
pd (ζ )
j=1
(d)
and for each d the poles τ j are distinct and are the first d Leja points from a set T
that is compact in C and its complement is connected and regular for the Dirichlet
(d)
problem, then if we denote by ρ̂ j the partial fraction coefficients for any other
arbitrary set of pairwise distinct points from T , then
d
(d)
d
(d) 1
(d) 1
|ρ j | ≤ χ −d(d−1) , lim |ρ j | (d(d−1) = χ −1 ≤ lim |ρ̂ j | (d(d−1) ,
d→∞ d→∞
j=1 j=1
Algorithm
d 12.6 Computing the IPF(τ ) representation of ( p(ζ ))−1 when p(ζ ) =
j=1 (ζ − τ j ) and the roots τ j are mutually distinct
Input: T = {τ j }dj=1 , τ ≥ 0;
μ μ
μ
Output: l=1 {τ j,l }kj=1
l
, l=1 {ρ (l)
j }, where l=1 kl = d;
1: j = 1, μ = 1, k = 0;
2: while j ≤ d
3: k = k + 1; // j − 1 poles already selected and k − 1 poles in present factor of the IPF(τ )
representation
4: if k = 1 then
5: choose τ1,μ ∈ T such that |τ1,μ | = mint∈T |t|;
(μ)
6: ρ1 = 1; j = j + 1;
7: else k−1 k−1
8: choose τk,μ ∈ T such that l=1 |τk,μ − τl,μ | = maxt∈T l=1 |t − τl,m |;
9: do l = 1 : k − 1
(μ) (μ)
10: ρ̃l = ρl (τl,μ − τk,μ )−1 ;
11: end k−1
(μ)
12: ρ̃l = l=1 (τk,μ − τl,μ )−1 ;
(μ)
13: if max1≤l≤k |ρ̃l | ≤ τ then
14: do l = 1 : k
(μ) (μ)
15: ρl = ρ̃l ;
16: end
17: j = j +1
18: else
k−1
19: T = T \ {τl,μ }l=1 ; kμ = k − 1; k = 0; μ = μ + 1; //begin new factor
20: end if
21: end if
22: end while
We next extend our discussion to include cases that the underlying matrix is so large
that iterative methods become necessary. It is then only feasible to compute f (A)B,
where B consists of one or few vectors, rather than f (A). A typical case is in solving
large systems of differential equations, e.g. those that occur after spatial discretization
of initial value problems of parabolic type. Then, the function in the template (12.2)
is related to the exponential; a large number of exponential integrators, methods
suitable for these problems, have been developed; cf. [2] for a survey. See also [51–
55]. The discussion that follows focuses on the computation of exp(A)b.
The two primary tools used for the effective approximation of exp(A)b when
A is large, are the partial fraction representation of rational approximations to the
exponential and Krylov projection methods. The early presentations [56–60] on the
combination of these principles for computing the exponential also contained pro-
posals for the parallel computation of the exponential. This was a direct consequence
of the fact that partial fractions and the Arnoldi procedure provide opportunities for
large and medium grain parallelism. These tools are also applicable for more general
matrix functions. Moreover, they are of interest independently of parallel processing.
12.1 Matrix Functions 425
For this reason, the theoretical underpinnings of such methods for the exponential and
other matrix functions have been the subject of extensive research; see e.g. [61–65].
One class of rational functions are the Padé rational approximants. Their numer-
ator and denominator polynomials are known analytically; cf. [66, 67]. Padé, like
Taylor series, are designed to provide good local approximations, e.g. near 0. More-
over, the roots of the numerator and denominator are simple and contain no common
values. For better approximation, the identity exp(A) = (exp(Ah))1/ h is used to
cluster the eigenvalues close to the origin. The other possibility is to construct a
rational Chebyshev approximation on some compact set in the complex plane. The
Chebyshev rational approximation is primarily applicable to matrices that are sym-
metric negative definite and has to be computed numerically; a standard reference
for the power form coefficients of the numerator and denominator polynomials with
deg p = degq (“diagonal approximations”) ranging from 1 to 30 is [68]. Refer-
ence [58] describes parallel algorithms based on the Padé and Chebyshev rational
approximations, illustrating their effectiveness and the advantages of the Chebyshev
approximation [68–70] for negative definite matrices. An alternative to the deli-
cate computations involved in the Chebyshev rational approximation is to use the
Caratheodory-Fejér (CF) method to obtain good rational approximations followed
by their partial fraction representation; cf. [5, 26].
Recall that any complex poles appear in conjugate pairs. These can be grouped
as suggested in Eq. (12.11) to halve the number of solves with (A − τ I ), where
τ is the one of the conjugate pairs. In the common case that A is hermitian and τ
is complex, (A − τ I ) is complex symmetric and there exist Lanczos and CG type
iterative methods that take advantage of this special structure; cf. [71–73]. Another
possibility is to combine terms corresponding to conjugate shifts and take advantage
of the fact that
(A − τ I )(A − τ̄ I ) = A2 − 2
τ A + |τ |2 I = A(A − 2
τ I ) + |τ |2 I.
is a real matrix. Then, MV operations in a Krylov method would only involve real
arithmetic.
Consider now the solution of the linear systems corresponding to the partial frac-
tion terms. As noted in the previous subsection, if one were able to use a direct
method, it would be possible to save computations to lower redundancy by first
reducing the matrix to Hessenberg form. When full reduction is prohibitive because
of the size of A, one can apply a “partial” approach using the Arnoldi process to
build an orthonormal basis, Vν , for the Krylov subspace Kν (A, b), where ν n.
The parallelism in Krylov methods was discussed in Sect. 9.3.2 of Chap. 9. Here,
however, it also holds that
Vν (A − τ I )Vν = Hν − τ Iν
for any τ and so the same basis reduces not only A but all shifted matrices to Hes-
senberg form [74, 75]. This is the well known shift invariance property of Krylov
426 12 Matrix Functions and the Determinant
subspaces that can be expressed by Kν (A, b) = Kν (A − τ I, b); cf. [76]. This prop-
erty implies that methods such as FOM and GMRES, can proceed by first computing
the basis via Arnoldi followed by the solution of (different) small systems, followed
by computing the solution to each partial fraction term using the same basis and then
combining using the partial fraction coefficients. For example, if FOM can be used,
then setting β = b, an approximation to r (A)b from Kν (A, b) is
d
x̃ = βVν ρ j (Hν − τ j I )−1 e1 = βVν r (Hν )e1 , (12.17)
j=1
on the condition that all matrices (Hν − τ j I ) are invertible. The above approach can
also be extended to handle restarting; cf. [77] when solving the shifted systems. It
is also possible to solve the multiply shifted systems by Lanczos based approaches
including BiCGSTAB, QMR and transpose free QMR; cf. [78, 79].
Consider now (12.17). By construction, if r (Hν ) exists, it must also provide a
rational approximation to the exponential. In other words, we can write
This is the Krylov approach for approximating the exponential, cf. [57, 58, 80,
81]. Note that the only difference between (12.17) and (12.18) is that r (Hν ) was
replaced by exp(Hν ). It can be shown that the term βVν exp(Hν )e1 is a polynomial
approximation of exp(A)b with a polynomial of degree ν − 1 which interpolates
the exponential, in the Hermite sense, on the set of eigenvalues of Hν ; [82]. In fact,
for certain classes of matrices, this type of approximation is extremely accurate;
cf. [61] and references therein. Moreover, Eq. (12.18) provides a a framework for
approximating general matrix functions using Krylov subspaces. We refer to [83]
for generalizations and their parallel implementations.
Consider, for example, using transpose free QMR. Then there is a multiply shifted
version of TFQMR [78] where each iteration consists of computations with the
common Krylov information to be shared between all systems and computations
that are specific to each term, building or updating data special to each term. As
described in [56], at each iteration of multiply shifted TFQMR, the dimension of the
underlying Krylov subspace increases by 2. The necessary computations are of two
types, one set that is used to advance the dimension of the Krylov subspace that are
independent of the total number of systems and consist of 2 MVs, 4 dot products
and 6 vector updates. The other set consists of computations specific to each term,
namely 9 vector updates and a few scalar ones, that can be conducted in parallel.
One possibility is to stop the iterations when an accurate approximation has been
obtained for all systems. Finally, there needs to be an MV operation to combine the
partial solutions as in (12.10). Because the roots τ j are likely to be complex, we
expect that this BLAS2 operation will contain complex elements.
An important characteristic of this approach is that it has both large grain paral-
lelism because of the partial fraction decomposition as well as medium grain par-
12.1 Matrix Functions 427
allelism because of the shared computations with A and vectors of that size. Recall
that the shared computations occur because of our desire to reduce redundancy by
exploiting the shift invariance of Krylov subspaces.
The exploitation of the multiple shifts reduces redundancy but also curtails par-
allelism. Under some conditions (e.g. high communication costs) this might not be
desirable. A more general approach that provides more flexibility over the amount
of large and medium grain parallelism was proposed in [56]. The ideas is to organize
the partial fraction terms into groups, and to express the rational function as a double
sum, say
k
(l)
r (A) = ρ0 I + ρ j (A − ζ j I )−1
l=1 j∈I j
where the index sets Il , l = 1, . . . , k are a partition of {1, 2, . . . , deg p} and the
(1) (l)
sets of coefficients {ρ j , . . . , ρ j } are a partition of the set {ρ1 , . . . , ρd }. Then we
can build a hybrid scheme in which the inner sum is constructed using the multi-
ply shifted approach, but the k components of the outer sum are treated completely
independently. The extreme cases are k = deg p, in which all systems are treated
independently, and k = 1, in which all systems are solved with a single instance of
the multiply shifted approach. This flexible hybrid approach was used on the Cedar
vector multiprocessor but can be useful in the context of architectures with hierar-
chical parallelism, that are emerging as a dominant paradigm in high performance
computing.
In the more general case where we need to compute exp(A)B, where B is a block
of vectors that replaces b in (12.9), there is a need to solve a set of shifted systems
for each column of B. This involves ample parallelism, but also replication of work,
given what we know about the Krylov invariance to shifting. As before, it is natural
to seek techniques (e.g. [84–86]) that will reduce the redundancy. The final choice,
of course, will depend on the problem and on the characteristics of the underlying
computational platform.
We conclude by recalling from Sect. 12.1.3 that the use of partial fractions requires
care to avoid catastrophic cancellations and loss of accuracy. Table 12.3 indicates
how large the partial fraction coefficients of some Padé approximants become as
the degrees of the numerator and denominator increase. To avoid problems, we
can apply incomplete partial fractions as suggested in Sect. 12.1.3. For example, in
Table 12.4 we present results with the IPF(τ ) algorithm that extends (12.6) to the
case of rational functions that are the quotient of polynomials of equal degree. The
algorithm was applied to evaluate the partial fraction representations of diagonal Padé
approximations for the matrix exponential applied on a vector, that is rd,d (−Aδ)b
as approximations for exp(−Aδ)b with A = h12 [−1, 2, −1] N using h = 1/(N + 1)
with N = 998. The right-hand side b = Nj=1 1j v j , where v j are the eigenvectors
of A, ordered so that v j corresponds to the eigenvalue λ j = h42 sin2 2(Njπ+1) . Under
“components” we show the groups that formed from the application of the IPF
428 12 Matrix Functions and the Determinant
algorithm. For example, when d = 24, the component set for τ = 108 indicates that
to keep the partial fraction coefficients below that value, the rational function was
written as a product of two terms, one consisting of a sum of 16 elements, the other
of a sum of 8. The resulting relative error in the final solution is approximately 10−9 .
If instead one were to use the full partial fraction as a sum of 28 terms, the error
would be 10−2.6 .
12.2 Determinants
det A = ρ K n , (12.19)
n
in which |ρ| = 1 and ln K = n1 i=1 ln |νii |.
For a general dense matrix, the complexity of the computation is O(n 3 ), and
O(n 2 ) for a Hessenberg matrix. For the latter, a variation of the above procedure is
the use of Hyman’s method (see [87] and references herein). In the present section,
we consider techniques that are more suitable for large structured sparse matrices.
where for i = 1, . . . , q − 1 the blocks Ai+1,i and Ai,i+1 are coupling matrices
defined by:
0 0
Ai,i+1 = ,
Bi 0
0 Ci+1
Ai+1,i = ;
0 0
Consider the Spike factorization scheme in Sect. 5.2.1, where, without loss of
generality, we assume that D = diag(A1 , A2 , . . . , Aq ) is nonsingular, and for i =
1, . . . , q, we have the factorizations Pi Ai Q i = L i Ui of the sparse diagonal blocks
Ai . Let P = diag(P1 , P2 , . . . , Pq ), Q = diag(Q 1 , Q 2 , . . . , Q q ) and
S = D −1 A = In + T,
where
and
⎛ ⎞
0s1 0 V1 0 0
⎜ 0 0r2 V1b 0 0 ⎟
⎜ ⎟
⎜ 0 W2t 0l1 0 0 V2t ⎟
⎜ ⎟
⎜ 0 W2 0 0s2 0 V2 ⎟
⎜ ⎟
⎜ 0 W2b 0 0 0r3 V2b ⎟
⎜ ⎟
⎜ W3t 0l20 0 V3t ⎟
⎜ ⎟
⎜ W3 0 0s3 0 V3 ⎟
T =⎜
⎜
⎟,
⎟
⎜ W3b 0 0 0r4 V3b ⎟
⎜ .. ⎟
⎜ . Vq−1 ⎟
t
⎜ 0 ⎟
⎜ ⎟
⎜ 0 Vq−1 ⎟
⎜ ⎟
⎜ 0rq 0 Vq−1 ⎟
b
⎜ ⎟
⎝ 0 Wqt 0lq−1 0 ⎠
0 Wq 0 0sq
and where the right and left spikes, respectively, are given by [91]:
V1 s1
V1 ∈ Cn 1 ×l1 ≡ (12.21)
V1b r2 ,
−1 0 −1 −1 0
= (A1 ) = Q 1 U1 L 1 P1 , (12.22)
B1 B1
and for i = 2, . . . , q − 1,
Vit li−1
n i ×li
Vi ∈ C ≡ Vi si (12.23)
Vib ri+1 ,
0 0
= (Ai )−1 = Q i Ui−1 L i−1 Pi , (12.24)
Bi Bi
12.2 Determinants 431
Wit li−1
n i ×ri
Wi ∈ C ≡ Wi si (12.25)
Wib ri+1
Ci Ci
= (Ai )−1 = Q i Ui−1 L i−1 Pi , (12.26)
0 0
and
Wqt lq−1
Wq ∈ Cn q ×rq ≡ (12.27)
Wq sq
−1 C q −1 −1 Cq
= (Aq ) = Q q Uq L q Pq . (12.28)
0 0
where R(z) = (z I − A)−1 is the resolvent. Also, let Φz (h) = det(I + h R(z)), then
z+h f (z)
dz = ln(Φz (h))
z f (z)
= ln |Φz (h)| + i arg(Φz (h)).
The following lemma determines the stepsize control which guarantees proper
integration. The branch (i.e. a determination arg0 of the argument), which is to be
followed along the integration process, is fixed by selecting an origin z 0 ∈ Γ and by
insuring that
then,
Φz (s) ∈
/ (−∞, 0], ∀s ∈ [0, h].
12.2 Determinants 433
1
|h| < . (12.36)
|Φz (0)|
e.g. see [21, 97–100]. The most straightforward procedure, but not the most efficient,
consists of approximating the derivative with the ratio
Φz (s) − 1
Φz (0) ≈ ,
s
where s = αh with an appropriately small α. Therefore, the computation imposes an
additional LU factorization for evaluating the quantity Φz (s). This approach doubles
the computational effort when compared to computing only one determinant per
vertex as needed for the integration.
−1
−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
434 12 Matrix Functions and the Determinant
References
1. Varga, R.: Matrix Iterative Analysis. Springer Series in Computational Mathematics, 2nd edn.
Springer, Berlin (2000)
2. Hochbruck, M., Ostermann, A.: Exponential integrators. Acta Numer. 19, 28–209 (2012).
doi:10.1017/S0962492910000048
3. Sidje, R.: EXPOKIT: software package for computing matrix exponentials. ACM Trans. Math.
Softw. 24(1), 130–156 (1998)
4. Berland, H., Skaflestad, B., Wright, W.M.: EXPINT—A MATLAB package for exponential
integrators. ACM Trans. Math. Softw. 33(1) (2007). doi:10.1145/1206040.1206044. http://
doi.acm.org/10.1145/1206040.1206044
5. Schmelzer, T., Trefethen, L.N.: Evaluating matrix functions for exponential integrators via
Carathéodory-Fejér approximation and contour integrals. ETNA 29, 1–18 (2007)
References 435
6. Skaflestad, B., Wright, W.: The scaling and modified squaring method for matrix functions
related to the exponential. Appl. Numer. Math. 59, 783–799 (2009)
7. Festinger, L.: The analysis of sociograms using matrix algebra. Hum. Relat. 2, 153–158 (1949)
8. Katz, L.: A new status index derived from sociometric index. Psychometrika 18(1), 39–43
(1953)
9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Pro-
ceedings of 7th International Conference World Wide Web, pp. 107–117. Elsevier Science
Publishers B.V., Brisbane (1998)
10. Kleinberg, J.: Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999)
11. Langville, A., Meyer, C.: Google’s PageRank and Beyond: The Science of Search Engine
Rankings. Princeton University Press, Princeton (2006)
12. Bonchi, F., Esfandiar, P., Gleich, D., Greif, C., Lakshmanan, L.: Fast matrix computations for
pairwise and columnwise commute times and Katz scores. Internet Math. 8(2011), 73–112
(2011). http://projecteuclid.org/euclid.im/1338512314
13. Estrada, E., Higham, D.: Network properties revealed through matrix functions. SIAM Rev.
52, 696–714. Published online November 08, 2010
14. Fenu, C., Martin, D., Reichel, L., Rodriguez, G.: Network analysis via partial spectral factor-
ization and Gauss quadrature. SIAM J. Sci. Comput. 35, A2046–A2068 (2013)
15. Estrada, E., Rodríguez-Velázquez, J.: Subgraph centrality and clustering in complex hyper-
networks. Phys. A: Stat. Mech. Appl. 364, 581–594 (2006). http://www.sciencedirect.com/
science/article/B6TVG-4HYD5P6-3/1/69cba5c107b2310f15a391fe982df305
16. Benzi, D., Ernesto, E., Klymko, C.: Ranking hubs and authorities using matrix functions.
Linear Algebra Appl. 438, 2447–2474 (2013)
17. Estrada, E., Hatano, N.: Communicability Graph and Community Structures in Complex
Networks. CoRR arXiv:abs/0905.4103 (2009)
18. Baeza-Yates, R., Boldi, P., Castillo, C.: Generic damping functions for propagating importance
in link-based ranking. J. Internet Math. 3(4), 445–478 (2006)
19. Kollias, G., Gallopoulos, E., Grama, A.: Surfing the network for ranking by multidamping.
IEEE TKDE (2013). http://www.computer.org/csdl/trans/tk/preprint/06412669-abs.html
20. Kollias, G., Gallopoulos, E.: Multidamping simulation framework for link-based ranking. In:
A. Frommer, M. Mahoney, D. Szyld (eds.) Web Information Retrieval and Linear Algebra
Algorithms, no. 07071 in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und
Forschungszentrum für Informatik (IBFI), Schloss Dagstuhl, Germany (2007). http://drops.
dagstuhl.de/opus/volltexte/2007/1060
21. Bai, Z., Fahey, G., Golub, G.: Some large-scale matrix computation problems. J. Com-
put. Appl. Math. 74(1–2), 71–89 (1996). doi:10.1016/0377-0427(96)00018-0. http://www.
sciencedirect.com/science/article/pii/0377042796000180
22. Bekas, C., Curioni, A., Fedulova, I.: Low-cost data uncertainty quantification. Concurr. Com-
put.: Pract. Exp. (2011). doi:10.1002/cpe.1770. http://dx.doi.org/10.1002/cpe.1770
23. Stathopoulos, A., Laeuchli, J., Orginos, K.: Hierarchical probing for estimating the trace of
the matrix inverse on toroidal lattices. SIAM J. Sci. Comput. 35, S299–S322 (2013). http://
epubs.siam.org/doi/abs/10.1137/S089547980036869X
24. Higham, N.: Functions of Matrices: Theory and Computation. SIAM, Philadelphia (2008)
25. Golub, G., Meurant, G.: Matrices. Moments and Quadrature with Applications. Princeton
University Press, Princeton (2010)
26. Trefethen, L.: Approximation Theory and Approximation Practice. SIAM, Philadelphia
(2013)
27. Higham, N., Deadman, E.: A catalogue of software for matrix functions. version 1.0. Technical
Report 2014.8, Manchester Institute for Mathematical Sciences, School of Mathematics, The
University of Manchester (2014)
28. Estrin, G.: Organization of computer systems: the fixed plus variable structure computer. In:
Proceedings Western Joint IRE-AIEE-ACM Computer Conference, pp. 33–40. ACM, New
York (1960)
436 12 Matrix Functions and the Determinant
29. Lakshmivarahan, S., Dhall, S.K.: Analysis and Design of Parallel Algorithms: Arithmetic and
Matrix Problems. McGraw-Hill Publishing, New York (1990)
30. Maruyama, K.: On the parallel evaluation of polynomials. IEEE Trans. Comput. C-22(1)
(1973)
31. Munro, I., Paterson, M.: Optimal algorithms for parallel polynomial evaluation. J. Comput.
Syst. Sci. 7, 189–198 (1973)
32. Pan, V.: Complexity of computations with matrices and polynomials. SIAM Rev. 34(2), 255–
262 (1992)
33. Reynolds, G.: Investigation of different methods of fast polynomial evaluation. Master’s thesis,
EPCC, The University of Edinburgh (2010)
34. Eberly, W.: Very fast parallel polynomial arithmetic. SIAM J. Comput. 18(5), 955–976 (1989)
35. Paterson, M., Stockmeyer, L.: On the number of nonscalar multiplications necessary to eval-
uate polynomials. SIAM J. Comput. 2, 60–66 (1973)
36. Alonso, P., Boratto, M., Peinado, J., Ibáñez, J., Sastre, J.: On the evaluation of matrix poly-
nomials using several GPGPUs. Technical Report, Department of Information Systems and
Computation, Universitat Politécnica de Valéncia (2014)
37. Bernstein, D.: Fast multiplication and its applications. In: Buhler, J., Stevenhagen, P. (eds.)
Algorithmic number theory: lattices, number fields, curves and cryptography, Mathematical
Sciences Research Institute Publications (Book 44), pp. 325–384. Cambridge University Press
(2008)
38. Trefethen, L., Weideman, J., Schmelzer, T.: Talbot quadratures and rational approximations.
BIT Numer. Math. 46, 653–670 (2006)
39. Swarztrauber, P.N.: A direct method for the discrete solution of separable elliptic equations.
SIAM J. Numer. Anal. 11(6), 1136–1150 (1974)
40. Henrici, P.: Applied and Computational Complex Analysis. Wiley, New York (1974)
41. Enright, W.H.: Improving the efficiency of matrix operations in the numerical solution of stiff
differential equations. ACM TOMS 4(2), 127–136 (1978)
42. Choi, C.H., Laub, A.J.: Improving the efficiency of matrix operations in the numerical solution
of large implicit systems of linear differential equations. Int. J. Control 46(3), 991–1008 (1987)
43. Lu, Y.: Computing a matrix function for exponential integrators. J. Comput. Appl. Math.
161(1), 203–216 (2003)
44. Ltaief, H., Kurzak, J., Dongarra., J.: Parallel block Hessenberg reduction using algorithms-
by-tiles for multicore architectures revisited. Technical Report. Innovative Computing Labo-
ratory, University of Tennessee (2008)
45. Quintana-Ortí, G., van de Geijn, R.: Improving the performance of reduction to Hessenberg
form. ACM Trans. Math. Softw. 32, 180–194 (2006)
46. Kung, H.: New algorithms and lower bounds for the parallel evaluation of certain rational
expressions and recurrences. J. Assoc. Comput. Mach. 23(2), 252–261 (1976)
47. Calvetti, D., Gallopoulos, E., Reichel, L.: Incomplete partial fractions for parallel evaluation
of rational matrix functions. J. Comput. Appl. Math. 59, 349–380 (1995)
48. Henrici, P.: An algorithm for the incomplete partial fraction decomposition of a rational
function into partial fractions. Z. Angew. Math. Phys. 22, 751–755 (1971)
49. Graham, R., Knuth, D., Patashnik, O.: Concrete Mathematics. Addison-Wesley, Reading
(1989)
50. Reichel, L.: The ordering of tridiagonal matrices in the cyclic reduction method for Poisson’s
equation. Numer. Math. 56(2/3), 215–228 (1989)
51. Butcher, J., Chartier, P.: Parallel general linear methods for stiff ordinary dif-
ferential and differential algebraic equations. Appl. Numer. Math. 17(3), 213–222
(1995). doi:10.1016/0168-9274(95)00029-T. http://www.sciencedirect.com/science/article/
pii/016892749500029T. Special Issue on Numerical Methods for Ordinary Differential Equa-
tions
52. Chartier, P., Philippe, B.: A parallel shooting technique for solving dissipative ODE’s. Com-
puting 51(3–4), 209–236 (1993). doi:10.1007/BF02238534
References 437
53. Chartier, P.: L-stable parallel one-block methods for ordinary differential equations. SIAM J.
Numer. Anal. 31(2), 552–571 (1994). doi:10.1137/0731030
54. Gander, M., Vandewalle, S.: Analysis of the parareal time-parallel time-integration method.
SIAM J. Sci. Comput. 29(2), 556–578 (2007)
55. Maday, Y., Turinici, G.: A parareal in time procedure for the control of partial differential
equation. C. R. Math. Acad. Sci. Paris 335(4), 387–392 (2002)
56. Baldwin, C., Freund, R., Gallopoulos, E.: A parallel iterative method for exponential propa-
gation. In: D. Bailey, R. Schreiber, J. Gilbert, M. Mascagni, H. Simon, V. Torczon, L. Watson
(eds.) Proceedings of Seventh SIAM Conference on Parallel Processing for Scientific Com-
puting, pp. 534–539. SIAM, Philadelphia (1995). Also CSRD Report No. 1380
57. Gallopoulos, E., Saad, Y.: Efficient solution of parabolic equations by Krylov approximation
methods. SIAM J. Sci. Stat. Comput. 1236–1264 (1992)
58. Gallopoulos, E., Saad, Y.: On the parallel solution of parabolic equations. In: Proceedings of
the 1989 International Conference on Supercomputing, pp. 17–28. Herakleion, Greece (1989)
59. Gallopoulos, E., Saad, Y.: Efficient parallel solution of parabolic equations: implicit methods
on the Cedar multicluster. In: J. Dongarra, P. Messina, D.C. Sorensen, R.G. Voigt (eds.)
Proceedings of Fourth SIAM Conference Parallel Processing for Scientific Computing, pp.
251–256. SIAM, (1990) Chicago, December 1989
60. Sidje, R.: Algorithmes parallèles pour le calcul des exponentielles de matrices de grandes
tailles. Ph.D. thesis, Université de Rennes I (1994)
61. Lopez, L., Simoncini, V.: Analysis of projection methods for rational function approximation
to the matrix exponential. SIAM J. Numer. Anal. 44(2), 613–635 (2006)
62. Popolizio, M., Simoncini, V.: Acceleration techniques for approximating the matrix exponen-
tial operator. SIAM J. Matrix Anal. Appl. 30, 657–683 (2008)
63. Frommer, A., Simoncini, V.: Matrix functions. In: Schilders, W., van der Vorst, H.A., Rommes,
J. (eds.) Model Order Reduction: Theory, Research Aspects and Applications, pp. 275–303.
Springer, Berlin (2008)
64. van den Eshof, J., Hochbruck, M.: Preconditioning Lanczos approximations to the matrix
exponential. SIAM J. Sci. Comput. 27, 1438–1457 (2006)
65. Gu, C., Zheng, L.: Computation of matrix functions with deflated restarting. J. Comput. Appl.
Math. 237(1), 223–233 (2013). doi:10.1016/j.cam.2012.07.020. http://www.sciencedirect.
com/science/article/pii/S037704271200310X
66. Varga, R.S.: On higher order stable implicit methods for solving parabolic partial differential
equations. J. Math. Phys. 40, 220–231 (1961)
67. Baker Jr, G., Graves-Morris, P.: Padé Approximants. Part I: Basic Theory. Addison Wesley,
Reading (1991)
68. Carpenter, A.J., Ruttan, A., Varga, R.S.: Extended numerical computations on the 1/9 conjec-
ture in rational approximation theory. In: Graves-Morris, P.R., Saff, E.B., Varga, R.S. (eds.)
Rational Approximation and Interpolation. Lecture Notes in Mathematics, vol. 1105, pp.
383–411. Springer, Berlin (1984)
69. Cody, W.J., Meinardus, G., Varga, R.S.: Chebyshev rational approximations to e−x in [0, +∞)
and applications to heat-conduction problems. J. Approx. Theory 2(1), 50–65 (1969)
70. Cavendish, J.C., Culham, W.E., Varga, R.S.: A comparison of Crank-Nicolson and Chebyshev
rational methods for numerically solving linear parabolic equations. J. Comput. Phys. 10,
354–368 (1972)
71. Freund, R.W.: Conjugate gradient-type methods for linear systems with complex symmetric
coefficient matrices. SIAM J. Sci. Stat. Comput. 13(1), 425–448 (1992)
72. Axelsson, O., Kucherov, A.: Real valued iterative methods for solving complex symmetric
linear systems. Numer. Linear Algebra Appl. 7(4), 197–218 (2000).doi:10.1002/1099-
1506(200005)http://dx.doi.org/10.1002/1099-1506(200005)7:4<197::AID-NLA194>3.0.
CO;2-S
73. Howle, V., Vavasis, S.: An iterative method for solving complex-symmetric systems arising
in electrical power modeling. SIAM J. Matrix Anal. Appl. 26, 1150–1178 (2005)
438 12 Matrix Functions and the Determinant
74. Datta, B.N., Saad, Y.: Arnoldi methods for large Sylvester-like observer matrix equations,
and an associated algorithm for partial spectrum assignment. Linear Algebra Appl. 154–156,
225–244 (1991)
75. Gear, C.W., Saad, Y.: Iterative solution of linear equations in ODE codes. SIAM J. Sci. Stat.
Comput. 4, 583–601 (1983)
76. Parlett, B.N.: The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs (1980)
77. Simoncini, V.: Restarted full orthogonalization method for shifted linear systems. BIT Numer.
Math. 43(2), 459–466 (2003)
78. Freund, R.: Solution of shifted linear systems by quasi-minimal residual iterations. In: Reichel,
L., Ruttan, A., Varga, R. (eds.) Numerical Linear Algebra, pp. 101–121. W. de Gruyter, Berlin
(1993)
79. Frommer, A.: BiCGStab() for families of shifted linear systems. Computing 7(2), 87–109
(2003)
80. Druskin, V., Knizhnerman, L.: Two polynomial methods of calculating matrix functions of
symmetric matrices. U.S.S.R. Comput. Math. Math. Phys. 29, 112–121 (1989)
81. Friesner, R.A., Tuckerman, L.S., Dornblaser, B.C., Russo, T.V.: A method for exponential
propagation of large systems of stiff nonlinear differential equations. J. Sci. Comput. 4(4),
327–354 (1989)
82. Saad, Y.: Analysis of some Krylov subspace approximations to the matrix exponential oper-
ator. SIAM J. Numer. Anal. 29, 209–228 (1992)
83. Güttel, S.: Rational Krylov methods for operator functions. Ph.D. thesis, Technischen Uni-
versity Bergakademie Freiberg (2010)
84. Simoncini, V., Gallopoulos, E.: A hybrid block GMRES method for nonsymmetric systems
with multiple right-hand sides. J. Comput. Appl. Math. 66, 457–469 (1996)
85. Darnell, D., Morgan, R.B., Wilcox, W.: Deflated GMRES for systems with multiple shifts
and multiple right-hand sides. Linear Algebra Appl. 429, 2415–2434 (2008)
86. Soodhalter, K., Szyld, D., Xue, F.: Krylov subspace recycling for sequences of shifted linear
systems. Appl. Numer. Math. 81, 105–118 (2014)
87. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
88. MUMPS: A parallel sparse direct solver. http://graal.ens-lyon.fr/MUMPS/
89. SuperLU (Supernodal LU). http://crd-legacy.lbl.gov/~xiaoye/SuperLU/
90. Kamgnia, E., Nguenang, L.B.: Some efficient methods for computing the determinant of large
sparse matrices. ARIMA J. 17, 73–92 (2014). http://www.inria.fr/arima/
91. Polizzi, E., Sameh, A.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel
Comput. 32, 177–194 (2006)
92. Bertrand, O., Philippe, B.: Counting the eigenvalues surrounded by a closed curve. Sib. J. Ind.
Math. 4, 73–94 (2001)
93. Kamgnia, E., Philippe, B.: Counting eigenvalues in domains of the complex field. Electron.
Trans. Numer. Anal. 40, 1–16 (2013)
94. Rudin, W.: Real and Complex Analysis. McGraw Hill, New York (1970)
95. Silverman, R.A.: Introductory Complex Analysis. Dover Publications, Inc., New York (1972)
96. Bindel, D.: Bounds and error estimates for nonlinear eigenvalue problems. Berkeley Applied
Math Seminar (2008). http://www.cims.nyu.edu/~dbindel/present/berkeley-oct08.pdf
97. Maeda, Y., Futamura, Y., Sakurai, T.: Stochastic estimation method of eigenvalue density for
nonlinear eigenvalue problem on the complex plane. J. SIAM Lett. 3, 61–64 (2011)
98. Bai, Z., Golub, G.H.: Bounds for the trace of the inverse and the determinant of symmetric
positive definite matrices. Ann. Numer. Math. 4, 29–38 (1997)
99. Duff, I., Erisman, A., Reid, J.: Direct Methods for Sparse Matrices. Oxford University Press
Inc., New York (1989)
100. Golub, G.H., Meurant, G.: Matrices, Moments and Quadrature with Applications. Princeton
University Press, Princeton (2009)
Chapter 13
Computing the Matrix Pseudospectrum
Fig. 13.1 Illustrations of pseudospectra for matrix grcar of order n = 50. The left frame was
computed using function ps from the Matrix Computation Toolbox that is based on relation 13.1 and
shows the eigenvalues of matrices A + E j for random perturbations E j ∈ C50×50 , j = 1, . . . , 10
where E j ≤ 10−3 . The frame on the right was computed using the EigTool package and is
based on relation 13.2; it shows the level curves defined by {z : s(z) ≤ ε} for ε = 10−1 down to
10−10
For brevity, we also write s(z) to denote σmin (A−z I ). Based on this characterization,
for any ε > 0, we can classify any point z to be interior to Λε (A) if s(z) < ε or to
be exterior when s(z) > ε. By convention, a point of ∂Λε (A) is assimilated to an
interior point.
The third characterization is based on the resolvent of A:
GRID is simple to implement and offers parallelism at several levels: large grain,
since the singular value computations at each point are independent, as well as
medium and fine grain grain parallelism when computing each s(z k ). The sequential
cost is typically modeled by
where |Ωh | denotes the number of nodes of Ωh and Cσmin is the average cost for for
computing s(z). The total sequential cost becomes rapidly prohibitive as the size of
A and/or the number of nodes increase. Given that the cost of computing s(z) is at
least O(n 2 ) and even O(n 3 ) for dense matrix methods, and that a typical mesh could
easily contain O(104 ) points, the cost can be very high even for matrices of moderate
size. For large enough |Ωh | relative to the number of processors p, GRID can be
implemented with almost perfect speedup by simple static assignment of |Ωh |/ p
computations of s(z) per processor.
Load balancing also becomes an issue when the cost of computing s(z) varies
a lot across the domain. This could happen when iterative methods, such as those
described in Chap. 11, Sects. 11.6.4 and 11.6.5, are used to compute s(z).
Another obstacle is that the smallest singular value of a given matrix is usually
the hardest to compute (much more so than computing the largest); moreover, this
computation has to be repeated for as many points as necessary, depending on the
resolution of Ω. Not surprisingly, therefore, as the size of the matrix and/or the
number of mesh points increase the cost of the straightforward algorithm above also
becomes prohibitive. Advances over GRID for computing pseudospectra are based
on some type of dimensionality reduction on the domain or on the matrix, that lower
442 13 Computing the Matrix Pseudospectrum
the cost of the factors in the cost model (13.4). In order to handle large matrices, it
is necessary to construct algorithms that combine or blend the two approaches.
Despite the potential for near perfect speedup on a parallel system, Algorithm 13.1
(GRID) entails much redundant computation. Specifically, σmin is computed over
different matrices, but all of them are shifts of one and the same matrix. In a sequen-
tial algorithm, it is preferable to first apply an unitary similarity transformation, say
Q ∗ AQ = T , that reduces the matrix to complex upper triangular (via Schur factor-
ization) or upper Hessenberg form. In both cases, the singular values remain intact,
thus σmin (A − z I ) = σmin (T − z I ). The gains from this preprocessing are substan-
tial in the sequential case: the total cost lowers from |Ωh |O(n 3 ) to that of the initial
reduction, which is O(n 3 ), plus |Ωh |O(n 2 ). The cost of the reduction is amortized
as the number of gridpoints increases.
We illustrate these ideas in Algorithm 13.2 (GRID_fact ) which reduces A to
Hessenberg or triangular form and then approximates each s(z k ) from the (square
root of the) smallest eigenvalue of (z k I − T )∗ (z k I − T ). This is done using inverse
Lanczos iteration, exploiting the fact that the cost at each iteration is quadratic since
the systems to be solved at each step are Hessenberg or triangular.
For a parallel implementation, we need to decide upon the following:
1. How to evaluate the preprocessing steps (lines 2–5 of Algorithm 13.2). On a par-
allel system, the reduction step can either be replicated on one or more processors
and the reduced matrix distributed to all processors participating in computing
the values σmin (T − z j I ). The overall cost will be that for the preprocessing plus
|Ωh |
p O(n ) when p ≤ |Ωh |. The cost of the reduction is amortized as the number
2
Algorithm 13.2 GRID_fact: computes Λε (A) using Def. 13.2 and factorization.
Input: A ∈ Rn×n , l ≥ 1 positive values in decreasing order E = {ε1 , ε2 , ..., εl }, logical variable
schur set to 1 to apply Schur factorization.
Output: plot of ∂Λε (A) for all ε ∈ E
1: Ωh as in line 1 of Algorithm 13.1.
2: T = hess(A) //reduce A to Hessenberg form
3: if schur == 1 then
4: T = schur(T, complex ) //compute complex Schur factor T
5: end if
6: doall z k ∈ Ωh √
7: compute s(z k ) = λmin (z k I − T )∗ (z k I − T ) //using inverse Lanczos iteration and
exploiting the structure of T
8: end
9: Plot the l contours of s(z k ) for all values of E .
We next consider an approach whose goal is to reduce the number of gridpoints where
s(z) needs to be evaluated. It turns out that it is possible to rapidly “thin the grid” by
stripping disk shaped regions from Ω. Recall that for any given ε and gridpoint z,
algorithm GRID uses the value of s(z) to classify z as lying outside or inside Λε (A).
In that sense, GRID makes pointwise use of the information it computes at each z. A
much more effective idea is based on the fact that at any point where the minimum
singular value s(z) is evaluated, given any value of ε, either the point lies in the
pseudospectrum or its boundary or a disk is constructed whose points are guaranteed
to be exterior to Λε (A). The following result is key.
Proposition 13.2 ([2]) If s(z) = r > ε then
D ◦ (z, r − ε) ∩ Λε (A) = ∅,
Figure 13.2 (from [4]) illustrates the application of MOG for matrix triangle
(a Toeplitz matrix with symbol z −1 + 41 z 2 from [5]) of order 32. MOG required only
676 computations of s(z) to approximate Λε (A) for ε = 10−1 compared to the 2500
needed for GRID.
Unlike GRID, the preferred sequence of exclusions in MOG cannot be determined
beforehand but depends on the order in which the gridpoints are swept since Ωh is
modified in the course of the computation. This renders a parallel implementation
more challenging. Consider for instance a static allocation policy of gripoints, e.g.
block partitioning. Then a single computation within one processor might exclude
many points allocated to this and possibly other processors. In other words, if s(z) is
computed, then it is likely that values s(z + Δz) at nearby points will not be needed.
This is an action reverse from the spatial locality of reference principle in computer
science and can cause load imbalance.
To handle this problem, it was proposed in [4] to create a structure accessible to
all processors that contains information about the status of each gridpoint. Points are
classified in three categories: Active, if s(z) still needs to be computed, inactive, if
they have been excluded, and fixed, if s(z) ≤ ε. This structure must be updated as
the computation proceeds. The pseudospectrum is available when there are no more
active points.
0.5
−0.5
−1
−1.5
−2
−1 −0.5 0 0.5 1 1.5 2
13.1 Grid Based Methods 445
−1
−2
−3
−4
−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5
446 13 Computing the Matrix Pseudospectrum
If the objective is to compute the Λε (A) for ε1 > · · · > εs , we can apply MOG
to estimate the enclosing Λε1 (A) followed by GRID for all remaining points. An
alternative is to take advantage of the fact that
Discounting negative and zero values, these define a sequence of concentric disks
that do not contain any point of the pseudospectra for the corresponding values of ε.
For example, the disk centered at z and radius σmin (A) − ε j has no intersection with
Λε j (A). This requires careful bookkeeping but its parallel implementation, would
greatly reduce the need to use GRID.
One approach that has the potential for dramatic dimensionality reduction on the
domain is to trace the individual pseudospectral boundaries using some type of path
following. We first review the original path following idea for pseudospectra and
then examine three methods that enable parallelization. Interestingly, parallelization
also helps to improve the numerical robustness of path following.
To trace the curve, we can use predictor-corrector techniques that require differen-
tial information from the curve, specifically tangents and normals. The following
important result shows that such differential information at any point z is available
if together with σmin one computes the corresponding singular vectors (that is the
minimum singular triplet {σmin , u min , vmin }. It is useful here to note that Λε (A) can
be defined implicitly by the equation
(here we identify the complex plane C with R2 ). The key result is the following that
is a generalization of Theorem 11.14 to the complex plane [6]:
Theorem 13.1 Let z = x + i y ∈ C \ Λ(A). Then g(x, y) is real analytic in a
neighborhood of (x, y), if σmin ((x + i y)I − A) is a simple singular value. The
gradient of g(x, y) is equal to
∗ ∗ ∗
∇g(x, y) = ((vmin u min ), (vmin u min )) = vmin u min ,
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 447
where u min and vmin denote the left and right singular vectors corresponding to σmin .
From the value of the gradient at any point of the curve, one can make a small predic-
tion step along the tangent followed by a correction step to return to a subsequent point
on the curve. Initially, the method needs to locate one point of each non-intersecting
boundary ∂Λε (A). The path following approach was proposed in [7]. We denote it
by PF and list it as Algorithm 13.4, incorporating the initial reduction to Hessenberg
form.
Algorithm 13.4 PF: computing ∂Λε (A) by predictor-corrector path following [7].
Input: A ∈ Rn×n , value ε > 0.
Output: plot of the ∂Λε (A)
1: Transform A to upper Hessenberg form and set k = 0.
2: Find initial point z 0 ∈ ∂Λε (A)
3: repeat
4: Determine rk ∈ C, |rk | = 1, steplength τk and set z̃ k = z k−1 + τk rk .
5: Correct along dk ∈ C, |dk | = 1 by setting z k = z̃ k + θk dk where θk is some steplength.
6: until termination
piv sup
Fig. 13.4 Cobra: Position of pivot (z k−1 ), initial prediction (z̃ k ), support (z k ), first order predic-
j
tors (ζ j,0 ) and corrected points (z k ). (A proper scale would show that h H)
It lends itself for parallel processing while being more robust than the original path
following. The name used is Cobra in order to evoke that snake’s spread neck.
Iteration k is illustrated in Fig. 13.4. Upon entering the repeat loop in line 4, it is
piv
assumed that there is a point, z k−1 , available that is positioned close to the curve being
traced. The loop consists of three steps. In the first prediction-correction step (line
sup
4–5), just like Algorithm PF, a support point, z k , is computed first using a small
piv sup
stepsize h. In the second prediction-correction step (lines 7–8) z k−1 , z k determine a
prediction direction dk and m equidistant points, ζi,0 ∈ dk , i = 1, . . . m, are selected
piv
in the direction of dk . Then h = |z̃ k − z k−1 | is the stepsize of the method; note that
piv
we can interpret H = |z k−1 − ζm,0 | as the length of the “neck” of cobra. Then each
ζi,0 ∈ dk , i = 1, . . . m is corrected to obtain z ki ∈ ∂Λε (A), i = 1, . . . , m. This is
implemented using only one step of Newton iteration on σmin (A − z I ) − ε = 0. All
corrections are independent and can be performed in parallel. In the third step (line
piv
10), the next pivot, z k , is selected using some suitable criterion.
The correction phase is implemented with Newton iteration along the direction of
steepest ascent dk = ∇g(x̃k , ỹk ), where x̃k , ỹk are the real and imaginary parts of z̃ k .
Theorem 13.1 provides the formula for the computation of the gradient of s(z) − ε.
The various parameters needed to implement the main steps of the procedure can be
found in [7].
In the correction phase of and other path following schemes, Newton’s method
is applied to solve the nonlinear equation g(x, y) = ε for some ε > 0 (cf. (13.5)).
Therefore, we need to compute ∇g(x, y). From Theorem 13.1, this costs only 1 DOT
operation.
Cobra offers large-grain parallelism in its second step. The number of gridpoints
on the cobra “neck” determines the amount of available large-grain parallelism avail-
able that is entirely due to path following. If the number of points m on the “neck”
is equal to or a multiple of the number of processors, then the maximum speedup
due to path following of Cobra over PF is expected to be m/2. A large number
of processors would favor making m large and thus it is well suited for obtaining
∂Λε (A) at very fine resolutions. The “neck” length H , of course, cannot be arbitrarily
large and once there are enough points to adequately represent ∂Λε (A), any further
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 449
Algorithm 13.5 Cobra: algorithm to compute ∂Λε (A) using parallel path following
[48]. After initialization (line 1), it consists of three stages: i) prediction-correction
(lines 4-5), ii) prediction-correction (lines 7-8), and iii) pivot selection (line 10).
Input: A ∈ Rn×n , value ε > 0, number m of gridpoints for parallel path following.
Output: pseudospectrum boundary ∂Λε (A)
1: Transform A to upper Hessenberg form and set k = 0.
2: Find initial point z 0 ∈ ∂Λε (A)
3: repeat
4: Set k = k + 1 and predict z̃ k .
sup
5: Correct using Newton and compute z k .
6: doall j = 1, ..., m
7: Predict ζ j,0 .
j
8: Correct using Newton and compute z k .
9: end
piv
10: Determine next pivot z k .
11: until termination
3 T
kl
l Skl S
k+1,l
2.5 ~
T
kl
2
l−1 S
k+1,l−1
1.5
k k+1 k+2
1 1.5 2 2.5 3 3.5 4
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 451
F(T ) = R( p(T ), θ ),
(13.6)
with θ = π3 if p(T ) is interior, else θ = − π3 ,
In Fig. 13.6, four situations are listed, depending on the type of the two points p(T )
and p(F(T )).
Definition 13.3 For any given T ∈ O L we define the F−orbit of T to be the set
O(T ) = {Tn , n ∈ Z} ⊂ O L , where Tn ≡ F n (T ).
3
Γ σ(A)
4 Γ (A) Υσ(A)
σ
2.5
Υσ(A)
3.5 Ti
2
Ti p(Ti+1) p(Ti)
3
Ti+1=F(Ti) 1.5 Ti+1=F(Ti)
2.5 p(Ti+1)=p(Ti)
1
2 Λ (A) Λσ(A)
σ 0.5
3
Γ σ(A) p(Ti)
Γ (A)
σ
−2 2.5
Ti Υσ(A)
Υσ(A)
p(T )=p(T )
i+1 i
−2.5 2
Ti+1=F(Ti)
Ti
p(Ti+1)
−3 1.5
Λ (A)
σ
Ti+1=F(Ti)
−3.5 1
−4 Λσ(A)
0.5
Remark 13.1 When the matrix A and the interior point z 0 are real, and when the
inclination θ is null, the orbit is symmetric with respect to the real axis. The compu-
tation of the orbit can be stopped as soon as a new interval is real. The entire orbit is
obtained by symmetry. This halves the effort of computation.
Remark 13.2 In Algorithm 13.6, one may stop at Step 19, when the orbit is deter-
mined. From the chain of triangles, a polygonal line of exterior points is therefore
obtained which follows closely the polygonal line which would have been built with
the full computation. This reduces the number of computations significantly.
To illustrate the result obtained by PAT, consider matrix grcar(100) and two exper-
iments. In the two cases, the initial interior point is automatically determined in the
neighborhood of the point z = 0.8 − 1.5i. The results are expressed in Table 13.1
and in Fig. 13.7.
Parallelism in PAT, as in the previous algorithms, arises at two levels. (i) within
the kernel computing s(z); (ii) across computations of s(z) at different values z =
z 1 , z 2 , . . .. The main part of the computation is done in the last loop (line 19) of
Algorithm 13.6 that is devoted to the computation of points of the curve ∂Λε (A) from
a list of intervals that intersect the curve. As indicated by the doall instruction, the
Table 13.1 PAT: number of triangles in two orbits for matrix grcar(100)
ε τ N
10−12 0.05 146
10−6 0.2 174
454 13 Computing the Matrix Pseudospectrum
iterations are independent. For a parallel version of the method, two processes can
be created: P1 dedicated to STEP I and STEP II, and P2 to STEP III. Process P1
produces intervals which are sent to P2 . Process P2 consumes the received intervals,
each interval corresponding to an independent task. This approach allows to start
STEP III before STEP II terminates.
The parallel version of PAT, termed PPAT, applies this approach (see [10]
for details). To increase the initial production of intervals intersecting the curve,
a technique for starting the orbit simultaneously in several places is implemented
(as noted earlier, Cobra could also benefit from a similar technique). When A is
large
PPAT computes s(z) by means of a Lanczos procedure applied to
and sparse,
O R −1
where R is the upper triangular factor of the sparse QR factorization
R −∗ 0
of the matrix (A − z I ). The two procedures, namely the QR factorization and the
triangular system solves at each iteration, are also parallelized.
PF, Cobra, PAT and PPAT construct ∂Λε (A) for any given ε. It turns out that once
an initial boundary has been constructed, a path following approach can be applied,
yielding an effective way for computing additional boundaries for smaller values of
ε that is rich in large grain parallelism.
The idea is instead of tracing the curve based on tangent directions, to follow the
normals along directions that s(z) decreases. This was implemented as an algorithm
called Pseudospectrum Descent Method (PsDM for short). Essentially, given enough
gridpoints on the outermost pseudospectrum curve, the algorithm creates a front along
the normals and towards the direction of decrease of s(z) at each point steps until
the next sought inner boundary is reached. The process can be repeated to compute
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 455
Y
0
following and the directions
used in marching from the
−0.5
outer to the inner curves with
PsDM
−1
−1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5
X
pseudospectral contours. Figure 13.8 shows the results from the application of PsDM
to matrix kahan(100) (available via the MATLAB gallery; see also [11]). The
plot shows (i) the trajectories of the points undergoing the steepest descent and (ii)
the corresponding level curves. The intersections are the actual points computed
byPsDM.
To be more specific, assume that an initial contour ∂Λε (A) is available in the
form of some approximation (e.g. piecewise linear) based on N points z k previously
computed using Cobra (with only slight modifications, the same idea applies if PAT
or PPAT are used to approximate the first contour). In order to obtain a new set of
points that define an inner level curve we proceed in 2 steps:
Step 1: Start from z k and compute an intermediate point w̃k by a single modified
Newton step towards a steepest descent direction dk obtained earlier.
Step 2: Correct w̃k to wk using a Newton step along the direction lk of steepest
descent at w̃k .
Figure 13.9 (from [12]) illustrates the basic idea using only one initial point. Applying
one Newton step at z k requires ∇s(z k ). To avoid this extra cost, we apply the same
idea as in PF and Cobra and use the fact that
∗ ∗
∇s(x̃k + i ỹk ) = (gmin qmin , gmin qmin )
where gmin and qmin are the corresponding left and right singular vectors. This quan-
tity is available from the original path following procedure. In essence we approxi-
mated the gradient based at z k with the gradient based at z̃ k , so that
s(x̃k + i ỹk ) − ε
z k = z̃ k − ∗ g , (13.7)
qmin min
456 13 Computing the Matrix Pseudospectrum
where δ < ε and ∂Λδ (A) is the new pseudospectrum boundary we are seeking. Once
w̃k is available, we perform a second Newton step that yields wk :
s(w̃k ) − δ
wk = w̃k − , (13.9)
u ∗min vmin
where the triplet used is associated with w̃k . These steps can be applied indepen-
dently to all N points. This is one sweep of PsDM; we list it as Algorithm 13.7
(PsDM_sweep).
Assume now that the new points computed after PsDM_sweep define satisfactory
approximations of a nearby contour ∂Λδ (A), where δ < ε. We can continue in this
manner to approximate boundaries of the pseudospectrum nested in ∂Λδ (A). As
noted earlier, the application of PsDM_sweep uses ∇s(x̃k + i ỹk ), i.e. the triplet
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 457
at z̃ k that was available from the original computation of ∂Λε (A) with Cobra or
PF. Observe now that as the sweep proceeds to compute ∂Λδ̃ (A) from ∂Λδ (A) for
δ̃ < δ, it also computes the corresponding minimum singular triplet at w̃k . Therefore
enough derivative information is available for PsDM_sweep to be reapplied with
starting points computed via the previous application of PsDM_sweep and it is not
necessary to run PF again. Using this idea repeatedly we obtain the PsDM method
listed as Algorithm 13.8 and illustrated in Fig. 13.10.
Observe that each sweep can be interpreted as a map that takes as input Nin
points approximating Λδi (A) and produces Nout points approximating Λδi+1 (A)
(δi+1 < δi ). In order not to burden the discussion, we restrict ourselves to the case
that Nin = Nout . Nevertheless, this is not an optimal strategy. Specifically, since for a
given collection of ε’s, the corresponding pseudospectral boundaries are nested sets
on the complex plane, when ε > δ > 0, the area of Λδ (A) is likely to be smaller than
the area of Λε (A); similarly, the length of the boundary is likely to change; it would
typically become smaller, unless there is separation and creation of disconnected
components whose total perimeter exceeds that of the original curve, in which case
it might increase.
The cost of computing the intermediate points w̃k , k = 1, . . . , N is small since
the derivatives at z̃ k , k = 1, . . . , N have already been computed by PF or a previous
application of PsDM_sweep. Furthermore, we have assumed that σmin (z k I − A) = ε
for all k, since the points z k approximate ∂Λε (A). On the other hand, computing
the final points wk , k = 1, . . . , N requires N triplet evaluations. Let Cσmin denote
the average cost for computing the triplet. We can then approximate the cost of
PsDM_sweep by T1 = N Cσmin . The target points can be computed independently.
On a system with p processors, we can assign the computation of at most N / p
target points to each processor; one sweep will then proceed with no need for synchro-
nization and communication and its total cost is approximated by T p = Np Cσmin .
Some additional characteristics of PsDM that are wortht of note (cf. the references
in Sect. 13.4) are the following:
• It can be shown that the local error induced by one sweep of PsDM stepping from
one contour to the next is bounded by a multiple of the square of the stepsize of
the sweep. This factor depends on the analytic and geometric characteristics of the
pseudospectrum.
• A good implementation of PsDM must incorporate a scheme to monitor any signif-
icant length reduction (more typical) or increase (if there is separation and creation
of two or more disconnected components) between boundaries as ε changes and
make the number of points computed per boundary by PsDM adapt accordingly.
• PsDM can capture disconnected components of the pseudospectrum lying inside
the initial boundary computed via PF.
As the number of points, N , on each curve is expected to be large, PsDM is much
more scalable than the other path following methods and offers many more oppor-
tunities for parallelism while avoiding the redundancy of GRID. On the other hand,
PsDM complements these methods: We expect that one would first run Cobra, PAT
or PPAT to obtain the first boundary using the limited parallelism of these methods,
and then would apply PsDM that is almost embarrassingly parallel. In particular, each
sweep can be split into as many tasks as the number of points it handles and each
task can proceed independently, most of its work being triplet computations. Fur-
thermore, when the step does not involve adaptation, no communication is necessary
between sweeps.
Regarding load balancing, as in GRID, when iterative methods need to be used, the
cost of computing the minimum singular triplets can vary a lot between points on the
same boundary and not as much between neighboring points. In the absence of any
other a priori information on how the cost of computing the triplets varies between
points, numerical experiments (cf. the references) indicate that a cyclic partitioning
strategy is more effective than block partitioning.
When the matrix is very large, the preprocessing by complete reduction of the matrix
into a Hessenberg or upper triangular form, recommended in Sect. 13.1.2, is no
longer feasible. Instead, we seek some form of dimensionality reduction that provides
13.3 Dimensionality Reduction on the Matrix: Methods Based on Projection 459
EigTool obtains the augmented upper Hessenberg matrix H̃m = Hm+1,m and ρ Ritz
values as approximations of the eigenvalues of interest. A grid Ωh is selected over
the Ritz values and algorithm GRID is applied. Denoting by I˜ the identity matrix
Im augmented by a zero row, the computation of σmin ( H̃m − z I˜) at every gridpoint
can be done economically by exploiting the Hessenberg structure. In particular, the
singular values of H̃m − z I˜ are the same as those of the square upper triangular
factor of its “thin Q R” decomposition and these can be computed fast by inverse
iteration or inverse Lanczos iteration. A parallel algorithm for approximating the
pseudospectrum locally as EigTool can be formulated by modifying slightly the
Algorithm in [14, Fig. 2]. We call it LAPSAR and list it as Algorithm 13.9. In addition
to the parallelism made explicit in the loop in lines 3–6, we assume that the call in
line 1 is a parallel implementation of implicitly restarted Arnoldi.
The scheme approximates ρ eigenvalues, where ρ < m (e.g. the EigTool default
is m = 2ρ) using implicitly restarted Arnoldi [15] and then uses the additional
information obtained from this run to compute the pseudospectrum corresponding
to those eigenvalues by using the augmented Hessenberg matrix H̃m .
It is worth noting that at any step m of the Arnoldi process, the following result
holds [16, Lemma 2.1]:
∗
(A + Δ)Wm = Wm Hm , where Δ = −h m+1,m wm+1 wm (13.10)
460 13 Computing the Matrix Pseudospectrum
This means that the eigenvalues of Hm are also eigenvalues of a specific perturbation
of A. Therefore, whenever |h m+1,m | is smaller than ε, the eigenvalues of Hm also
lie in the pseudospectrum of A for this specific perturbation. Of course one must
be careful in pushing this approach further. Using the pseudospectrum of Hm in
order to approximate the pseudospectrum of A entails a double approximation. Even
though when |h m+1,m | < ε the eigenvalues (that is the 0-pseudospectrum of Hm ) is
contained in Λε (A), this is not necessarily true for Λε (Hm ).
Hm,m
Wm+1 = (Wm , wm+1 ) and Hm+1,m =
h m+1,m em
and abbreviate G z,m (A) = G z (A, Wm+1 , Wm∗ ) it can be shown that
13.3 Dimensionality Reduction on the Matrix: Methods Based on Projection 461
1 1
≤ G z,m (A) ≤ + G z (A, Wm+1 , Wm∗ )ũ
σmin ( H̃m − z I˜) σmin ( H̃m − z I˜)
(13.11)
1 1
≤ G z,m (A) ≤ = R(z).
σmin ( H̃m − z I˜) σmin (A − z I )
G z,m (A) = ((I − h m+1,m φz em )(Hm,m − z I )−1 , φz )). (13.12)
Therefore, to compute G z,m (A) for the values of z dictated by the underlying
domain method (e.g. all gridpoints in Ωh for straightforward GRID) we must compute
φz , solve m Hessenberg systems of size m, and then compute the norm of G z,m (A).
The last two steps are straightforward. To obtain the term (A − z I )−1 wm+1 requires
solving one linear system per shift z. Moreover, they all have the same right-hand side,
wm+1 . From the shift invariance property of Krylov subspaces, that is Kd (A, r ) =
Kd (A − z I, r ), it follows that if a Krylov subspace method, e.g. GMRES, is utilized
then the basis to use to approximate all solutions will be the same and so the Arnoldi
process needs to run only once. In particular, if we denote by Ŵ1 the orthogonal basis
for the Krylov subspace Kd (A, wm+1 ) and by Fd+1,d the corresponding (d + 1) × d
upper Hessenberg matrix, then
where I˜d = (Id , 0) . We refer to Sect. 9.3.2 of Chap. 9 regarding parallel implemen-
tations of the Arnoldi iterations and highlight one implementation of the key steps
of the parallel transfer function approach, based on GMRES to solve the multiply
shifted systems in Algorithm 13.10 that we call TR.
TR consists of two parts, the first of which applies Arnoldi processes to obtain
orthogonal bases for the Krylov subspaces used for the transfer function and for the
solution of the size n system. One advantage of the method is that after the first
part is completed and the basis matrices are computed, we only need to repeat the
second part for every gridpoint. TR is organized around row-wise data partition:
Each processor is assigned a number of rows of A and the orthogonal bases. At each
step, the vector that is to orthogonalized to become the new element of the basis Wm
462 13 Computing the Matrix Pseudospectrum
(i) (i)
submatrices Wm , Ŵd and on the column dimension of Y and could be decided at
runtime.
Another observation is that the part of Algorithm 13.10 after line 7 involves only
dense matrix computations and limited communication. Remember also that at the
end of the first part, each processor has a copy of the upper Hessenberg matrices.
Hence, we can use BLAS3 to exploit the memory hierarchy available in each proces-
sor. Furthermore, if M ≥ p, the workload is expected to be evenly divided amongst
processors.
Observe that TR can also be applied in stages, to approximate the pseudospectrum
for subsets of Ωh at a time. This might be preferable when the size of the matrix
and number of gridpoints is very large, in which case the direct application of TR
to the entire grid becomes very demanding in memory and communication. Another
possibility, is to combine the transfer function approach with one of the methods
we described earlier that utilizes dimensionality reduction on the domain. For very
large problems, it is possible that the basis Ŵd+1 cannot be made large enough to
provide an adequate approximation for (A − z I )−1 wm+1 at one or more values of z.
Therefore, restarting is necessary using as starting vectors the residuals correspond-
ing to each shift (other than those systems that have already been approximated at
the desired tolerance) will be necessary. Unfortunately, the residuals are different
in general and the shift invariance property will be no longer readily applicable.
Cures for this problem have been proposed both in the context of GMRES [17] and
FOM(Full Orthogonalization Method) [18]. Another possibility is to use short recur-
rences methods, e.g. QMR [19] or BiCGStab [20]. In FOM for example, it was shown
that if FOM is used, then the residuals from all shifted systems will be collinear; cf.
[21].
13.4 Notes
The seminal reference on pseudospectra and their computation is [1]. The initial
reduction of A to Schur or Hessenberg form prior to computing its pseudospectra
and the idea of continuation were proposed in [22]. A first version of a Schur-based
GRID algorithm written in MATLAB was proposed in [23]. The EigTool package
was initially developed in [24]. The queue approach for the GRID algorithm and its
variants was proposed in [25] and further extended in [4] who also showed several
examples where imbalance could result from simple static work allocation. The idea
of inclusion-exclusion and the MOG algorithm 13.3 were first presented in [2]. The
parallel Modified Grid Method was described in [4] and experiments on a cluster of
single processor PC’s over Ethernet running the Cornell Multitasking Toolbox [26]
demonstrated the potential of this parallelization approach. MOG was extended for
matrix polynomials in [27] also using the idea of “inclusion disks”. These, combined
with the concentric exclusion disks we mentioned earlier, provide the possibility for
a parallel MOG-like algorithm that constructs the pseudospectrum for several values
464 13 Computing the Matrix Pseudospectrum
of ε. The path following techniques for the pseudospectrum originate from early
work in [7] where they demonstrated impressive savings over GRID. Cobra was
devised and proposed in [8]. Advancing by triangles was proposed in [9, 10]. The
Pseudospectrum Descent Method (PsDM) was described in [12]. The programs were
written in Fortran-90 and the MPI library and tested on an 8 processor SGI Origin
200 system. The method shares a lot with an original idea described in [28] for the
independent computation of eigenvalues.
The transfer function approach and its parallel implementation were described in
[16, 29]. Finally, parallel software tools for constructing pseudospectra based on the
algorithms described in this chapter can be found in in [30, 31].
References
1. Trefethen, L., Embree, M.: Spectra and Pseudospectra. Princeton University Press, Princeton
(2005)
2. Koutis, I., Gallopoulos, E.: Exclusion regions and fast estimation of pseudospectra (2000).
Submitted for publication (ETNA)
3. Braconnier, T., McCoy, R., Toumazou, V.: Using the field of values for pseudospectra genera-
tion. Technical Report TR/PA/97/28, CERFACS, Toulouse (1997)
4. Bekas, C., Kokiopoulou, E., Koutis, I., Gallopoulos, E.: Towards the effective parallel compu-
tation of matrix pseudospectra. In: Proceedings of the 15th ACM International Conference on
Supercomputing (ICS’01), pp. 260–269. Sorrento (2001)
5. Reichel, L., Trefethen, L.N.: Eigenvalues and pseudo-eigenvalues of Toeplitz matrices. Linear
Algebra Appl. 162–164, 153–185 (1992)
6. Sun, J.: Eigenvalues and eigenvectors of a matrix dependent on several parameters. J. Comput.
Math. 3(4), 351–364 (1985)
7. Brühl, M.: A curve tracing algorithm for computing the pseudospectrum. BIT 33(3), 441–445
(1996)
8. Bekas, C., Gallopoulos, E.: Cobra: parallel path following for computing the matrix
pseudospectrum. Parallel Comput. 27(14), 1879–1896 (2001)
9. Mezher, D., Philippe, B.: PAT—a reliable path following algorithm. Numer. Algorithms 1(29),
131–152 (2002)
10. Mezher, D., Philippe, B.: Parallel computation of the pseudospectrum of large matrices. Parallel
Comput. 28(2), 199–221 (2002)
11. Higham, N.: The Matrix Computation Toolbox. Technical Report, Manchester Centre for Com-
putational Mathematics (2002). http://www.ma.man.uc.uk/~higham/mctoolbox
12. Bekas, C., Gallopoulos, E.: Parallel computation of pseudospectra by fast descent. Parallel
Comput. 28(2), 223–242 (2002)
13. Toh, K.C., Trefethen, L.: Calculation of pseudospectra by the Arnoldi iteration. SIAM J. Sci.
Comput. 17(1), 1–15 (1996)
14. Wright, T., Trefethen, L.N.: Large-scale computation of pseudospectra using ARPACK and
Eigs. SIAM J. Sci. Comput. 23(2), 591–605 (2001)
15. Lehoucq, R., Sorensen, D., Yang, C.: ARPACK User’s Guide: Solution of Large-Scale Eigen-
value Problems with Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia (1998)
16. Simoncini, V., Gallopoulos, E.: Transfer functions and resolvent norm approximation of large
matrices. Electron. Trans. Numer. Anal. (ETNA) 7, 190–201 (1998). http://etna.mcs.kent.edu/
vol.7.1998/pp190-201.dir/pp190-201.html
17. Frommer, A., Glässner, U.: Restarted GMRES for shifted linear systems. SIAM J. Sci. Comput.
19(1), 15–26 (1998)
References 465
18. Simoncini, V.: Restarted full orthogonalization method for shifted linear systems. BIT Numer.
Math. 43(2), 459–466 (2003)
19. Freund, R.: Solution of shifted linear systems by quasi-minimal residual iterations. In: Reichel,
L., Ruttan, A., Varga, R. (eds.) Numerical Linear Algebra, pp. 101–121. W. de Gruyter, Berlin
(1993)
20. Frommer, A.: BiCGStab(
) for families of shifted linear systems. Computing 7(2), 87–109
(2003)
21. Simoncini, V.: Restarted full orthogonalization method for shifted linear systems. BIT Numer.
Math. 43(2), 459–466 (2003)
22. Lui, S.: Computation of pseudospectra with continuation. SIAM J. Sci. Comput. 18(2), 565–573
(1997)
23. Trefethen, L.: Computation of pseudospectra. Acta Numerica 1999, vol. 8, pp. 247–295. Cam-
bridge University Press, Cambridge (1999)
24. Wright, T.: Eigtool: a graphical tool for nonsymmetric eigenproblems (2002). http://web.
comlab.ox.ac.uk/pseudospectra/eigtool. (At the Oxford University Computing Laboratory site)
25. Frayssé, V., Giraud, L., Toumazou, V.: Parallel computation of spectral portraits on the Meiko
CS2. In: Liddel, H., et al. (eds.) LNCS: High-Performance Computing and Networking, vol.
1067, pp. 312–318. Springer, Berlin (1996)
26. Zollweg, J., Verma, A.: The Cornell multitask toolbox. http://www.tc.cornell.edu/Services/
Software/CMTM/. Directory Services/Software/CMTM at http://www.tc.cornell.edu
27. Fatouros, S., Psarrakos, P.: An improved grid method for the computation of the pseudospectra
of matrix polynomials. Math. Comput. Model. 49, 55–65 (2009)
28. Koutis, I.: Spectrum through pseudospectrum. http://arxiv.org/abs/math.NA/0701368
29. Bekas, C., Kokiopoulou, E., Gallopoulos, E., Simoncini, E.: Parallel computation of
pseudospectra using transfer functions on a MATLAB-MPI cluster platform. In: Recent
Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 9th
European PVM/MPI Users’ Group Meeting. LNCS, vol. 2474. Springer, Berlin (2002)
30. Bekas, C., Kokiopoulou, E., Gallopoulos, E.: The design of a distributed MATLAB-based
environment for computing pseudospectra. Future Gener. Comput. Syst. 21(6), 930–941 (2005)
31. Mezher, D.: A graphical tool for driving the parallel computation of pseudosprectra. In: Pro-
ceedings of the 15th ACM International Conference on Supercomputing (ICS’01), pp. 270–276.
Sorrento (2001)
Index
with partial pivoting, 99 spectral radius, 109, 280, 288, 290, 294,
without pivoting, 99 316–318, 322, 323
LU/UL strategy, 104 spectrum, 109, 249, 267, 268, 304, 306,
308, 318, 322, 345–347, 349, 354, 378,
385, 387, 410
Spike, 96, 101, 153, 431
M sub-identity, 327, 328
Marching algorithms, 129, 130, 150, 214, symmetric indefinite, 392
277 symmetric positive definite (SPD), 118,
Matrix 119, 123–126, 133, 149, 298, 391, 394
banded, 91, 95, 105, 115, 116, 129, 149, Toeplitz, 62–72, 127, 165, 166, 176–191,
154, 159, 311, 384, 394 198, 201, 212, 215, 218, 410, 440, 444
bidiagonal, 177, 251, 272, 284, 307, 308, triangular, 40, 49, 53, 57, 58, 60, 62, 64,
393 66, 80, 81, 107, 130, 134, 142, 147, 148,
circulant, 177, 180, 182–185, 188 152, 166, 189, 231, 237, 240, 243, 244,
diagonal, 132, 216, 217, 252, 260, 261, 259–261, 300, 349, 387, 393, 410, 428,
266, 281, 327, 328, 390 431
diagonal form, 33, 38, 96, 115, 251, 260, tridiagonal, 98, 115, 127, 132–136, 141,
263, 307, 349, 357, 358, 360, 393, 394, 143, 144, 148, 150, 152, 154, 156, 157,
416, 429 159, 197, 198, 200, 201, 203, 204, 207–
diagonalizable, 199, 345, 410 209, 215, 219, 262–265, 267, 268, 278,
diagonally dominant, 81, 85–87, 99, 100, 280, 287, 290, 302, 306, 308, 309, 311,
104, 115, 119, 125, 127, 133, 134, 137– 339, 349, 350, 352–354, 356–359, 392,
139, 144, 191, 279, 284, 370 394
Hermitian, 249, 343, 344, 431 unitary, 249
Hessenberg, 231, 300–304, 308, 429, zero, 120, 129, 134, 135
458–462 Matrix decomposition, 201–204, 210, 211,
indefinite, 349 213, 215, 216, 218, 219
irreducible, 40, 128, 142–144, 148, 150, MD- Fourier, 215, 216
156, 157, 279, 351 Matrix reordering, 37, 38, 42, 129, 311
Jacobian, 238 minimum-degree, 37
Jordan canonical form, 410 nested dissection, 37
M-matrix, 280, 330 reverse Cuthill-McKee, 37–39, 91
multiplication, 20, 22, 24, 30, 60, 80, 87, spectral, 38–43, 91
345, 384, 385, 395, 413, 418, 462 weighted spectral, 39, 40, 42
non-diagonalizable, 410 Matrix splitting-based paracr, 134, 136
nonnegative, 80, 133 incomplete, 137
nonsingular, 38, 79, 83, 120, 123, 229, Matrix-free methods, 31
299, 336, 387, 388 Maximum product on diagonal algorithm,
norm, 439 see MPD algorithm
orthogonal, 120, 151, 194, 196, 197, 202, Maximum traversal, 40
228, 229, 242, 243, 246, 250, 251, 256, MD- Fourier, see Matrix decomposition
262, 263, 266, 272, 297, 322, 363, 375, Memory
391 distributed, 9, 10, 19, 28, 33, 96, 169,
orthornormal, 320, 358, 379 218, 235, 236, 261, 262, 384, 429
permutation, 37, 40, 192, 194, 197, 202, global, 9, 239
282, 286, 349, 362, 428 hierarchical, 261
positive definite, 314 local, 13, 24, 36, 290, 338, 385
positive semidefinite, 41 shared, 9, 218, 219, 261, 262
pseudospectrum, 417 Minimum residual iteration, 380
reduction, 356, 393, 394, 458, 463 MOG, see Grid methods
Schur form, 443, 444, 463 MPD algorithm, 40
skew-symmetric, 173 Multiplicative Schwarz algorithm, 329, 332
Index 471
as preconditioner, 333 O
splitting, 334 Ordering
MUMPS, see Numerical libraries and envi- Leja, 308, 421, 423
ronments red-black, 281–283, 285, 286
Orthogonal factorization
Givens orthogonal, 259
sparse, 107
N
Neville, see Divided differences
Newton basis, 339
Newton form, 172, 174 P
Newton Grassmann method, 368 paracr, 136
Newton interpolating polynomial, 174 Parallel computation, 13, 141, 267, 424
Newton polynomial basis, 307 Parallel cyclic reduction, see paracr
Newton’s method, 238 Partial fractions
coefficients, 206, 208, 210, 215, 411,
iteration, 37, 75, 447, 448
416, 417, 419–421, 423, 426–428
step, 364, 455
expansion, 415, 417–420, 423, 426
Newton-Arnoldi iteration, 337
incomplete
Newton-Krylov procedure, 308, 327, 337
IPF, 419–421, 423, 424, 427
Nonlinear systems, 238
representation, 205, 206, 208, 210, 213,
Norm
215, 414, 415, 420, 424, 425, 427
∞-norm, 167
solving linear systems, 210
2-norm, 130, 143, 148, 228, 242, 244,
PAT, see Path following methods
299, 352, 366, 374, 376, 384, 400, 439
Path following methods
ellipsoidal norm, 314
Cobra, 448–450, 454, 455, 457
Frobenius norm, 37–39, 243, 245, 253,
PAT, 450, 452–455
311
PF, 447–450, 454, 455, 457, 458
matrix norm, 439 PPAT, 454, 455
vector norm, 314 PDGESVD, see Singular value decomposi-
Numerical libraries and environments tion
ARPACK, 355, 459 Perron-Frobenius theorem, 279
EISPACK PF, see Path following methods
TINVIT, 272 Pivoting strategies
GotoBLAS, 20 column, 60, 154, 228, 242, 243, 262, 388
Harwell Subroutine Library, 349 complete, 60
LAPACK, 100, 104, 265, 267, 272, 384 diagonal, 132, 134, 154
DAXPY, 170 incremental, 84
DGEMV, 232 pairwise, 82–85, 91
DGER, 232 partial, 84, 85, 87, 91, 92, 99, 100, 148,
DSTEMR, 272 154, 188, 428
DSYTRD, 264 Poisson solver, 219
MATLAB, 30, 168, 232, 349, 454, 459, Polyalgorithm, 149
463 Polynomial
Matrix Computation Toolbox, 440 Chebyshev, 50, 199, 200, 205, 214, 289,
MUMPS, 429, 431 295, 304, 305, 348, 354, 390, 391, 411
PARDISO solver, 312 evaluation, 69
PELLPACK, 218 Horner’s rule, 69, 414
ScaLAPACK, 92, 137, 149, 259 Paterson-Stockmeyer algorithm, 413,
PDGEQRF, 259 414
PDSYTRD, 264 Power method, 343, 344, 346
SuperLU, 429, 431 powform algorithm, 170, 171
WSMP, 431 PPAT, see Path following methods
Numerical quadrature, 414 prefix, see Prefix computation
472 Index
prefix_opt, see Prefix computation Rayleigh quotient, 259, 363, 366, 371, 397
pr2pw algorithm, 169–172 generalized, 371, 382
Preconditioned iterative solver, 111 matrix, 390
Preconditioning, 18, 99, 113, 115, 311 Rayleigh quotient iteration, 347
Krylov subspace schemes, 38 Rayleigh quotient method, 368
left and right, 122, 154, 336 Rayleigh-Ritz procedure, 390
Prefix rd_pref, see Tridiagonal solver
parallel matrix prefix, 142–144, 146–148 Recursive doubling, 127
parallel prefix, 142–144, 146, 148, 175, RES, see Rapid elliptic solvers
176 RODDEC, 237
Prefix computation Row projection algorithms
prefix, 174, 175 block, 311
prefix_opt, 174, 175 Cimmino, 326
Programming Kaczmarz method
Single Instruction Multiple threads classical, 326
(SIMT), 6 symmetrized, 319–321, 323, 325
Multiple Instruction Multiple Data Rutishauser’s
(MIMD), 9–11, 218 Chebyshev acceleration, 390
Single Instruction Multiple Data RITZIT, 390, 391
(SIMD), 5, 6, 10, 13, 218, 277 subspace iteration, 391
Single Program Multiple Data organiza-
tion (SPMD), 11
PsDM, see Pseudospectrum methods
S
Pseudospectrum computation
SAS decomposition, 165, 195, 196, 362
LAPSAR, 459
ScaLAPACK, see Numerical libraries and
PsDM, 454, 456–458
environments
TR, 461–463
Scatter and gather procedure, 33
Schur complement, 132, 155, 219, 372
Q Schur decomposition, 442, 443
QMR, 426, 463 Schwarz preconditioning, 336
QR algorithm, 265 Sherman-Morrison-Woodbury formula,
iterations, 197, 265 127, 154, 183, 190
QR factorization, 60, 147, 154, 227, 228, simpleSolve_PF algorithm, 206
231, 237, 242, 243, 246, 262, 265, Simultaneous iteration method, 271, 345–
272, 302, 303, 399, 454 348, 369, 371–373, 377, 389, 390
by Householder transformations, 242 Singular value decomposition (SVD), 259–
incomplete, 399 262, 386, 387, 393, 395
solving linear systems, 227 block Lanczos, 394
sparse, 454 BLSVD, 392, 394
thin, 459 hybrid scheme, 261
via Givens rotations, 154 sparse, 401
with column pivoting, 60, 262 SOR, 293, 315, 316, 318, 319, 322
Quadratic equation, 155 block symmetric (SSOR), 316
SP_Givens method, 150
Sparse direct solver, 312, 387, 431
R Sparse inner product, 32
Rapid elliptic solvers, 165, 170, 198, 409, Sparse matrices
415 computations, 17, 33
BCR, 205, 209–211, 219, 411 fill-in, 31
CFT, 204, 219 graph representation, 31
CORF, 207–211 reordering, 36
EES, 215–217 storage, 31–33
FACR, 211, 219 systems, 18, 37
Index 473
T W
TFQMR, 426 Weighted bandwidth reduction, 39
TR, see Pseudospectrum methods
Trace minimization
TRSVD, 396–398 Z
TraceMIN, 384, 385 ZEROIN method, 269