KEMBAR78
Gallopoulos - Parallelism in Matrix Computations | PDF | Matrix (Mathematics) | Computational Science
100% found this document useful (1 vote)
866 views505 pages

Gallopoulos - Parallelism in Matrix Computations

Uploaded by

Thermion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
866 views505 pages

Gallopoulos - Parallelism in Matrix Computations

Uploaded by

Thermion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 505

Scientific Computation

Efstratios Gallopoulos
Bernard Philippe
Ahmed H. Sameh

Parallelism
in Matrix
Computations

www.allitebooks.com
Parallelism in Matrix Computations

www.allitebooks.com
Scientific Computation

Editorial Board

J.-J. Chattot, Davis, CA, USA


P. Colella, Berkeley, CA, USA
R. Glowinski, Houston, TX, USA
M.Y. Hussaini, Tallahassee, FL, USA
P. Joly, Le Chesnay, France
D.I. Meiron, Pasadena, CA, USA
O. Pironneau, Paris, France
A. Quarteroni, Lausanne, Switzerland
and Politecnico of Milan, Milan, Italy
M. Rappaz, Lausanne, Switzerland
R. Rosner, Chicago, IL, USA
P. Sagaut, Paris, France
J.H. Seinfeld, Pasadena, CA, USA
A. Szepessy, Stockholm, Sweden
M.F. Wheeler, Austin, TX, USA

More information about this series at http://www.springer.com/series/718

www.allitebooks.com
Efstratios Gallopoulos Bernard Philippe

Ahmed H. Sameh

Parallelism in Matrix
Computations

123

www.allitebooks.com
Efstratios Gallopoulos Ahmed H. Sameh
Computer Engineering Department of Computer Science
and Informatics Department Purdue University
University of Patras West Lafayette, IN
Patras USA
Greece

Bernard Philippe
Campus de Beaulieu
INRIA/IRISA
Rennes Cedex
France

ISSN 1434-8322 ISSN 2198-2589 (electronic)


Scientific Computation
ISBN 978-94-017-7187-0 ISBN 978-94-017-7188-7 (eBook)
DOI 10.1007/978-94-017-7188-7

Library of Congress Control Number: 2015941347

Springer Dordrecht Heidelberg New York London


© Springer Science+Business Media Dordrecht 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.

Printed on acid-free paper

Springer Science+Business Media B.V. Dordrecht is part of Springer Science+Business Media


(www.springer.com)

www.allitebooks.com
To the memory of Daniel L. Slotnick,
parallel processing pioneer

www.allitebooks.com
Preface

Computing instruments were developed to facilitate fast calculations, especially in


computational science and engineering applications. The fact that this is not just a
matter of building a hardware device and its system software, was already hinted to
by Charles Babbage, when he wrote in the mid-nineteenth century, As soon as an
Analytical Engine exists, it will necessarily guide the future course of science.
Whenever any result is sought by its aid, the question will then arise—By what
course of calculation can these results be arrived at by the machine in the shortest
time? [1]. This question points to one of the principal challenges for parallel
computing. In fact, in the aforementioned reference, Babbage did consider the
advantage of parallel processing and the perfect speedup that could be obtained
when adding numbers if no carries were generated. He wrote If this could be
accomplished it would render additions and subtractions with numbers having ten,
twenty, fifty or any number of figures as rapid as those operations are with single
figures. He was also well aware of the limitations, in this case the dependencies
caused by the carries. A little more than half a century after Babbage, in 1922, an
extraordinary idea was sketched by Lewis Fry Richardson. In his treatise Weather
Prediction by Numerical Process he described his “forecast-factory” fantasy to
speed up calculations by means of parallel processing performed by humans
[3, Chap. 11, p. 219]. Following the development of the first electronic computer, in
the early 1950s, scientists and engineers proposed that one way to achieve higher
performance was to build a computing platform consisting of many interconnected
von Neumann uniprocessors that can cooperate in handling what were the large
computational problems of that era. This idea appeared simple and natural, and
quickly attracted the attention of university-, government-, and industrial-research
laboratories. Forty years after Richardson’s treatise, the designers of the first par-
allel computer prototype ever built, introduced their design in 1962 as follows: The
Simultaneous Operation Linked Ordinal Modular Network (SOLOMON), a parallel
network computer, is a new system involving the interconnections and program-
ming, under the supervision of a central control unit, of many identical processing
elements (as few or as many as a given problem requires), in an arrangement that
can simulate directly the problem being solved. It is remarkable how this

vii

www.allitebooks.com
viii Preface

introductory paragraph underlines the generality and adaptive character of the


design, despite the fact that neither the prototype nor subsequent designs went as
far. These authors stated further that this architecture shows great promise in aiding
progress in certain critical applications that rely on common mathematical
denominators that are dominated by matrix computations.
Soon after that, the field of parallel processing came into existence starting with
the development of the ILLIAC-IV at the University of Illinois at Urbana-
Champaign led by Daniel L. Slotnick (1931–1985), who was one of the principal
designers of the SOLOMON computer. The design and building of parallel com-
puting platforms, together with developing the underlying system software as well
as the associated numerical libraries, emerged as important research topics. Now,
four decades after the introduction of the ILLIAC-IV, parallel computing resources
ranging from multicore systems (which are found in most modern desktops and
laptops) to massively parallel platforms are within easy reach of most computa-
tional scientists and engineers. In essence, parallel computing has evolved from an
exotic technology to a widely available commodity. Harnessing this power to the
maximum level possible, however, remains the subject of ongoing research efforts.
Massively parallel computing platforms now consist of thousands of nodes
cooperating via sophisticated interconnection networks with several layers of
hierarchical memories. Each node in such platforms is often a multicore architec-
ture. Peak performance of these platforms has reached the persetascale level, in
terms of the number of floating point operations completed in one second, and will
soon reach the exascale level. These rapid hardware technological advances,
however, have not been matched by system or application software developments.
Since the late 1960s different parallel architectures have come and gone in a rel-
atively short time resulting in lack of stable and sustainable parallel software
infrastructure. In fact, present day researchers involved in the design of parallel
algorithms and development of system software for a given parallel architecture
often rediscover work that has been done by others decades earlier. Such lack of
stability in parallel software and algorithm development has been pointed out by
George Cybenko and David Kuck as early as 1992 in [2].
Libraries of efficient parallel algorithms and their underlying kernels are needed
for enhancing the realizable performance of various computational science and
engineering (CSE) applications on current multicore and petascale computing
platforms. Developing robust parallel algorithms, together with their theoretical
underpinnings is the focus of this book. More specifically, we focus exclusively on
those algorithms relevant to dense and sparse matrix computations which govern
the performance of many CSE applications. The important role of matrix compu-
tations was recognized in the early days of digital computers. In fact, after the
introduction of the Automatic Computing Engine (ACE), Alan Turing included
solving linear systems and matrix multiplication as two of the computational
challenges for this computing platform. Also, in what must be one of the first
references to sparse and structured matrix computations, he observed that even
though the storage capacities available then could not handle dense linear systems

www.allitebooks.com
Preface ix

of order larger than 50, in practice one can handle much larger systems: The
majority of problems have very degenerate matrices and we do not need to store
anything like as much as … since the coefficients in these equations are very
systematic and mostly zero. The computational challenges we face today are cer-
tainly different in scale than those above but they are surprisingly similar in their
dependence on matrix computations and numerical linear algebra. In the early
1980s, during building the experimental parallel computing platform “Cedar”, led
by David Kuck, at the University of Illinois at Urbana-Champaign, a table was
compiled that identifies the common computational bottlenecks of major science
and engineering applications, and the parallel algorithms that need to be designed,
together with their underlying kernels, in order to achieve high performance.
Among the algorithms listed, matrix computations are the most prominent.
A similar list was created by UC Berkeley in 2009. Among Berkeley’s 13 parallel
algorithmic methods that capture patterns of computation and communication,
which are called “dwarfs”, the top two are matrix computation-based. Not only are
matrix computations, and especially sparse matrix computations, essential in
advancing science and engineering disciplines such as computational mechanics,
electromagnetics, nanoelectronics among others, but they are also essential for
manipulation of the large graphs that arise in social networks, sensor networks, data
mining, and machine learning just to list a few. Thus, we conclude that realizing
high performance in dense and sparse matrix computations on parallel computing
platforms is central to many applications and hence justify our focus.
Our goal in this book is therefore to provide researchers and practitioners with
the basic principles necessary to design efficient parallel algorithms for dense and
sparse matrix computations. In fact, for each fundamental matrix computation
problem such as solving banded linear systems, for example, we present a family of
algorithms. The “optimal” choice of a member of this family will depend on the
linear system and the architecture of the parallel computing platform under con-
sideration. Clearly, however, executing a computation on a parallel platform
requires the combination of many steps ranging from: (i) the search for an “optimal”
parallel algorithm that minimizes the required arithmetic operations, memory ref-
erences and interprocessor communications, to (ii) its implementation on the
underlying platform. The latter step depends on the specific architectural charac-
teristics of the parallel computing platform. Since these architectural characteristics
are still evolving rapidly, we will refrain in this book from exposing fine imple-
mentation details for each parallel algorithm. Rather, we focus on algorithm
robustness and opportunities for parallelism in general. In other words, even though
our approach is geared towards numerically reliable algorithms that lend themselves
to practical implementation on parallel computing platforms that are currently
available, we will also present classes of algorithms that expose the theoretical
limitations of parallelism if one were not constrained by the number of cores/
processors, or the cost of memory references or interprocessor communications.

www.allitebooks.com
x Preface

In summary, this book is intended to be both a research monograph as well as an


advanced graduate textbook for a course dealing with parallel algorithms in matrix
computations or numerical linear algebra. It is assumed that the reader has general,
but not extensive, knowledge of: numerical linear algebra, parallel architectures,
and parallel programming paradigms. This book consists of four parts for a total of
13 chapters. Part I is an introduction to parallel programming paradigms and
primitives for dense and sparse matrix computations. Part II is devoted to dense
matrix computations such as solving linear systems, linear least squares and alge-
braic eigenvalue problems. Part II also deals with parallel algorithms for special
matrices such as banded, Vandermonde, Toeplitz, and block Toeplitz. Part III deals
with sparse matrix computations: (a) iterative parallel linear system solvers with
emphasis on scalable preconditioners, (b) schemes for obtaining few of the extreme
or interior eigenpairs of symmetric eigenvalue problems, (c) schemes for obtaining
few of the singular triplets. Finally, Part IV discusses parallel algorithms for
computing matrix functions and the matrix pseudospectrum.

Acknowledgments

We wish to thank all of our current and previous collaborators who have been,
directly or indirectly, involved with topics discussed in this book. We thank
especially: Guy-Antoine Atenekeng-Kahou, Costas Bekas, Michael Berry, Olivier
Bertrand, Randy Bramley, Daniela Calvetti, Peter Cappello, Philippe Chartier,
Michel Crouzeix, George Cybenko, Ömer Eğecioğlu, Jocelyne Erhel, Roland
Freund, Kyle Gallivan, Ananth Grama, Joseph Grcar, Elias Houstis, William Jalby,
Vassilis Kalantzis, Emmanuel Kamgnia, Alicia Klinvex, Çetin Koç, Efi
Kokiopoulou, George Kollias, Erricos Kontoghiorghes, Alex Kouris, Ioannis
Koutis, David Kuck, Jacques Lenfant, Murat Manguoğlu, Dani Mezher, Carl
Christian Mikkelsen, Maxim Naumov, Antonio Navarra, Louis Bernard Nguenang,
Nikos Nikoloutsakos, David Padua, Eric Polizzi, Lothar Reichel, Yousef Saad,
Miloud Sadkane, Vivek Sarin, Olaf Schenk, Roger Blaise Sidje, Valeria Simoncini,
Aleksandros Sobczyk, Danny Sorensen, Andreas Stathopoulos, Daniel Szyld,
Maurice Tchuente, Tayfun Tezduyar, John Tsitsiklis, Marian Vajteršic, Panayot
Vassilevski, Ioannis Venetis, Brigitte Vital, Harry Wijshoff, Christos Zaroliagis,
Dimitris Zeimpekis, Zahari Zlatev, and Yao Zhu. Any errors and omissions, of
course, are entirely our responsibility.
In addition we wish to express our gratitude to Yousuff Hussaini who encour-
aged us to have our book published by Springer, to Connie Ermel who typed a
major part of the first draft, to Eugenia-Maria Kontopoulou for her help in preparing
the index, and to our Springer contacts, Kirsten Theunissen and Aldo Rampioni.
Finally, we would like to acknowledge the remarkable contributions of the late
Gene Golub—a mentor and a friend—from whom we learned a lot about matrix
computations. Further, we wish to pay our respect to the memory of our late
collaborators and friends: Theodore Papatheodorou, and John Wisniewski.
Preface xi

Last, but not least, we would like to thank our families, especially our spouses,
Aristoula, Elisabeth and Marilyn, for their patience during the time it took us to
produce this book.

Patras Efstratios Gallopoulos


Rennes Bernard Philippe
West Lafayette Ahmed H. Sameh
January 2015

References

1. Babbage, C.: Passages From the Life of a Philosopher. Longman, Green, Longman, Roberts &
Green, London (1864)
2. Cybenko, G., Kuck, D.: Revolution or Evolution, IEEE Spectrum, 29(9), 39–41 (1992)
3. Richardson, L.F.: Weather Prediction by Numerical Process. Cambridge University Press,
Cambridge (1922). (Reprinted by Dover Publications, 1965)
Contents

Part I Basics

1 Parallel Programming Paradigms . . . . . . . . . . . . . . . . . . . . .... 3


1.1 Computational Models. . . . . . . . . . . . . . . . . . . . . . . . . .... 3
1.1.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . .... 3
1.1.2 Single Instruction Multiple Data Architectures
and Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.3 Multiple Instruction Multiple Data Architectures . . . . . 9
1.1.4 Hierarchical Architectures . . . . . . . . . . . . . . . . . . . . 10
1.2 Principles of Parallel Programming . . . . . . . . . . . . . . . . . . . 11
1.2.1 From Amdahl’s Law to Scalability . . . . . . . . . . . . . . 12
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Fundamental Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1 Vector Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Higher Level BLAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Dense Matrix Multiplication . . . . . . . . . . . . . . . . . . . 20
2.2.2 Lowering Complexity via the Strassen Algorithm . . . . 22
2.2.3 Accelerating the Multiplication
of Complex Matrices . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 General Organization for Dense Matrix Factorizations . . . . . . . 25
2.3.1 Fan-Out and Fan-In Versions . . . . . . . . . . . . . . . . . . 25
2.3.2 Parallelism in the Fan-Out Version . . . . . . . . . . . . . . 26
2.3.3 Data Allocation for Distributed Memory. . . . . . . . . . . 28
2.3.4 Block Versions and Numerical Libraries . . . . . . . . . . 29
2.4 Sparse Matrix Computations. . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Sparse Matrix Storage and Matrix-Vector
Multiplication Schemes . . . . . . . . . . . . . . . . . . .... 31
2.4.2 Matrix Reordering Schemes . . . . . . . . . . . . . . . .... 36
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 43

xiii
xiv Contents

Part II Dense and Special Matrix Computations

3 Recurrences and Triangular Systems . . . . . . . . . . . . . . . . . . . . . 49


3.1 Definitions and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Linear Recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Dense Triangular Systems . . . . . . . . . . . . . . . . . . . . 52
3.2.2 Banded Triangular Systems . . . . . . . . . . . . . . . . . . . 57
3.2.3 Stability of Triangular System Solvers . . . . . . . . . . . . 59
3.2.4 Toeplitz Triangular Systems . . . . . . . . . . . . . . . . . . . 61
3.3 Implementations for a Given Number of Processors . . . . . . . . 66
3.4 Nonlinear Recurrences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4 General Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79


4.1 Gaussian Elimination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 Pairwise Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 Block LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.1 Approximate Block Factorization . . . . . . . . . . . . . . . 85
4.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5 Banded Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91


5.1 LU-based Schemes with Partial Pivoting . . . . . . . . . . . . . . . . 91
5.2 The Spike Family of Algorithms. . . . . . . . . . . . . . . . . . . . . . 94
5.2.1 The Spike Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2.2 Spike: A Polyalgorithm . . . . . . . . . . . . . . . . . . . . . . 99
5.2.3 The Non-diagonally Dominant Case . . . . . . . . . . . . . 100
5.2.4 The Diagonally Dominant Case . . . . . . . . . . . . . . . . 104
5.3 The Spike-Balance Scheme . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 A Tearing-Based Banded Solver . . . . . . . . . . . . . . . . . . . . . . 115
5.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.2 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4.3 The Balance System . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.4 The Hybrid Solver of the Balance System . . . . . . . . . 124
5.5 Tridiagonal Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5.1 Solving by Marching . . . . . . . . . . . . . . . . . . . . . . . 128
5.5.2 Cyclic Reduction and Parallel Cyclic Reduction . . . . . 130
5.5.3 LDU Factorization by Recurrence Linearization . . . . . 139
5.5.4 Recursive Doubling . . . . . . . . . . . . . . . . . . . . . . . . 143
5.5.5 Solving by Givens Rotations . . . . . . . . . . . . . . . . . . 144
5.5.6 Partitioning and Hybrids . . . . . . . . . . . . . . . . . . . . . 149
5.5.7 Using Determinants and Other Special Forms . . . . . . . 155
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Contents xv

6 Special Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165


6.1 Vandermonde Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.1.1 Vandermonde Matrix Inversion. . . . . . . . . . . . . . . . . 170
6.1.2 Solving Vandermonde Systems and Parallel Prefix . . . 172
6.1.3 A Brief Excursion into Parallel Prefix . . . . . . . . . . . . 174
6.2 Banded Toeplitz Linear Systems Solvers . . . . . . . . . . . . . . . . 176
6.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.2.2 Computational Schemes . . . . . . . . . . . . . . . . . . . . . . 182
6.3 Symmetric and Antisymmetric Decomposition (SAS) . . . . . . . 192
6.3.1 Reflexive Matrices as Preconditioners . . . . . . . . . . . . 194
6.3.2 Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . 196
6.4 Rapid Elliptic Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
6.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
6.4.2 Mathematical and Algorithmic Infrastructure . . . . . . . 199
6.4.3 Matrix Decomposition . . . . . . . . . . . . . . . . . . . . . . . 201
6.4.4 Complete Fourier Transform . . . . . . . . . . . . . . . . . . 203
6.4.5 Block Cyclic Reduction . . . . . . . . . . . . . . . . . . . . . . 205
6.4.6 Fourier Analysis-Cyclic Reduction . . . . . . . . . . . . . . 210
6.4.7 Sparse Selection and Marching . . . . . . . . . . . . . . . . 211
6.4.8 Poisson Inverse in Partial Fraction Representation . . . . 214
6.4.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220

7 Orthogonal Factorization and Linear Least Squares Problems . . . 227


7.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.2 QR Factorization via Givens Rotations . . . . . . . . . . . . . . . . . 228
7.3 QR Factorization via Householder Reductions . . . . . . . . . . . . 232
7.4 Gram-Schmidt Orthogonalization . . . . . . . . . . . . . . . . . . . . . 233
7.5 Normal Equations Versus Orthogonal Reductions . . . . . . . . . . 235
7.6 Hybrid Algorithms When m >> n . . . . . . . . . . . . . . . . . . . . . 236
7.7 Orthogonal Factorization of Block Angular Matrices . . . . . . . . 237
7.8 Rank-Deficient Linear Least Squares Problems . . . . . . . . . . . . 242
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

8 The Symmetric Eigenvalue and Singular-Value Problems . . . . ... 249


8.1 The Jacobi Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . ... 251
8.1.1 The Two-Sided Jacobi Scheme for the Symmetric
Standard Eigenvalue Problem . . . . . . . . . . . . . . . ... 251
8.1.2 The One-Sided Jacobi Scheme for the Singular
Value Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
8.1.3 The Householder-Jacobi Scheme . . . . . . . . . . . . . . . . 259
8.1.4 Block Jacobi Algorithms . . . . . . . . . . . . . . . . . . . . . 261
8.1.5 Efficiency of Parallel Jacobi Methods . . . . . . . . . . . . 262
xvi Contents

8.2 Tridiagonalization-Based Schemes. . . . . . . . . . . . . . . . ..... 263


8.2.1 Tridiagonalization of a Symmetric Matrix . . . . ..... 264
8.2.2 The QR Algorithm: A Divide-and-Conquer
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
8.2.3 Sturm Sequences: A Multisectioning Approach . . . . . . 267
8.3 Bidiagonalization via Householder Reduction . . . . . . . . . . . . . 272
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273

Part III Sparse Matrix Computations

9 Iterative Schemes for Large Linear Systems . . . . . . . . . . . . . . . . 277


9.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
9.2 Classical Splitting Methods . . . . . . . . . . . . . . . . . . . . . . . . . 280
9.2.1 Point Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
9.2.2 Point Gauss-Seidel . . . . . . . . . . . . . . . . . . . . . . . . . 282
9.2.3 Line Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.2.4 Line Gauss-Seidel . . . . . . . . . . . . . . . . . . . . . . . . . . 286
9.2.5 The Symmetric Positive Definite Case . . . . . . . . . . . . 287
9.3 Polynomial Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
9.3.1 Chebyshev Acceleration. . . . . . . . . . . . . . . . . . . . . . 294
9.3.2 Krylov Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309

10 Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 311


10.1 A Tearing-Based Solver for Generalized
Banded Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . ... 312
10.2 Row Projection Methods for Large Nonsymmetric
Linear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 312
10.2.1 The Kaczmarz Scheme . . . . . . . . . . . . . . . . . . . ... 313
10.2.2 The Cimmino Scheme . . . . . . . . . . . . . . . . . . . . ... 319
10.2.3 Connection Between RP Systems
and the Normal Equations . . . . . . . . . . . . . . . . . . . . 319
10.2.4 CG Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.2.5 The 2-Partitions Case . . . . . . . . . . . . . . . . . . . . . . . 321
10.2.6 Row Partitioning Goals . . . . . . . . . . . . . . . . . . . . . . 324
10.2.7 Row Projection Methods and Banded Systems . . . . . . 325
10.3 Multiplicative Schwarz Preconditioner with GMRES . . . . . . . . 326
10.3.1 Algebraic Domain Decomposition
of a Sparse Matrix . . . . . . . . . . . . . . . . . . . . . . ... 327
10.3.2 Block Multiplicative Schwarz . . . . . . . . . . . . . . . ... 329
10.3.3 Block Multiplicative Schwarz as a Preconditioner
for Krylov Methods. . . . . . . . . . . . . . . . . . . . . . ... 336
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 340
Contents xvii

11 Large Symmetric Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . 343


11.1 Computing Dominant Eigenpairs and Spectral
Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
11.1.1 Spectral Transformations . . . . . . . . . . . . . . . . . . . . . 345
11.1.2 Use of Sturm Sequences . . . . . . . . . . . . . . . . . . . . . 349
11.2 The Lanczos Method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
11.2.1 The Lanczos Tridiagonalization . . . . . . . . . . . . . . . . 350
11.2.2 The Lanczos Eigensolver . . . . . . . . . . . . . . . . . . . . . 352
11.3 A Block Lanczos Approach for Solving Symmetric Perturbed
Standard Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . 356
11.3.1 Starting Vectors for AðSi Þx ¼ λX . . . . . . . . . . . . . . . 356
11.3.2 Starting Vectors for AðSi Þ1 x ¼ μx . . . . . . . . . . . . . . 358
11.3.3 Extension to the Perturbed Symmetric Generalized
Eigenvalue Problems . . . . . . . . . . . . . . . . . . . . . . . . 359
11.3.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
11.4 The Davidson Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
11.4.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . 363
11.4.2 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
11.4.3 Types of Correction Steps . . . . . . . . . . . . . . . . . . . . 366
11.5 The Trace Minimization Method for the Symmetric
Generalized Eigenvalue Problem . . . . . . . . . . . . . . . . . . . . . . 368
11.5.1 Derivation of the Trace Minimization Algorithm . . . . . 370
11.5.2 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . 373
11.5.3 Acceleration Techniques . . . . . . . . . . . . . . . . . . . . . 378
11.5.4 A Davidson-Type Extension . . . . . . . . . . . . . . . . . . . 381
11.5.5 Implementations of TRACEMIN . . . . . . . . . . . . . . . . . 384
11.6 The Sparse Singular-Value Problem . . . . . . . . . . . . . . . . . . . 386
11.6.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386
11.6.2 Subspace Iteration for Computing the Largest
Singular Triplets . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.6.3 The Lanczos Method for Computing a Few
of the Largest Singular Triplets. . . . . . . . . . . . . . . . . 392
11.6.4 The Trace Minimization Method for Computing
the Smallest Singular Triplets . . . . . . . . . . . . . . . . . . 395
11.6.5 Davidson Methods for the Computation
of the Smallest Singular Values . . . . . . . . . . . . . . . . 398
11.6.6 Refinement of Left Singular Vectors . . . . . . . . . . . . . 399
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
xviii Contents

Part IV Matrix Functions and Characteristics

12 Matrix Functions and the Determinant . . . . . . . . . . . . . . ...... 409


12.1 Matrix Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . ...... 410
12.1.1 Methods Based on the Product Form
of the Denominator . . . . . . . . . . . . . . . . . . . . . . . . . 412
12.1.2 Methods Based on Partial Fractions. . . . . . . . . . . . . . 414
12.1.3 Partial Fractions in Finite Precision . . . . . . . . . . . . . . 418
12.1.4 Iterative Methods and the Matrix Exponential. . . . . . . 424
12.2 Determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
12.2.1 Determinant of a Block-Tridiagonal Matrix . . . . . . . . 429
12.2.2 Counting Eigenvalues with Determinants . . . . . . . . . . 431
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434

13 Computing the Matrix Pseudospectrum . . . . . . . . . . . . . . . . . . . 439


13.1 Grid Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
13.1.1 Limitations of the Basic Approach . . . . . . . . . . . . . . 440
13.1.2 Dense Matrix Reduction . . . . . . . . . . . . . . . . . . . . . 442
13.1.3 Thinning the Grid: The Modified GRID Method . . . . . 443
13.2 Dimensionality Reduction on the Domain: Methods
Based on Path Following . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
13.2.1 Path Following by Tangents . . . . . . . . . . . . . . . . . . . 446
13.2.2 Path Following by Triangles. . . . . . . . . . . . . . . . . . . 450
13.2.3 Descending the Pseudospectrum . . . . . . . . . . . . . . . . 454
13.3 Dimensionality Reduction on the Matrix: Methods
Based on Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
13.3.1 An EIGTOOL Approach for Large Matrices . . . . . . . . . 459
13.3.2 Transfer Function Approach . . . . . . . . . . . . . . . . . . . 460
13.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
List of Figures

Figure 1.1 SIMD architecture . . . . . . . . . . . . . . . . . . . . . . . . . ..... 5


Figure 1.2 Computational rate and efficiency of a vector
operation on p = 16 PEs of a SIMD architecture . . . . ..... 6
Figure 1.3 Floating-point adder pipeline . . . . . . . . . . . . . . . . . . ..... 7
Figure 1.4 Computational rate of a pipelined operation . . . . . . . ..... 8
Figure 1.5 Computational rate of a pipelined operation with vector
registers (N = 64) . . . . . . . . . . . . . . . . . . . . . . . . . ..... 8
Figure 1.6 MIMD architecture with shared memory. . . . . . . . . . ..... 9
Figure 1.7 MIMD architecture with distributed memory . . . . . . ..... 10
Figure 2.1 Efficiencies of the doall approach with a sequential
outer loop in MGS. . . . . . . . . . . . . . . . . . . . . . . . . ..... 28
Figure 2.2 Partition of a block-diagonal matrix with overlapping
blocks; vector v is decomposed in overlapping slices . ..... 35
Figure 2.3 Elementary blocks for the MV kernel . . . . . . . . . . . . ..... 35
Figure 2.4 Common structure of programs in time-dependent
simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 2.5 Reordering to a narrow-banded matrix . . . . . . . . . . . . . . . . 38
Figure 2.6 Reordering to a medium-banded matrix . . . . . . . . . . . . . . . 39
Figure 2.7 Reordering to a wide-banded matrix. . . . . . . . . . . . . . . . . . 39
Figure 3.1 Sequential solution of lower triangular system
Lx = f using CSWEEP (column-sweep Algorithm 3.1) . ..... 54
Figure 3.2 Computation of solution of lower triangular system Lx
= f using the fan-in approach of DTS (Algorithm 3.2) ..... 55
Figure 3.3 Computation of the terms of the vector (3.21) . . . . . . ..... 65
Figure 3.4 Unit lower triangular matrix in triangular system (3.25)
with n = 16, p = 4 and m = 1 . . . . . . . . . . . . . . . . . ..... 70
Figure 3.5 Sequential method, ξkþ1 ¼ f ðξk Þ, for approximating
the fixed point α . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 76
Figure 3.6 Function g used to compute the fixed point of f. . . . . ..... 77
Figure 4.1 Annihilation in Gaussian elimination with pairwise
pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 83

xix

www.allitebooks.com
xx List of Figures

Figure 4.2 Annihilation in block factorization without pivoting . . . . ... 86


Figure 4.3 Concurrency in block factorization . . . . . . . . . . . . . . . . ... 86
Figure 5.1 Illustrating limited parallelism when Gaussian
elimination is applied on a system of bandwidth 9 . . . . . ... 92
Figure 5.2 Original banded matrix A 2 R1818 . . . . . . . . . . . . . . . ... 93
Figure 5.3 Banded matrix after the row permutations A1 ¼ P0 A0 ,
as in SCALAPACK. . . . . . . . . . . . . . . . . . . . . . . . . . . ... 93
Figure 5.4 Banded matrix after the row and column permutations
A2 ¼ A1 P1 , as in SCALAPACK . . . . . . . . . . . . . . . . . . ... 93
Figure 5.5 Spike partitioning of the matrix A and block
of right-hand sides F with p ¼ 4 . . . . . . . . . . . . . . . . . . . . 96
Figure 5.6 The Spike matrix with 4 partitions . . . . . . . . . . . . . . . . . . . 97
Figure 5.7 The Spike-balance scheme for two block rows. . . . . . . . . . . 105
Figure 5.8 The Spike-balance scheme for p block rows . . . . . . . . . . . . 109
Figure 5.9 Nonzero structure of AðjÞ as PARACR progresses
from j ¼ 1 (upper left) to j ¼ 4 (lower right). . . . . . . . . ... 137
Figure 5.10 Nonzero structure of Að3Þ 2 R77 (left)
and Að4Þ 2 R1515 (right). In both panels, the unknown
corresponding to the middle equation is obtained using
the middle value of each matrix enclosed by the
rectangle with double border. The next set of com-
puted unknowns (2 of them) correspond to the diagonal
elements enclosed by the simple rectangle, the next set
of computed unknowns (22 of them) correspond to the
diagonal elements enclosed by the dotted rectangles.
For Að4Þ , the final set of 23 unknowns correspond to the
encircled elements . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 140
Figure 6.1 Regions in the ðσ; δÞ-plane of positive definiteness
and diagonal dominance of pentadiagonal Toeplitz
matrices ð1; σ; δ; σ; 1Þ . . . . . . . . . . . . . . . . . . . . . . . . . ... 191
Figure 6.2 Prismatic bar with one end fixed and the other
elastically supported . . . . . . . . . . . . . . . . . . . . . . . . . . ... 195
Figure 7.1 Ordering for the Givens rotations: all the entries
of label k are annihilated by Qk . . . . . . . . . . . . . . . . . . . . . 230
Figure 7.2 Parallel annihilation when m  n: a two-step procedure . . . . 236
Figure 7.3 Geographical partition of a geodetic network. . . . . . . . . . . . 239
Figure 8.1 Levels and associated tasks method Divide-and-Conquer . . . 267
Figure 10.1 A matrix domain decomposition with block overlaps . . . . . . 327
Figure 10.2 Illustration of (10.49). Legend I = Ii and I + 1 = Ii+1 . . . . . . 333
Figure 10.3 Expression of the block multiplicative Schwarz split-
ting for a block-tridiagonal matrix. Left Pattern of A;
Right Pattern of N where A = P − N is the
corresponding splitting . . . . . . . . . . . . . . . . . . . . . . . . ... 335
List of Figures xxi

Figure 10.4 Pipelined costruction of the Newton-Krylov basis


corresponding to the block multiplicative Schwarz
preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 337
Figure 10.5 Flow of the computation vkþ1 ¼ σ k P1 ðA  λk IÞvk
(courtesy of the authors of [50]) . . . . . . . . . . . . . . . . . . . . 339
Figure 11.1 A cantilever beam with additional spring support. . . . . . . . . 361
Figure 11.2 The regular hexahedral finite element . . . . . . . . . . . . . . . . . 362
Figure 11.3 A 3-D cantilever beam with additional spring supports . . . . . 362
Figure 12.1 Application of EIGENCNT on a random matrix
of order 5. The eigenvalues are indicated by the stars.
The polygonal line is defined by the 10 points
withcircles; the other points of the line are
automatically introduced to insure the conditions
as specified in [93] . . . . . . . . . . . . . . . . . . . . . . . . . . ... 433
Figure 13.1 Illustrations of pseudospectra for matrix grcar of
order n ¼ 50. The left frame was computed using
function ps from the Matrix Computation Toolbox
that is based on relation 13.1 and shows the eigen-
values of matrices A þ Ej for random perturbations
Ej 2 C5050 ; j ¼ 1; . . .; 10 where kEj k  103 . The
frame on the right was computed using the EIGTOOL
package and is based on relation 13.2 ; it shows the
level curves defined by fz : sðzÞ  εg for ε ¼ 101
down to 1010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 440
Figure 13.2 Using MOG to compute the pseudospectrum of
triangle(32) for ε = 1e − 1 (from [4]) . . . . . . . . . . . ... 444
Figure 13.3 Method MOG: outcome from stages 1 (dots ‘•’) and 2
(circle ‘o’) for matrix grcar(1000) on 50 × 50 grid
(from [4]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 445
piv
Figure 13.4 COBRA: Position of pivot (Zk1 ), initial prediction (~zk ),
sup
support (Zk ), first order predictors (ζj; 0) and
corrected points (Zkj ). (A proper scale would show
that h << H) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 448
Figure 13.5 PAT: The lattice and the equilateral triangles . . . . . . . . ... 450
Figure 13.6 PAT: Illustrations of four situations for transformation F ... 451
Figure 13.7 PAT: Two orbits for the matrix grcar(100);
eigenvalues are plotted by (red) ‘*’, the two
chains of triangles are drawn in black (ε = 10−12)
and blue (ε = 10−6) . . . . . . . . . . . . . . . . . . . . . . . . . . ... 454
xxii List of Figures

Figure 13.8 Pseudospectrum contours and trajectories of points


computed by PsDM for pseps, ε = 10−1, ..., 10−3 for
matrix kahan(100). Arrows show the directions used
in preparing the outermost curve with path following
and the directions used in marching from the outer to
the inner curves with PsDM . . . . . . . . . . . . . . . . . . . . . . . 455
Figure 13.9 Computing ∂Λδ (A), δ < ε . . . . . . . . . . . . . . . . . . . . . . . . . 456
Figure 13.10 Pseudospectrum descent process for a single point . . . . . . . . 457
List of Tables

Table 1.1 Loops considered in this book . . . . . . . . . . . . . . . . . . . . .. 12


Table 2.1 Elementary factorization procedures; MATLAB index
notation used for submatrices . . . . . . . . . . . . . . . . . . . . . .. 26
Table 2.2 Benefit of pipelining the outer loop in MGS
(QR factorization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 27
Table 3.1 Summary of bounds for parallel steps, efficiency,
redundancy and processor count in algorithms for solving
linear recurrences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 72
Table 3.2 Summary of linear recurrence bounds using a limited
number, p, of processors, where m < p << n and m and
n are as defined in Table 3.1 . . . . . . . . . . . . . . . . . . . . . .. 73
Table 5.1 Steps of a parallel prefix matrix product algorithm
to compute the recurrence (5.93). . . . . . . . . . . . . . . . . . . .. 142
Table 6.1 Parallel arithmetic operations, number of processors, and
overall operation counts for Algorithms 6.9, 2 and 3 . . . . . .. 192
Table 8.1 Annihilation scheme for 2JAC . . . . . . . . . . . . . . . . . . . . .. 253
Table 10.1 Comparison of system matrices for three row projection
methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 319
Table 12.1 Partial fraction coefficients for p1;k1ðζÞ when p1;k ðζÞ ¼
Q6 j
j¼1 ðζ  Þ for the incomplete partial fraction
6
j6¼k
1 1
expansion of p1;k ðζÞ ðζnk Þ . ........................... 420
1
Table 12.2 Partial fraction coefficients for p1;ðk;iÞ ðζÞ when p1;ðk;iÞ ðζÞ ¼
Q6 j
j¼1 ðζ  Þ for the incomplete partial fraction
6
j6¼k;i
1 1
expansion of p1;ðk;iÞ ðζÞ ðζnkÞðζni Þ . ....................... 422
Table 12.3 Base 10 logarithm of partial fraction coefficient of largest . . . 428
Table 12.4 Components and base 10 logarithm of maximum relative. . . . 428
Table 13.1 PAT: number of triangles in two orbits for matrix
grcar(100). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453

xxiii
List of Algorithms

Algorithm 2.1 Fan-out version for factorization schemes . . . . . . . .... 25


Algorithm 2.2 Fan-in version for factorization schemes . . . . . . . . .... 25
Algorithm 2.3 do/doall fan-out version for factorization schemes . .... 26
Algorithm 2.4 doacross/doall version for factorization schemes
(fan-out.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 28
Algorithm 2.5 Message passing fan-out version for factorization
schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Algorithm 2.6 CRS-type MV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Algorithm 2.7 CCS-type MV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Algorithm 2.8 COO-type MV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Algorithm 2.9 MV: w = w + Av . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Algorithm 2.10 MTV: w = w + A⊤v . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Algorithm 2.11 Scalable MV multiplication w = w + Av . . . . . . . . . . . . 36
Algorithm 3.1 CSWEEP: Column-sweep method for unit lower
triangular system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Algorithm 3.2 DTS: Triangular solver based on a fan-in approach . . . . 57
Algorithm 3.3 BBTS: Block banded triangular solver . . . . . . . . . . . . . 60
Algorithm 3.4 TTS: Triangular Toeplitz solver . . . . . . . . . . . . . . . . . . 63
Algorithm 3.5 BTS: Banded Toeplitz triangular solver . . . . . . . . . . . . 66
Algorithm 5.1 Basic stages of the Spike algorithm . . . . . . . . . . . . . . . 95
Algorithm 5.2 Domain Decomposition Conjugate Gradient
(DDCG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 126
Algorithm 5.3 CR: cyclic reduction tridiagonal solver . . . . . . . . . .... 133
Algorithm 5.4 PARACR: matrix splitting-based paracr (using
transformation 5.88). Operators ⊙, ⊘ denote
elementwise multiplication and division
respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 136
Algorithm 5.5 TRID_LDU: LDU factorization of tridiagonal matrix . .... 142
Algorithm 5.6 RD_PREF: tridiagonal system solver using rd and
matrix parallel prefix . . . . . . . . . . . . . . . . . . . . . .... 144

xxv
xxvi List of Algorithms

Algorithm 5.7 PARGIV: tridiagonal system solver using Givens


rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 148
Algorithm 6.1 PR2PW: Conversion of polynomial from product
form to power form . . . . . . . . . . . . . . . . . . . . . ..... 169
Algorithm 6.2 POWFORM: Conversion from full product
(roots and leading coefficient) (6.5) to power form
(coefficients) using the explicit formula (6.8) for
the transforms.. . . . . . . . . . . . . . . . . . . . . . . . . ..... 171
Algorithm 6.3 IVAND: computing the Vandermonde inverse . . . . ..... 171
Algorithm 6.4 NEVILLE: computing the divided differences by the
Neville method. . . . . . . . . . . . . . . . . . . . . . . . . ..... 172
Algorithm 6.5 DD_PPREFIX: computing divided difference
coefficients by sums of rationals and
parallel prefix . . . . . . . . . . . . . . . . . . . . . . . . . ..... 174
Algorithm 6.6 DD2PW: converting divided differences
to power form coefficients . . . . . . . . . . . . . . . . ..... 174
Algorithm 6.7 PREFIX: prefix computation . . . . . . . . . . . . . . . . . ..... 175
Algorithm 6.8 PREFIX_OPT: prefix computation by odd-even
reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 175
Algorithm 6.9 A banded Toeplitz solver for nonsymmetric
systemswith nonsingular associated
circulant matrices . . . . . . . . . . . . . . . . . . . . . . . ..... 184
Algorithm 6.10 A Banded Toeplitz Solver for Symmetric
Positive Definite Systems . . . . . . . . . . . . . . . . . ..... 184
Algorithm 6.11 A Banded Toeplitz Solver for Symmetric
Positive Definite Systems with Positive
Definite Associated Circulant Matrices . . . . . . . . ..... 188
Algorithm 6.12 MD- FOURIER: matrix decomposition
method for the discrete Poisson system. . . . . . . . ..... 203
Algorithm 6.13 CFT: complete Fourier transform method
for the discrete Poisson system.Q . . . . . . . . . . . . . ..... 204
d
Algorithm 6.14 SIMPLESOLVE_PF: solving j¼1 ðT  ρj IÞ x ¼
b for mutually distinct values ρj from partial
fraction expansions. . . . . . . . . . . . . . . . . . . . . . ..... 206
Algorithm 6.15 CORF: Block cyclic reduction for the discrete
Poisson system . . . . . . . . . . . . . . . . . . . . . . . . ..... 207
Algorithm 6.16 BCR: Block cyclic reduction with Buneman
stabilization for the discrete Poisson system . . . . ..... 209
Algorithm 6.17 EES: Explicit Elliptic Solver for the discrete
Poisson system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Algorithm 7.1 QR by Givens rotations . . . . . . . . . . . . . . . . . . . . . . . 231
Algorithm 7.2 CGS: classical Gram-Schmidt . . . . . . . . . . . . . . . . . . . 234
Algorithm 7.3 MGS: modified Gram-Schmidt . . . . . . . . . . . . . . . . . . 234
Algorithm 7.4 B2GS: block Gram-Schmidt . . . . . . . . . . . . . . . . . . . . 235
List of Algorithms xxvii

Algorithm 7.5 Rank-revealing QR for a triangular matrix. . . . . . . . . . . 244


Algorithm 8.1 2JAC: two-sided Jacobi scheme . . . . . . . . . . . . . . . . . . 255
Algorithm 8.2 1JAC: one-sided Jacobi for rank-revealing SVD . . . . . . 258
Algorithm 8.3 QJAC: SVD of a tall matrix (m n) . . . . . . . . . . . . . . 260
Algorithm 8.4 TREPS: tridiagonal eigenvalue parallel solver by
Sturm sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Algorithm 8.5 Computation of an eigenvector by inverse iteration . . . . 271
Algorithm 9.1 Line SOR iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
Algorithm 9.2 Block Stiefel iterations . . . . . . . . . . . . . . . . . . . . . . . . 298
Algorithm 9.3 Arnoldi procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Algorithm 9.4 Chebyshev-Krylov procedure. . . . . . . . . . . . . . . . . . . . 306
Algorithm 9.5 Newton-Krylov procedure . . . . . . . . . . . . . . . . . . . . . . 308
Algorithm 9.6 Real Newton-Krylov procedure . . . . . . . . . . . . . . . . . . 309
Algorithm 10.1 Kaczmarz method (classical version) . . . . . . . . . . . . . . 313
Algorithm 10.2 KACZ: Kaczmarz method (symmetrized version) . . . . . . 318
Algorithm 10.3 Single block multiplicative Schwarz step. . . . . . . . . . . . 329
Algorithm 10.4 Pipelined multiplicative Schwarz iteration
w := P −1v: program for processor p(q). . . . . . . . . . . . . 338
Algorithm 11.1 Power method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Algorithm 11.2 Simultaneous iteration method . . . . . . . . . . . . . . . . . . . 345
Algorithm 11.3 Shift-and-invert method . . . . . . . . . . . . . . . . . . . . . . . 347
Algorithm 11.4 Computing intermediate eigenvalues. . . . . . . . . . . . . . . 348
Algorithm 11.5 Lanczos procedure (no reorthogonalization). . . . . . . . . . 351
Algorithm 11.6 Lanczos procedure (with reorthogonalization). . . . . . . . . 352
Algorithm 11.7 Block-Lanczos procedure (with full reorth.). . . . . . . . . . 353
Algorithm 11.8 LANCZOS1: Two-pass Lanczos eigensolver . . . . . . . . . 355
Algorithm 11.9 LANCZOS2: Iterative Lanczos eigensolver . . . . . . . . . . 355
Algorithm 11.10 Generic Davidson method . . . . . . . . . . . . . . . . . . . . . . 364
Algorithm 11.11 Generic block Davidson method . . . . . . . . . . . . . . . . . 365
Algorithm 11.12 Simultaneous iteration for the generalized
eigenvalue problem . . . . . . . . . . . . . . . . . . . . . . . . . . 371
Algorithm 11.13 The basic trace minimization algorithm. . . . . . . . . . . . . 372
Algorithm 11.14 The block Jacobi-Davidson algorithm . . . . . . . . . . . . . . 382
Algorithm 11.15 The Davidson-type trace minimization algorithm . . . . . . 383
Algorithm 11.16 SISVD: inner iteration of the subspace iteration as
implemented in Rutishauser’s RITZIT . . . . . . . . . .... 390
Algorithm 11.17 BLSVD: hybrid Lanczos outer iteration
(Formation of symmetric block-tridiagonal
matrix Hk ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... 395
Algorithm 11.18 TRSVD: trace minimization with
Chebyshev acceleration and Ritz shifts . . . . . . . . . .... 398
Algorithm 11.19 Computing smallest singular values by the
block Davidson method . . . . . . . . . . . . . . . . . . . .... 400
xxviii List of Algorithms

Algorithm 11.20 Refinement procedure for the left singular


vector approximations obtained via scaling . . . . . ..... 402
Algorithm 12.1 Computing x = (p(A))−1b when p(ζ ) =
Qd
j¼1 ðζ  τ j Þ with τ j mutually distinct . . . . . . . ..... 412
Algorithm 12.2 Computing x = (p(A))−1 when p(ζ ) =
Qd
j¼1 ðζ  τ j Þ with τ j mutually distinct . . . . . . . ..... 412
Algorithm 12.3 Computing x = (p(A))−1q(A)b when p(ζ ) =
Qd
j¼1 ðζ  τ j Þ and the roots τ j are mutually
distinct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 416
Algorithm 12.4 Compute x = (p(A))−1q(A)b when pðζÞ ¼
Qd
j¼1 ðζ  τ j Þ and the roots τ j are mutually
distinct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..... 417
Algorithm 12.5 Computing ζd using partial fractions . . . . . . . . . . ..... 418
Algorithm 12.6 Computing the IPF(τ) representation
Q
of (p(ζ))−1 when p(ζ ) = dj¼1 ðζ  τj Þ
and the roots τj are mutually distinct . . . . . . . . . ..... 424
Algorithm 12.7 EIGENCNT: counting eigenvalues
surrounded by a curve . . . . . . . . . . . . . . . . . . . ..... 434
Algorithm 13.1 GRID: Computing Λε(A) based on Def. (13.2). . . ..... 441
Algorithm 13.2 GRID_FACT: computes Λε(A) using Def. 13.2
and factorization . . . . . . . . . . . . . . . . . . . . . . . ..... 443
Algorithm 13.3 MoG: Computing Λε(A) based
on inclusion-exclusion [2] . . . . . . . . . . . . . . . . . ..... 444
Algorithm 13.4 PF: computing ∂Λε(A) by predictor-corrector
path following [7] . . . . . . . . . . . . . . . . . . . . . . ..... 447
Algorithm 13.5 COBRA: algorithm to compute ∂Λ ε(A) using parallel
path following [48]. After initialization (line 1), it
consists of three stages: i) prediction-correction
(lines 4-5), ii) prediction-correction (lines 7-8), and
iii) pivot selection (line 10) . . . . . . . . . . . . . . . . ..... 449
Algorithm 13.6 PAT: path-following method to determine
Λε(A) with triangles . . . . . . . . . . . . . . . . . . . . . ..... 452
Algorithm 13.7 PsDM_sweep: single sweep of method PsDM . . ..... 456
Algorithm 13.8 PsDM: pseudospectrum descent method . . . . . . . ..... 457
Algorithm 13.9 LAPSAR: Local approximation of pseudospectrum
around Ritz values obtained from ARPACK. . . . . ..... 460
Algorithm 13.10 TR: approximating the ε-pseudospectrum
Λε(A) using definition 13.3 on p processors. . . . . ..... 462
Notations

i1 : is : i2 MATLAB type colon notation


; MATLAB type start new matrix row
de and bc floor and ceiling functions
log logarithm (base-2) unless mentioned otherwise
R; C real number field, complex number field
ξ conjugate of ξ
AT , AH transpose of A, Hermitian transpose of A 2 Cnn
,  Hadamard (element-by-element) multiplication
and division
, Kronecker product, Kronecker sum
0 2 Rnn , 0n zero matrix (size implied by the context of n  n)
0m;n zero matrix of size m  n
I 2 Rnn , In identity matrix (size implied by the context of
n  n)
ðnÞ jth column of I or In
ej or ej 2 Rn
e or eðnÞ column vector of 1’s, (size implied or n)
Jn or J order n matrix ðe2 ; e3 ; . . .; en ; 0Þ;
subscript omitted when implied by context
diagðxÞ diagonal matrix with x along the diagonal when it
is a vector
diagðXÞ diagonal matrix of diagonal elements of X
when X is a matrix
diagðξ1 ; . . .; ξn Þ same as diagðxÞ where x ¼ ðξ1 ; . . .; ξn ÞT
diagðA1 ; . . .; An Þ the block diagonal matrix with diagonal blocks the
matrices Aj
detðXÞ determinant of X
trðXÞ trace of X
trilðXÞ, triuðXÞ the strictly lower (resp. upper) triangular section
of X
½αi;i1 ; αi;i ; αi;iþ1
tridiagonal matrix with αi;i1 ; αi;i ; αi;iþ1 in row i

xxix

www.allitebooks.com
xxx Notations

or ½αi;i1 ; αi;i ; αi;iþ1


1:n order implied or n
½β; α; γ
tridiagonal Toeplitz matrix with β; α; γ along the
or ½β; α; γ
n subdiagonal, diagonal and supradiagonal respec-
tively (size implied or n  n)
½βi ; αi ; γ iþ1
, ½βi ; αi ; γ iþ1
1:n tridiagonal matrix (implied order or n)
or ½βi ; αi ; γ iþ1
n with βi ; αi ; γ iþ1 in row i
Tn Chebyshev polynomial of the 1st kind
and degree n
^n
U Chebyshev polynomial of the 1st kind
and degree n, scaled
^n
U Chebyshev polynomial of the 2nd kind
and degree n, shifted

Based on the above,

and
0 1
α1;1 α1;2
B α2;1 α2;2 α2;3 C
B .. .. .. C
B C
½αi;i1 ; αi;i ; αi;iþ1
1:n ¼B . . . C
B .. .. C
@ . . αn1;n A
αn;n1 αn;n

Algorithms are frequently expressed in pseudocode, using MATLAB constructs.


In most cases, the “Householder notation” is followed, that is matrices are denoted
by upper case romans, vectors by lower case romans and scalars by greek letters.
There are some notable exceptions: Indices and function names are also denoted by
lower case romans (e.g. i; j; k and f ; p; q). So the following disclaimer is necessary:
whenever we felt that by following the Householder notation conventions, read-
ability would be compromised, they were briefly abandoned.
Part I
Basics
Chapter 1
Parallel Programming Paradigms

In this chapter, we briefly present the main concepts in parallel computing. This is
an attempt to make more precise some definitions that will be used throughout this
book rather than a survey on the topic. Interested readers are referred to one of the
many books available on the subject, e.g. [1–8].

1.1 Computational Models

In order to describe algorithms that are general so as to be useful in a variety of


generations of parallel computers, we need to define computational models that are
abstract enough so as to encapsulate the main features of the underlying architec-
ture, without obscuring the algorithm description. Such models are helpful in the
choice of the programming paradigm as well as designing algorithms that enable
faithful performance assessment. It is worth mentioning that creating a computa-
tional model that helps in designing efficient algorithms can be vastly different from
models targeting the design of compilers and other systems software.
In this chapter we present an outline of different models of parallel architectures
and the appropriate way of using them.

1.1.1 Performance Metrics

It is often useful to evaluate the potential performance of a parallel algorithm by


assuming that there is an unlimited number of processors, and that interprocessor
communication is instantaneous. Such an approach is of value, as it informs the
algorithm designer about the maximum expected benefit (such as total runtime and
resource usage) that can be achieved by parallel programs implementing these algo-
rithms. Moreover, this approach can steer the designer to search for algorithms that
achieve better performance by taking alternative approaches.
© Springer Science+Business Media Dordrecht 2016 3
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_1
4 1 Parallel Programming Paradigms

In what follows:
• p denotes the number of processors.
• T p denotes the number of steps of the parallel algorithm on p processors. Each
step is assumed to consist of (i) the interprocessor communication and memory
references, and (ii) the arithmetic operations performed immediately after by the
active processors. Thus, for example, as will be shown in Sect. 2.1 of Chap. 2, an
inner product of two n-sized vectors, requires at least T p = O(log n) steps on
p = n processors.
• O p denotes the number of arithmetic operations required by the parallel algorithm
when run on p processors.
• R p denotes arithmetic redundancy. This is the ratio of the total number of arith-
metic operations, O p , in the parallel algorithm, over the least number of arithmetic
operations, O1 , required by the sequential algorithm. Specifically

R p = O p /O1 .

Since interprocessor communication, or accessing data from various levels of


the memory hierarchy in a multiprocessor, is much more time consuming than
arithmetic operations, efficient parallel algorithms often attempt to reduce such
communication at the cost of introducing more arithmetic operations than those
needed by the best sequential scheme. Therefore, the arithmetic rendundancy is
often higher than 1. Note also that O1 and T1 are the same.
• S p and E p , respectively denote the speedup and efficiency of the parallel algorithm.
These are defined by

T1 Sp
Sp = , and E p = .
Tp p

The study of speedups is central in most performance evaluation studies undertaken


for parallel programs. Speedups and efficiencies are also directly related to the
notion of scalability, discussed at the end of the current chapter.
• V p denotes the computational rate, defined by

Op
Vp = .
Tp

Based on these definitions, V1 = O1 = T1 .


• Whenever referenced as above, it would be assumed that the size of the problem is
readily available. Sometimes, in our discussions, it would be desirable to express
these metrics as functions of the problem size. In these cases, we write T p (n),
O p (n), where n is the problem size.
It is important to note that in order to avoid notation clutter, when we present the
number of steps, speedup, number of arithmetic operations and number of processors
1.1 Computational Models 5

Fig. 1.1 SIMD architecture


MB MB
0 p-1

CU
INTERCONNECTION NETWORK

PE PE

that turn out to be rational functions in the problem size, it must be understood that we
are referring to an integer value. Typically this is the ceiling function of the fraction.

1.1.2 Single Instruction Multiple Data Architectures


and Pipelining

The earliest computers constructed aiming at high performance via the use of paral-
lelism were of the Single Instruction Multiple Data (SIMD) type, according to the
standard classification of parallel architectures [9]. These are characterized by the
fact that several processing elements (PE) are controlled by a single control unit
(CU); see Fig. 1.1, where MB denotes the memory banks.
For executing a program on such an architecture,
• the CU runs the program and sends: (i) the characteristics of the vectors to the
memory, (ii) the permutation to be applied to these vectors by the interconnection
network in order to achieve the desired alignment of the operands, and (iii) the
instructions to be executed by the PEs,
• the memory delivers the operands which are permuted through the interconnection
network,
• the processors perform the operation, and
• the results are sent back via the interconnection network to be correctly stored in
the memory.
The computational rate of the operation depends on the vector length. Denoting the
duration of the operation on one slice of N ≤ p components by t, the parallel steps
needed to perform the operation on a vector of length n is given by
 
n
Tp = t.
p

Thus, the computational rate realized via the use of this SIMD architecture is given
by,
6 1 Parallel Programming Paradigms

Fig. 1.2 Computational rate r∞ 1


and efficiency of a vector
operation on p = 16 PEs of
.8
a SIMD architecture

.6

V p (n)
.4

.2

0
0 20 40 60 80 100 120 140 160
vector length n

n
Vp = ,
Tp

with an asymptotic computational rate of r∞ = p/t, which is reached when the


vector length is a multiple of the number of processors p (see Fig. 1.2).
Large scale SIMD architectures have totally disappeared. Current CPUs, on the
other hand, offer instructions for SIMD processing while graphics processing units
(GPUs) utilize related concepts, such as SIMT (Single Instruction stream Multiple
Threads); cf. [6, 10].
Vector operations can also be realized using the principles of pipelining. Similar
to the assembly line in a factory, in which the total fabrication of an item is split into
elementary steps with the items under construction traveling from one post to another
until the chain is entirely filled and an item is assembled, we consider the definition
of a pipeline for performing a vector floating point addition: As an example, the
addition of two vectors of floating-point numbers,

ci = ai + bi , for i = 1, . . . , n,

is typically decomposed into the four following stages:


1. comparing exponents,
2. aligning the mantissas,
3. adding the mantissas,
4. normalizing the resulting floating-point number.
This pipeline scheme is depicted in Fig. 1.3.
In order to describe the performance of a pipeline, we introduce the following
auxiliary quantities:
• s denotes the number of stages of the pipeline;
• τ denotes the time consumed by each stage (which is assumed to be uniform);
• t0 = (s − 1)τ denotes the start-up time required to fill the pipeline, so that the first
result is delivered after time t0 + τ .
1.1 Computational Models 7

xi compare align add normalize


exponents mantissas mantissas xi y i
yi C A S N

x1
C A S N x 1 y1
y1

x2
C A S N x2 y2
y2

x3
C A S N x 3 y3
y3

x4
C A S N x4 y 4
y4

x5
C A S N x5 y 5
y5

0 τ 2τ 3τ 4τ 8τ time

Fig. 1.3 Floating-point adder pipeline

Thus, the elapsed time corresponding to the addition of two vectors of length n is
given by,

T p = t0 + nτ

which insures a computational rate of,


n n
Vp = = . (1.1)
Tp t0 + nτ

Figure 1.4, depicts the computational rate realized with respect to the vector length.
The asymptotic computational rate, r∞ = 1/τ , is not reached for finite vector lengths.
Half of that asymptotic computational rate, however, is reached for a vector of length
n1/2 = t0 /τ = s − 1. The two numbers (r∞ , and n1/2 ) characterize the pipeline
performance since it is easy to see:
 
n1/2
V p = r∞ 1 − .
n + n1/2

See [7, 11] for more details on this useful metric.


Most of the pipelines load their operands from vector registers and not directly
from the memory. With a vector register of length N (typically N = 64 or 128), vector
8 1 Parallel Programming Paradigms

Fig. 1.4 Computational rate r∞ 1


of a pipelined operation
.8

.6

V p (n)
.4

.2

0
0 20 40 60 80 100 120 140 160
vector length n

operands of length n must be partitioned into slices each of length N . This approach
favors operations on short vectors since it decreases the start-up time. By assuming
that vector registers are ready with slices of the operands so that an operation is
immediately performed on a given slice once the operation on a previous slice is
completed, then the rate due to pipelining can be expressed as,
n
V p (n) =
kt0 + nτ

which is an adaptation of (1.1), where k =  Nn  denotes the number of slices of the


operands (the last slice might be shorter than N ). The corresponding behavior of the
speedup is depicted in Fig. 1.5 in which r∞ is equal to 1/τ .
Even higher performance can be achieved when the underlying system allows
chaining, that is letting one pipeline deliver its results directly to another one that
performs the subsequent computation; cf. [6].

Fig. 1.5 Computational rate r∞ 1


of a pipelined operation with
vector registers (N = 64)
.8

.6
V p (n)

.4

.2

0
0 20 40 60 80 100 120 140 160
vector length n
1.1 Computational Models 9

For illustration, let us consider the following vector operation that is quite common
in a variety of numerical algorithms (see Sect. 2.1):

di = ai + bi ci , for i = 1, . . . , n.

Assuming that the time for each stage of the two pipelines (which performs the
multiplication and the addition) is equal, by chaining the pipelines, the results are
still obtained at the same rate. Therefore, the speedup is doubled since two operations
are performed at the same time. It is worth noting that in several cases pipelining and
chaining are used to increase the performance of elementary operations. In particular,
the scalar operation a + bc is implemented as a single instruction, often called Fused
Multiply-Add (FMA). This also deliver results with smaller roundoff than if one
were to implement it as two separate operations; cf. [12].
Pipelining can also be applied to interconnection networks between memories and
PEs to enhance memory bandwidth. In addition, it can be applied to multiprocessors
to enhance memory bandwidth.

1.1.3 Multiple Instruction Multiple Data Architectures

Next, we consider architectures belonging to the Multiple Instruction Multiple Data


(MIMD) class. In such a class, each of the PEs owns its proper CU and all the PEs are
able to run distinct tasks in parallel. This class is subdivided into two major subclasses:
the shared memory, and the distributed memory architectures. In the first type, all the
PEs exchange their data through a global memory while in the second type, the PEs
exchange their data with messages sent through an interconnection network. These
two MIMD organizations are depicted in Figs. 1.6 and 1.7, respectively.
By expressing parallelism at the loop level, a shared memory architecture allows
a programming style that is closer to that of programming for a uniprocessor. In
such a context, exclusion mechanisms are used when instructions for reading and

Fig. 1.6 MIMD architecture


with shared memory
MEMORY

INTERCONNECTION NETWORK

CPU CPU

www.allitebooks.com
10 1 Parallel Programming Paradigms

Fig. 1.7 MIMD architecture


with distributed memory
M M

CPU CPU

INTERCONNECTION NETWORK

writing must be synchronized to operate in parallel on the same variables. Most of


MIMD computing platforms with a massive number of processors belong to the
distributed memory class. In such a class, the computing platform consists of many
nodes in which each node is a multicore shared-memory architecture with its own
proper memory. The nodes are interconnected via physical links. An important issue
concerns the topology of the underlying connection network. The most powerful
network is the crossbar switch which connects any node to all the rest. Unfortunately,
for p nodes, it necessitates p( p − 1)/2 links, which makes it infeasible for large p.
Other topologies have been investigated for connecting a large number of nodes.
Since these topologies do not connect each node to all the rest, they imply routing
procedures that permit all possible exchanges of data. An important characteristic
of a network is its diameter, i.e. the maximum distance between two nodes (the
distance between two nodes is the minimum length of all paths connecting them).
For a crossbar network, the diameter is minimum, i.e. d = 1, while for a linear
array the diameter is maximum, i.e. d = p − 1. For other interconnections such as
rings, hypercubes and tori, the respective diameters satisfy 1 < d < p − 1. Clearly,
all other things being equal, the smaller the diameter, the faster the interprocessor
communication.
The usual way of modeling the time spent in sending a vector of length n between
two processors is by β + nτc , where β is the latency and τc the time for sending a
single element of the vector, which is independent of the latency.
More elaborate models of parallel architectures can be found in [13–15].

1.1.4 Hierarchical Architectures

Many machines today adopt a hierarchical design in the sense that both the process-
ing and memory systems are organized in layers, each having specific characteristics.
Regarding the processing hierarchy, it consists of units capable of scalar processing,
vector and SIMD processing, multiple cores each able to process one or multiple
threads, the multiple cores organized into interconnected clusters. The memory hier-
1.1 Computational Models 11

archy consists of short and long registers, various levels of cache memory local
to each processor, and memory shared between processors in a cluster. In many
respects, it can be argued that most parallel systems are of this type or simplified
versions thereof.

1.2 Principles of Parallel Programming

We present here a simplified picture of parallel programming by introducing two


main standards for MIMD programming: MPI and OpenMP. For a more complete
description of these two paradigms, the reader is referred to the extensive litera-
ture on parallel programming such as [16, 17] and the general books on parallel
processing referenced at the beginning of the current chapter. The main difference
between these two programming paradigms could be stated simply as follows: MPI
is a programming model for MIMD architectures with the necessary interprocessor
communications whereas in OpenMP, programs are written using directives (prag-
mas) in the source code to steer the compiler in the restructuring of loops (a key find
in the early 1970s was that loops rather than arithmetic expressions must be the key
target for increasing the speedup of parallel programs) and other program constructs.
MIMD Programming
Since an MIMD architecture is a collection of p independent processors, the straight-
forward MIMD approach would consider p distinct programs to be run on the proces-
sors with a mechanism for exchanging information. The most common version of
such an approach is to consider the Single Program Multiple Data organization
(SPMD) [18]. This implies, in turn, a common but independently run program on
each processor. At first sight, this seems to limit the ability of having distinct tasks
executed on the processors. If the first instruction of the common program, however,
is a switch among all the possible tasks depending on the processor identity, it is then
clear that SPMD mimics the original MIMD approach.
In this model it is possible to simulate a parallel architecture of q processes using
a smaller number of processors.
The most common software system for message passing programming is the Mes-
sage Passing Interface (MPI). This is a standard for the syntax and semantics of a
set of basic library routines for implementing message passing on the underlying
computer platform; cf. [19, 20]. Writing programs in MPI allows program porta-
bility between different distributed computing systems. MPI can also be used on a
single multicore node. In message passing, there are several types of communica-
tions defined, blocking for synchronous and non-blocking for asynchronous; cf. [17]
Moreover, the type of communication can be point-to-point or global.
Expressing Parallelism at the Loop Level
A second option in parallel programming is by means of an application programming
interface where parallelism is expressed by directives that mainly apply to loops, in
12 1 Parallel Programming Paradigms

Table 1.1 Loops considered in this book

which each iteration is independent of the rest. These directives steer the compiler in
its restructuring of the program. The most prominent programming paradigm of this
category is called OpenMP (Open Multiprocessing Application Program Interface),
e.g. see [21, 22]. In such a paradigm, the tasks are implemented via threads.
In a given program, directives allow the user to define parallel regions which
invoke fork and join mechanisms at the beginning and the end of the region, with the
ability to define the shared variables of the region. Other directives allow specifying
parallelism through loops. For parallel loops, the programmer specifies whether the
distribution of the iterations through the tasks (threads) is static or dynamic. Static
repartition lowers the overhead cost whereas dynamic allocation adapts better to
irregular task loads. Several techniques are provided for synchronizing threads which
include locks, barriers, and critical sections.
In this book, when necessary for the sake of illustration, we consider three types
of loops which are shown in Table 1.1:
(a) do in which the iterations are run sequentially,
(b) doall in which the iterations are run independently and possibly simultaneously,
(c) doacross in which the iterations are pipelined.
Table 1.1 illustrates these three cases. The doacross loop enables the use of pipelining
between iterations, so that an iteration can start before the previous one has been
completed. For this reason, it also depends on the use of synchronization mechanisms
(such as wait and post) to control the execution of the iterations.
It is also worth noting that frequently, it is best to write programs that utilize both
of the above paradigms, that is MPI and OpenMP. This hybrid mode is natural in
order to take advantage of the hierarchical structure of high performance computer
systems.

1.2.1 From Amdahl’s Law to Scalability

It was observed quite early in the history of parallel computing that there is a limit
to the available parallelism in a given fixed-size computation. The limit is governed
1.2 Principles of Parallel Programming 13

by the percentage of the inherently sequential portion of the computation and in


turn limits the speedup and efficiency that can be achieved by the program. This
observation is known as Amdahl’s law and can be expressed as follows:

Proposition 1.1 (Amdahl’s law) [23] Let O p be the number of operations a parallel
program implemented on p processors. If the portion f p (0 < f p < 1) of the O p
operations is inherently sequential, then the speedup is bounded by S p < 1/ f p .
1− f p
Proof From the definitions of T1 and T p is obvious that T p ≥ ( f p + p )T1 ; the
above upper bound immediately follows.

This simple result gives a rule-of-thumb for the maximum speedup and efficiency
that can be expected of a given algorithm. The limits to parallel processing implied
by the law but also the limits of the law’s basic assumptions have been discussed
extensively in the literature, by designers of parallel systems and algorithms; for
example, cf. [24–31]. We make the following remarks.

Remark 1.1 Even for a fully parallel program, there is a loss of efficiency due to
the overhead (interprocessor communication time, memory references, and parallel
management) which usually implies that S p < p.

Remark 1.2 An opposite situation may occur when the observed speedup is “super-
linear”, i.e. S p > p. This happens, for example, when the data set is too large for the
local memory of one processor; whereas storing it across the p processor memories
becomes possible. In fact, the ability to manipulate larger datasets is an important
advantage of parallel processing that is becoming especially relevant in the context
of data intensive computations.

Remark 1.3 The speedup bound in Amdahl’s law refers strictly to the performance of
a program running in single-user mode on a parallel computer, using one or all of the
processors of the system. This was a reasonable assumption at the time of large-scale
SIMD systems, but today one can argue that it is too strong of a simplification. For
instance, it does not consider the effect of systems that can handle multiple parallel
computations across the processors. It also does not capture the possibility of parallel
systems consisting of heterogeneous processors with different performance.

Remark 1.4 The aspect of Amdahl’s law that has been under criticism by various
researchers is the assumption that the program whose speedup is evaluated is solving a
problem of fixed size. As was observed by Gustafson in [32], the problem size should
be allowed to vary with the number of processors. This meant that the parallel fraction
of the algorithm is not constant as the number of processors varies. We discuss this
issue in greater detail next.

The performance evaluation of parallel programs usually relies on the graph of


p → S p , that is that of the “processor to speedup” mapping. A graph close to
the straight line p → p corresponds to a program whose performance (in terms
of its speedup and efficiency) is maintained as the number of processors increases.
14 1 Parallel Programming Paradigms

Assuming that this holds independently of the size of the problem, the program is
characterized as strongly scalable.
In many cases, the speedup graph exhibits a two-stage behavior: there is some
threshold value, say p̃, such that S p increases linearly as long as p̃ ≥ p, whereas for
p > p̃, S p stagnates or even decreases. This is the sign that for p > p̃, the overhead,
i.e. the time spent in managing parallelism, becomes too high when the size of the
problem is too small for that number of processors.
The above performance constraint can be partially or fully removed, if we allow
the size of the problem to increase with the number of processors. Intuitively, it is
not surprising to expect that as the computer system becomes more powerful, so
would the size of the problem to be solved. This also means that the fraction of the
computations that are computed in parallel does not remain constant as the number
of processors increases. The notion of weak scalability is then relevant.

Definition 1.1 (Weak Scalability) Consider a program capable of solving a family of


problems depending on a size parameter n p associated to the number p of processors
on which the program is run. Let O p (n p ) = p O1 (n 1 ), where O p (n p ), O1 (n 1 ) are
the total number of operations necessary for the parallel and sequential programs of
size n 1 and n p respectively, assuming that there is no arithmetic redundancy. Also
let T p (n p ) denote the steps of the parallel algorithm on p processors for a problem
of size n p . Then a program is characterized as being weakly scalable if the sequence
T p (n p ) is constant.

Naturally, weak scalability is easier to achieve than strong scalability, which seeks
constant efficiency for a problem of fixed size. On the other hand, even by selecting
the largest possible size for the problem that can be run on one processor, the problem
becomes too small to be run efficiently on a large number of processors.
In conclusion, we note that in order to investigate the scalability potential of a
program, it is well worth analyzing the graph of the mapping p → (1 − f p )O p , that
is “processors to the total number of operations that can be performed in parallel”
(assuming no redundancy). Defining as problem size, O1 , that is the total number
of operations that are needed to solve the problem, one question is how fast should
the problem size increase in order to keep the efficiency constant as the number of
processors increases. An appropriate rate, sometimes called isoefficiency [33], could
indeed be neither of the two extremes, namely the constant problem size of Amdahl
and the linear increase suggested in [32]; cf. [24] for a discussion.

References

1. Arbenz, P., Petersen, W.: Introduction to Parallel Computing. Oxford University Press (2004)
2. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation. Prentice Hall, Engle-
wood Cliffs (1989)
3. Culler, D., Singh, J., Gupta, A.: Parallel Computer Architecture: A Hardware/Software
Approach. Morgan Kaufmann, San Francisco (1998)
References 15

4. Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing: Design and
Analysis of Algorithms, 2nd edn. Addison-Wesley (2003)
5. Casanova, H., Legrand, A., Robert, Y.: Parallel Algorithms. Chapman & Hall/CRC Press (2008)
6. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach. Elsevier Science
& Technology (2011)
7. Hockney, R., Jesshope, C.: Parallel Computers 2: Architecture, Programming and Algorithms,
2nd edn. Adam Hilger (1988)
8. Tchuente, M.: Parallel Computation on Regular Arrays. Algorithms and Architectures for
Advanced Scientific Computing. Manchester University Press (1991)
9. Flynn, M.: Some computer organizations and their effectiveness. IEEE Trans. Comput. C-21,
948–960 (1972)
10. Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming, 1st edn.
Morgan Kaufmann Publishers Inc., San Francisco (2013)
11. Hockney, R.: The Science of Computer Benchmarking. SIAM, Philadelphia (1996)
12. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
13. Karp, R., Sahay, A., Santos, E., Schauser, K.: Optimal broadcast and summation in the logP
model. In: Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Archi-
tectures SPAA’93, pp. 142–153. ACM Press, Velen (1993). http://doi.acm.org/10.1145/165231.
165250
14. Culler, D., Karp, R., Patterson, D., A. Sahay, K.S., Santos, E., Subramonian, R., von Eicken,
T.: LogP: towards a realistic model of parallel computation. In: Principles, Practice of Parallel
Programming, pp. 1–12 (1993). http://citeseer.ist.psu.edu/culler93logp.html
15. Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Gabriel, G.F.E., Dongarra, J.: Performance
analysis of MPI collective operations. In: Fourth International Workshop on Performance
Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS’05).
Denver (2005). (Submitted)
16. Breshaers, C.: The Art of Concurrency - A Thread Monkey’s Guide to Writing Parallel Appli-
cations. O’Reilly (2009)
17. Rauber, T., Rünger, G.: Parallel Programming—for Multicore and Cluster Systems. Springer
(2010)
18. Darema., F.: The SPMD model : past, present and future. In: Recent Advances in Parallel
Virtual Machine and Message Passing Interface. LNCS, vol. 2131/2001, p. 1. Springer, Berlin
(2001)
19. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message
Passing Interface. MIT Press, Cambridge (1994)
20. Snir, M., Otto, S., Huss-Lederman, S., Walker, D., Dongarra, J.: MPI: The Complete Reference
(1995). http://www.netlib.org/utk/papers/mpi-book/mpi-book.html
21. Chapman, B., Jost, G., Pas, R.: Using OpenMP: Portable Shared Memory Parallel Program-
ming. The MIT Press, Cambridge (2007)
22. OpenMP Architecture Review Board: OpenMP Application Program Interface (Version 3.1).
(2011). http://www.openmp.org/mp-documents/
23. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing
capabilities. Proc. AFIPS Spring Jt. Comput. Conf. 31, 483–485 (1967)
24. Juurlink, B., Meenderinck, C.: Amdahl’s law for predicting the future of multicores considered
harmful. SIGARCH Comput. Archit. News 40(2), 1–9 (2012). doi:10.1145/2234336.2234338.
http://doi.acm.org/10.1145/2234336.2234338
25. Hill, M., Marty, M.: Amdahl’s law in the multicore era. In: HPCA, p. 187. IEEE Computer
Society (2008)
26. Sun, X.H., Chen, Y.: Reevaluating Amdahl’s law in the multicore era. J. Parallel Distrib.
Comput. 70(2), 183–188 (2010)
27. Flatt, H., Kennedy, K.: Performance of parallel processors. Parallel Comput. 12(1), 1–20
(1989). doi:10.1016/0167-8191(89)90003-3. http://www.sciencedirect.com/science/article/
pii/0167819189900033
16 1 Parallel Programming Paradigms

28. Kuck, D.: High Performance Computing: Challenges for Future Systems. Oxford University
Press, New York (1996)
29. Kuck, D.: What do users of parallel computer systems really need? Int. J. Parallel Program.
22(1), 99–127 (1994). doi:10.1007/BF02577794. http://dx.doi.org/10.1007/BF02577794
30. Kumar, V., Gupta, A.: Analyzing scalability of parallel algorithms and architectures. J. Parallel
Distrib. Comput. 22(3), 379–391 (1994)
31. Worley, P.H.: The effect of time constraints on scaled speedup. SIAM J. Sci. Stat. Comput.
11(5), 838–858 (1990)
32. Gustafson, J.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)
33. Grama, A., Gupta, A., Kumar, V.: Isoefficiency: measuring the scalability of parallel algorithms
and architectures. IEEE Parallel Distrib. Technol. 12–21 (1993)
Chapter 2
Fundamental Kernels

In this chapter we discuss the fundamental operations, that are the building blocks
of dense and sparse matrix computations. They are termed kernels because in most
cases they account for most of the computational effort. Because of this, their imple-
mentation directly impacts the overall efficiency of the computation. They occur
often at the lowest level where parallelism is expressed.
Most basic kernels are of the form C = C + AB, where A, B and C can be
matrix, vector and possibly scalar operands of appropriate dimensions. For dense
matrices, the community has converged into a standard application programming
interface, termed Basic Linear Algebra Subroutines (BLAS) that have specific syn-
tax and semantics. The set is organized into three separate sets of instructions. The
first part of this chapter describes these sets. It then considers several basic sparse
matrix operations that are essential for the implementation of algorithms presented
in future chapters. In this chapter we frequently make explicit reference to communi-
cation costs, on account of the well known growing discrepancy, in the performance
characteristics of computer systems, between the rate of performing computations
(typically measured by a base unit of the form flops per second) and the rate of
moving data (typically measured by a base unit of the form words per second).

2.1 Vector Operations

Operations on vectors are known as Level_1 Basic Linear Algebraic Subroutines


(BLAS1) [1]. The two most frequent vector operations are the _AXPY and the
_DOT:
_AXPY: given x, y ∈ Rn and α ∈ R, the instruction updates vector y by:
y = y + αx.
_DOT: given x, y ∈ Rn , the instruction computes the inner product of the two
vectors: s = x  y.

© Springer Science+Business Media Dordrecht 2016 17


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_2
18 2 Fundamental Kernels

A common feature of these instructions is that minimal number of data that needs
to be read (loaded) into memory and then stored back in order for the operation to take
place is O(n). Moreover, the number of computations required on a uniprocessor
is also O(n). Therefore, the ratio of instructions to load from and store to memory
relative to purely arithmetic operations is O(1).
With p = n processors, the _AXPY primitive requires 2 steps which yields
a perfect speedup, n. The _DOT primitive involves a reduction with the sum of n
numbers to obtain a scalar. We assume temporarily, for the sake of clarity, that n = 2m .
At the first step, each processor computes the product of two components, and the
(0)
result can be expressed as the vector (si )1:n . This computation is then followed by
(k−1)
m steps such that at each step k the vector (si )1:2m−k+1 is transformed into the
(k) (k)
vector (si )1:2m−k by computing in parallel si = s2i−1 (k−1) + s2i (k−1) , for i =
(m)
1, . . . , 2m−k , with the final result being the scalar s1 . Therefore, the inner product
consumes T p = m + 1 = (1 + log n) steps, with a speedup of S p = 2n/(1 + log n)
and an efficiency of E p = 2/(1 + log n).
On vector processors, these procedures can obtain high performance, especially
for the _AXPY primitive which allows chaining of the pipelines for multiplication
and addition.
Implementing these instructions on parallel architectures is not a difficult task. It
is realized by splitting the vectors in slices of the same length, with each processor
performing the operation on its own subvectors. For the _DOT operation, there is an
additional summation of all the partial results to obtain the final scalar. Following
that, this result has to be broadcast to all the processors. These final steps entail extra
costs for data movement and synchronization, especially for computer systems with
distributed memory and a large number of processors.
We analyze this issue in greater detail next, departing on this occasion from the
assumptions made in Chap. 1 and taking explicitly into account the communication
costs in evaluating T p . The message is that inner products are harmful to parallel
performance of many algorithms.
Inner Products Inhibit Parallel Scalability
A major part of this book deals with parallel algorithms for solving large sparse linear
systems of equations using preconditioned iterative schemes. The most effective
classes of these methods are dominated by a combination of a “global” inner product,
that is applied on vectors distributed across all the processors, followed by fan-
out operations. As we will show, the overheads involved in such operations cause
inefficiency and less than optimal speedup.
To illustrate this point, we consider such a combination in the form of the following
primitive for vectors u, v, w of size n that appears often in many computations:

w = w − (u  v)u. (2.1)

We assume that the vectors are stored in a consistent way to perform the operations
on the components (each processor stores slices of components of the two vectors
2.1 Vector Operations 19

with identical indices). The _DOT primitive involves a reduction and therefore an all-
to-one (fan-in) communication. Since the result of a dot product is usually needed
by all processors in the sequel, the communication actually becomes an all-to-all
(fan-out) procedure.
To evaluate the weak scalability on p processors (see Definition 1.1) by taking
communication into account (as mentioned earlier, we depart here from our usual
definition of T p ), let us assume that n = pq. The number of steps required by the prim-
itive (2.1) is 4q −1. Assuming no overlap between communication and computation,
the cost on p processors, T p ( pq), is the sum of the computational and communication
p ( pq) and T p ( pq), respectively. For the all-to-all communication,
costs: Tcal com

T p ( pq) is given by: Tcom γ


p ( pq) = K p in which 1 < γ ≤ 2, with the
com

constant K depending on the interconnection network technology. The computa-


tional load, which is well balanced, is given by Tcal p ( pq) = 4q − 1, resulting in
T p ( pq) = (4q − 1) + K p γ , which increases with the number of processors. In
fact, once Tcomp ( pq) dominates T p ( pq), the total cost T p ( pq) increases almost
cal

quadratically with the number of processors.


This fact is of crucial importance in parallel implementation of inner products. It
makes clear that an important goal of a designer of parallel algorithms on distributed
memory architectures is to avoid distributed _DOT primitives as they are detrimental
to parallel scalability. Moreover, because of the frequent occurrence and prominence
of inner products in most numerical linear algebra algorithms, the aforementioned
deleterious effects on the _DOT performance can be far reaching.

2.2 Higher Level BLAS

In order to increase efficiency, vector operations are often packed into a global task
of higher level. This occurs for the multiplication of a matrix by a vector which in
a code is usually expressed by a doubly nested loop. Classic kernels of this type are
gathered into the set known as Level_2 Basic Linear Algebraic Subroutines (BLAS2)
[2]. The most common operations of this type, assuming general matrices, are
_GEMV : given x ∈ Rn , y ∈ Rm and A ∈ Rm×n this performs the matrix-vector
multiplication and accumulate y = y + Ax. It is also possible to multiply (row)
vector by matrix, scale the result before accumulating.
_TRSV : given b ∈ Rn and A ∈ Rn×n upper or lower triangular, this solves the
triangular system Ax = b.
_GER : given scalar α, x ∈ Rn , y ∈ Rm and A ∈ Rm×n , this performs the rank-one
update A = A + αx y  .
A common feature of these instructions is that the smallest number of data that
needs to be read into memory and then stored back in order for the operation to take
place when m = n is O(n 2 ), arithmetic operations is O(1). Moreover, the number
of computations required on a uniprocessor is also O(n 2 ). Therefore, the ratio of
instructions to load from and store to memory relative to purely arithmetic ones is

www.allitebooks.com
20 2 Fundamental Kernels

O(1). Typically, the constants involved are a little smaller than those for the BLAS1
instructions. On the other hand, of far more interest in terms of efficiency.
Although of interest, the efficiencies realized by these kernels are easily surpassed
by those of (BLAS3) [3], where one additional loop level is considered, e.g. matrix
multiplication, and rank-k updates (k > 1). The next section of this chapter is devoted
to matrix-matrix multiplications.
The set of BLAS is designed for a uniprocessor and used in parallel programs in
the sequential mode. Thus, an efficient implementation of the BLAS is of the utmost
importance to enable high performance. Versions of the BLAS that are especially
fine-tuned for certain types of processors are available (e.g. the Intel Math Kernel
Library [4] or the open source set GotoBLAS [5]). Alternately, one can create a
parametrized BLAS set which can be tuned on any processor by an automatic code
optimizer, e.g. ATLAS [6, 7]. Yet, it is hard to outperform well designed methods
that are based on accurate architectural models and domain expertise; cf. [8, 9].

2.2.1 Dense Matrix Multiplication

Given matrices A, B and C of sizes n 1 × n 2 , n 2 × n 3 and n 1 × n 3 respectively, the


general operation, denoted by _GEMM, is C = C + AB.
Properties of this primitive include:
• The computation involves three nested loops that may be permuted and split; such
a feature provides great flexibility in adapting the computation for vector and/or
parallel architectures.
• High performance implementations are based on the potential for high data local-
ity that is evident from the relation between the lower bound on the number of
data moved (O(n 2 )) to arithmetic operations, O(n 3 ) for the classical schemes
and O(n 2+μ ) for some μ > 0 for the “superfast” schemes described later in this
section.

Hence, in dense matrix multiplication, it is possible to reuse data stored in cache


memory.
Because of these advantages of dense matrix multiplication over lower level
BLAS, there has been a concerted effort by researchers for a long time now (see
e.g. [10]) to (re)formulate many algorithms in scientific computing to be based on
dense matrix multiplications (such as _GEMM and variants).
A Data Management Scheme Dense Matrix Multiplications
We discuss an implementation strategy for _GEMM:

C = C + AB. (2.2)
2.2 Higher Level BLAS 21

We adopt our discussion from [11], where the authors consider the situation of a
cluster of p processors with a common cache and count loads and stores in their
evaluation. We simplify that discussion and outline the implementation strategy for
a uniprocessor equipped with a cache memory that is characterized by fast access.
The purpose is to highlight some critical design decisions that will have to be faced by
the sequential as well as by the parallel algorithm designer. We assume that reading
one floating-point word of the type used in the multiplication from cache can be
accomplished in one clock period. Since the storage capacity of a cache memory is
limited, the goal of a code developer is to reuse, as much as possible, data stored in
the cache memory.
Let M be the storage capacity of the cache memory and let us assume that matrices
A, B and C are, respectively, n 1 × n 2 , n 2 × n 3 and n 1 × n 3 matrices. Partitioning
these matrices into blocks of sizes m 1 × m 2 , m 2 × m 3 and m 1 × m 3 , respectively,
where n i = m i ki for all i = 1, 2, 3, our goal is then to estimate the block sizes m i
which maximize data reuse under the cache size constraint.
Instruction (2.2) can be expressed as the nested loop,

do i = 1 : k1 ,
do k = 1 : k2 ,
do j = 1 : k3 ,
Ci j = Ci j + Aik × Bk j ;
end
end
end

where Ci j , Aik and Bk j are, respectively, blocks of C, A and B, with subscripts


denoting here the appropriate block indices. The innermost loop, refers to the identical
block Aik in all its iterations. To put it in cache, its dimensions must satisfy m 1 m 2 ≤
M. Actually, the blocks Ci j and Bk j must also reside in the cache and the condition
becomes
m 1 m 3 + m 1 m 2 + m 2 m 3 ≤ M. (2.3)

Further, since the blocks are obviously smaller than the original matrices, we need
the additional constraints:

1 ≤ m i ≤ n i for i = 1, 2, 3. (2.4)

Evaluating the volume of the data moves using the number of data loads necessary for
the whole procedure and assuming that the constraints (2.3) and (2.4) are satisfied,
we observe that
• all the blocks of the matrix A are loaded only once;
• the blocks of the matrix B are loaded k1 times;
• the blocks of the matrix C are loaded k2 times.
22 2 Fundamental Kernels

Thus the total amount of loads is given by:


 
1 1
L = n1n2 + n1n2n3 + . (2.5)
m1 m2

Choosing m 3 = 1, and hence k3 = n 3 , (the multiplications of the blocks are


performed by columns) and neglecting for simplicity the necessary storage of the
columns of the blocks Ci j and Bk j , the values of m 1 and m 2 , which minimize m11 + m12
under the previous constraints are obtained as follows:

if n 1 n 2 ≤ M then
m 1 = n 1 and
√ m 2 = n2 ;
else if n 2 ≤ M then
m 1 = nM2 and m 2 = n 2 ;

else if n 1 ≤ M then
m 1 = n 1 and m 2 = nM1 ;
else √ √
m 1 = M and m 2 = M;
end if

In practice, M should be slightly smaller than the total cache volume to allow for
storing the neglected vectors. With this parameter adjustment, at the innermost level,
the block multiplication involves 2m 1 m 2 operations and m 1 +m 2 loads as long as Aik
resides in the cache. This indicates why the matrix multiplication can be a compute
bound program.
The reveal the important decisions that need to be made by the code developer,
and leads to a scheme that is very similar to parallel multiplication.
In [11], the authors consider the situation of a cluster of p processors with a
common cache and count loads and stores in their evaluation. However, the final
decision tree is similar to the one presented here.

2.2.2 Lowering Complexity via the Strassen Algorithm

The classical multiplication algorithms implementing the operation (2.2) for dense
matrices use 2n 3 operations. We next describe the scheme proposed by Strassen
which reduces the number of operations in the procedure [12]: assuming that n is
even, the operands can be decomposed in 2 × 2 matrices of n2 × n2 blocks:
     
C11 C12 A11 A12 B11 B12
= . (2.6)
C21 C22 A21 A22 B21 B22

Then, the multiplication can be performed by the following operations on the blocks
2.2 Higher Level BLAS 23

P1 = (A11 + A22 )(B11 + B22 ), C11 = P1 + P4 − P5 + P7 ,


P2 = (A21 + A22 )B11 , C12 = P3 + P5 ,
P3 = A11 (B12 − B22 ), C21 = P2 + P4 ,
P4 = A22 (B21 − B11 ), C22 = P1 + P3 − P2 + P6 . (2.7)
P5 = (A11 + A12 )B22 ,
P6 = (A21 − A11 )(B11 + B12 ),
P7 = (A12 − A22 )(B21 + B22 ),

The computation of Pk and Ci j is referred as one Strassen step. This procedure


involves 7 block multiplications and 18 block additions of blocks, instead of 8 block
multiplications and 4 block additions as the case in the classical algorithm. Since the
complexity of the multiplications is O(n 3 ) whereas for an addition it is only O(n 2 ),
the Strassen approach is beneficial for large enough n. This approach was improved
in [13] by the following sequence of 7 block multiplications and 15 block additions.
It is implemented in the so-called Strassen-Winograd procedure (as expressed in
[14]):

T0 = A11 , S0 = B11 , Q0 = T0 S0 , U1 = Q 0 + Q 3 ,
T1 = A12 , S1 = B21 , Q1 = T1 S1 , U2 = U1 + Q 4 ,
T2 = A21 + A22 , S2 = B12 + B11 , Q2 = T2 S2 , U3 = U1 + Q 2 ,
T3 = T2 − A12 , S3 = B22 − S2 , Q3 = T3 S3 , C11 = Q 0 + Q 1 , (2.8)
T4 = A11 − A12 , S4 = B22 − B12 , Q4 = T4 S4 , C12 = U3 + Q 5 ,
T5 = A12 + T3 , S5 = B22 , Q5 = T5 S5 , C21 = U2 − Q 6 ,
T6 = A22 , S6 = S3 − B21 , Q6 = T6 S6 , C22 = U2 + Q 2 .

Clearly, (2.7) and (2.8) are still valid for rectangular blocks. If n = 2γ , the approach
can be repeated for implementing the multiplications of the blocks. If it is recursively
applied up to 2×2 blocks, the total complexity of the process becomes O(n ω0 ), where
ω0 = log 7. More generally, if the process is iteratively applied until we get blocks
of order m ≤ n 0 , the total number of operations is

T (n) = cs n ω0 − 5n 2 , (2.9)

with cs = (2n 0 + 4)/n 0ω0 −2 , which achieves its minimum for n 0 = 8; cf. [14].
The numerical stability of the above methods has been considered by several
authors. In [15], it is shown that the rounding errors in the Strassen algorithm can
be worse than those in the classical algorithm for multiplying two matrices, with the
situation somewhat worse in Winograd’s algorithm. However, ref. [15] indicates that
it is possible to get a fast and stable version of _GEMM by incorporating in it steps
from the Strassen or Winograd-type algorithms.
Both the Strassen algorithm (2.7), and the Winograd version (2.8), can be imple-
mented on parallel architectures. In particular, the seven block multiplications are
independent, as well as most of the block additions. Moreover, each of these opera-
tions has yet another inner level of parallelism.
24 2 Fundamental Kernels

A parallel implementation must allow for the recursive application of several


Strassen steps while maintaining good data locality. The Communication-Avoiding
Parallel Strassen (CAPS) algorithm, proposed in [14, 16], achieves this aim; cf. [14,
16]. In CAPS, the Strassen steps are implemented by combining two strategies: all
the processors cooperate in computing the blocks Pk and Ci j whenever the local
memories are not large enough to store the blocks. The remaining Strassen steps
consist of block operations that are executed independently on seven sets of proces-
sors. The latter minimizes the communications but needs extra memory. CAPS is
asymptotically optimal with respect to computational cost and data communication.
ω0
 ω CAPS
Theorem2.1 ([14])
ω
has
/2
 cost Θ (n / p) and requires band-
 computational
width Θ max n 0 / pM 0 log p, log p , assuming p processors, each with
local memory of size M words.
Experiments in [17] show that CAPS uses less communication than some commu-
nication optimal classical algorithms and much less than previous implementations
of the Strassen algorithm. As a result, it can outperform both classical algorithms
for large sized problems, because it requires fewer operations, as well as for small
problems, because of lower communication costs.

2.2.3 Accelerating the Multiplication of Complex Matrices

Savings may be realized in multiplying two complex matrices, e.g. see [18]. Let
A = A1 + i A2 and B = B1 + iB2 two complex matrices where A j , B j ∈ Rn×n
for j = 1, 2. The real and imaginary parts C1 and C2 of the matrix C = AB can
be obtained using only three multiplications of real matrices (and not four as in the
classical expression):

T1 = A1 B1 , C1 = T1 − T2 ,
(2.10)
T2 = A2 B2 , C2 = (A1 + A2 )(B1 + B2 ) − T1 − T2 .

The savings are realized through the way the imaginary part C2 is computed. Unfor-
tunately, the above formulation may suffer from catastrophic cancellations, [18].
For large n, there is a 25 % benefit in arithmetic operations over the conven-
tional approach. Although remarkable, this benefit does not lower the complexity
which remains the same, i.e. O(n 3 ). To push such advantage further, one may use
the Strassen’s approach in the three matrix multiplications above to realize O(n ω0 )
arithmetic operations.
Parallelism is achieved at several levels:
• All the matrix operations are additions and multiplications. They can be imple-
mented with full efficiency. In addition, the multiplication can be realized through
the Strassen algorithm as implemented in CAPS, see Sect. 2.2.2.
• The three matrix multiplications are independent, once the two additions are per-
formed.
2.3 General Organization for Dense Matrix Factorizations 25

2.3 General Organization for Dense Matrix Factorizations

In this section, we describe the usual techniques for expressing parallelism in the
factorization schemes (i.e. the algorithms that compute any of the well-known decom-
positions such as LU, Cholesky, or QR). More specific factorizations are included in
the ensuing chapters of the book.

2.3.1 Fan-Out and Fan-In Versions

Factorization schemes can be based on one of two basic templates: the fan-out tem-
plate (see Algorithm 2.1) and the fan-in version (see Algorithm 2.2). Each of these
templates involves two basic procedures which we generically call compute( j) and
update( j, k). The two versions, however, differ only by a single loop interchange.

Algorithm 2.1 Fan-out version for factorization schemes.


do j = 1 : n,
compute( j) ;
do k = j + 1 : n,
update( j, k) ;
end
end

Algorithm 2.2 Fan-in version for factorization schemes.


do k = 1 : n,
do j = 1 : k − 1,
update( j, k) ;
end
compute(k) ;
end

The above implementations are also respectively named as the right-looking and
the left-looking versions. The exact definitions of the basic procedures, when applied
to a given matrix A, are displayed in Table 2.1 together with their arithmetic complex-
ities on a uniprocessor. They are based on a column oriented organization. For the
analysis of loop dependencies, it is important to consider that column j is unchanged
by task update( j, k) whereas column k is overwritten by the same task; column j is
overwritten by task compute( j).
The two versions are based on vector operations (i.e. BLAS1). It can be seen,
however, that for a given j, the inner loop of the fan-out algorithm is a rank-one
update (i.e. BLAS2), with a special feature for the Cholesky factorization, where
only the lower triangular part of A is updated.
26 2 Fundamental Kernels

Table 2.1 Elementary factorization procedures; MATLAB index notation used for submatrices
Factorization Procedures Complexity

Cholesky on C : A( j : n, j) = A( j : n, j)/ A( j, j) 3 n + O(n )
1 3 2

A ∈ Rn×n U : A(k : n, k) = A(k : n, j) − A(k : n, j)A(k, j)


LU on A ∈ Rn×n C : A( j : n, j) = A( j : n, j)/A( j, j) 3 n + O(n )
2 3 2

(no pivoting) U : A( j : n, k) = A( j : n, j) − A( j : n, j)A(k, j)


QR on A ∈ Rm×n C : u = house(A( j : n, j)) and β = 2/u2 2n 2 (m − 13 n) +
(Householder) U : A( j : n, k) = A( j : n, j) − βu(u  A( j : n, j)) O(mn)
C : A(1 : n, j) = A(1 : n, j)/A(1 : n, j) 2mn 2 + O(mn)
MGS on A ∈ Rm×n U : A(1 : n, k) = A(1 : n, k) −
(Modified A(1 : n, k)(A(1 : n, k) A(1 : n, j))
Gram-Schmidt)
Notations C: compute( j); U: update( j, k)
v = house(u): computes the Householder vector (see [19])

2.3.2 Parallelism in the Fan-Out Version

In the fan-out version, the inner loop (loop k) of Algorithm 2.1 involves independent
iterations whereas in the fan-in version, the inner loop (loop j) of Algorithm 2.2
must be sequential because of a recursion on vector k.
The inner loop of Algorithm 2.1 can be expressed as a doall loop. The resulting
algorithm is referred to as Algorithm 2.3.

Algorithm 2.3 do/doall fan-out version for factorization schemes.


do j = 1 : n,
compute( j) ;
doall k = j + 1 : n,
update( j, k) ;
end
end

At the outer iteration j, there are n − j independent tasks with identical cost.
When the outer loop is regarded as a sequential one, idle processors will result at
the end of most of the outer iterations. Let p be the number of processors used,
and for the sake of simplicity, let n = pq + 1 and assume that the time spent
by one processor in executing task compute( j) or task update( j, k) is the same
which is taken as the time unit. Note that this last assumption is valid only for
the Gram-Schmidt orthogonalization, since for the other algorithms, the cost of
task compute( j) and task update( j, k) are proportional to n − j or even smaller
for the Cholesky factorization. A simple computation shows that the sequential
process consumes T1 = n(n + 1)/2 steps, whereas the parallel process on p proces-
q+1
sors consumes T p = 1 + p i=2 i = pq(q+3) 2 + 1 = (n−1)(n−1+3
2p
p)
+ 1 steps. For
2.3 General Organization for Dense Matrix Factorizations 27

Table 2.2 Benefit of pipelining the outer loop in MGS (QR factorization)
steps parallel runs steps parallel runs
1 C(1) 1 C(1)
2 U(1,2) U(1,3) U(1,4) U(1,5) 2 U(1,2) U(1,3) U(1,4) U(1,5)
3 U(1,6) U(1,7) U(1,8) U(1,9) 3 C(2) U(1,6) U(1,7) U(1,8)
4 C(2) 4 U(1,9) U(2,3) U(2,4) U(2,5)
5 U(2,3) U(2,4) U(2,5) U(2,6) 5 C(3) U(2,6) U(2,7) U(2,8)
6 U(2,7) U(2,8) U(2,9) 6 U(2,9) U(3,4) U(3,5) U(3,6)
7 C(3) 7 C(4) U(3,7) U(3,8) U(3,9)
8 U(3,4) U(3,5) U(3,6) U(3,7) 8 U(4,5) U(4,6) U(4,7) U(4,8)
9 U(3,8) U(3,9) 9 C(5) U(4,9)
10 C(4) 10 U(5,6) U(5,7) U(5,8) U(5,9)
11 U(4,5) U(4,6) U(4,7) U(4,8) 11 C(6)
12 U(4,9) 12 U(6,7) U(6,8) U(6,9)
13 C(5) 13 C(7)
14 U(5,6) U(5,7) U(5,8) U(5,9) 14 U(7,8) U(7,9)
15 C(6) 15 C(8)
16 U(6,7) U(6,8) U(6,9) 16 U(8,9)
17 C(7) 17 C(9)
18 U(7,8) U(7,9)
19 C(8)
Notations : C(j) = compute(j)
20 U(8,9)
U(j,k) = update(j,k)
21 C(9)
(a) Sequential outer loop.
(b) doacross outer loop.

instance, for n = 9 and p = 4, the parallel calculation is performed in 21 steps


(see Table 2.2a) whereas the sequential algorithm requires 45 steps. In Fig. 2.1, the
n(n+1)
efficiency E p = (n−1)(n−1+3 p)+2 p is displayed for p = 4, 8, 16, 32, 64 processors
when dealing with vectors of length n, where 100 ≤ n ≤ 1000.
The efficiency study above is for the Modified Gram-Schmidt (MGS) algorithm.
Even though the analysis for other factorizations is more complicated, the general
behavior of the corresponding efficiency curves with respect to the vector length,
does not change.
The next question to be addressed is whether the iterations of the outer loop can
be pipelined so that they can be implemented utilizing the doacross.
At step j, for k = j + 1, . . . , n, task update( j, k) may start as soon as
task compute( j) is completed but compute( j) may start as soon as all the tasks
update(l, j), for l = 1, . . . , j − 1 are completed. Maintaining the serial execution
of tasks update(l, j) for l = 1, . . . , j − 1 is equivalent to guaranteeing that any task
update( j, k) cannot start before completion of update( j −1, k). The resulting scheme
is listed as Algorithm 2.4.
In our particular example, scheduling of the elementary tasks is displayed in
Table 2.2b. Comparing with the non-pipelined scheme, we clearly see that the proces-
sors are fully utilized whenever the number of remaining vectors is large enough. On
the other hand, the end of the process is identical for the two strategies. Therefore,
pipelining the outer loop is beneficial except when p is much smaller than n.
28 2 Fundamental Kernels

0.9

0.8
Efficiency

0.7

0.6 p=4
p=8
p=16
0.5 p=32
p=64
0.4

100 200 300 400 500 600 700 800 900 1000
Vector length

Fig. 2.1 Efficiencies of the doall approach with a sequential outer loop in MGS

Algorithm 2.4 doacross/doall version for factorization schemes (fan-out.)


doacross j = 1 : n,
if j > 1, then
wait( j) ;
end if
compute( j) ;
doall k = j + 1 : n,
update( j, k) ;
if k = j + 1, then
post( j + 1) ;
end if
end
end

2.3.3 Data Allocation for Distributed Memory.

The previous analysis is valid for shared or distributed memory architectures. How-
ever, for distributed memory systems we need to discuss the data allocation. As an
illustration consider a ring of p processors, numbered from 0 to p − 1, on which r
consecutive columns of A are stored in a round-robin mode. By denoting j̃ = j − 1,
column j is stored on processor s when j̃ = r ( pv + t) + s with 0 ≤ s < r and
0 ≤ t < p.
As soon as column j is ready, it is broadcast to the rest of the processors so they
can start tasks update( j, k) for the columns k which they own. This implements the
doacross/doall strategy of the fan-out approach, listed as Algorithm 2.5.
2.3 General Organization for Dense Matrix Factorizations 29

To reduce the number of messages, one may transfer only the blocks of r consec-
utive vectors when they are all ready to be used (i.e. the corresponding compute( j)
tasks are completed). The drawback of this option is increasing the periods during
which there are idle processors. Therefore, the block size r must be chosen so as to
obtain a better trade-off between using as many processors as possible and reduc-
ing communication cost. Clearly, the optimum value is architecture dependent as it
depends on the smallest efficient task granularity.

Algorithm 2.5 Message passing fan-out version for factorization schemes.


Input: processor #q owns the set Cq of columns of A.
do j = 1 : n,
if j ∈ Cq , then
compute( j) ;
sendtoall( j) ;
else
receive( j) ;
end if
do (k ∈ Cq ) & (k > j),
update( j, k) ;
end
end

The discussion above could easily be extended to the case of a torus configuration
where each processor of the previous ring is replaced by a ring of q processors. Every
column of the matrix A is now distributed into slices on the corresponding ring in
a round-robin mode. This, in turn, implies global communication in each ring of q
processors.

2.3.4 Block Versions and Numerical Libraries

We have already seen that it is useful to block consecutive columns of A. Actually,


there is benefit of doing so, even on a uniprocessor. In Algorithms 2.1 and 2.2,
tasks compute( j) and update( j, k) can be redefined to operate on a block of vectors
rather than on a single vector. In that case, indices j and k would refer to blocks
of r consecutive columns. In Table 2.1 the scalar multiplications correspond now to
matrix multiplications involving BLAS3 procedures. It can be shown that for all the
above mentioned factorizations, task compute( j) becomes the task performing the
original factorization scheme on the corresponding block; cf. [19]. Task update( j, k)
remains formally the same but involving blocks of vectors (rank-r update) instead
of individual vectors (rank-1 update).
The resulting block algorithms are mathematically equivalent to their vector coun-
terparts but they may have different numerical behavior, especially for the Gram-
Schmidt algorithm. This will be discussed in detail in Chap. 7.

www.allitebooks.com
30 2 Fundamental Kernels

Well designed block algorithms for matrix multiplication and rank-k updates for
hierarchical machines with multiple levels of memory and parallelism are of critical
importance for the design of solvers for the problems considered in this chapter that
demonstrate high performance and scalability. The library LAPACK [20], that solves
the classic matrix-problems, is a case in point by being based on BLAS3 as well as
its parallel version ScaLAPACK [21]:
• LAPACK: This is the main reference for a software library for numerical linear
algebra. It provides routines for solving systems of linear equations and linear least
squares, eigenvalue problems, and singular value decomposition. The involved
matrices can be stored as dense matrices or band matrices. The procedures are
based on BLAS3 and are proved to be backward stable. LAPACK was originally
written in FORTRAN 77, but moved to Fortran 90 in version 3.2 (2008).
• ScaLAPACK: This library can be seen as the parallel version of the LAPACK
library for distributed memory architectures. It is based on the Message Passing
Interface standard MPI [22]. Matrices and vectors are stored on a process grid
into a two-dimensional block-cyclic distribution. The library is often chosen as
the reference to which compare any new developed procedure.
In fact, many users became fully aware of these gains even when using high-
level problem solving environments like MATLAB (cf. [23]). As early works on the
subject had shown (we consider it rewarding for the reader to consider the pioneering
analyses in [11, 24]), the task of designing primitives is far from simple, if one desires
to provide a design that closely resembles the target computer model. The task
becomes more difficult as the complexity of the computer architectures increases.
It becomes even harder when the target is to build methods that can deliver high
performance for a spectrum of computer architectures.

2.4 Sparse Matrix Computations

Most large scale matrix computations in computational science and engineering


involve sparse matrices, that is matrices with relatively few nonzero elements, e.g.
nnz = O(n), for square matrices of order n. See [25] for instances of matrices from
a large variety of applications.
For example, in numerical simulations governed by partial differential equations
approximated using finite difference or finite elements, the number of nonzero entries
per row is related to the topology of the underlying finite element or finite differ-
ence grid. In two-dimensional problems discretized by a 5-point finite difference
discretization scheme, the number of nonzeros is about nnz = 5n and the density of
the resulting sparse matrix (i.e. the ratio between nonzeros entries and all entries) is
d ≈ n5 , where n is the matrix order.
2.4 Sparse Matrix Computations 31

Methods designed for dense matrix computations are rarely suitable for sparse
matrices since they quickly destroy the sparsity of the original matrix leading to
the need of storing a much larger number of nonzeros. However, with the avail-
ability of large memory capacities in new architectures, factorization methods (LU
and QR) exist that control fill-in and manage the needed extra storage. We do not
present such algorithms in this book but refer the reader to existing literature, e.g. see
[26–28]. Another option is to use matrix-free methods in which the sparse matrix is
not generated explicitly but used as an operator through the matrix-vector multipli-
cation kernel.
To make feasible large scale computations that involve sparse matrices, they are
encoded in some suitable sparse matrix storage format in which only nonzero ele-
ments of the matrix are stored together with sufficient information regarding their
row and column location to access them in the course of operations.

2.4.1 Sparse Matrix Storage and Matrix-Vector


Multiplication Schemes

Let A = (αi j ) ∈ Rn×n be a sparse matrix, and nnz the number of nonzero entries
in A.
Definition 2.1 (Graph of a sparse matrix) The graph of the matrix is given by the
pair of nodes and edges (< 1 : n >, G) where G is characterized by

((i, j) ∈ G) iff αi j = 0.

The adjacency matrix C of the matrix A is C = (γi j ) ∈ Rn×n such that


γi j = 1 if (i, j) ∈ G otherwise γi j = 0.
The most common sparse storage schemes are presented below together with their
associated kernels: MV for matrix-vector multiplication and MTV for the multipli-
cation by the transpose. For a complete description and some additional storage types
see [29].
Compressed Row Sparse Storage (CRS)
All the nonzero entries are successively stored, row by row, in a one-dimensional
array a of length nnz . Column indices are stored in the same order in a vector ja of
the same length nnz . Since the entries are stored row by row, it is sufficient to define a
third vector ia to store the indices of the beginning of each row in a. By convention,
the vector is extended by one entry: ian+1 = nnz + 1. Therefore, when scanning
vector ak for k = 1, . . . , nnz , the corresponding row index i and column index j are
obtained from the following

j = jak ,
ak = αi j ⇔ (2.11)
iai ≤ k < iai+1 .
32 2 Fundamental Kernels

The corresponding MV kernel is given by Algorithm 2.6. The inner loop implements
a sparse inner product through a so-called gather procedure.

Algorithm 2.6 CRS-type MV.


Input: CRS storage (a, ja, ia) of A ∈ Rn×n as defined in (2.11) ; v, w ∈ Rn .
1: do i = 1 : n,
2: do k = iai : iai+1 − 1,
3: wi = wi + ak v jak ; //Gather
4: end
5: end

Compressed Column Sparse Storage (CCS)


This storage is the dual of CRS: it corresponds to storing A via a CRS format.
Therefore, the nonzero entries are successively stored, column by column, in a vector
a of length nnz . Row indices are stored in the same order in a vector ia of length
nnz . The third vector ja stores the indices of the beginning of each column in a. By
convention, the vector is extended by one entry: jan+1 = nnz + 1. Thus (2.11) is
replaced by,

j = iak ,
ak = αi j ⇔ (2.12)
ja j ≤ k < ja j+1 .

The corresponding MV kernel is given by Algorithm 2.7. The inner loop implements
a sparse _AXPY through a so-called scatter procedure.

Algorithm 2.7 CCS-type MV.


Input: CCS storage (a, ja, ia) of A ∈ Rn×n as defined in (2.12) ; v, w ∈ Rn .
Output: w = w + Av.
1: do j = 1 : n,
2: do k = ja j : ja j+1 − 1,
3: wiak = wiak + ak v j ; //Scatter
4: end
5: end

Compressed Storage by Coordinates (COO)


In this storage, no special order of the entries is assumed. Therefore three vectors a,
ia and ja of length nnz are used satisfying

i = iak ,
ak = αi j ⇔ (2.13)
j = jak

The corresponding MV kernel is given by Algorithm 2.8. It involves both the scatter
and gather procedures.
2.4 Sparse Matrix Computations 33

Algorithm 2.8 COO-type MV.


Input: COO storage (a, ja, ia) of A ∈ Rn×n as defined in (2.13) ; v, w ∈ Rn .
Output: w = w + Av.
1: do k = 1 : nnz ,
2: wiak = wiak + ak v jak ;
3: end

MTV Kernel and Other Storages


When A is stored in one of the above mentioned compressed storage formats the
MTV kernel
w = w + A v,

is expressed for a CRS-stored matrix by Algorithm 2.7 and for a CCS-stored one by
Algorithm 2.6. For a COO-stored matrix, the algorithm is obtained by inverting the
roles of the arrays ia and ja in Algorithm 2.8.
Nowadays, the scatter-gather procedures (see step 3 in Algorithm 2.6 and step 3
in Algorithm 2.7) are pipelined on the architectures allowing vector computations.
However, their startup time is often large (i.e. order of magnitude of n1/2 —as defined
in Sect. 1.1.2—is in the hundreds. If in MV a were a dense matrix n1/2 would be in the
tens). The vector lengths in Algorithms 2.6 and 2.7 are determined by the number of
nonzero entries per row or per column. They often are so small that the computations
are run at sequential computational rates. There have been many attempts to define
sparse storage formats that favor larger vector lengths (e.g. see the jagged diagonal
format mentioned in [30, 31]).
An efficient storage format which combines the advantages of dense and sparse
matrix computations attempts to define a square block structure of a sparse matrix in
which most of the blocks are empty. The non empty blocks are stored in any of the
above formats, e.g. CSR, or the regular dense storage depending on the sparsity
density. Such a sparse storage format is called either Block Compressed Sparse
storage (BCRS) where the sparse nonempty blocks are stored using the CRS format,
or Block Compressed Column storage (BCCS) where the sparse nonempty blocks
are strored using the CCS format.
Basic Implementation on Distributed Memory Architecture
Let us consider the implementation of w = w +Av and w = w + A v on a distributed
memory parallel architecture with p processors where A ∈ Rn×n and v, w ∈ Rn . The
first stage consists of partitioning the matrix and allocating respective parts to the
local processor memories. Each processor Pq with q = 1, . . . , p, receives a block
of rows of A and the corresponding slices of the vectors v and w:
34 2 Fundamental Kernels
⎛ ⎞ ⎛⎞ ⎛ ⎞
P1 : A1,1 A1,2 · · · A1, p v1 w1
P2 : ⎜ A2,1 A2,2 · · · A2, p ⎟ ⎜ v2 ⎟ ⎜ w2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
.. A = ⎜ .. .. .. ⎟ v=⎜ . ⎟ w=⎜ . ⎟
. ⎝ . . . ⎠ ⎝ .. ⎠ ⎝ .. ⎠
Pp : A p,1 A p,2 · · · A p, p vp wp

With the blocks Aq, j ( j = 1, . . . , p) residing on processor Pq , this partition


determines the necessary communications for performing the multiplications. All
the blocks are sparse matrices with some being empty. To implement the kernels
w = w + Av and w = w + A v, the communication graph is defined by the sets
R(q) and C (q) which respectively include the list of indices of the nonempty blocks
Aq, j of the block row q and A j,q of the block column q ( j = 1, . . . , p). The two
implementations are given by Algorithms 2.9 and 2.10, respectively.

Algorithm 2.9 MV: w = w + Av


Input: q : processor number.
R (q) : list of indices of the nonempty blocks of row q.
C (q) : list of indices of the nonempty blocks of column q.
1: do j ∈ C (q),
2: send vq to processor P j ;
3: end
4: compute wq = wq + Aq,q vq .
5: do j ∈ R (q),
6: receive v j from processor P j ;
7: compute wq = wq + Aq, j v j ;
8: end

Algorithm 2.10 MTV: w = w + A v


Input: q : processor number.
R (q) : list of indices of the nonempty blocks of row q.
C (q) : list of indices of the nonempty blocks of column q.
1: do j ∈ R (q),
2: compute t j = Aq,  v ;
j q
3: send t j to processor P j ;
4: end
5: compute wq = wq + Aq,q  v ;
q
6: do j ∈ C (q),
7: receive u j from processor P j ;
8: compute wq = wq + u j ;
9: end

The efficiency of the two procedures MV and MTV are often quite different,
depending on the chosen sparse storage format.
2.4 Sparse Matrix Computations 35

A
1 ν2
1

3 1
A2 ν1=ν2
ν2
2
3 1
A3 ν =ν
2 3
A = 2
ν
3

3 1
Ap νp−1=νp

2
νp

ν ν ν3 ...
νp
1 2

Fig. 2.2 Partition of a block-diagonal matrix with overlapping blocks; vector v is decomposed in
overlapping slices

Scalable Implementation for an Overlapped Block-Diagonal Structure


If the sparse matrix A ∈ Rn×n can be reordered to result in p overlapping diagonal
blocks denoted by Aq , q = 1, . . . , p as shown in Fig. 2.2, then a matrix-vector
multiplication primitive can be designed in high parallel scalability. Let block Aq
be stored in the memory of the processor Pq and the vectors v, w ∈ Rn stored
accordingly. It is therefore necessary to maintain the consistency between the two
copies of the components corresponding to the overlaps.
To perform the MV computation w = w + Av, the matrix A may be considered
as a sum of p blocks Bq (q = 1, . . . , p) where B p = A p and all earlier blocks Bq
are the same as Aq with the elements of lower right submatrix corresponding to the
overlap are replaced by zeros (see Fig. 2.3).
Let vector vq be the subvector of v corresponding to 
the q-block rowindices. For
  
2 ≤ q ≤ p − 1, the vector vq is partitioned into vq = vq1 , vq2 , vq3 , according
to the overlap with the neighboring
 blocks.
 The first and the last subvectors are
  3   
partitioned as v1 = v1 , v1
2 and v p = v1p , v2p .
Denoting B̄q and v̄q the prolongation
p by zeros of Bq and vq to the full order n,
the operation w + Av = w + q=1 B̄q v̄q can be performed via Algorithm 2.11.
After completion, the vector w is correctly updated and distributed on the processors
3
with consistent subvectors wq (i.e. wq−1 = wq1 for q = 2, . . . , p). This algorithm

Fig. 2.3 Elementary blocks


for the MV kernel
36 2 Fundamental Kernels

Algorithm 2.11 Scalable MV multiplication w = w + Av.


Input: q: processor number.
In the local memory: Bq , vq = [(vq1 ) , (vq2 ) , (vq3 ) ], and wq = [(wq1 ) , (wq2 ) , (wq3 ) ].
Output: w = w + A v.
1: z q = Bq vq ;
2: if q < p, then
3: send z q3 to processor Pq+1 ;
4: end if
5: if q > 1, then
6: send z q1 to processor Pq−1 ;
7: end if
8: wq = wq + z q ;
9: if q < p, then
10: receive t from processor Pq+1 ;
11: wq3 = wq3 + t ;
12: end if
13: if q > 1, then
14: receive t from processor Pq−1 ;
15: wq1 = wq1 + t ;
16: end if

does not involve global communications and it can be implemented on a linear


array of processors in which every processor only exchanges information with its
two immediate neighbors: each processor receives one message from each of its
neighbors and it sends back one message to each.
Proposition 2.1 Algorithm 2.11 which implements the MV kernel for a sparse
matrix with an overlapped block-diagonal structure on a ring of p processors is
weakly scalable as long as the number of nonzeros entries of each block and the
overlap sizes are independent of the number of processors p.
Proof Let TBMV be the bound on the number of steps for the MV kernel of each indi-
vidual diagonal block, and  being the maximum overlap sizes, then on p processors
the number of steps is given by

T p ≤ TBMV + 4(β + τc ),

where β is the latency for a message and τc the time for sending a word to an
immediate neighbouring node regardless of the latency. Since T p is independent of
p, weak scalability is assured.

2.4.2 Matrix Reordering Schemes

Algorithms for reordering sparse matrices play a vital role in enhancing the parallel
scalability of various sparse matrix algorithms and their underlying primitives, e.g.,
see [32, 33].
2.4 Sparse Matrix Computations 37

Early reordering algorithms such as minimum-degree and nested dissection have


been developed for reducing fill-in in sequential direct methods for solving sparse
symmetric positive definite linear systems, e.g., see [26, 34, 35]. Similarly, algo-
rithms such as reverse Cuthill-McKee (RCM), e.g., see [36, 37], have been used
for reducing the envelope (variable band or profile) of sparse matrices in order to:
(i) enhance the efficiency of uniprocessor direct factorization schemes, (ii) reduce
the cost of sparse matrix-vector multiplications in iterative methods such as the con-
jugate gradients method (CG), for example, and (iii) obtain preconditioners for the
PCG scheme based on incomplete factorization [38, 39]. In this section, we here
describe a reordering scheme that not only reduces the profile of sparse matrices,
but also brings as many of the heaviest (larger magnitude) off-diagonal elements as
possible close to the diagonal. For solving sparse linear systems, for example, one
aims at realizing an overall cost, with reordering, that is much less than that without
reordering. In fact, in many time-dependent computational science and engineering
applications, this is possible. In such applications, the relevant nested computational
loop occurs as shown in Fig. 2.4.
The outer-most loop deals with time-step t, followed by solving a nonlinear set
of equations using a variant of Newton’s method, with the inner-most loop dealing
with solving a linear system in each Newton iteration to a relatively modest relative
residual ηk . Further, it is often the case that it is sufficient to realize the benefits of
reordering by keep using the permutation matrices obtained at time step t for several
subsequent time steps. This results not only in amortization of the cost of reordering,
but also in reducing the total cost of solving all the linear systems arising in such an
application.
With such a reordering, we aim to obtain a matrix C = PAQ, where A = (αi j )
is the original sparse matrix, P and Q are permutation matrices, such that C can be
split as C = B + E, with the most important requirements being that: (i) the sparse
matrix E contains far fewer nonzero elements than A, and is of a much lower rank,
and (ii) the central band B is a “generalized-banded” matrix with a Frobenius norm
that is a substantial fraction of that of A.

Fig. 2.4 Common structure


of programs in
time-dependent simulations
38 2 Fundamental Kernels

Fig. 2.5 Reordering to a


narrow-banded matrix

Hence, depending on the original matrix A, the matrix B can be extracted as:
(a) “narrow-banded” of bandwidth β much smaller than the order n of the matrix
A, i.e., β = 10−4 n, for n ≥ 106 , for example (the most fortunate situation), see
Fig. 2.5,
(b) “medium-banded”, i.e., of the block-tridiagonal form [H, G, J ], in which the
elements of the off-diagonal blocks H and J are all zero except for their small
upper-right and lower-left corners, respectively, see Fig. 2.6, or
(c) “wide-banded”, i.e., consisting of overlapped diagonal blocks, in which each
diagonal block is a sparse matrix, see Fig. 2.7.
The motivation for desiring such a reordering scheme is three-fold. First, B can
be used as a preconditioner of a Krylov subspace method when solving a linear
system Ax = f of order n. Since E is of a rank p much less than n, the precondi-
tioned Krylov subspace scheme will converge quickly. In exact arithmetic, the Krylov
subspace method will converge in exactly p iterations. In floating-point arithmetic,
however, this translates into the method achieving small relative residuals in less
than p iterations. Second, since we require the diagonal of B to be zero-free with the
product of its entries maximized, and that the Frobenius norm of B is close to that of
A, this will enhance the possibility that B is nonsingular, or close to a nonsingular
matrix. Third, multiplying C by a vector can be implemented on a parallel archi-
tecture with higher efficiency by splitting the operation into two parts: multiplying
the “generalized-banded” matrix B by a vector, and a low-rank sparse matrix E by
a vector. The former, e.g. v = Bu, can be achieved with high parallel scalability on
distributed-memory architectures requiring only nearest neighbor communication,
e.g. see Sect. 2.4.1 for the scalable parallel implementation of an overlapped block
diagonal matrix-vector multiplication scheme. The latter, e.g. w = Eu, however,
incurs much less irregular addressing penalty compared to y = Au since E contains
far fewer nonzero entries than A.
Since A is nonsymmetric, in general, we could reduce its profile by using RCM
(i.e. via symmetric permutations only) applied to (|A| + |A |), [40], or by using
the spectral reordering introduced in [41]; see also [42]. However, this will neither
2.4 Sparse Matrix Computations 39

Fig. 2.6 Reordering to a medium-banded matrix

Fig. 2.7 Reordering to a wide-banded matrix

realize a zero-free diagonal, nor insure bringing the heaviest off-diagonal elements
close to the diagonal. Consequently, RCM alone will not realize a central “band”
B with its Frobenius norm statisfying: B F ≥ (1 − ε)∗ A F . In order to solve
this weighted bandwidth reduction problem, we use a weighted spectral reordering
technique which is a generalization of spectral reordering. To alleviate the shortcom-
ings of using only symmetric permutations, and assuming that the matrix A is not
structurally singular, this weighted spectral reordering will need to be coupled with

www.allitebooks.com
40 2 Fundamental Kernels

a nonsymmetric ordering technique such as the maximum traversal algorithm [43]


to guarantee a zero-free diagonal, and to maximize the magnitude of the product
of the diagonal elements, via the MPD algorithm (Maximum Product on Diagonal
algorithm) [44, 45]. Such a procedure is implemented in the Harwell Subroutine
Library [46] as (HSL-MC64).
Thus, the resulting algorithm, which we refer to as WSO (Weighted Spectral
Ordering), consists of three stages:
Stage 1: Nonsymmetric Permutations
Here, we obtain a permutation matrix Q that maximizes the product of the absolute
values of the diagonal entries of QA, [44, 45]. This is achieved by a maximum
traversal search followed by a scaling procedure resulting in diagonal entries of
absolute values equal to 1, and all off-diagonal elements with magnitudes less than
or equal to 1. After applying this stage, a linear system Ax = f , becomes of the
form,
(Q D2 AD1 )(D1−1 x) = (Q D2 f ) (2.14)

in which each D j , j = 1, 2 is a diagonal scaling matrix.


Stage 2: Checking for Irreducibility
In this stage, we need to detect whether the sparse matrix under consideration is
irreducible, i.e., whether the corresponding graph has one strongly connected com-
ponent. This is achieved via Tarjan’s strongly connected component algorithm [47],
see also related schemes in [48], or [49]. If the matrix is reducible, we apply the
weighted spectral reordering simultaneously on each strongly connected component
(i.e. on each sparse diagonal block of the resulting upper block triangular matrix).
For the rest of this section, we assume that the sparse matrix A under considera-
tion is irreducible, i.e., the corresponding graph has only one strongly connected
component.
Stage 3: The Weighted Spectral Reordering Scheme [50]
As in other traditional reordering algorithms, we wish to minimize the half-bandwidth
of a matrix A which is given by,

BW (A) = max |i − j|, (2.15)


i, j:αi j =0

i.e., to minimize the maximum distance of a nonzero entry from the main diagonal.
Let us assume for the time being that A is a symmetric matrix, and that we aim at
extracting a central band B = (βi j ) of minimum bandwidth such that, for a given
tolerance ε, 
i, j |αi j − βi j |
 ≤ ε, (2.16)
i, j |αi j |
2.4 Sparse Matrix Computations 41

and
βi j = αi j if |i − j| ≤ k,
(2.17)
βi j = 0

The idea behind this formulation is that if a significant part of the matrix is packed
into a central band B, then the rest of the nonzero entries can be dropped to obtain an
effective preconditioner. In order to find a heuristic solution to the weighted band-
width reduction problem, we use a generalization of spectral reordering. Spectral
reordering is a linear algebraic technique that is commonly used to obtain approx-
imate solutions to various intractable graph optimization problems [51]. It has also
been successfully applied to the bandwidth and envelope reduction problems for
sparse matrices [41]. The core idea of spectral reordering is to compute a vector
x = (ξi ) that minimizes

σ A (x) = (ξi − ξ j )2 , (2.18)
i, j:αi j =0

subject to x2 = 1 and x  e = 0. As mentioned above we assume that the matrix


A is real and symmetric. The vector x that minimizes σ A (x) under these constraints
provides a mapping of the rows (and columns) of matrix A to a one-dimensional
Euclidean space, such that pairs of rows that correspond to nonzeros are located as
close as possible to each other. Consequently, the ordering of the entries of the vector
x provides an ordering of the matrix that significantly reduces the bandwidth.
Fiedler [52] first showed that the optimal solution to this problem is given by the
eigenvector corresponding to the second smallest eigenvalue of the Laplacian matrix
L = (λi j ) of A,
λi j = −1 if i = j ∧ αi j = 0,
(2.19)
λii = |{ j : αi j = 0}|.

Note that the matrix L is positive semidefinite, and the smallest eigenvalue of this
matrix is equal to zero. The eigenvector x that minimizes σ A (x) = x  L x, such that
x2 = 1 and x  e = 0, is the eigenvector corresponding to the second smallest
eigenvalue of the Laplacian, i.e. the symmetric eigenvalue problem

Lx = λx, (2.20)

and is known as the Fiedler vector. The Fiedler vector of a sparse matrix can be
computed efficiently using any of the eigensolvers discussed in Chap. 11, see also
[53].
While spectral reordering is shown to be effective in bandwidth reduction, the
classical approach described above ignores the magnitude of nonzeros in the matrix.
Therefore, it is not directly applicable to the weighted bandwidth reduction problem.
However, Fiedler’s result can be directly generalized to the weighted case [54]. More
precisely, the eigenvector x that corresponds to the second smallest eigenvalue of the
42 2 Fundamental Kernels

weighted Laplacian L minimizes



σ̄ A (x) = x  L x = |αi j |(ξi − ξ j )2 , (2.21)
i, j

where L is defined as
λi j = −|αi j | if i = j,

λii = |αi j |. (2.22)
j

We now show how weighted spectral reordering can be used to obtain a continuous
approximation to the weighted bandwidth reduction problem. For this purpose, we
first define the relative bandweight of a specified band of the matrix as follows:

i, j:|i− j|<k |αi j |
wk (A) =  . (2.23)
i, j |αi j |

In other words, the bandweight of a matrix A, with respect to an integer k, is equal


to the fraction of the total magnitude of entries that are encapsulated in a band of
half-width k.
For a given α, 0 ≤ α ≤ 1, we define α-bandwidth as the smallest half-bandwidth
that encapsulates a fraction α of the total matrix weight, i.e.,

BWα (A) = min k. (2.24)


k:wk (A)≥α

Observe that α-bandwidth is a generalization of half-bandwidth, i.e., when α = 1,


the α-bandwidth is equal to the half-bandwidth of the matrix. Now, for a given
vector x = (ξ1 , ξ2 , ......., ξn ) ∈ Rn , define an injective permutation function π :
{1, 2, . . . , n} → {1, 2, . . . , n}, such that, for 1 ≤ i, j ≤ n, ξπi ≤ ξπ j iff i ≤ j. Here,
n denotes the number of rows (columns) of the matrix A. Moreover, for a fixed k,
define the function δk (i, j) : {1, 2, . . . , n}×{1, 2, . . . , n} → {0, 1}, which quantizes
the difference between πi and π j with respect to k, i.e.,

0 if |πi − π j | ≤ k,
δk (i, j) = (2.25)
1 else

Let Ā be the matrix obtained by reordering the rows and columns of A according to
π , i.e.,
Ā(πi , π j ) = αi j for 1 ≤ i, j ≤ n. (2.26)

Then δk (i, j) = 0 indicates that αi j is inside a band of half-width k in the matrix


Ā while δk (i, j) = 1 indicates that it is outside the band. Defining
2.4 Sparse Matrix Computations 43


σ̂k (A) = |αi j |δk (i, j), (2.27)
i, j

then,

σ̂k (A) = (1 − wk ( Ā)) |αi j |. (2.28)
i, j

Therefore, for a fixed α, the α-bandwidth of the matrix Ā is equal to the smallest

k that satisfies σ̂ A (k)/ |αi j | ≤ 1 − α.
i, j
Note that the problem of minimizing σ̄x (A) is a continuous relaxation of the
problem of minimizing σ̂k (A) for a given k. Therefore, the Fiedler vector of the
weighted Laplacian L provides a good basis for reordering A to minimize σ̂k (A).
Consequently, for a fixed ε, this vector provides a heuristic solution to the problem
of finding a reordered matrix Ā = (ᾱi j ) with minimum (1 − ε)-bandwidth. Once the
matrix is obtained, we extract the central band B as follows:

B = {βi j = ᾱi j if |i − j| ≤ BW1− ( Ā), otherwise βi j = 0}. (2.29)

Clearly, B satisfies (2.16) and is of minimal bandwidth.


Note that spectral reordering is defined specifically for symmetric matrices, and
the resulting permutation is symmetric as well. Since our main focus here concerns
general nonsymmetric matrices, we apply spectral reordering to nonsymmetric matri-
ces by computing the Laplacian matrix of |A| + |A | instead of |A|. We note also
that this formulation results in a symmetric permutation for a nonsymmetric matrix,
which may be considered overconstrained.
Once, the Fiedler vector yields the permutation P, we obtain the matrix C as,

C = (PQD2 AD1 P  ), (2.30)

and the linear system Ax = f becomes of the final form,

(PQD2 AD1 P  )(PD−1


1 x) = (PQD2 f ). (2.31)

References

1. Lawson, C., Hanson, R., Kincaid, D., Krogh, F.: Basic linear algebra subprograms for Fortran
usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979)
2. Dongarra, J., Croz, J.D., Hammarling, S., Hanson, R.: An extended set of FORTRAN basic
linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)
3. Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.: A set of level-3 basic linear algebra
subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
4. Intel company: Intel Math Kernel Library. http://software.intel.com/en-us/intel-mkl
44 2 Fundamental Kernels

5. Texas advanced computer center, University of Texas: GotoBLAS2. https://www.tacc.utexas.


edu/tacc-software/gotoblas2
6. Netlib Repository at UTK and ORNL: Automatically Tuned Linear Algebra Software
(ATLAS). http://www.netlib.org/atlas/
7. Whaley, R., Dongarra, J.: Automatically tuned linear algebra software. In: Proceedings of 1998
ACM/IEEE Conference on Supercomputing, Supercomputing’98, pp. 1–27. IEEE Computer
Society, Washington (1998). http://dl.acm.org/citation.cfm?id=509058.509096
8. Yotov, K., Li, X., Ren, G., Garzarán, M., Padua, D., Pingali, K., Stodghill, P.: Is search really
necessary to generate high-performance BLAS? Proc. IEEE 93(2), 358–386 (2005). doi:10.
1109/JPROC.2004.840444
9. Goto, K., van de Geijn, R.: Anatomy of high-performance matrix multiplication. ACM Trans.
Math. Softw. 34(3), 12:1–12:25 (2008). doi:10.1145/1356052.1356053. http://doi.acm.org/10.
1145/1356052.1356053
10. Gallivan, K.A., Plemmons, R.J., Sameh, A.H.: Parallel algorithms for dense linear algebra
computations. SIAM Rev. 32(1), 54–135 (1990). doi:http://dx.doi.org/10.1137/1032002
11. Gallivan, K., Jalby, W., Meier, U.: The use of BLAS3 in linear algebra on a parallel processor
with a hierarchical memory. SIAM J. Sci. Stat. Comput. 8(6), 1079–1084 (1987)
12. Strassen, V.: Gaussian elimination is not optimal. Numerische Mathematik 13, 354–356 (1969)
13. Winograd, S.: On multiplication of 2 × 2 matrices. Linear Algebra Appl. 4(4), 381–388 (1971)
14. Ballard, G., Demmel, J., Holtz, O., Lipshitz, B., Schwartz, O.: Communication-optimal parallel
algorithm for Strassen matrix multiplication. Technical report UCB/EECS-2012-32, EECS
Department, University of California, Berkeley (2012). http://www.eecs.berkeley.edu/Pubs/
TechRpts/2012/EECS-2012-32.html
15. Higham, N.J.: Exploiting fast matrix multiplication within the level 3 BLAS. ACM Trans.
Math. Softw. 16(4), 352–368 (1990)
16. Ballard, G., Demmel, J., Holtz, O., Schwartz, O.: Graph expansion and communication costs of
fast matrix multiplication. J. ACM 59(6), 32:1–32:23 (2012). doi:10.1145/2395116.2395121.
http://doi.acm.org/10.1145/2395116.2395121
17. Lipshitz, B., Ballard, G., Demmel, J., Schwartz, O.: Communication-avoiding parallel Strassen:
implementation and performance. In: Proceedings of the International Conference on High Per-
formance Computing, Networking, Storage and Analysis, SC’12, pp. 101:1–101:11. IEEE
Computer Society Press, Los Alamitos (2012). http://dl.acm.org/citation.cfm?id=2388996.
2389133
18. Higham, N.J.: Stability of a method for multiplying complex matrices with three real matrix
multiplications. SIAM J. Matrix Anal. Appl. 13(3), 681–687 (1992)
19. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
20. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
21. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,
Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: ScaLAPACK
User’s Guide. SIAM, Philadelphia (1997). http://www.netlib.org/scalapack
22. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message
Passing Interface. MIT Press, Cambridge (1994)
23. Moler, C.: MATLAB incorporates LAPACK. Mathworks Newsletter (2000). http://www.
mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html
24. Gallivan, K., Jalby, W., Meier, U., Sameh, A.: The impact of hierarchical memory systems on
linear algebra algorithm design. Int. J. Supercomput. Appl. 2(1) (1988)
25. Davis, T., Hu, Y.: The University of Florida Sparse Matrix Collection. ACM Trans. Math.
Softw. 38(1), 1:1–1:25 (2011). http://doi.acm.org/10.1145/2049662.2049663
26. Duff, I., Erisman, A., Reid, J.: Direct Methods for Sparse Matrices. Oxford University Press
Inc., New York (1989)
27. Davis, T.: Direct Methods for Sparse Linear Systems. SIAM, Philadelphia (2006)
References 45

28. Zlatev, Z.: Computational Methods for General Sparse Matrices, vol. 65. Kluwer Academic
Publishers, Dordrecht (1991)
29. Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., van der Vorst, H.: Templates for the Solution of
Algebraic Eigenvalue Problems: A Practical Guide. SIAM, Philadelphia (2000)
30. Melhem, R.: Toward efficient implementation of preconditioned conjugate gradient methods
on vector supercomputers. Int. J. Supercomput. Appl. 1(1), 70–98 (1987)
31. Philippe, B., Saad, Y.: Solving large sparse eigenvalue problems on supercomputers. Technical
report RIACS TR 88.38, NASA Ames Research Center (1988)
32. Schenk, O.: Combinatorial Scientific Computing. CRC Press, Switzerland (2012)
33. Kepner, J., Gilbert, J.: Graph Algorithms in the Language of Linear Algebra. SIAM, Philadel-
phia (2011)
34. George, J., Liu, J.: Computer Solutions of Large Sparse Positive Definite Systems. Prentice
Hall (1981)
35. Pissanetzky, S.: Sparse Matrix Technology. Academic Press, New York (1984)
36. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: Proceedings
of 24th National Conference Association Computer Machinery, pp. 157–172. ACM Publica-
tions, New York (1969)
37. Liu, W., Sherman, A.: Comparative analysis of the Cuthill-McKee and the reverse Cuthill-
McKee ordering algorithms for sparse matrices. SIAM J. Numer. Anal. 13, 198–213 (1976)
38. D’Azevedo, E.F., Forsyth, P.A., Tang, W.P.: Ordering methods for preconditioned conjugate
gradient methods applied to unstructured grid problems. SIAM J. Matrix Anal. 13(3), 944–961
(1992)
39. Duff, I., Meurant, G.: The effect of ordering on preconditioned conjugate gradients. BIT 29,
635–657 (1989)
40. Reid, J., Scott, J.: Reducing the total bandwidth of a sparse unsymmetric matrix. SIAM J.
Matrix Anal. Appl. 28(3), 805–821 (2005)
41. Barnard, S., Pothen, A., Simon, H.: A spectral algorithm for envelope reduction of sparse
matrices. Numer. Linear Algebra Appl. 2, 317–334 (1995)
42. Spielman, D., Teng, S.: Spectral partitioning works: planar graphs and finite element meshes.
Numer. Linear Algebra Appl. 421, 284–305 (2007)
43. Duff, I.: On algorithms for obtaining a maximum transversal. ACM Trans. Math. Softw. 7,
315–330 (1981)
44. Duff, I., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse matrix.
SIAM J. Matrix Anal. Appl. 22, 973–966 (2001)
45. Duff, I., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal
of sparse matrices. SIAM J. Matrix Anal. Appl. 20, 889–901 (1999)
46. The HSL mathematical software library. See http://www.hsl.r1.ac.uk/index.html
47. Tarjan, R.: Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160
(1972)
48. Cheriyan, J., Mehlhorn, K.: Algorithms for dense graphs and networks on the random access
computer. Algorithmica 15, 521–549 (1996)
49. Dijkstra, E.: A Discipline of Programming, Chapter 25. Prentice Hall, Englewood Cliffs (1976)
50. Manguoğlu, M., Mehmet, K., Sameh, A., Grama, A.: Weighted matrix ordering and parallel
banded preconditioners for iterative linear system solvers. SIAM J. Sci. Comput. 32(3), 1201–
1206 (2010)
51. Hendrickson, B., Leland, R.: An improved spectral graph partitioning algorithm for mapping
parallel computations. SIAM J. Sci. Comput. 16(2), 452–469 (1995). http://citeseer.nj.nec.
com/hendrickson95improved.html
52. Fiedler, M.: Algebraic connectivity of graphs. Czechoslovak Math. J. 23, 298–305 (1973)
53. Kruyt, N.: A conjugate gradient method for the spectral partitioning of graphs. Parallel Comput.
22, 1493–1502 (1997)
54. Chan, P., Schlag, M., Zien, J.: Spectral k-way ratio-cut partitioning and clustering. IEEE Trans.
CAD-Integr. Circuits Syst. 13, 1088–1096 (1994)
Part II
Dense and Special Matrix Computations
Chapter 3
Recurrences and Triangular Systems

A recurrence relation is a rule that defines each element of a sequence in terms of


the preceding elements; it forms one of the most basic tools in discrete and compu-
tational mathematics with myriad applications in scientific computing, ranging from
numerical linear algebra, the numerical solution of ordinary and partial differential
equations and orthogonal polynomials, to dependence analysis in restructuring com-
pilers and logic design. In fact, almost every computational task relies on recursive
techniques and hence involves recurrence relations. Therefore, recurrence solvers are
computational kernels and need to be implemented as efficient primitives on vari-
ous architectures. Throughout this book we will encounter many linear and nonlinear
recurrences that can be solved using techniques from this chapter. Linear recurrences
can usually be expressed in the form of mostly banded lower triangular linear sys-
tems, so much of the discussion is devoted to parallel algorithms for solving Lx = f ,
where L is a lower triangular matrix.

3.1 Definitions and Examples

For illustration, we list a few simple examples from computational mathematics;


cf. [1].

Example 3.1 Consider the problem of evaluating the definite integral


 1 ξk
ψk = dξ for k = 0, 1, 2, . . . , n
0 ξ +8

for some integer n. It is not difficult to verify that ψk + 8ψk−1 = k1 , which is a linear
recurrence for evaluating ψk , k ≥ 1, with ψ0 = loge (9/8).

© Springer Science+Business Media Dordrecht 2016 49


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_3
50 3 Recurrences and Triangular Systems

Example 3.2 We wish to use a finite-difference method for solving the second-
order differential equation ψ  = φ(ξ, ψ) with the initial conditions ψ(0) = α and
ψ  (0) = γ . Replacing the derivatives ψ  and ψ  by the differences

ψk+1 − ψk−1
ψ  (ξk )  ,
2h
ψk+1 − 2ψk + ψk−1
ψ  (ξk )  ,
h2
where h = ξk+1 − ξk , k ≥ 0, we obtain the linear recurrence relation

ψk+1 − 2ψk + ψk−1 = h 2 φ(ξk , ψk ) (3.1)

with the initial values


ψ0 = α, (3.2)

ψ1 − ψ−1 = 2hγ . (3.3)

The undefined value ψ−1 can be eliminated using (3.1) (with k = 0), and (3.2).
Hence, the linear recurrence (3.1) can be started with ψ0 = α and ψ1 = ψ0 + hγ +
2 h φ(ξ0 , ψ0 ).
1 2

Example 3.3 ([2]) Chebyshev polynomials of the first kind

Tk (ξ ) = cos(k cos−1 ξ ), k = 0, 1, 2, . . . ,

satisfy the important three-term recurrence relation

Tk+1 (ξ ) − 2ξ Tk (ξ ) + Tk−1 (ξ ) = 0

for k ≥ 1, with the starting values T0 (ξ ) = 1 and T1 (ξ ) = ξ .


Example 3.4 Newton’s method for obtaining a root α of a single nonlinear equation
φ(ξ ) = 0 is given by the nonlinear recurrence involving the first derivative

ξk+1 = ξk − (φ(ξk )/φ  (ξk ))

with a given initial approximation

ξ0 = β.

See also [3] for examples of recurrences in logic design and restructuring com-
pilers.
Stability Issues
It is well known that computations with recurrence relations are prone to error growth;
at each step, computations are performed and new errors generated by operating on
3.1 Definitions and Examples 51

past data that maybe be already contaminated with errors [1]. This error propagation
and gradual accumulation could be catastrophic. For example, consider the linear
recurrence from Example 3.1,

1
ψk + 8ψk−1 =
k

with ψ0 = loge (9/8). Using three decimals (with rounding) throughout the eval-
uation of ψi , i ≥ 0, and taking ψ0  0.118, the recurrence yields ψ1  0.056,
ψ2  0.052, and ψ3  −0.083. The reason for obtaining a negative ψ3 (note that
ψk > 0 for all values of k) is that the initial error δ0 in ψ0 has been highly magnified
in ψ3 . In fact, even if we use exact arithmetic in evaluating ψ1 , ψ2 , ψ3 , . . . the initial
rounding error δ0 propagates such that the error δk in ψk is given by (−8)k δ0 . There-
fore, since |δ0 | ≤ 0.5×10−3 we get |δ3 | ≤ 0.256; a very high error, given that the true
value of ψ3 (to three decimals) is 0.028. Note that this numerical instability cannot be
eliminated by using higher precision (six decimals, say). Such wrong results will only
be postponed, and will show up at a later stage. The relation between the available
precision and the highest subscript k of a reasonably accurate ψk , however, will play
an important role in constructing stable parallel algorithms for handling recurrence
relations in general. Such numerical instability may be avoided by observing that the
definite integral of Example 3.1 decreases as the value of k increases. Hence, if we
assume that ψ5  0 (say) and evaluate the recurrence ψk + 8ψk−1 = k1 backwards,
we should obtain reasonably accurate values for at least ψ0 and ψ1 , since in this
case δk−1 = (−1/8)δk . Performing the calculations to three decimals, we obtain
ψ4  0.025, ψ3  0.028, ψ2  0.038, ψ1  0.058, and ψ0  0.118.
In the remainder of this section we discuss algorithms that are particularly suitable
for evaluating linear and certain nonlinear recurrence relations on parallel architec-
tures.

3.2 Linear Recurrences

Consider the linear recurrence system,

ξ 1 = φ1 ,

i−1
(3.4)
ξi = φi − λij ξ j ,
j=k

where i = 2, . . . , n and k = max{1, i − m}. In matrix notation, (3.4) may be


written as
x = f − L̂ x, (3.5)
52 3 Recurrences and Triangular Systems

where
x = (ξ1 , ξ2 , . . . , ξn ) ,
f = (φ1 , φ2 , . . . , φn ) , and

for m = 3 (say) L̂ is of the form


⎛ ⎞
0
⎜ λ21 0 ⎟
⎜ ⎟
⎜ λ31 λ32 0 ⎟
⎜ ⎟
⎜ ⎟
L̂ = ⎜ λ41 λ42 λ43 0 ⎟.
⎜ λ52 λ53 λ54 0 ⎟
⎜ ⎟
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
λn,n−3 λn,n−2 λn,n−1 0

Denoting (I + L̂) by L, where I is the identity matrix of order n, (3.5) may be


written as
Lx = f. (3.6)

Here L is unit lower triangular with bandwidth m + 1 i.e., λii = 1, and λij = 0 for
i − j > m. The sequential algorithm given by (3.4), forward substitution, requires
2mn + O(m 2 ) arithmetic operations.
We present parallel algorithms for solving the unit lower triangular system (3.6)
for the following cases:
(i) dense systems (m = n − 1), denoted by R<n>,
(ii) banded systems (m  n), denoted by R<n,m>, and
(iii) Toeplitz systems, denoted by R̂<n>, and R̂<n,m>, where λij = λ̃i− j .
The kernels used as the basic building blocks of these algorithms will be those dense
BLAS primitives discussed in Chap. 2.

3.2.1 Dense Triangular Systems

Letting m = n − 1, then (3.6) may be written as


⎛ ⎞⎛ ⎞ ⎛ ⎞
1 ξ1 φ1
⎜ λ21 1 ⎟ ⎜ ξ 2 ⎟ ⎜ φ2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ λ31 λ32 1 ⎟ ⎜ ξ 3 ⎟ ⎜ φ3 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ λ41 λ42 λ43 1 ⎟ ⎜ ξ 4 ⎟ = ⎜ φ4 ⎟ . (3.7)
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. .. .. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . . . . ⎠⎝ . ⎠ ⎝ . ⎠
λn1 λn2 λn3 λn4 . . . λn,n−1 1 ξn φn
3.2 Linear Recurrences 53

Clearly, ξ1 = φ1 so the right hand-side of (3.7) is purified by f − ξ1 Le1 , creating


a new right hand-side for the following lower triangular system of order (n − 1):
⎛ ⎞⎛ ⎞ ⎛ (1) ⎞
1 ξ2 φ2
⎜ λ32 ⎟⎜ξ ⎟ ⎜ (1) ⎟
⎜ 1 ⎟ ⎜ 3 ⎟ ⎜ φ3(1) ⎟

⎜ λ42 λ43 ⎟⎜ξ ⎟ ⎜φ ⎟
⎜ 1 ⎟⎜ 4 ⎟ = ⎜ 4 ⎟ ,
⎜ .. .. .. ⎟ ⎜ .. ⎟ ⎜ . ⎟
⎝ . . . ⎠ ⎝ . ⎠ ⎝ .. ⎟ ⎠
λn2 λn3 λn4 ... 1 ξn (1)
φn

Here φi(1) = φi −φ1 λi1 , i = 2, 3, . . . , n. The process may be repeated to obtain the
rest of the components of the solution vector. Assuming we have (n − 1) processors,
this algorithm requires 2(n − 1) parallel steps with no arithmetic redundancy. This
method is often referred to as the column-sweep algorithm; we list it as Algorithm 3.1
(CSweep). It is straightforward to show that the cost becomes 3(n − 1) parallel
operations for non-unit triangular systems. The column-sweep algorithm can be

Algorithm 3.1 CSweep: Column-sweep method for unit lower triangular system
Input: Lower triangular matrix L of order n with unit diagonal, right-hand side f
Output: Solution of L x = f
1: set φ (0)
j = φ j j = 1, . . . , n //that is f
(0) = f

2: do i = 1 : n
3: ξi = φi(i−1)
4: doall j = i + 1 : n
(i) (i−1) (i−1)
5: φj = φj − φi λ j,i //compute f (i) = Ni−1 f (i−1)
6: end
7: end

modified, however, to solve (3.6) in fewer parallel steps but with higher arithmetic
redundancy.

Theorem 3.1 ([4, 5]) The triangular system of equations Lx = f , where L is a unit
lower triangular matrix of order n, can be solved in T p = 21 log2 n + 23 log n using
no more than p = 1024
15 3
n + O(n 2 ) processors, yielding an arithmetic redundancy
of R p = O(n).

Proof To motivate the proof, we consider the column-sweep algorithm observing


that the coefficient matrix in (3.7), L, may be factored as the product

L = N1 N2 N3 . . . Nn−1 ,
54 3 Recurrences and Triangular Systems

where
⎛ ⎞
1
⎜ 1 ⎟
⎜ ⎟
⎜ . .. ⎟
⎜ ⎟
⎜ ⎟
Nj = ⎜
⎜ 1 ⎟
⎟ (3.8)
⎜ λ 1 ⎟
⎜ j+1, j ⎟
⎜ .. . . ⎟
⎝ . . ⎠
λn, j 1

Hence, the solution x of Lx = f , is given by


−1 −1
x = Nn−1 Nn−2 . . . N2−1 N1−1 f, (3.9)

where the inverse of N j is trivially obtained by reversing the signs of λij in (3.8).
Forming the product (3.9) as shown in Fig. 3.1, we obtain the column-sweep algo-
rithm which requires 2(n − 1) parallel steps to compute the solution vector x, given
(n − 1) processors.
Assuming that n is power of 2 and utilizing a fan-in approach to compute

(0)
Mn−1 · · · M1(0) f

in parallel, the number of stages in Fig. 3.1 can be reduced from n − 1 to log n, as
(0)
shown in Fig. 3.2, where Mi = −Ni−1 . This is the approach used in Algorithm 3.2
(DTS). To derive the total costs, it is important to consider the structure of the terms
(0)
in the tree of Fig. 3.2. The initial terms Mi , i = 1, . . . , n − 1 can each be computed

−1 −1
Nn−1 Nn−2 ··· N3−1 N2−1 N1−1 f

Fig. 3.1 Sequential solution of lower triangular system Lx = f using CSweep (column-sweep
Algorithm 3.1)
3.2 Linear Recurrences 55

(0) (0) (0) (0) (0) (0) (0)


Mn−1 Mn−2 Mn−3 Mn−4 ··· M3 M2 M1 f (0)

(1) (1)
M n −1 M n −2 (1) f (1)
2 2 M1
(2) f (2)
M n −1
4

f (3) ≡ x

Fig. 3.2 Computation of solution of lower triangular system Lx = f using the fan-in approach of
DTS (Algorithm 3.2)

using one parallel division of length n − i and a sign reversal. This requires at most
2 steps using at most n(n + 1)/2 processors.
( j)
The next important observation, that we show below, is that each Mk has a
maximum of 1 + 2 elements in any given row. Therefore,
j

( j+1) ( j) ( j)
Mk = M2k+1 M2k (3.10)
( j+1) ( j)
f = M1 f ( j) (3.11)

can be computed using independent inner products of vectors each of 1 + 2 j com-


ponents at most. Therefore, using enough processors (more on that later), all the
products for this stage can be accomplished in j + 2 parallel steps. Thus the total is
approximately


log n−1
1 3
Tp = ( j + 2) + 2 = log2 n + log n.
2 2
j=0

We now show our claim about the number of independent inner products in
the pairwise products occurring at each stage. It is convenient at each stage j =
( j)
1, . . . , log n, to partition each Mk as follows:
⎛ ( j)

Iq
( j) ⎜ ( j) ⎟
Mk =⎝ Lk ⎠, (3.12)
( j) ( j)
Wk Ir

( j) ( j) ( j)
where L k is unit lower triangular of order s = 2 j , Iq and Ir are the identities
( j)
of order q = ks − 1, and r = (n + 1) − (k + 1)s, and Wk is of order r -by-s. For
j = 0,
56 3 Recurrences and Triangular Systems

(0)
L s = 1, and
(0)
Wk = −(λk+1,k , . . . , λn,k ) .

Observe that the number of nonzeros at the first stage j = 0, in each row of matrices
(0) ( j)
Mk is at most 2 = 20 + 1. Assume the result to be true at stage j. Partitioning Wk
as,

( j)
( j) Uk
Wk = ( j) ,
Vk

( j)
where Uk is a square matrix of order s, then from

( j+1) ( j) ( j)
Mk = M2k+1 M2k ,

we obtain


( j)
( j+1) L 2k 0
Lk = ( j) ( j) ( j) , (3.13)
L 2k+1 U2k L 2k+1

and
( j+1) ( j) ( j) ( j) ( j)
Wk = (W2k+1 U2k + V2k , W2k+1 ). (3.14)

From (3.12) to (3.14) it follows that the maximum number of nonzeros in each row
( j+1)
of Mk is 2 j+1 + 1. Also, if we partition f ( j) as
⎛ ( j) ⎞
g1
⎜ ( j) ⎟
f ( j) = ⎜
⎝ g2 ⎠

( j)
g3

( j) ( j)
in which g1 is of order (s − 1), and g2 is of order s, then

( j+1) ( j)
g1 = g1 ,
( j+1) ( j) ( j)
g2 = L 1 g2 , and
( j+1) ( j) ( j) ( j)
g3 = W1 g2 + g3

with the first two partitions constituting the first (2s − 1) elements of the solution.
( j) ( j)
We next estimate the number of processors to accomplish DTS. Terms L 2k+1 U2k
( j) ( j)
in (3.13) and W2k+1 U2k in (3.14) can be computed simultaneously. Each column
of the former requires s inner
products of s pairs of vectors of size 1, . . . , s. There-
s
fore, each column requires i=1 i = s(s + 1)/2 processors. Moreover, the term
( j) ( j)
W2k+1 U2k necessitates sr inner products of length s, where as noted above, at this
3.2 Linear Recurrences 57

stage r = (n +1)−(k +1)s. The total number of processors for this product therefore
is s 2 r . The total becomes s 2 (s + 1)/2 + s 2 r and substituting for r we obtain that the
number of processors necessary is

( j+1) s2 s3
pM (k) = (2n + 3) − (2k + 1).
2 2
The remaining matrix addition in (3.14) can be performed with fewer processors.
( j+1)
Similarly, the evaluation of f ( j+1) requires p f = 21 (2n + 3)s − 23 s 2 processors.
Therefore the total number of processors necessary for stage j + 1 where j =
0, 1, . . . , log n − 2 is


n/2s−1
( j+1) ( j+1)
p ( j+1) = pM (k) + p f
k=1

so the number of processors necessary for the algorithm are

max{ p ( j+1) , n(n + 1)/2}


j

which is computed to be the value for p in the statement of the theorem.

Algorithm 3.2 DTS: Triangular solver based on a fan-in approach


Input: Lower triangular matrix L of order n = 2μ and right-hand side f
Output: Solution x of L x = f
(0)
1: f (0) = f , Mi = Ni−1 , i = 1, . . . , n − 1 where N j as in Eq. (3.8)
2: do j = 0 : μ − 1
( j)
3: f ( j+1) = M1 f ( j)
4: doall k = 1 : (n/2 j+1 ) − 1
( j+1) ( j) ( j)
5: Mk = M2k+1 M2k
6: end
7: end

3.2.2 Banded Triangular Systems

In many practical applications, the order m of the linear recurrence (3.6) is much less
than n. For such a case, algorithm DTS for dense triangular systems can be modified
to yield the result in fewer parallel steps.

Theorem 3.2 ([5]) Let L be a banded unit lower triangular matrix of order n and
bandwidth m + 1, where m ≤ n/2, and λik = 0 for i − k > m. Then the system
58 3 Recurrences and Triangular Systems

L x = f can be solved in less than T p = (2 + log m) log n parallel steps using fewer
than p = m(m + 1)n/2 processors.

Proof The matrix L and the vector f can be written in the form
⎛ ⎞ ⎛ ⎞
L1 f1
⎜ R1 L 2 ⎟ ⎜ f2 ⎟
⎜ ⎟ ⎜ ⎟
⎜ R2 L 3 ⎟ ⎜ f3 ⎟
L=⎜ ⎟, f = ⎜ ⎟,
⎜ .. .. ⎟ ⎜ .. ⎟
⎝ . . ⎠ ⎝ . ⎠
R n
m −1
L mn f mn

where L i and Ri are m×m unit lower triangular and upper triangular matrices, respec-
tively. Premultiplying both sides of Lx = f by the matrix D = diag(L −1 −1
1 , . . . , L p ),
(0) (0)
we obtain the system L x = f , where
⎛ ⎞
Im
⎜ G (0) Im ⎟
⎜ 1 ⎟
⎜ (0) ⎟
L (0) =⎜

G 2 Im ⎟,

⎜ .. .. ⎟
⎝ . . ⎠
G (0)
n Im
m −1

(0)
L 1 f1 = f 1 , and
(3.15)
(0) (0)
L i (G i−1 , f i ) = (Ri−1 , f i ), i = 2, 3, . . . , mn .

From Theorem 3.1, we can show that solving the systems in (3.15) requires T (0) =
(0) = 21 m 2 n + O(mn) processors. Now
2 log m + 2 log m parallel steps, using p
1 2 3
128
we form matrices D ( j) , j = 0, 1, . . . , log 2m
n
such that if

L ( j+1) = D ( j) L ( j) and f ( j+1) = D ( j) f ( j) ,

then L (μ) = I and x = f (μ) , where μ ≡ log(n/m). Each matrix L ( j) is of the form
⎛ ⎞
Ir
⎜ G ( j) Ir ⎟
⎜ 1 ⎟
⎜ ( j) ⎟
L ( j) =⎜

G 2 Ir ⎟,

⎜ .. .. ⎟
⎝ . . ⎠
( j)
G n −1 Ir
r

( j) ( j)
where r = 2 j ·m. Therefore, D ( j) = diag((L 1 )−1 , . . . , L p )−1 ) (i = 1, 2, . . . , 2r
n
)
in which
3.2 Linear Recurrences 59

( j)−1 Ir
Li = ( j) .
−G 2i−1 Ir

Hence, for the stage j + 1, we have




( j)
( j+1) 0 G 2i n
Gi = ( j) ( j) , i = 1, 2, . . . , − 1,
0 −G 2i+1 · G 2i 2r

and


( j)
( j+1) f 2i−1 n
fi = ( j) ( j) ( j) , i = 1, 2, . . . , .
−G 2i−1 · f 2i−1 + f 2i 2r

( j)
Observing that all except the last m columns of each matrix G i are zero, then
( j) ( j) ( j) ( j)
G 2i+1 G 2i and G 2i−1 f 2i−1 for all i, can be evaluated simultaneously in 1 + log m
parallel arithmetic operations using p  = 21 m(m +1)n −r m 2 processors. In one final
( j+1) ( j+1)
subtraction, we evaluate f i and G i , for all i, using p  = p  /m processors.
Therefore, T ( j + 1) = 2 + log m parallel steps using p ( j+1) = max{ p  , p  } =
2 m(m + 1)n − r m processors. The total number of parallel steps is thus given by
1 2

n
log
m
Tp = T ( j) = (2 + log m) log n − (1/2)(log m)(1 + log m) (3.16)
j=0

with p = max{ p ( j) } processors. For m = n/2, p ≡ p (0) , otherwise


j

1
p ≡ p (1) = m(m + 1)n − m 3 (3.17)
2
processors.

We call the corresponding scheme BBTS and list it as Algorithm 3.3.

Corollary 3.1 Algorithm 3.3 (BBTS) as described in Theorem 3.2 requires O p =


m 2 n log(n/2m) + O(mn log n) arithmetic operations, resulting in an arithmetic
redundancy of R p = O(m log n) over the sequential algorithm.

3.2.3 Stability of Triangular System Solvers

It is well known that substitution algorithms for solving triangular systems, includ-
ing CSweep, are backward stable: in particular, the computed solution x̃, satisfies a
60 3 Recurrences and Triangular Systems

relation of the form b − L x̃ ∞ = L ∞ x̃ ∞ u; cf. [6]. Not only that, but in many
cases the forward error is much smaller than what is predicted by the usual upper
bound involving the condition number (either based on normwise or componentwise
analysis) and the backward error. For some classes of matrices, it is even possi-
ble to prove that the theoretical forward error bound does not depend on a condition
number. This is the case, for example, for the lower triangular matrix arising from an

Algorithm 3.3 BBTS: Block banded triangular solver


Input: Banded lower triangular matrix L s.t. n/m = 2ν and vector f
Output: Solution x of L x = f
1: solve L 1 f 1(0) = f 1 ;
2: doall 1 : (n/2 j+1 ) − 1
(0) (0)
3: solve L i (G i , f i ) = (Ri , f i )
4: end
5: do k = 1 :
ν − 1,
(k) f 1(k−1)
6: f1 =
f 2(k−1) − G 1(k−1) f 1(k−1)
7: doall j =
1 : 2ν−k − 1,
(k−1)
(k) 0 G2 j
8: Gj = (k−1) (k−1)
0 −G G

(k−1)2 j+1 2 j
(k) f
9: f j+1 = 2 j+1
(k−1) (k−1) (k−1)
f 2 j+2 − G 2 j+1 f 2 j+1
10: end
11: end

(ν−1)
(ν) f1
12: f 1 = (ν−1) (ν−1) (ν−1)
f2 − G1 f1
(ν)
13: x = f 1

LU factorization with partial or complete pivoting strategies, and the upper triangular
matrix resulting from the QR factorization with column pivoting [6]. This is also
the case for some triangular matrices arising in the course of parallel factorization
algorithms; see for example [7].
The upper bounds obtained for the error of the parallel triangular solvers are less
satisfactory; cf. [5, 8]. This is due to the anticipated error accumulation in the (loga-
rithmic length) stages consisting of matrix multiplications building the intermediate
values in DTS. Bounds for the residual were first obtained in [5]. These were later
improved in [8]. In particular, the residual corresponding to the computed solution,
x̃, of DTS satisfies

b − L x̃ ∞ d̃n (|L||L −1 |)2 |L||x| ∞ u + O(u2 ), (3.18)

and the forward error


3.2 Linear Recurrences 61

|x − x̃| ≤ dn [M(L)]−1 |b|u + O(u2 ),

where M(L) is the matrix with values |λi,i | on the diagonal, and −|λi, j | in the
off-diagonal positions, with dn and d̃n constants of the order n log n. When L is an
M-matrix and b ≥ 0, DTS can be shown to be componentwise backward stable to
first order; cf. [8].

3.2.4 Toeplitz Triangular Systems

Toeplitz triangular systems, in which λij = λ̃i− j , for i > j, arise frequently in
practice. The algorithms presented in the previous two sections do not take advan-
tage of this special structure of L. Efficient schemes for solving Toeplitz triangular
systems require essentially the same number of parallel arithmetic operations as in
the general case, but need fewer processors, O(n 2 ) rather than O(n 3 ) processors for
dense systems, and O(mn) rather than O(m 2 n) processors for banded systems. The
solution of more general banded Toeplitz systems is discussed in Sect. 6.2 of Chap. 6.
To pave the way for a concise presentation of the algorithms for Toeplitz systems,
we present the following fundamental lemma.
Lemma 3.1 ([9]) If L is Toeplitz, then L −1 is also Toeplitz, where
⎛ ⎞
1
⎜ λ1 1 ⎟
⎜ ⎟
⎜ λ2 λ1 1 ⎟
L=⎜ ⎟.
⎜ .. . . . . . . ⎟
⎝ . . . . ⎠
λn−1 · · · λ2 λ1 1

Proof The matrix L may be written as

L = I + λ1 J + λ2 J 2 + . . . + λn−1 J n−1 , (3.19)

where we recall that in our notation in this book, J is the matrix


0
1
J= 0
0 1 0
Observing that J2 = (e3 , e4 , . . . , en , 0, 0), where ei is the ith column of the identity,
then J 3 = (e4 , . . . , en , 0, 0, 0), . . . , J n−1 = (en , 0, . . . , 0), and J n ≡ 0n , then from
(3.19) we see that J L = L J . Therefore, solving for the ith column of L −1 , i.e.,
solving the system L xi = ei , we have J L xi = J ei = ei+1 , or

L(J xi ) = ei+1 ,
62 3 Recurrences and Triangular Systems

where we have used the fact that L and J commute. Therefore, xi+1 = J xi , and L −1
can be written as
L −1 = (x1 , J x1 , J 2 x1 , . . . , J n−1 x1 ),

where L x1 = e1 . Now, if x1 = (ξ1 , ξ2 , . . . , ξn ) , then


⎛ ⎞ ⎛ ⎞ ⎛ ⎞
0 0 0
⎜ ⎟
ξ1 ⎜ ⎟ 0 ⎜ ⎟ 0
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ ⎟.. ⎜ ⎟ ξ1 ⎜ ⎟ 0
⎜ ⎟ . ⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟ .. ⎜ ⎟ ξ1

J x1 = ⎜ ⎟ ⎜ ⎟ . ⎜ ⎟, . . . , etc.,
⎟, J x 1 = ⎜ ⎟, J x 1 = ⎜
2 3
⎟ ..
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ .
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ . ⎟ ⎜ . ⎟ ⎜ ⎟ ..
.
⎝ . ⎠ .
⎝ . ⎠ ⎝ ⎠ .
ξn−1 ξn−2 ξn−3

establishing that L −1 is a Toeplitz matrix.

Theorem 3.3 Let L be a dense Toeplitz unit lower triangular matrix of order n.
Then the system L x = f can be solved in T p = log2 n + 2 log n − 1 parallel steps,
using no more than p = n 2 /4 processors.

Proof From Lemma 3.1, we see that the first column of L −1 determines L −1
uniquely. Using this observation, consider a leading principal submatrix of the
Toeplitz matrix L, 
L1 0
,
G1 L1

where L 1 (Toeplitz), G 1 are of order q = 2 j , and we assume that L −1


1 e1 is known.
The first column of the inverse of this principal submatrix is given by

L −1
1 e1
−L 1 G 1 L −1
−1
1 e1

and can be computed in 2(1 + log q) = 2( j + 1) parallel steps using q 2 processors.


Note that the only computation involved is that of obtaining −L −1 −1
1 G 1 L 1 e1 , where
−1
L 1 e1 is given. Starting with a leading submatrix of order 4, i.e., for j = 1, we have

1 0
L −1 = ,
1 −λ1 1

and doubling the size every stage, we obtain the inverse of the leading principal
submatrix M of order n/2 in
3.2 Linear Recurrences 63

n
log
4
2( j + 1) = log2 n − log n − 2
j=1

parallel steps with (n 2 /16) processors. Thus, the solution of a Toeplitz system L x =
f , or   
M 0 x1 f
= 1 ,
N M x2 f2

is given by
x1 = M −1 f 1 ,

and
x2 = M −1 f 2 − M −1 N M −1 f 1 .

Since we have already obtained M −1 (actually the first column of the Toeplitz matrix
M −1 ), x1 and x2 can be computed in (1 + 3 log n) parallel arithmetic operations
using no more than n 2 /4 processors. Hence, the total parallel arithmetic operations
for solving a Toeplitz system of linear equations is (log2 n + 2 log n − 1) using n 2 /4
processors.

Let now n = 2ν , Le1 = (1, λ1 , λ2 , ..., λn−1 ) , and let G k be a square Toeplitz of
order 2k with its first and last columns given by

G k e1 = (λ2k , λ2k +1 , . . . , λ2k+1 −1 ) ,

and
G k e2k = (λ1 , λ2 , . . . , λ2k ) .

Based on this discussion, we construct the triangular Toeplitz solver TTS that is
listed as Algorithm 3.4.

Algorithm 3.4 TTS: Triangular Toeplitz solver


Input: n = 2ν ; L ∈ Rn×n dense Toeplitz unit lower triangular with Le1 = [1, λ1 , · · · , λn−1 ] and
right-hand side f ∈ Rn .
Output: Solution x of L x = f
(1) −1 1
1: g = L 1 e1 = ;
−λ1
2: do k = 1 : ν − 2, 
g (k)
3: g (k+1) = L −1 e = (k) ;
−L −1
k+1 1
k Gk g
4: end
// g (ν−1) determine uniquely L −1 ν−1

L −1
ν−1 f1
5: x =
L −1 −1 −1
ν−1 f 2 − L ν−1 G ν−1 L ν−1 f 1
64 3 Recurrences and Triangular Systems

If L is banded, Algorithm 3.4 is slightly altered as illustrated in the proof of the


following theorem.

Theorem 3.4 Let L be a Toeplitz banded unit lower triangular matrix of order n
and bandwidth (m + 1) ≥ 3. Then the system L x = f can be solved in less than
(3 + 2 log m) log n parallel steps using no more than 3mn/4 processors.

Proof Let L and f be partitioned as


⎛ ⎞ ⎛ (0) ⎞
L0 f1
⎜ R0 L 0 ⎟ ⎜ (0) ⎟
⎜ ⎟ ⎜
⎜ ⎟ f 2 ⎟
L=⎜ R0 L 0 ⎟, f = ⎜
⎜ .. ⎟
⎟. (3.20)
⎜ . . . . ⎟ ⎝ . ⎠
⎝ . . ⎠
(0)
R0 L 0 f n/m

At the jth stage of this algorithm, we consider the leading principal submatrix L j
of order 2r = m2 j , 
L j−1 0
Lj = ,
R j−1 L j−1

where L j−1 is Toeplitz, and 


0 R0
R j−1 = .
00

The corresponding 2r components of the right-hand side are given by




( j−1)
( j) f 2i−1
fi = ( j−1) ,
f 2i

( j)T ( j)T ( j)T


where f  = ( f 1 , f2 , . . . , f (n/2r ) ). Now, obtain the column vectors

( j) ( j)
xi = L −1
j f i , i = 1, 2, . . . , n/2r,

or

( j−1)

( j−1)
x2i−1 L −1
j−1 f 2i−1
( j−1) = ( j−1) ( j−1) . (3.21)
y2i L −1
j−1 f 2i − L −1 −1
j−1 R j−1 L j−1 f 2i−1

( j)
Note that the first 2r components of the solution vector x are given by x1 . Assuming
( j−1) ( j−1)
that we have already obtained L −1 −1
j−1 f 2i−1 and L j−1 f 2i from stage ( j − 1), then
(3.21) may be computed as shown below in Fig. 3.3, in T  = (3 + 2 log m) parallel
 
steps, using p  = mr − 21 m 2 − 21 m (n/2r ) processors. This is possible only if we
have L −1 −1
j−1 explicitly, i.e., the first column of L j−1 .
3.2 Linear Recurrences 65

( j−1)
L−1
j−1
( j−1)
f2i L−1
j−1 R j−1 L−1
j−1 f2i−1

− ·

·

Fig. 3.3 Computation of the terms of the vector (3.21)

Assuming that the first column of L −1 −1


j−1 , i.e. L j−1 e1 , is available from stage j − 1,
then we can compute the first column of L −1
j simultaneously with (3.21). Since the
first column of L −1
j can be written as


L −1
j−1 e1
,
−L −1 −1
j−1 R j−1 L j−1 e1

we see that it requires T  = (2 + 2 log m) parallel steps using p  = mr − 21 m 2 − 21 m


processors. Hence, stage j requires T ( j) = max{T  , T  } = 3 + 2 log m, parallel
(0)
steps using p j = p  + p  processors. Starting with stage 0, we compute L −1 0 fi ,
i = 1, 2, . . . , n/m, and L −1
0 e1 in T
(0) = log2 m + 2 log m − 1 parallel steps using

m(n + m)/4 processors [5]. After stage ν, where ν = log(n/2m), the last n/2
components of the solution are obtained by solving the system

(ν)
(ν)
Lν 0 x1 f1
(ν) = (ν) .
Rν L ν x2 f2

Again, the solution of this system is obtained in T (ν+1) = (3 + 2 log2 m) parallel


steps using m(n − m − 1)/2 processors. Consequently, the total number of parallel
steps required for this process is given by


log(n/m)
Tp = T (k) = (3 + 2 log m) log n − (log2 m + log m + 1).
k=0

The maximum number of processors p used in any given stage, therefore, does not
exceed 3mn/4. For n m the number of parallel steps required is less than twice
that of the general-purpose banded solver while the number of processors used is
reduced roughly by a factor of 2m/3.
66 3 Recurrences and Triangular Systems

3.3 Implementations for a Given Number of Processors

In Sects. 3.2.1–3.2.4, we presented algorithms that require the least known number of
parallel steps, usually at the expense of a rather large number of processors, especially
for dense triangular systems. Throughout this section, we present alternative schemes
that achieve the least known number of parallel steps for a given number of processors.
First, we consider banded systems of order n and bandwidth (m +1), i.e., R n, m,
and assume that the number of available processors p satisfies the inequality, m <
p  n. Clearly, if p = m we can use the column-sweep algorithm to solve the
triangular system in 2(n − 1) steps. Our main goal here is to develop a more suitable
algorithm for p > m that requires O(m 2 n/ p) parallel steps given only p > m
processors.

Algorithm 3.5 BTS: Banded Toeplitz triangular solver


Input: Banded unit Toeplitz lower triangular matrix L ∈ Rn×n , integer m s.t. n/m = 2ν and
vector f ∈ Rn as defined by (3.20)
Output: Solution x of L x = f
1: g (0) = L −1 0 e1 (using Algorithm 3.4) ;
2: doall i = 1 : n/m
(0) (0)
3: xi = L −1 0 fi ;
4: end
5: do k = 1 : ν, 
(k) g (k−1)
6: g = −1 ;
−L k−1 Rk−1 g (k−1)
// g (k) determines uniquely L −1 k
7: doall i =
1 : 2ν−k ,
(k−1)
x2i−1
8: xi(k) = (k−1) (k−1)
x2i − L −1
k−1 Rk−1 x 2i−1
9: end
10: end
//x1(ν) is the solution of L x = f , and g (ν) is the first column of L −1 .

In the following, we present two algorithms; one for obtaining all the elements of
the solution of L x = f , and one for obtaining only the last m components of x.
Theorem 3.5 Given p processors m < p  n, a unit lower triangular system of n
equations with bandwidth (m+1) can be solved in

n−m
T p = 2(m − 1) + τ (3.22)
p( p + m − 1)

parallel steps, where



(2m 2 + 3m) p − (m/2)(2m 2 + 3m + 5)
τ = max (3.23)
2m(m + 1) p − 2m
3.3 Implementations for a Given Number of Processors 67

Proof To motivate the approach, consider the following underdetermined system,


L̂ ẑ = g, ⎛ ⎞
⎛ ⎞ z0 ⎛ ⎞
R0 L 1 ⎜ z1 ⎟ g1
⎜ R1 L 2 ⎟ ⎜ ⎟ ⎜ g2 ⎟
⎜ ⎟ ⎜ z2 ⎟ ⎜ ⎟
⎜ R2 L 3 ⎟ ⎜ ⎟ ⎜ g3 ⎟
⎜ ⎟ ⎜ z 3 ⎟ = ⎜ ⎟, (3.24)
⎜ .. .. ⎟ ⎜ ⎟ ⎜ .. ⎟
⎝ . . ⎠ ⎜ .. ⎟ ⎝ . ⎠
⎝ . ⎠
R p−1 Lp gp
zp

of s = q + p( p − 1) equations with q = mp, where z 0 is given. Each L i is unit


lower triangular of bandwidth m + 1, L 1 is of order q, L i (i > 1) is of order p, and
Ri (0 ≤ i ≤ p − 1) is upper triangular containing nonzero elements only on its top
m superdiagonals.
System (3.24) can be expressed as
⎛ ⎞⎛ ⎞ ⎛ ⎞
Imp z1 h1
⎜ G1 I p ⎟ ⎜ z2 ⎟ ⎜ h2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ G2 I p ⎟ ⎜ z3 ⎟ ⎜ h3 ⎟
⎜ ⎟ ⎜ ⎟ = ⎜ ⎟, (3.25)
⎜ .. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . . ⎠⎝ . ⎠ ⎝ . ⎠
G p−1 I p zp hp

where h i and G i are obtained by solving the systems

L 1 h 1 = g1 − R0 z 0 , (3.26)

L i (G i−1 , h i ) = (Ri−1 , gi ), i = 2, 3, . . . , p. (3.27)

Using one processor, the systems in (3.26) can be solved in

τ1 = 2m 2 p

parallel steps while each of the ( p − 1) systems in (3.27) can be solved sequentially
in m
τ2 = (2m 2 + m) p − (2m 2 + 3m + 1),
2
parallel steps. Now the reason for choosing q = mp is clear; we need to make the
difference in parallel steps between solving (3.26) and any of the ( p − 1) systems
in (3.27) as small as practically possible. This will minimize the number of parallel
steps during which some of the processors remain idle. In fact, τ1 = τ2 for p =
(2m 2 + 3m + 1)/2. Assigning one processor to each of the p systems in (3.27), they
can be solved simultaneously in τ3 = max{τ1 , τ2 } parallel steps. From (3.25), the
solution vector z is given by
68 3 Recurrences and Triangular Systems

z1 = h1,
z i = h i − G i−1 z i−1 , i = 2, 3, . . . , p.

Observing that only the last m columns of each G i are different from zero and
using the available p processors, each z i can be computed in 2m parallel steps. Thus,
z 2 , z 3 , . . . , z p are obtained in τ4 = 2m( p − 1) parallel steps, and the system L ẑ = g
in (3.24) is solved in

τ = τ3 + τ4
(2m 2 + 3m) p − (m/2)(2m 2 + 3m + 5) (3.28)
= max
2m(m + 1) p − 2m

parallel steps. Now, partitioning the unit lower triangular system L x = f of n


equations and bandwidth m + 1, in the form
⎛ ⎞⎛ ⎞ ⎛ ⎞
V0 x0 f0
⎜ U1 V1 ⎟ ⎜ x1 ⎟ ⎜ f1 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ U2 V2 ⎟ ⎜ x2 ⎟ ⎜ f2 ⎟
⎜ ⎟⎜ ⎟ = ⎜ ⎟,
⎜ .. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . . ⎠⎝ . ⎠ ⎝ . ⎠
Uk Vk xk fk

where V0 is of order m, Vi , i > 0, (with the possible exception of Vk ) is of order s,


i.e., k = (n − m)/s, and each Ui is of the same form as Ri . The solution vector x is
then given by
V0 x0 = f 0 , (3.29)

Vi xi = ( f i − Ui xi−1 ), i = 1, 2, 3, . . . , k. (3.30)

Solving (3.29) by the column-sweep method in 2(m − 1) parallel steps (note that
p > m), the k systems in (3.30) can then be solved one at a time using the algorithm
developed for solving (3.24). Consequently, using p processors, L x = f is solved
in
n−m
T p = 2(m − 1) + τ
p( p + m − 1)

parallel steps in which τ is given by (3.28).


Example 3.5 Let n = 1024, m = 2, and p = 16. From (3.28), we see that τ = 205
parallel steps. Thus, T p = 822 which is less than that required by the column-sweep
method, i.e., 2(n − 1) = 2046.
In general, this algorithm is guaranteed to require fewer parallel steps than the
column-sweep method for p ≥ 2m 2 .
Corollary 3.2 The algorithm in Theorem 3.5 requires O p = O(m 2 n) arithmetic
operations, resulting in an arithmetic redundancy R p = O(m) over the sequential
algorithm.
3.3 Implementations for a Given Number of Processors 69

In some applications, one is interested only in obtaining the last component of


the solution vector, e.g. polynomial evaluation via Horner’s rule. The algorithm in
Theorem 3.5 can be modified in this case to achieve some savings. In the following
theorem, we consider the special case of first-order linear recurrences.

Theorem 3.6 Consider the unit lower triangular system L x = f of order n and
bandwidth 2. Given p processors, 1 < p  n, we can obtain ξn , the last component
of x in
T p = 3(q − 1) + 2 log p

parallel steps, where q = (2n + 1)/(2 p + 1).

Proof Partition the system L x = f in the form


⎛ ⎞⎛ ⎞ ⎛ ⎞
L1 x1 f1
⎜ R1 L 2 ⎟ ⎜ x2 ⎟ ⎜ f2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ R2 L 3 ⎟ ⎜ x3 ⎟ ⎜ f3 ⎟
⎜ ⎟⎜ ⎟ = ⎜ ⎟, (3.31)
⎜ . .
.. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ ⎠⎝ . ⎠ ⎝ . ⎠
R p−1 L p xp fp

where L 1 is of order r and L i , i > 1, of order q. Now, we would like to choose r


and q, so that the number of parallel steps for solving

L 1 x1 = f 1 , (3.32)

is roughly equal to that needed for solving

L i (G i−1 , h i ) = (Ri−1 , f i ). (3.33)

This can be achieved by choosing q = (2n + 1)/(2 p + 1), and r = n − ( p − 1)q.


Therefore, the number of parallel steps required for solving the systems (3.32) and
(3.33), or reducing (3.31) to the form of (3.25) is given by

σ  = max{2(r − 1), 3(q − 1)} = 3(q − 1).

Splitting the unit lower triangular system of order p whose elements are encircled in
Fig. 3.4, the p elements ξr , ξr +q , ξr +2q , . . . , ξn can be obtained using the algorithm
of Theorem 3.2 in 2 log p parallel steps using ( p − 1) processors. Hence, we can
obtain ξn in
T p = σ + 2 log p.

Similarly, we consider solving Toeplitz triangular systems using a limited number


of processors. The algorithms outlined in Theorems 3.5 and 3.6 can be modified to
take advantage of the Toeplitz structure. Such modifications are left as exercises. We
70 3 Recurrences and Triangular Systems

Fig. 3.4 Unit lower 1


triangular matrix in
triangular system (3.25) with 1
n = 16, p = 4 and m = 1 1
1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

will, however, consider an interesting special case; a parallel Horner’s rule for the
evaluation of polynomials.
Theorem 3.7 Given p processors, 1 < p  n, we can evaluate a polynomial of
degree n in T p = 2(k − 1) + 2 log( p − 1) parallel steps where k = (n + 1)/( p − 1).
Proof Consider the evaluation of the polynomial

Pn (ξ ) = α1 ξ n + α2 ξ n−1 + . . . + αn ξ + αn+1

at ξ = θ . On a uniprocessor this can be achieved via Horner’s rule

β1 = α1 ,
βi+1 = θβi + αi+1 i = 1, 2, . . . , n,

where βn+1 = Pn (θ ). This amounts to obtaining the last component of the solution
vector of the triangular system
⎛ ⎞⎛ ⎞ ⎛ ⎞
1 β1 α1
⎜ −θ 1 ⎟ ⎜ β2 ⎟ ⎜ α2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ −θ 1 ⎟ ⎜ β3 ⎟ ⎜ α3 ⎟
⎜ ⎟⎜ ⎟=⎜ ⎟. (3.34)
⎜ .. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . . ⎠ ⎝ . ⎠ ⎝ . ⎠
−θ 1 βn+1 αn+1

Again, partitioning (3.34) as


3.3 Implementations for a Given Number of Processors 71
⎛ ⎞ ⎛ ⎞
⎛ ⎞ β1 α1
L̂ ⎜ β2 ⎟ ⎜ ⎟
⎜ R̂ L ⎟⎜ ⎟ ⎜ α2 ⎟
⎜ ⎟ ⎜ β3 ⎟ ⎜ α3 ⎟
⎜ . . ⎟⎜ ⎟=⎜ ⎟,
⎝ R . . . . ⎠ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . ⎠ ⎝ . ⎠
R L β α p−1
p−1

where L is of order k = (n + 1)/( p − 1), L̂ is of order j = (n + 1) − ( p − 2)k < k,


and R̂e j = Rek = −θ e1 . Premultiplying both sides by the block-diagonal matrix,
diag( L̂ −1 , . . . , L̂ −1 ) we get the system
⎛ ⎞⎛ ⎞ ⎛ ⎞
Ij b1 h1
⎜ Ĝ Ik ⎟ ⎜ b2 ⎟ ⎜ h 2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ G ⎟ ⎜ b3 ⎟ ⎜ h 3 ⎟
⎜ ⎟⎜ ⎟=⎜ ⎟, (3.35)
⎜ .. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . . ⎠ ⎝ . ⎠ ⎝ . ⎠
G Ik b p−1 h p−1

where,
L̂h 1 = a1 ,
(3.36)
Lh i = ai , i = 2, 3, . . . , p − 1,

and
L(Ĝe j ) = L(Gek ) = −θ e1 . (3.37)

Assigning one processor to each of the bidiagonal systems (3.36) and (3.37), we
obtain h i and Gek in 2(k − 1) parallel steps. Since L is Toeplitz, one can easily show
that
g = Ĝe j = Gek = (−θ, −θ 2 , . . . , −θ k ) .

In a manner similar to the algorithm of Theorem 3.6 we split from (3.35) a smaller
linear system of order ( p − 1)

⎛ ⎞ ⎛ b ⎞ ⎛ e h ⎞
1 j j 1
⎜ −θ k 1 ⎟⎜ b ⎟ ⎜ e h 2 ⎟
⎟⎜ ⎟ ⎜ k ⎟
j+k

⎜ −θ k 1 ⎟⎜ b j+2k ⎟ ⎜
⎟ = ⎜ ek h 3 ⎟.

⎜ ⎟⎜⎜ . ⎟ ⎜ . ⎟
(3.38)
⎝ · · ⎠⎝ . ⎠ ⎝ .. ⎠
.
−θ k 1 bn+1 
ek h p−1

From Theorem 3.2 we see that using ( p−2) processors, we obtain bn+1 in 2 log( p−1)
parallel steps. Thus the total number of parallel steps for evaluating a polynomial of
degree n using p  n processors is given by

T p = 2(k − 1) + 2 log( p − 1).


72 3 Recurrences and Triangular Systems

Table 3.1 Summary of bounds for parallel steps, efficiency, redundancy and processor count in
algorithms for solving linear recurrences
L Tp p Ep Rp
(proportional to)
2 log n + 2 log n 15/1024n 3 +O(n 2 )
1 2 3
Dense R<n> 1/(n log2 n) O(n)
log2 n + 2 log n − 1 n /4 1/ log2 n
Triangular R̂<n> 2 O(1)(=5/4)
Banded R<n,m> (2 + log m) log n + 21 m(m + 1)n − m 3 1/(m log m log n) O(m log n)
Triangular O(log2 m)
R̂<n,m> (3+2 log m) log n+ 3mn/4 1/(log m log n) log n
O(log 2 m)
Here n is the order of the unit triangular system L x = f and m + 1 is the bandwidth of L

Example 3.6 Let n = 1024 and p = 16. Hence, k = 69 and T p = 144. This is
roughly (1/14) of the number of arithmetic operations required by the sequential
scheme, which is very close to (1/ p).
The various results presented in this chapter so far are summarized in Tables 3.1 and
3.2. In both tables, we deal with the problem of solving a unit lower triangular system
of equations L x = f . Table 3.1 shows upper bounds for the number of parallel steps,
processors, and the corresponding arithmetic redundancy for the unlimited number
of processors case. In this table, we observe that when the algorithms take advantage
of the special Toeplitz structure to reduce the number of processors and arithmetic
redundancy, the number of parallel steps is increased. Table 3.2 gives upper bounds
on the parallel steps and the redundancy for solving banded unit lower triangular
systems given only a limited number of processors, p, m < p  n. It is also of
interest to note that in the general case, it hardly makes a difference in the number
of parallel steps whether we are seeking all or only the last m components of the
solution. In the Toeplitz case, on the other hand, the parallel steps are cut in half if
we seek only the last m components of the solution.

3.4 Nonlinear Recurrences

We have already seen in Sects. 3.1 and 3.2 that parallel algorithms for evaluating
linear recurrence relations can achieve significant reduction in the number of required
parallel steps compared to sequential schemes. For example, from Table 3.1 we see
that the speedup for evaluating short recurrences (small m) as a function of the
problem size behaves like S p = O(n/ log n). Therefore, if we let the problem size
vary, the speedup is unbounded with n. Even discounting the fact that efficiency goes
to 0, the result is quite impressive for recurrence computations that appear sequential
at first sight. As we see next, and has been known for a long time, speedups are far
more restricted for nonlinear recurrences.

ξk+r = f (k, ξk , ξk+1 , . . . , ξk+r −1 ). (3.39)


3.4 Nonlinear Recurrences 73

Table 3.2 Summary of linear recurrence bounds using a limited number, p, of processors, where
m < p  n and m and n are as defined in Table 3.1
Problem Tp Ep Rp
2
p+m−1/2 +
2m n
Solve the General case Solve for all proportional O(m)
p )
O( mn
banded unit or last m to 1/m
lower components of
triangular x
system
Lx = f
(m = 1) Solve 3n p + O(log p) 2/3 2
for last
component of
x
p + O(mp) 1/2
4mn
Toeplitz case Solve for all O(m)
components of
x
2mn
Solve for last p−m + 1 − m
p 1 + 1
p
m components O(m 2 log p)
of x
m = 1 (e.g., p−1 +
2n
1 − 1
p 1 + 1
p
evaluation of a O(log p)
polynomial of
degree n − 1)

The mathematics literature dealing with the above recurrence is extensive, see for
example [10, 11].
The most obvious method for speeding up the evaluation of (3.39) is through
linearization, if possible, and then using the algorithms of Sects. 3.2 and 3.3.

Example 3.7 ([10]) Consider the first order rational recurrence of degree one

αk ξk + βk
ξk+1 = k ≥ 0. (3.40)
γk + ξk

Using the simple change of variables


ωk+1
ξk = − γk , (3.41)
ωk

we obtain
1
ξk+1 (γk + ξk ) = (ωk+2 − γk+1 ωk+1 ),
ωk
74 3 Recurrences and Triangular Systems

and
1
αk ξk + βk = (αk ωk+1 + (βk − αk γk )ωk ).
ωk

Therefore, the nonlinear recurrence (3.40) is reduced to the linear one

ωk+2 + δk+1 ωk+1 + ζk ωk = 0, (3.42)

where
δk+1 = −(αk + γk+1 ), and
ζk = αk γk − βk .

If the initial condition of (3.40) is ξ0 = τ , where τ = −γ0 , the corresponding initial


conditions of (3.42) are ω0 = 1 and ω1 = γ0 + τ . It is interesting to note that
the number of parallel steps required to evaluate ξi , 1 ≤ i ≤ n, via (3.40) is O(n)
whether we use one or more processors. One, however, can obtain ξ1 , ξ2 , . . . , ξn via
(3.42) in O(log n) parallel steps using O(n) processors as we shall illustrate in the
following:
(i) ζk , δk+1 (0 ≤ k ≤ n − 1), and ω1 can be obtained in 2 parallel steps using
2n + 1 processors,
(ii) the triangular system of (n +2) equations, corresponding to (3.42) can be solved
in 2 log(n + 2) parallel steps using no more than 2n − 4 processors,
(iii) ξi , 1 ≤ i ≤ n, is obtained from (3.41) in 2 parallel steps employing 2n proces-
sors.
Hence, given (2n + 1) processors, we evaluate ξi , i = 1, 2, ..., n, in 4 + 2 log(n + 2)
parallel steps.

If γk = 0, (3.40) is a continued fraction iteration. The number of parallel steps to


evaluate ξ1 , ξ2 , . . . , ξn via (3.43) in this case reduces to 1 + 2 log(n + 2) assuming
that (2n − 4) processors are available.
In general, linearization of nonlinear recurrences should be approached with cau-
tion as it may result in an ill-conditioned linear recurrence, or even over- or underflow
in the underlying computations. Further, it may not lead to significant reduction in
the number of parallel steps. The following theorem and corollary from [12] show
that not much can be expected if one allows only algebraic transformations.

Theorem 3.8 ([12]) Let ξi+1 = f (ξi ) be a rational recurrence of order 1 and
degree d > 1. Then the speedup for the parallel evaluation of the final term, ξn , of
the recurrence on p processors is bounded as follows

O1 ( f (ξ ))
Sp ≤ .
log d

for all n and p, where O1 ( f (ξ )) is the number of arithmetic operations required to


evaluate f (ξ ) on a uniprocessor.
3.4 Nonlinear Recurrences 75

As an example, consider Newton’s iteration for approximating the square root of


a real number α
ξk+1 = (ξk2 + α)/2ξk . (3.43)

This is a first-order rational recurrence of degree 2. No algebraic change of variables


similar
 to that of Example 3.7 can reduce (3.43) to a linear recurrence. Here, f (ξ ) =
α
2 ξ + ξ , T1 ( f ) = 3, and log d = 1, thus yielding 3 as an upper bound for S p .
1

Corollary 3.3 ([12]) By using parallelism, the evaluation of an expression defined


by any nonlinear polynomial recurrence, such as ξi+1 = 2ξi2 ξi−1 + ξi−2 , can be
speeded up by at most a constant factor.

If we are not restricted to algebraic transformations of the variables one can, in


certain instances, achieve higher reduction in parallel steps.

Example 3.8 Consider the rational recurrence of order 2 and degree 2

α(ξk + ξk+1 )
ξk+2 = .
(ξk ξk+1 − α)

This is equivalent to

ξk ξk+1 ξk+2 = α(ξk + ξk+1 + ξk+2 ).



Setting ξk = α tan ψk , and using the trigonometric identity

tan ψk + tan ψk+1


tan(ψk + ψk+1 ) = ,
1 − tan ψk tan ψk+1

we obtain the linear recurrence

ψk + ψk+1 + ψk+2 = arcsin 0 = νπ

ν = 0, 1, 2, . . . . If we have efficient sequential algorithms for computing tan(β) or


arctan(β), for any parameter β, in τ parallel steps, then computing ξ2 , ξ3 , . . . , ξn ,
from the initial conditions (ξ0 , ξ1 ), may be outlined as follows:

(i) compute α in 2 parallel steps (say),
(ii) compute ψ0 and ψ1 using two processors in (1 + τ ) parallel steps,
(iii) the algorithm in Theorem 3.2 may be used to obtain ψ2 , ψ3 , . . . , ψn in −1 +
3 log(n + 1) parallel steps using less than 3n processors,
(iv) using (n − 1) processors, compute ξ2 , ξ3 , . . . , ξn in (1 + τ ) parallel steps.

Hence, the total number of parallel steps amounts to

T p = 2τ + 3 log 2(n + 1),


76 3 Recurrences and Triangular Systems

Fig. 3.5 Sequential method,


ξk+1 = f (ξk ), for
approximating the fixed
point α ψ=ξ

ψ = f (ξ)

ξ0 ξ1 ξ2
0
0

using p = 3n processors, whereas the sequential algorithm for handling the nonlinear
recurrence directly requires O1 = 5(n − 1) arithmetic operations.
Therefore, provided that τ  n, we can achieve speedup S p = O(n/ log n).
This can be further enhanced if we are seeking only ξn and have a parallel algorithm
for evaluating the tan and arctan functions. In many cases the first-order nonlinear
recurrences ξk+1 = f (ξk ) arise when one attempts to converge to a fixed point. The
goal in this case is not the evaluation of ξi for i ≥ 1 (given the initial iterate ξ0 ), but an
approximation of lim ξi . The classical sequential algorithm is illustrated in Fig. 3.5,
i→∞
where | f  (ξ )| < 1 in the neighborhood of the root α. In this case, α is called an
attractive fixed point. An efficient parallel algorithm, therefore, should not attempt
to linearize the nonlinear recurrence ξk+1 = f (ξk ), but rather seek an approximation
of α by other methods for obtaining roots of a nonlinear function.

We best illustrate this by an example.

Example 3.9 Consider the nonlinear recurrence

2ξk (ξk2 + 6)
ξk+1 = = f (ξk ).
3ξk2 + 2

Starting with ξ0 = 10 as an initial approximation to the attractive fixed point α =



10, f  (α) = 3/8 < 1, we seek ξn such that |ξn − α| ≤ 10−10 . By the theory of
functional iterations, to approximate α to within the above tolerance we need at least
25 function evaluations of f , each of which requires 6 arithmetic operations, hence
T1 ≥ 150.
From the nonlinear recurrence and its initial iterate, it is obvious that we are seek-
ing the positive root of the nonlinear equation g(ξ ) = ξ(ξ 2 − 10), see Fig. 3.6. In
contrast to Theorem 3.8, we will show that given enough processors we can appre-
ciably accelerate the evaluation of ξn . It is not difficult to obtain an interval (τ, τ +γ )
3.4 Nonlinear Recurrences 77

Fig. 3.6 Function g used to


compute the fixed point of f

g(ξ)


10/3
• 
α 4 ξ

in which g(ξ ) changes sign, i.e., α ∈ (τ, τ + γ ). Assuming we have p processors,


we describe a modification of the bisection method for obtaining the single root in
(τ, τ + γ ) (see also [13, 14]):
(i) divide the interval into ( p − 1) equal subintervals (θi , θi+1 ), 1 ≤ i ≤ p − 1,
(ii) evaluate the function g(ξ ) at the points θ j , 1 ≤ j ≤ p,
(iii) detect that interval with change of sign, i.e., sign(g(θi )) = sign(g(θi+1 )),
(iv) go back to (i).
This process is repeated until we obtain a subinterval in (iii) (θν , θν+1 ) of width ≤ 2ε,
where ε = 10−t is some prescribed tolerance. A reasonable approximation to α is
therefore given by (θν + θν+1 )/2. Clearly, the process may terminate at an earlier
time if for any θi , g(θi ) ≤ ε. We can obtain an upper bound on the parallel steps
required to locate an approximation to the root. If ν is the number of simultaneous
function evaluations such that
γ
≤ 2ε,
( p − 1)ν

then  
t + log10 (γ /2)
ν≤ ,
log10 ( p − 1)

and
T p ≤ 2 + 3ν.

(Note that one function evaluation of g requires 3 arithmetic operations.)


Let (τ, τ + γ ) = (2,4), i.e., γ = 2, p = 64, and t = 10 then ν ≤ 6 and T64 ≤ 20.
Thus T1 /T p = (150/20)  7.5 which is O(n/ log n) for n ≥ 54.
78 3 Recurrences and Triangular Systems

References

1. Gautschi, W.: Computational aspects of three-term recurrence relations. SIAM Rev. 9, 24–82
(1967)
2. Rivlin, T.: The Chebyshev Polynomials. Wiley-Interscience, New York (1974)
3. Kuck, D.: The Structure of Computers and Computations. Wiley, New Yok (1978)
4. Chen, S.C., Kuck, D.: Time and parallel processor bounds for linear recurrence systems. IEEE
Trans. Comput. C-24(7), 701–717 (1975)
5. Sameh, A., Brent, R.: Solving triangular systems on a parallel computer. SIAM J. Numer. Anal.
14(6), 1101–1113 (1977)
6. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
7. Sameh, A., Kuck, D.: A parallel QR algorithm for symmetric tridiagonal matrices. IEEE Trans.
Comput. 26(2), 147–153 (1977)
8. Higham, N.: Stability of parallel triangular system solvers. SIAM J. Sci. Comput. 16(2), 400–
413 (1995)
9. Lafon, J.: Base tensorielle des matrices de Hankel (ou de Toeplitz). Appl. Numer. Math. 23,
249–361 (1975)
10. Boole, G.: Calculus of Finite Differences. Chelsea Publishing Company, New York (1970)
11. Wimp, J.: Computation with Recurrence Relations. Pitman, Boston (1984)
12. Kung, H.: New algorithms and lower bounds for the parallel evaluation of certain rational
expressions and recurrences. J. Assoc. Comput. Mach. 23(2), 252–261 (1976)
13. Miranker, W.: Parallel methods for solving equations. Math. Comput. Simul. 20(2), 93–101
(1978). doi:10.1016/0378-4754(78)90032-0. http://www.sciencedirect.com/science/article/
pii/0378475478900320
14. Gal, S., Miranker, W.: Optimal sequential and parallel seach for finding a root. J. Combinatorial
Theory 23, 1–14 (1977)
Chapter 4
General Linear Systems

One of the most fundamental problems in matrix computations is solving linear


systems of the form,
Ax = f, (4.1)

where A is an n-by-n nonsingular matrix, and f ∈ Rn is the corresponding right


hand-side. Important related problems are solving for multiple right-hand sides, say
AX = F with F ∈ Rn×s , and computing the inverse A−1 .
When A is dense, i.e. when A does not possess special structure and contains
relatively few zero elements, the most common method for solving system (4.1) on
uniprocessors is some form of Gaussian elimination; cf. [1, 2]. Typically, obtaining
a factorization (or decomposition) of A into a product of simpler terms has been
the first step in the solution procedure. For example, see the historical remarks in
[3, 4], and the seminal early contributions in [5, 6]. It is of interest to note that matrix
decompositions are not only needed for solving linear systems, but are also needed
in areas such as data analytics; for example in revealing latent information in data,
e.g. see [7]. Over the last two decades, a considerable effort has been devoted to
the development of parallel algorithms for dense matrix computations. This effort
resulted in the development of numerical libraries such as LAPACK , PLASMA and
ScaLAPACK which are widely used on uniprocessors, multicore architectures, and
distributed memory architectures, respectively, e.g. see [8, 9]. A related, but inde-
pendent, effort has also resulted in the high performance libraries: BLIS, libFlame,
and Elemental e.g. see [10]. As the subject of direct dense linear system solvers
has been sufficiently treated elsewhere in the literature, we will briefly summarize
basic direct dense linear system solvers, offer variations that enhance their parallel
scalability, and propose an approximate factorization scheme that can be used in
conjunction with iterative schemes to realize an effective and scalable solver on par-
allel architectures. Note that such approximate factorization approach is frequently
justified when:

© Springer Science+Business Media Dordrecht 2016 79


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_4
80 4 General Linear Systems

1. An exact factorization might be very expensive or prohibitive; for example if


the dense matrix is very large or when communication in the underlying parallel
architecture is very expensive.
2. An exact factorization might not exist if certain constraints hold, for example:
(i) in order to avoid pivoting, the matrix is subjected to additive modifications
of low rank and the factorization corresponds to a modified matrix, or (ii) when
the underlying matrix is nonnegative and it is stipulated that the factors are also
nonnegative (as in data analytics).
3. Even when an exact factorization exists, the high accuracy obtained in solving the
linear system via a direct method is not justified due to uncertainty in the input.
Therefore, in such cases (in large scale applications) the combination of an approx-
imate factorization that lends itself to efficient parallel processing in combination
with an iterative technique, ranging from basic iterative refinement to precondi-
tioned projection methods, can be a preferable alternative. It is also worth noting that
recently, researchers have been addressing the issue of designing communication
avoiding (CA) algorithms for these problems. This consists of building algorithms
that achieve as little communication overhead as possible relative to some known
lower bounds, without significantly increasing the number of arithmetic (floating-
point) operations or negatively affecting the numerical stability, e.g. see for example
[11–13], and the survey [14] that discusses many problems of numerical linear
algebra (including matrix multiplication, eigenvalue problems and Krylov subspace
methods for sparse linear systems) in the CA framework.

4.1 Gaussian Elimination

Given a vector w ∈ Rn with ω1 = 0, one can construct an elementary lower triangular


matrix M
⎛ ⎞
1
⎜μ21 1 ⎟ 
⎜ ⎟ 1 0
M =⎜ . .. ⎠ ⎟ = (4.2)
⎝ .. . m In−1
μn1 1

such that

Mw = ω1 e1 . (4.3)

This is accomplished by choosing

μ j1 = −(ω j /ω1 ).
4.1 Gaussian Elimination 81

α11 a 
If A ≡ A1 = is diagonally dominant, then α11 = 0 and B is also
b B
diagonally dominant. One can construct an elementary lower triangular matrix M1
such that A2 = M1 A1 is upper-triangular as far as its first column is concerned, i.e.,

α11 a 
A2 = (4.4)
0 C

where C = B + ma , i.e., C is a rank-1 modification of B. It is also well-known


that if A1 is diagonally dominant, then C will also be diagonally dominant. Thus,
the process can be repeated on the matrix C which is of order 1 less than A, and so
on in order to produce the factorization,

Mn−1 · · · M2 M1 A = U (4.5)

where U is upper triangular, and


⎛ ⎞
1
⎜ 1 ↓ ⎟
⎜ ⎟
⎜ .. ⎟
⎜ . 1 ⎟
⎜ ⎟
Mj = ⎜ . ⎟, (4.6)
⎜ .
μ j+1, j . ⎟
⎜ ⎟
⎜ .. ⎟
⎝ . ⎠
μn, j 1

i.e., A = LU, in which L = M1−1 M2−1 · · · Mn−1


−1
, or
⎛ ⎞
1
⎜ −μ21 1 ⎟
⎜ ⎟
⎜ ⎟
L = ⎜ −μ31 −μ32 1 ⎟. (4.7)
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
−μn1 −μn,2 · · · −μn,n−1 1

If A is not diagonally dominant, α11 is not guaranteed to be nonzero. Hence a pivot-


ing strategy is required to assure that the elements μi j are less than 1 in magnitude
to enhance numerical stability. Adopting pivoting, the first step is to choose a per-
(1)
mutation P1 such that the first column of P1 A1 has as its first element (pivot) αk1
(1) (1)
for which |αk1 | = max |αi1 |. Consequently, the procedure yields,
1≤i≤n

Mn−1 Pn−1 · · · P2 M1 P1 A1 = U, (4.8)


82 4 General Linear Systems

or
M̂n−1 M̂n−2 · · · M̂1 (Pn−1 · · · P2 P1 )A = U,

where M̂ j is identical to M j , j = 1, 2, . . . , n − 2, but whose elements in the jth


column below the diagonal are shuffled, and M̂n−1 = Mn−1 . In other words, we
obtain the factorization
PA = L̂U (4.9)

where L̂ is unit-lower triangular, U is upper triangular, and P = Pn−1 · · · P2 P1 .


In this form of Gaussian elimination, the main operation in each step is the rank-1
update kernel in BLAS2
E = E + ma . (4.10)

Theorem 4.1 ([15]) Let A ∈ Rn×n be nonsingular. Using Gaussian elimination


without pivoting, the triangular factorization A = LU may be obtained in 3(n − 1)
parallel arithmetic operations using no more than n 2 processors.
Proof Constructing each M j ( j = 1, . . . , n − 1) using n − j processors can be done
in one parallel arithmetic operation. Applying the rank-1 update at step j can be
done in two parallel arithmetic operations using (n − j)2 processors. The theorem
follows easily.
It is well-known that a rank-k update kernel, for k > 1, has higher data-locality
than the BLAS2 kernel (4.10). Thus, to realize higher performance, one resorts to
using a block form of the Gaussian elimination procedure that utilizes rank-k updates
that are BLAS3 operations, see Sect. 2.2.

4.2 Pairwise Pivoting

As outlined above, using a pivoting strategy such as partial pivoting requires obtaining
the element of maximum modulus of a vector of order n (say). In a parallel imple-
mentation, that vector could be stored across p nodes. Identifying the pivotal row
will thus require accessing data along the entire matrix column, potentially giving
rise to excessive communication across processors. Not surprisingly, more aggres-
sive forms of pivoting that have the benefit of smaller potential error, are likely to
involve even more communication. To reduce these costs, one approach is to utilize
a pairwise pivoting strategy. Pairwise pivoting leads to a factorization of the form

A = S1−1 S2−1 · · · S2n−3


−1
U (4.11)

where each S j is a block-diagonal matrix in which each block is of order 2, and U


is upper triangular. It is important to note that this is no longer an LU factorization.
Specifically, the factorization may be outlined as follows. Let a  and b be two rows
of A:
4.2 Pairwise Pivoting 83

α1 α2 α3 · · · αn
,
β1 β2 β3 · · · βn

10
and let G be the 2×2 stabilized elementary transformation G = if |α1 | ≥ |β1 |
γ 1

01
with γ = −β1 /α1 , or G = if |β1 | > |α1 | with γ = −α1 /β1 . Hence, each S j

consists of 2×2 diagonal blocks of the form of G or the identity I2 . For a nonsingular
matrix A of order 8, say, the pattern of annihilation of the elements below the diagonal
is given in Fig. 4.1 where an ‘∗’ denotes a diagonal element, and an entry ‘k’ denotes
an element annihilated by Sk . Note that, for n even, Sn−1 annihilates (n/2) elements
simultaneously, with S1 and S2n−3 each annihilating only one element. Note also
that Sk−1 , for any k, is easily obtained since (G (k)
j )
−1 is either

 
1 0 −γ 1
or .
−γ 1 1 0

Consequently, solving a system of the form Ax = f via Gaussian elimination with


pairwise pivoting consists of:

(a) factorization:
(S2n−3 · · · S2 S1 )A = U̇ , or A = Ṡ −1 U̇ ,

(b) forward sweep:


v0 = f
(4.12)
v j = S j v j−1 , j = 1, 2, . . . , 2n − 3

(c) backward sweep: Solve


U x = v2n−3 (4.13)

via the column sweep scheme.


7 
6 8 
5 7 9 
4 6 8 10 
3 5 7 9 11 
2 4 6 8 10 12 
1 3 5 7 9 11 13 

Fig. 4.1 Annihilation in Gaussian elimination with pairwise pivoting


84 4 General Linear Systems

A detailed error analysis of this pivoting strategy was conducted in [16], and in a
much refined form in [17], which shows that

|A − S1−1 S2−2 · · · S2n−3


−1
U̇ | ≤ 2n−1 (2n−1 − 1)|A|u. (4.14)

This upper bound is 2n−1 larger than that obtained for the partial pivoting strategy.
Nevertheless, except in very special cases, extensive numerical experiments show
that the difference in quality (as measured by relative residuals) of the solutions
obtained by these two pivoting strategies is imperceptible.
Another pivoting approach that is related to pairwise pivoting is termed incre-
mental pivoting [18]. Incremental pivoting has also been designed so as to reduce
the communication costs in Gaussian elimination with partial pivoting.

4.3 Block LU Factorization

Block LU factorization may be described as follows. Let A be partitioned as



A11 A12
A = ( Ȧ, Ä) = , (4.15)
A21 A22

where Ȧ consists of ν-columns, v


n. First, we obtain the LU factorization of the
rectangular matrix Ȧ using the Gaussian elimination procedure with partial pivoting
described above in which the main operations depend on rank-1 updates,

L 11
Ṗ0 Ȧ = (U11 ).
L 21

Thus, Ṗ0 A may be written as


 
L 11 0 U11 U12
Ṗ0 A = , (4.16)
L 21 In−ν 0 B0

where U12 is obtained by solving

L 11 U12 = A12 (4.17)

for U12 , and obtaining

B0 = A22 − L 21 U12 (rank−ν update).

Now, the process is repeated for B0 , i.e., choosing a window Ḃ0 consisting of the
first ν columns of B0 , obtaining an LU factorization of Ṗ1 Ḃ0 followed by obtaining
ν additional rows and columns of the factorization (4.16)
4.3 Block LU Factorization 85
⎛ ⎞⎛ ⎞
 L 11 U
U11 U12
Iν 12
Ṗ0 A = ⎝ L 21 L 22 ⎠⎝ U22 U23 ⎠ .
Ṗ1
L 21 L 32 In−2ν B1

The factorization PA = LU is completed in (n/ν) stages.


Variants of this basic block LU factorization are presented in Chap. 5 of [19].

4.3.1 Approximate Block Factorization

If A were diagonally dominant or symmetric positive definite, the above block LU


factorization could be obtained without the need of a pivoting strategy. Our aim here
is to obtain an approximate block LU factorization of general nonsymmetric matrices
without incurring the communication and memory reference overhead due to partial
pivoting employed in the previous section. The resulting approximate factorization
can then be used as an effective preconditioner of a Krylov subspace scheme for
solving the system Ax = f .
To illustrate this approximate block factorization approach, we first assume that
A is diagonally dominant of order mq, i.e., A ≡ A0 = [Ai(0) j ], i, j = 1, 2, . . . , q,
in which each Ai j is of order m. Similar to the pairwise pivoting scheme outlined
previously, the block scheme requires (2q − 3) stages to produce the factorization,

A = Ŝ −1 Û (4.18)

in which Ŝ is the product of (2q − 3) matrices each of which consists of a maximum


of q/2 nested independent block-elementary transformation of the form

Im
,
−G Im

and Û is block-upper-triangular. Given two block rows:



Aii Ai,i+1 · · · Ai,q
,
A ji A j,i+1 · · · A j,q

we denote by [i, j] the transformation,


 
I Aii Ai,i+1 · · · Ai,q
−G i j I A j,i A j,i+1 · · · A j,q

Aii Ai,i+1 · · · Ai,q
= (4.19)
0 A j,i+1 · · · A j,q
86 4 General Linear Systems

in which,
G i j = A ji Aii−1 , and (4.20)

A j,k = A j,k − A ji Aii−1 Aik (4.21)

for k = i + 1, i + 2, . . . , q. Thus, the parallel annihilation scheme that produces the


factorization (4.18) is given in Fig. 4.2 (here each ‘’ denotes a matrix of order m
n)
illustrating the available parallelism, with the order of the [i, j] transformations (4.19)
shown in Fig. 4.3.
Note that, by assuming that A is diagonally dominant, any Aii or its update Aii
is also diagonally dominant assuring the existence of the factorization Aii = L i Ui
without pivoting, or the factorization Aii = Ŝi−1 Ûi .


1 
2 3 
3 4 5 
4 5 6 7 
5 6 7 8 9 
6 7 8 9 10 11 
7 8 9 10 11 12 13 

Fig. 4.2 Annihilation in block factorization without pivoting

Steps Nodes →
↓ 1 2 3 4 5 6 7 8
1 fact(A11)
2 [1, 2]
3 [1, 3] fact(A22)
4 [1, 4] [2, 3]
5 [1, 5] [2, 4] fact(A33)
6 [1, 6] [2, 5] [3, 4]
7 [1, 7] [2, 6] [3, 5] fact(A44)
8 [1, 8] [2, 7] [3, 6] [4, 5]
9 [2, 8] [3, 7] [4, 6] fact(A55)
10 [3, 8] [4, 7] [5, 6]
11 [4, 8] [5, 7] fact(A66)
12 [5, 8] [6, 7]
13 [6, 8] fact(A77)
14 [7, 8]
15 fact(A88)

Fig. 4.3 Concurrency in block factorization


4.4 Remarks 87

We consider next the general case, where the matrix A is not diagonally domi-
nant. One attractive possibility is to proceed with the factorization of each Aii without
partial pivoting but applying, whenever necessary, the procedure of “diagonal boost-
ing” which insures the nonsingularity of each diagonal block. Such a technique was
originally proposed for Gaussian elimination in [20].
Diagonal boosting is invoked whenever any diagonal pivot violates the following
condition:
|pivot| > 0u A j 1 ,

where 0u is a multiple of the unit roundoff, u. If the diagonal pivot does not satisfy
the above condition, its value is “boosted” as:

pivot = pivot + δA j 1 if pivot > 0,

pivot = pivot − δA j 1 if pivot < 0,

where δ is often taken as the square root of 0u .


In other words, we will obtain a block factorization of a matrix M, rather than the
original coefficient matrix A. The matrix M = Ŝ −1 Û will differ from A in norm by
a quantity that depends on δ and differ in rank by the number of times boosting was
invoked. A small difference in norm and rank will make M a good preconditioner
for a Krylov subspace method (such as those described in Chaps. 9 and 10). When
applied to solving Ax = f with this preconditioner, only few iterations will be needed
to achieve a reasonable relative residual.

4.4 Remarks

Well designed kernels for matrix multiplication and rank-k updates for hierarchical
machines with multiple levels of memory and parallelism are of critical importance
for the design of dense linear system solvers. ScaLAPACK based on such kernels in
BLAS3 is a case in point. The subroutines of this library achieve high performance
and parallel scalability. In fact, many users became fully aware of these gains even
when using high-level problem solving environments like MATLAB (cf. [21]). As
early work on the subject had shown (we consider it rewarding for the reader to
consider the pioneering analyses undertaken in [22, 23]), the task of designing kernels
with high parallel scalability is far from simple, if one desires to provide a design that
closely resembles the target computer model. The task becomes even more difficult as
the complexity of the computer architecture increases. It becomes even harder when
the target is to build methods that can deliver high performance across a spectrum of
computer architectures.
88 4 General Linear Systems

References

1. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
2. Stewart, G.: Matrix Algorithms. Vol. I. Basic Decompositions. SIAM, Philadelphia (1998)
3. Grcar, J.: John Von Neumann’s analysis of Gaussian elimination and the origins of modern
numerical analysis. SIAM Rev. 53(4), 607–682 (2011). doi:10.1137/080734716. http://dx.doi.
org/10.1137/080734716
4. Stewart, G.: The decompositional approach in matrix computations. IEEE Comput. Sci. Eng.
Mag. 50–59 (2000)
5. von Neumann, J., Goldstine, H.: Numerical inverting of matrices of high order. Bull. Am. Math.
Soc. 53, 1021–1099 (1947)
6. Householder, A.S.: The Theory of Matrices in Numerical Analysis. Dover Publications, New
York (1964)
7. Skillicorn, D.: Understanding Complex Datasets: Data Mining using Matrix Decompositions.
Chapman Hall/CRC Press, Boca Raton (2007)
8. Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S.: Dense linear algebra on accelerated multicore
hardware. In: Berry, M., et al. (eds.) High-Performance Scientific Computing: Algorithms and
Applications, pp. 123–146. Springer, New York (2012)
9. Luszczek, P., Kurzak, J., Dongarra, J.: Looking back at dense linear algebrasoftware. J. Parallel
Distrib. Comput. 74(7), 2548–2560 (2014). DOI http://dx.doi.org/10.1016/j.jpdc.2013.10.005
10. Igual, F.D., Chan, E., Quintana-Ortí, E.S., Quintana-Ortí, G., van de Geijn, R.A., Zee, F.G.V.:
The FLAME approach: from dense linear algebra algorithms to high-performance multi-
accelerator implementations. J. Parallel Distrib. Comput. 72, 1134–1143 (2012)
11. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and
sequential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). doi:10.
1137/080731992
12. Grigori, L., Demmel, J., Xiang, H.: CALU: a communication optimal LU factorization algo-
rithm. SIAM J. Matrix Anal. Appl. 32(4), 1317–1350 (2011). doi:10.1137/100788926. http://
dx.doi.org/10.1137/100788926
13. Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix
multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004). doi:10.1016/j.jpdc.2004.
03.021. http://dx.doi.org/10.1016/j.jpdc.2004.03.021
14. Ballard, G., Carson, E., Demmel, J., Hoemmen, M., Knight, N., Schwartz, O.: Communication
lower bounds and optimal algorithms for numerical linear algebra. Acta Numerica 23, 1–155
(2014)
15. Sameh, A.: Numerical parallel algorithms—a survey. In: Kuck, D., Lawrie, D., Sameh, A.
(eds.) High Speed Computer and Algorithm Optimization, pp. 207–228. Academic Press, New
York (1977)
16. Stern, J.: A fast Gaussian elimination scheme and automated roundoff error analysis on S.I.M.E.
machines. Ph.D. thesis, University of Illinois (1979)
17. Sorensen, D.: Analysis of pairwise pivoting in Gaussian elemination. IEEE Trans. Comput.
C-34(3), 274–278 (1985)
18. Quintana-Ortí, E.S., van de Geijn, R.A.: Updating an LU factorization with pivoting. ACM
Trans. Math. Softw. (TOMS) 35(2), 11 (2008)
19. Dongarra, J., Duff, I., Sorensen, D., van der Vorst, H.: Numerical Linear Algebra for High-
Performance Computers. SIAM, Philadelphia (1998)
20. Stewart, G.: Modifying pivot elements in Gaussian elimination. Math. Comput. 28(126), 537–
542 (1974)
References 89

21. Moler, C.: MATLAB incorporates LAPACK. Mathworks Newsletter (2000). http://www.
mathworks.com/company/newsletters/articles/matlab-incorporates-lapack.html
22. Gallivan, K., Jalby, W., Meier, U.: The use of BLAS3 in linear algebra on a parallel processor
with a hierarchical memory. SIAM J. Sci. Statist. Comput. 8(6), 1079–1084 (1987)
23. Gallivan, K., Jalby, W., Meier, U., Sameh, A.: The impact of hierarchical memory systems on
linear algebra algorithm design. Int. J. Supercomput. Appl. 2(1) (1988)
Chapter 5
Banded Linear Systems

We encounter banded linear systems in many areas of computational science and


engineering, including computational mechanics and nanoelectronics, to name but a
few. In finite-element analysis, the underlying large sparse linear systems can some-
times be reordered via graph manipulation schemes such as reverse Cuthill-McKee
or spectral reordering to result in low-rank perturbations of banded systems that are
dense or sparse within the band, and in which the width of the band is often but a small
fraction of the size of the overall problem; cf. Sect. 2.4.2. In such cases, a chosen
central band can be used as a preconditioner for an outer Krylov subspace iteration. It
is thus safe to say that the need to solve banded linear systems arises extremely often
in numerical computing. In this chapter, we present parallel algorithms for solving
narrow-banded linear systems: (a) direct methods that depend on Gaussian elimina-
tion with partial pivoting as well as pairwise pivoting strategies, (b) a class of robust
hybrid (direct/iterative) solvers for narrow-banded systems that do not depend on a
global LU-factorization of the coefficient matrix and the subsequent global forward
and backward sweeps, (c) a generalization of this class of algorithms for solving
systems involving wide-banded preconditioners via a tearing method, and, finally,
(d) direct methods specially designed for tridiagonal linear systems. Throughout this
rather long chapter, we denote the linear system(s) under consideration by AX = F,
where A is a nonsingular banded matrix of bandwidth 2m + 1, and F contains ν ≥ 1
right hand-sides. However if we want to restrict our discussion to a single right-hand
side, we will write Ax = f .

5.1 LU-based Schemes with Partial Pivoting

Using the classical Gaussian elimination scheme, with partial pivoting, outlined
in Chap. 4, for handling banded systems results in limited opportunity for paral-
lelism. This limitation becomes more pronounced the narrower the system’s band.
To illustrate this limitation, consider a banded system of bandwidth (2m + 1), shown
in Fig. 5.1 for m = 4.
© Springer Science+Business Media Dordrecht 2016 91
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_5
92 5 Banded Linear Systems

Fig. 5.1 Illustrating limited x x x x x


parallelism when Gaussian x x x x x x
elimination is applied on a x x x x x x x
system of bandwidth 9
x x x x x x x x
x x x x x x x x x
x x x x x x x x x
x x x x x x x x x
x x x x x x x x x
x x x x x x x xx
x x x x x x xxx
x x x x x xxxx

The scheme will consider the triangularization of the leading (2m +1) × (2m+1)
window with partial pivoting, leading to possible additional fill-in. The process is re-
peated for the following window that slides down the diagonal, one diagonal element
at a time.
In view of such limitation, an alternative LU factorization considered in [1] was
adopted in the ScaLAPACK library; see also [2, 3]. For ease of illustration, we
consider the two-partitions case of a tridiagonal system Ax = f , of order 18 × 18,
initially setting A0 ≡ A and f 0 ≡ f ; see Fig. 5.2 for matrix A. After row permuta-
tions, we obtain A1 x = f 1 where A1 = P0 A0 , with P0 = [e18 , e1 , e2 , . . . , e17 ]
in which e j is the jth column of the identity I18 , see Fig. 5.3. Following this by the
column permutations, A2 = A1 P1 , with
⎛ ⎞
I7 0 0 0
⎜0 0 I2 0⎟
P1 = ⎜
⎝0

I7 0 0⎠
0 0 0 I2

we get, 
B1 0 E 1 F1
A2 = ,
0 B2 E 2 F2

see Fig. 5.4, where B1 , B2 ∈ R9×7 , and E j , F j ∈ R9×2 , j = 1, 2. Obtaining the


LU-factorization of B1 and B2 simultaneously, via partial pivoting,
 
R1 R2
P2 B1 = L 1 ; P3 B2 = L 2 ,
0 0

where L j is unit lower triangular of order 9, and R j is upper triangular of order 7,


then the linear system Ax = f is reduced to solving A2 y = f 1 , where y = P1 x,
which in turn is reduced to solving A3 y = g where
⎛ ⎞
R1 0 E 1 F1
⎜ 0 0 E 1 F1 ⎟
A3 = ⎜
⎝ 0

R2 E 2 F2 ⎠
0 0 E 2 F2
5.1 LU-based Schemes with Partial Pivoting 93

Fig. 5.2 Original banded


matrix A ∈ R18×18

Fig. 5.3 Banded matrix


after the row permutations
A1 = P0 A0 , as in
ScaLAPACK

Fig. 5.4 Banded matrix


after the row and column
permutations A2 = A1 P1 , as
in ScaLAPACK
94 5 Banded Linear Systems
 
L −1 0 P2 0
g= 1 f1,
0 L −1
2
0 P3

and 

E j F j
L −1 E j , Fj = ,
j E j F j

with E j , F j ∈ R7×2 , and E j , F j ∈ R2×2 . Using the permutation


⎛ ⎞
I7 0 0 0 0
⎜0 0 I7 0 0⎟
P4 = ⎜
⎝0
⎟,
I2 0 0 0⎠
0 0 0 0 I2

the system A3 y = g is reduced to the block-upper triangular system P4 A3 y = P4 g,


where ⎛ ⎞
R1 0 E 1 F1
⎜ 0 R2 E  F  ⎟
P4 A3 = ⎜ 2 2⎟
⎝ 0 0 E  F  ⎠ .
1 1
0 0 E 2 F2

Since A is nonsingular, the (4 × 4) matrix



E 1 F1
E 2 F2

is also nonsingular and hence y, and subsequently, the unique solution x can be
computed.
In Sects. 5.2 and 5.3, we present alternatives to this parallel direct solver where we
present various hybrid (direct/iterative) schemes that possess higher parallel scala-
bility. Clearly, these hybrid schemes are most suitable when the user does not require
obtaining solutions whose corresponding residuals have norms of the order of the
unit roundoff.

5.2 The Spike Family of Algorithms

Parallel banded linear system solvers have been considered by many authors
[1, 4–11]. We focus here on one of these solvers—the Spike algorithm which dates
back to the 1970s (the original algorithm created for solving tridiagonal systems on
parallel architectures [12] is discussed in detail in Sect. 5.5.5 later in this chapter).
Further developments and analysis are given in [13–18]. A distinct feature of the
Spike algorithm is that it avoids the poorly scalable parallel scheme of obtaining
the classical global LU factorization of the narrow-banded coefficient matrix. The
overarching strategy of the Spike scheme consists of two main phases:
5.2 The Spike Family of Algorithms 95

(i) the coefficient matrix is reordered or modified so as to consist, for example, of


several independent diagonal blocks interconnected somehow so as to allow the
extraction of an independent reduced system of a smaller size than that of the
original system,
(ii) once the reduced system is solved, the original problem decomposes into several
independent smaller problems facilitating almost perfect parallelism in retrieving
the rest of the solution vector.

First, we review the Spike algorithm and its features. The Spike algorithm consists
of the four stages listed in Algorithm 5.1.

Algorithm 5.1 Basic stages of the Spike algorithm


Input: Banded matrix and right-hand side(s)
Output: Solution vector(s)
//(a) Pre-processing:
1: partitioning of the original system and distributing it on different nodes, in which each node
consists of several or many cores,
2: factorization of each diagonal block and extraction of a reduced system of much smaller size;
//(b) Post-processing: solving the reduced system,
3: retrieving the overall solution.

Compared to the parallel banded LU-factorization schemes in the existing litera-


ture, the Spike algorithm reduces the required memory references and interprocessor
communications at the cost of performing more arithmetic operations. Such a trade-
off results in the good parallel scalability enjoyed by the Spike family of solvers.
The Spike algorithm also enables multilevel parallelism—parallelism across nodes,
and parallelism within each multi- or many-core node. Further, it is flexible in the
methods used for the pre- and post-processing stages, resulting in a family of solvers.
As such, the Spike algorithm has several built-in options that range from using it as
a pure direct solver to using it as a solver for banded preconditioners of an outer
iterative scheme.

5.2.1 The Spike Algorithm

For ease of illustration, we assume for the time being, that any block-diagonal part of
A is nonsingular. Later, however, this assumption will be removed, and the systems
AX = F are solved via a Krylov subspace method with preconditioners which are
low-rank perturbations of the matrix A. Solving systems involving such banded
preconditioners in each outer Krylov iteration will be achieved using a member of
the Spike family of algorithms.
96 5 Banded Linear Systems

Preprocessing Stage
This stage starts with partitioning the banded linear system into a block tridiagonal
form with p diagonal blocks A j ( j = 1, . . . , p), each of order n j = n/ p (assuming
that n ia an integer multiple of p), with coupling matrices (of order m  n) B j ( j =
1, . . . , p − 1), and C j ( j = 2, . . . , p), associated with the super- and sub-block
diagonals, respectively. Note that it is not a requirement that all the diagonal blocks
A j be of the same order. A straightforward implementation of the Spike algorithm
on a distributed memory architecture could be realized by choosing the number of
partitions p to be the same as the number of available multicore nodes q. In general,
however, p ≤ q. This stage concludes by computing the LU-factorization of each
diagonal block A j . Figure 5.5 illustrates the partitioning of the matrices A and F for
p = 4.
The Spike Factorization Stage
Based on our assumption above that each A j is nonsingular, the matrix A can be
factored as A = DS, where D is a block-diagonal matrix consisting only of the
factored diagonal blocks A j ,

D = diag(A1 , . . . , A p ),

and S is called the “Spike matrix”—shown in Fig. 5.6.


For a given partition j, we call V j ( j = 1, . . . , p − 1) and W j ( j = 2, . . . , p),
respectively, the right and the left spikes each of order n j × m.
The spikes V1 and W p are given by
 
0 Im
V1 = (A1 )−1 B1 , and W p = (A p )−1 C p. (5.1)
Im 0

The spikes V j and W j , j = 2, . . . , p − 1, may be generated by solving,

A1 F1 1
B
1
C
2 A2 F2
B 2
A = 2 F =
C
3 F3
A3 B 3
3
C
4 A4 F4 4

Fig. 5.5 Spike partitioning of the matrix A and block of right-hand sides F with p = 4
5.2 The Spike Family of Algorithms 97

*
I. .
. . V
. . 1 1
I *
*
. I. *
.
W . . . V
2 . . . 2 2
S = * I *
*
. I. *
.
W . . . V
3 . . . 3 3
* I *
*
. I.
W . .
4 . . 4
* I

Fig. 5.6 The Spike matrix with 4 partitions

⎞ ⎛
0 Cj
⎜ .. ⎟
⎜ . 0 ⎟

A j (V j , W j ) = ⎜ ⎟
. ⎟. (5.2)
⎝ 0 .. ⎠
Bj 0

Postprocessing Stage
Solving the system AX = F now consists of two steps:

(a) solve DG = F, (5.3)


(b) solve SX = G. (5.4)

The solution of the linear system DG = F in step (a), yields the modified right
hand-side matrix G needed for step (b). Assigning one partition to each node, step
(a) is performed with perfect parallelism. If we decouple the pre- and post-processing
stages, step (a) may be combined with the generation of the spikes in Eq. (5.2).
Let the spikes V j and W j be partitioned as follows:

(t)
⎞ ⎛ ⎛ (t) ⎞
Vj W
⎜  ⎟ ⎜ j ⎟
V j = ⎝ j ⎠ and W j = ⎝ W j ⎟
⎜ V ⎟ ⎜
⎠ (5.5)
(b) (b)
Vj Wj

where V j(t) , V j , V j(b) , and W (t)  (b)


j , W j , W j , are the top m, the middle (n j − 2m) and
the bottom m rows of V j and W j , respectively. In other words,

V j(b) = [0, Im ]V j ; W (t)


j = [Im 0]W j , (5.6)

and
(t) (b)
Vj = [Im 0]V j ; W j = [0 Im ]W j . (5.7)
98 5 Banded Linear Systems

Similarly, let the jth partitions of X and G, be given by



(t)
⎞ ⎛ (t) ⎞
Xj G
⎜  ⎟ ⎜ j ⎟
X j = ⎝ j ⎠ and G j = ⎝ G j ⎟
⎜ X ⎟ ⎜
⎠. (5.8)
(b) (b)
Xj Gj

Thus, in solving SX = G in step (b), we observe that the problem can be reduced
further by solving the reduced system,

Ŝ X̂ = Ĝ, (5.9)

where Ŝ is a block-tridiagonal matrix of order t = 2m( p − 1). While t may not be


small, in practice it is much smaller than n (the size of the original system). Each kth
diagonal block of Ŝ is given by,

(b)
Im Vk
(t) , (5.10)
Wk+1 Im

with the corresponding left and right off-diagonal blocks,


 (b)

Wk 0 0 0
and (t) , (5.11)
0 0 0 Vk+1

and the associated solution and right hand-side matrices


(b)
(b)

Xk Gk
(t)
and .
X k+1 G (t)
k+1

Finally, once the solution X̂ of the reduced system (5.9) is obtained, the global
solution X is reconstructed with perfect parallelism from X k(b) (k = 1, . . . , p − 1)
and X k(t) (k = 2, . . . , p) as follows:

⎪    (t)
⎨ X 1 = G 1 − V1 X 2 ,
(t) (b)
X j = G j − V j X j+1 − W j X j−1 , j = 2, . . . , p − 1,
 

⎩  (b)
X p = G p − W j X p−1 .

Remark 5.1 Note that the generation of the spikes is included in the factorization
step. In this way, the solver makes use of the spikes stored in memory thus allowing
solving the reduced system quickly and efficiently. Since in many applications one
has to solve many linear systems with the same coefficient matrix A but with different
5.2 The Spike Family of Algorithms 99

right hand-sides, optimization of the solver step is crucial for assuring high parallel
scalability of the Spike algorithm.

5.2.2 Spike: A Polyalgorithm

Several options are available for efficient implementation of the Spike algorithm on
parallel architectures. These choices depend on the properties of the linear system
as well as the parallel architecture at hand. More specifically, each of the following
three tasks in the Spike algorithm can be handled in several ways:
1. factorization of the diagonal blocks A j ,
2. computation of the spikes,
3. solution of the reduced system.
In the first task, each linear system associated with A j can be solved via: (i) a direct
method making use of the LU-factorization with partial pivoting, or the Cholesky
factorization if A j is symmetric positive definite, (ii) a direct method using an LU-
factorization without pivoting but with a diagonal boosting strategy, (iii) an iterative
method with a preconditioning strategy, or (iv) via an appropriate approximation of
the inverse of A j . If more than one node is associated with each partition, then linear
systems associated with each A j may be solved using the Spike algorithm creating
yet another level of parallelism. In the following, however, we consider only the case
in which each partition is associated with only one multicore node.
In the second task, the spikes can be computed either explicitly (fully or partially)
using Eq. (5.2), or implicitly—“on-the-fly”.
In the third task, the reduced system (5.9) can be solved via (i) a direct method
such as LU with partial pivoting, (ii) a “recursive” form of the Spike algorithm, (iii)
a preconditioned iterative scheme, or (iv) a “truncated” form of the Spike scheme,
which is ideal for diagonally dominant systems, as will be discussed later.
Note that an outer iterative method will be necessary to assure solutions with
acceptable relative residuals for the original linear system whenever we do not use
numerically stable direct methods for solving systems (5.3) and (5.9). In such a case,
the overall hybrid solver consists of an outer Krylov subspace iteration for solving
Ax = f , in which the Spike algorithm is used as a solver of systems involving a
banded preconditioner consisting of an approximate Spike factorization of A in each
outer iteration. In the remainder of this section, we describe several variants of the
Spike algorithm depending on: (a) whether the whole spikes are obtained explicitly,
or (b) whether we obtain only an approximation of the independent reduced system.
We describe these variants depending on whether the original banded system is
diagonally dominant.
100 5 Banded Linear Systems

5.2.3 The Non-diagonally Dominant Case

If it is known apriori that all the diagonal blocks A j are nonsingular, then the LU-
factorization of each block A j with partial pivoting is obtained using either the
relevant single core LAPACK routine [19], or its multithreaded counterpart on one
multicore node. Solving the resulting banded triangular system to generate the spikes
in (5.2), and update the right hand-sides in (5.3), may be realized using a BLAS3
based primitive [16], instead of BLAS2 based LAPACK primitive. In the absence of
any knowledge of the nonsingularity of the diagonal blocks A j , an LU-factorization
is performed on each diagonal block without pivoting, but with a diagonal boosting
strategy to overcome problems associated with very small pivots. Thus, we either ob-
tain an LU-factorization of a given diagonal block A j , or the factorization of a slightly
perturbed A j . Such a strategy circumvents difficulties resulting from computation-
ally singular diagonal blocks. If diagonal boosting is required for the factorization of
any diagonal block, an outer Krylov subspace iteration is used to solve Ax = f with
the preconditioner being M = D̂ Ŝ, where D̂ and Ŝ are the Spike factors resulting
after diagonal boosting. In this case, the Spike scheme reduces to solving systems
involving the preconditioner in each outer Krylov subspace iteration.
One natural way to solve the reduced system (5.9) in parallel is to make use of an
inner Krylov subspace iterations with a block Jacobi preconditioner obtained from
the diagonal blocks of the reduced system (5.10). For these non-diagonally dominant
systems, however, this preconditioner may not be effective in assuring a small relative
residual without a large number of outer iterations. This, in turn, will result in high
interprocessor communication cost. If the unit cost of interprocessor communication
is excessively high, the reduced system may be solved directly on a single node.
Such alternative, however, may have memory limitations if the size of the reduced
system is large. Instead, we propose the following parallel recursive scheme for solv-
ing the reduced system. This “Recursive” scheme involves successive applications
of the Spike algorithm resulting in better balance between the computational and
communication costs.
First, we dispense with the simple case of two partitions, p = 2. In this case, the
reduced system consists only of one diagonal block (5.10), extracted from the central
part of the system (5.4)

Im V1(b) X 1(b) G (b)
(t) (t) = 1
(t) , (5.12)
W2 I m X2 G2

that can be solved directly as follows:


(t) (b)
• Form H = Im − W2 V1 ,
(t) (t) (t) (b) (t)
• Solve H X 2 = G 2 − W2 G 1 to obtain X 2 ,
• Compute X 1(b) = G (b) (b) (t)
1 − V1 X 2 .
5.2 The Spike Family of Algorithms 101

Note that if we obtain the LU factorization of A1 , and the UL factorization of A2 , then


(b) (t)
the bottom tip V1 of the right spike, and and the top tip W2 of the left spike can
be obtained very economically without obtaining all of the spikes V1 and W2 . Once
X 1(b) and X 2(t) are computed, retrieving the rest of the solution of AX = F cannot be
obtained using Eq. (5.12) since the entire spikes are not computed explicitly. Instead,
we purify the right hand-side from the contributions of the coupling blocks B and
C, thus decoupling the system into two independent subsystems. This is followed by
solving these independent systems simultaneously using the previously computed
factorizations.
The Recursive Form of the Spike Algorithm
In this variant of the Spike algorithm we wish to solve the reduced system recursively
using the Spike algorithm. In order to proceed this way, however, we first need the
number of partitions to be a power of 2, i.e. p = 2d , and second we need to work with
a slightly modified reduced system. The independent reduced system that is suitable
for the first level of this recursion has a coefficient matrix S̃1 that consists of the matrix
Ŝ in (5.9), augmented at the top by the top m rows of the first partition, and augmented
at the bottom by the bottom m rows of the last partition. The structure of this new
reduced system, of order 2mp, remains block-tridiagonal, with each diagonal block
being the identity matrix of order 2m, and the corresponding off-diagonal blocks
associated with the kth diagonal block (also of order 2m) are given by

0 Wk(t) Vk(t) 0
(b)
for k = 2, . . . , p, and (b)
for k = 1, . . . , p − 1. (5.13)
0 Wk Vk 0

Let the spikes of the new reduced system at level 1 of the recursion be denoted by
Vk[1] and Wk[1] , where
(t)
(t)

Vk Wk
Vk[1] = and Wk[1] = . (5.14)
Vk(b) Wk(b)

In preparation for level 2 of the recursion of the Spike algorithm, we choose now
to partition the matrix S̃1 using p/2 partitions with diagonal blocks each of size 4m.
The matrix can then be factored as

S̃1 = D1 S̃2 ,

where S̃2 represents the new Spike matrix at level 2 composed of the spikes Vk[2] and
Wk[2] . For p = 4 partitions, these matrices are of the form
102 5 Banded Linear Systems

and

In general, at level i of the recursion, the spikes Vk[i] , and Wk[i] , with k ranging from
1 to p/(2i ), are of order 2i m × m. Thus, if the number of the original partitions p is
equal to 2d , the total number of recursion levels is d − 1 and the matrix S̃1 is given
by the product
S̃1 = D1 D2 . . . Dd−1 S̃d ,

where the matrix S̃d has only two spikes V1[d] and W2[d] . Thus, the reduced system
can be written as,
S̃d X̃ = B, (5.15)

where B is the modified right hand side given by

B = D −1 −1 −1
d−1 . . . D2 D1 G̃. (5.16)

If we assume that the spikes Vk[i] and Wk[i] , for all k, of the matrix S̃i are known at
a given level i, then we can compute the spikes Vk[i+1] and Wk[i+1] at level i + 1 as
follows:
STEP 1: Denoting the bottom and the top blocks of the spikes at the level i by

Vk[i](b) = [0 Im ]Vk[i] ; Wk[i](t) = [Im 0]Wk[i] ,


5.2 The Spike Family of Algorithms 103

and the middle block of 2m rows of the spikes at level i + 1 by



V̇k[i+1] Ẇk[i+1]
= [0 I2m 0]Vk[i+1] ; = [0 I2m 0]Wk[i+1] ,
V̈k[i+1] Ẅk[i+1]

one can form the following reduced systems:


[i](b)

Im V2k−1 V̇k[i+1] 0 p
= [i](t) , k = 1, 2, . . . , − 1, (5.17)
[i](t)
W2k Im V̈k[i+1] V2k 2i−1

and
[i](b)

Im V2k−1 Ẇk[i+1] [i](b)
W2k−1 p
= , k = 2, 3, . . . , . (5.18)
[i](t)
W2k Im Ẅk[i+1] 0 2i−1

These reduced systems are solved similarly to (5.12) to obtain the central portion
of all the spikes at level i + 1.
STEP 2: The rest of the spikes at level i + 1 are retrieved as follows:

[I2i m 0]Vk[i+1] = −V2k−1


[i]
V̈k[i+1] , [0 I2i m ]Vk[i+1] = V2k
[i] [i] [i+1]
− W2k V̇k , (5.19)

and

[I2i m 0]Wk[i+1] = W2k−1


[i] [i]
−V2k−1 Ẅk[i+1] , [0 I2i m ]Wk[i+1] = −W2k
[i] [i+1]
Ẇk . (5.20)

In order to compute the modified right hand-side B in (5.16), we need to solve


d − 1 block-diagonal systems of the form, Di G̃ i = G̃ i−1 , where G̃ 1 = D1−1 G̃). For
each diagonal block k, the reduced system is similar to that in (5.12), or in (5.17) and
(5.18), but the right hand-side is now defined as a function of G̃ i−1 . Once we get the
central portion of each G̃ i , associated with each block k in Di , the entire solution
is retrieved as in (5.19) and (5.20). Consequently, the linear system (5.15) involves
only one reduced system to solve and only one retrieval stage to get the solution X̃ ,
from which the overall solution X is retrieved via (5.12).
The Spike Algorithm on-the-Fly
One can completely avoid explicitly computing and storing all the spikes, and hence
the reduced system. Instead, the reduced system is solved using a Krylov subspace
method which requires only matrix-vector multiplication. Note that the 2m × 2m
diagonal and off-diagonal blocks are given by (5.10) and (5.11), respectively. Since
(t) (b) (t) (b)
V j , V j , W j+1 , and W j+1 , for j = 1, 2, . . . ., p − 1 are given by (5.6) and (5.7), in
which Vk and Wk are given by (5.1), each matrix-vector product exhibits abundant
parallelism in which the most time-consuming computations concerns solving sys-
tems of equations involving the diagonal blocks A j whose LU-factors are available.
104 5 Banded Linear Systems

5.2.4 The Diagonally Dominant Case

In this section, we describe a variant of the Spike algorithm—the truncated Spike


algorithm—that is aimed at handling diagonally dominant systems. This truncated
scheme generates an effective preconditioner for such diagonally dominant systems.
Further, it yields a modified reduced system that is block-diagonal rather than block-
tridiagonal. This, in turn, allows solving the reduced system with higher parallel
scalability. In addition, the truncated scheme facilitates different new options for the
factorization of each diagonal making it possible to compute the upper and lower
tips of each spike rather than the entire spike.
The Truncated Spike Algorithm
If matrix A is diagonally dominant, one can show that the magnitude of the elements
of the right spikes V j decay in magnitude from bottom to top, while the elements of
the left spikes W j decay in magnitude from top to bottom [20]. The decay is more
pronounced the higher the degree of diagonal dominance. Since the size of each A j
is much larger than the size of the coupling blocks B j and C j , the bottom blocks of
(b) (t)
the left spikes W j and the top blocks of the right spikes V j can be ignored. This
leads to an approximate block-diagonal reduced system in which each block is of
the form (5.10) leading to enhanced parallelism.
The Truncated Factorization Stage
Since the matrix is diagonally dominant, the LU factorization of each block A j can
be performed without pivoting via the relevant single core LAPACK routine or its
multicore counterpart. Using the resulting factors of A j , the bottom m × m tip of
the right spike V j in (5.1) is obtained using only the bottom m × m blocks of L and
U . Obtaining the upper m × m tip of the left spike W j still requires computing the
entire spike with complete forward and backward sweeps. However, if one computes
the UL-factorization of A j , without pivoting, the upper m × m tip of W j can be
similarly obtained using the top m × m blocks of the new U and L. Because of
the data locality-rich BLAS3, adopting such a LU/UL strategy on many parallel
architectures, results in superior performance compared to computing only the LU-
factorization for each diagonal block and generating the entire left spikes. Note that,
using an appropriate permutation, that reverses the order of the rows and columns
of each diagonal block, one can use the same LU-factorization scheme to obtain the
corresponding UL-factorization.
Approximation of the Factorization Stage
For high degree of diagonal dominance, an alternative to performing a UL-factori-
(t)
zation of the diagonal blocks A j is to approximate the upper tip of the left spike, W j ,
of order m, by inverting only the leading l ×l (l > m) diagonal block (top left corner)
of the banded block A j . Typically, we choose l = 2m to get a suitable approximation
of the m ×m left top corner of the inverse of A j . With such an approximation strategy,
Spike is used as a preconditioner that could yield low relative residuals after few outer
iterations of a Krylov subspace method.
5.2 The Spike Family of Algorithms 105

As outlined above for the two-partition case, once the reduced system is solved,
we purify the right hand-side from the contributions of the coupling blocks B j and
C j , thus decoupling the system into independent subsystems, one corresponding to
each diagonal block. Solving these independent systems simultaneously using the
previously computed LU- or UL-factorizations as shown below:
⎧ 

⎪ 0 (t)

⎪ A1 X 1 = F1 − B j X2 ,

⎪ I m



⎪  

0 (t) I (b)
A j X j = Fj − B j X j+1 − m C j X j−1 , j = 2, . . . , p − 1

⎪ I m 0



⎪ 



⎪ Im (b)
⎩ p p
A X = F p − C j X p−1 .
0

5.3 The Spike-Balance Scheme

We introduce yet another variant of the Spike scheme for solving banded systems
that depends on solving simultaneously several independent underdetermined linear
least squares problems under certain constraints [7].
Consider the nonsingular linear system

Ax = f, (5.21)

where A is an n×n banded matrix with bandwidth β = 2m+1. For ease of illustration
we consider first the case in which n is even, e.g. n = 2s, and A is partitioned into
two equal block rows (each of s rows), as shown in Fig. 5.7. The linear system can
be represented as

Fig. 5.7 The Spike-balance


scheme for two block rows
k
B1
A1

A2

C2
106 5 Banded Linear Systems
⎛ ⎞
 x1 
A1 B1 ⎝ ξ ⎠ = f1 , (5.22)
C 2 A2 f2
x2

where A1 and A2 are of order (s × (s − m)), and B1 , C2 are of order (s × 2m).


Denoting that portion of the solution vector which represents the unknowns com-
mon to both blocks by ξ , the above system gives rise to the following two underde-
termined systems,

x1
(A1 , B1 ) = f1, (5.23)
ξ

ξ̃
(C2 , A2 ) = f2 . (5.24)
x2

While these two underdetermined systems can be solved independently, they will
yield the solution of the global system (5.22), only if ξ = ξ̃ . Here, the vectors x1 , x2
are of order (s − m), and ξ , ξ̃ each of order 2m.
Let the matrices of the underdetermined systems be denoted as follows:

E 1 = (A1 , B1 ),
E 2 = (C2 , A2 ),

then the general form of the solution of the two underdetermined systems (5.23) and
(5.24) is given by
vi = pi + Z i yi , i = 1, 2,

where pi is a particular solution, Z i is a basis for N (E i ), and yi is an arbitrary


vector. Note the pi is of order s + m and Z i is of order (s + m) × m. Thus,
  
x1 p1,1 Z 1,1
= + y1 , (5.25)
ξ p1,2 Z 1,2
  
ξ̃ p2,1 Z 2,1
= + y2 , (5.26)
x2 p2,2 Z 2,2

where y1 and y2 are of order m.


Clearly ξ and ξ̃ are functions of y1 and y2 , respectively. Since the underdeter-
mined linear systems (5.23) and (5.24) are consistent under the assumption that the
submatrices E 1 and E 2 are of full row rank, enforcing the condition:

ξ(y1 ) = ξ̃ (y2 ),

i.e.
p1,2 + Z 1,2 y1 = p2,1 + Z 2,1 y2 .
5.3 The Spike-Balance Scheme 107

assures us of obtaining the global solution of (5.21) and yields the balance linear
system
My = g

of order 2m where M = (Z 1,2 , −Z 2,1 ), g = p2,1 − p1,2 , and whose solution


y = (y1 , y2 ) , satisfies the above condition.
Thus, the Spike-balance scheme for the two-block case consists of the following
steps:
1. Obtain the orthogonal factorization of E i , i = 1, 2 to determine Z i , and to allow
computing the particular solution pi = E i (E i E i )−1 f i in a numerically stable
way, thus solving the underdetermined systems (5.23) and (5.24).
2. Solve the balance system M y = g.
3. Back-substitute y in (5.25) and (5.26) to determine x1 , x2 , and ξ .
The first step requires the solution of underdetermined systems of the form

Bz = d,

where B is an (s × (s + m)) matrix with full row-rank. The general solution is given
by
z = B + d + PN (B) u,

where B + = B  (B B  )−1 is the generalized inverse of B, and PN (B) is the or-


thogonal projector onto N (B). Furthermore, the projector PN (B) can be expressed
as,
PN (B) = I − B + B.

The general solution z can be obtained using a variety of techniques, e.g., see
[21, 22]. Using sparse orthogonal factorization, e.g. see [23], we obtain the decom-
position, 
 R
B = (Q, Z ) ,
0

where R is an s ×s triangular matrix and Z is an (s +m)×m) matrix with orthonormal


columns. Since Z is a basis for N (B), the general solution is given as

z = z̄ + Z y,
where z̄ = Q R −T d.

Before presenting the solution scheme of the balance system M y = g, we show


first that it is nonsingular. Let,

Ri
E i = (Q i , Z i ) i = 1, 2,
0
108 5 Banded Linear Systems

where Ri is upper triangular, nonsingular of order s = n/2. Partitioning Q 1 and Q 2


as  
Q 11 Q 21
Q1 = , Q2 = ,
Q 12 Q 22

where Q 12 , Q 21 are of order 2m × s, and Q 11 , Q 22 are of order (s − m) × s, and


similarly for Z 1 and Z 2 ,
 
Z 11 Z 21
Z1 = , Z2 =
Z 12 Z 22

in which Z 12 , Z 21 are of order (2m × m), and Z 11 , Z 22 are of order (s − m) × m.


Now, consider  −T   
R1 0 Q 11 Q 12 0
à = A= ,
0 R2−T 0 Q 
21 Q 22

and 
 Im Q
12 Q 21
à à =  .
Q 21 Q 12 Im

Let, H = Q  
12 Q 21 . Then, since ( Ã Ã ) is symmetric positive definite,

λ( Ã Ã ) = 1 ± η > 0

in which λ(H  H ) = η2 . Consequently, −1 < η < 1. Consider next the structure of


à Ã, ⎛ ⎞
Q 11 Q  
11 Q 11 Q 12 0
⎜ ⎟
⎜  + Q Q Q Q ⎟ .
à à = ⎜ Q 12 Q  Q 12 Q 21 21 22 ⎟
⎝ 11 12 21 ⎠
0 Q 22 Q 
21 Q 22 Q 
22

Since à à is positive definite, then the diagonal block N = (Q 12 Q  


12 + Q 21 Q 21 )
is also positive definite. Further, since

Q i Q i = I − Z i Z i i = 1, 2,

we can easily verify that

 + Z Z  ),
N = 2I − (Z 12 Z 12 21 21
= 2I − M M . 
5.3 The Spike-Balance Scheme 109

But, the eigenvalues of N lie within the spectrum of à Ã, or à à , i.e.,

1 − ρ ≤ 2 − λ(MM ) ≤ 1 + ρ

in which ρ 2 < 1 is spectral radius of (H  H ). Hence,

0 < 1 − ρ ≤ λ(MM ) ≤ 1 + ρ.

In other words, M is nonsingular. Next, we consider the general case of Spike-balance


scheme with p partitions, i.e. n = s × p (e.g., see Fig. 5.8),
⎛ ⎞
⎛ ⎞ x1 ⎛ ⎞
A1 B1 ⎜ ξ1 ⎟ f1
⎜ C2 A2 B2 ⎟⎜ ⎟ ⎜ f2⎟
⎜ ⎟ ⎜ x2 ⎟ ⎜ ⎟
⎜ C A B ⎟ ⎜ ⎟ ⎜ f3⎟
⎜ 3 3 3 ⎟ ⎜ ξ2 ⎟ ⎜ ⎟
⎜ . .
.. .. ⎟ ⎜ ⎟=⎜ ..⎟ . (5.27)
⎜ ⎟ ⎜ .. ⎟ ⎜ .⎟
⎜ ⎟⎜ . ⎟ ⎜ ⎟
⎝ C p−1 A p−1 B p−1 ⎠⎜ ⎟ ⎝ f p−1 ⎠
⎝ ξ p−1 ⎠
Cp Ap fp
xp

With p equal block rows, the columns of A are divided into 2 p − 1 column blocks,
where the unknowns ξi , i = 1, . . . p − 1, are common to consecutive blocks of the

B
A 1
1

C
2 B
A 2
2

c
3 B
A 3
3

C B
p –1 p–1
A
p –1

C
p A
p

Fig. 5.8 The Spike-balance scheme for p block rows


110 5 Banded Linear Systems

matrix. Then, A1 and A p are of order s × (s − m), and each Ai , i = 2, 3, . . . ., p − 1


is of order s × (s − 2m), and Bi , Ci are of order s × 2m. These p block rows give
rise to the following set of underdetermined systems:

x1
(A1 B1 ) , (5.28)
ξ1
⎛ ⎞
ξ̃i−1
(Ci Ai Bi ) ⎝ xi ⎠ = f i , i = 2, . . . , p − 1, (5.29)
ξi

ξ̃ p−1
(C p A p) = f p. (5.30)
xp

Denoting the coefficient matrices of the above underdetermined systems as follows:

E 1 = (A1 , B1 ),
E i = (Ci , Ai , Bi ), i = 2, . . . , p − 1,
E p = (C p , A p ),

the general solution of the ith underdetermined system is given by

vi = pi + Z i yi ,

where pi is the particular solution, Z i is a basis for N (E i ), and yi is an arbitrary


vector. The solutions of the systems (5.28) through (5.30) are then given by
  
x1 p1,1
Z 1,1
= y1 , + (5.31)
ξ1 p1,2
Z 1,2
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
ξ̃i−1 pi,1 Z i,1
⎝ xi ⎠ = ⎝ pi,2 ⎠ + ⎝ Z i,2 ⎠ yi , i = 2, . . . , p − 1, (5.32)
ξi pi,3 Z i,3
  
ξ̃ p−1 p p,1 Z p,1
= + yp. (5.33)
xp p p,2 Z p,2

Here, each ξ and ξ̃ is of order 2m, and each yi , i = 2, 3, . . . .., p − 1 is of the same
order 2m, with y1 , y p being of order m. Since the submatrices E i are of full row-
rank, the above underdetermined linear systems are consistent, and we can enforce
the following conditions to insure obtaining the unique solution of the global system:

ξi (yi ) = ξ̃i−1 (yi+1 ), i = 1, . . . , p − 1.


5.3 The Spike-Balance Scheme 111

This gives rise to the system of equations

p1,2 + Z 1,2 y1 = p2,1 + Z 2,1 y2 ,


pi,3 + Z i,3 yi = pi+1,1 + Z i+1,1 yi+1 , i = 2, . . . , p − 1,

which yields the balance system


M y = g.

Here, M is of order ν = 2m( p − 1),


⎛ ⎞
Z 1,2 −Z 2,1
⎜ Z 2,3 −Z 3,1 ⎟
⎜ ⎟
⎜ Z 3,3 −Z 4,1 ⎟
M =⎜ ⎟, (5.34)
⎜ .. .. ⎟
⎝ . . ⎠
Z p−1,3 −Z p,1

with y consisting of the subvectors yi , i = 1, 2, . . . , p, and the right-hand side g


given by ⎛ ⎞
p2,1 − p1,2
⎜ p3,1 − p2,3 ⎟
⎜ ⎟
g=⎜ .. ⎟.
⎝ . ⎠
p p,1 − p p−1,3

Thus, the banded linear system (5.27) may be solved by simultaneously computing
the solutions of the underdetermined systems (5.28)–(5.30), followed by solving
the balance system M y = g. Next, we summarize the Spike-balance scheme for p
blocks.
The Spike-Balance Scheme

1. Simultaneously solve the underdetermined systems (5.28)–(5.30) to obtain the


particular solution pi and the bases Z i , i = 1, . . . , p.
2. Solve the balance system M y = g.
3. Simultaneously back-substitute the subvectors yi to determine all the subvectors
xi and ξi .

Solving the underdetermined systems, in Step 1, via orthogonal factorization of the


matrices E i has been outlined above. Later, we will consider a solution strategy that
does not require the complete orthogonal factorization of the E i ’s and storing the
relevant parts of the matrices Z i which constitute the bases of N (E i ).
In Step 2, we can solve the balance system in a number of ways. The matrix M
can be computed explicitly, as outlined above, and the balance system is handled
via a direct or preconditioned iterative solver. When using a direct method, it is
advantageous to exploit the block-bidiagonal structure of the balance system. Iterative
methods may be used when M is sufficiently large and storage is at a premium.
112 5 Banded Linear Systems

In fact, for large linear systems (5.21), and a relatively large number of partitions p,
e.g. when solving (5.21) on a cluster with large number of nodes, it is preferable not
to form the resulting large balance matrix M. Instead, we use an iterative scheme in
which the major kernels are: (i) matrix–vector multiplication, and (ii) solving systems
involving the preconditioner. From the above derivation of the balance system, we
observe the following:

r (y) = g − M y = ξ̃ (y) − ξ(y),

and computing the matrix-vector product M y is given by

M y = r (0) − r (y),

where r (0) = g. In this case, however, one needs to solve the underdetermined
systems in each iteration.
Note that conditioning the matrix M is critical for the rapid convergence of any
iterative method used for solving the balance system. The following theorem provides
an estimate of the condition number of M, κ(M).
Theorem 5.1 ([7]) The coefficient matrix M of the balance system has a condition
number which is at most equal to that of the coefficient matrix A of the original
banded system (5.21), i.e.,
κ(M) ≤ κ(A).

In order to form the coefficient matrix M of the balance system, we need to obtain the
matrices Z i , i = 1, 2, . . . ., p. Depending on the size of the system and the parallel
architecture at hand, however, the storage requirements could be excessive. Below,
we outline an alternate approach that does not form the balance system explicitly.
A Projection-Based Spike-Balance Scheme
In this approach, the matrix M is not computed explicitly, rather, the balance system
is available only implicitly in the form of a matrix-vector product, in which the matrix
under consideration is given by MM . As a result, an iterative scheme such as the
conjugate gradient method (CG) can be used to solve the system MM ŵ = g instead
of the balance system in step 2 of the Spike-balance scheme.
This algorithm is best illustrated by considering the 2-partition case (5.23) and
(5.24),

x1
E1 = f1,
ξ

ξ̃
E2 = f2 ,
x2
5.3 The Spike-Balance Scheme 113

where the general solution for these underdetermined systems are given by

x1
= p1 + (I − P1 )u 1 , (5.35)
ξ

ξ̃
= p2 + (I − P2 )u 2 , (5.36)
x2

in which the particular solutions pi are expressed as

pi = E i (E i E i )−1 f i , i = 1, 2,

and the projectors Pi are given by

Pi = E i (E i E i )−1 E i , i = 1, 2. (5.37)

Imposing the condition


ξ = ξ̃ ,

we obtain the identical balance system, but expressed differently as



u1
(N1 , N2 ) =g
u2

in which,
g = (0, Iˆ) p1 − ( Iˆ, 0) p2 ,
N1 = −(0, Iˆ)(I − P1 ),
N2 = ( Iˆ, 0)(I − P2 ).

Here, Iˆ ≡ I2β , and I ≡ I( n +β ) .


2
Let, u 1 = N1 w and u 2 = N2 w to obtain the symmetric positive definite system

MM w = g, (5.38)

where   ˆ
0 I
MM = (0, Iˆ)(I − P1 ) ˆ + ( Iˆ, 0)(I − P2 ) . (5.39)
I 0

The system (5.38) is solved using the CG scheme with a preconditioning strategy. In
each CG iteration, the multiplication of MM by a vector requires the simultaneous
multiplications (or projections),

c j = (I − P j )b j , j = 1, 2
114 5 Banded Linear Systems

forming an outer level of parallelism, e.g., using two nodes. Such a projection cor-
responds to computing the residuals of the least squares problems,

min b j − E j c j 2 , j = 1, 2.
cj

These residuals, in turn, can be computed using inner CG iterations with each iteration
involving matrix-vector multiplication of the form,

h i = E i E i vi , i = 1, 2

forming an inner level of parallelism, e.g., taking advantage of the multicore archi-
tecture of each node. Once the solution w of the balance system in (5.38) is obtained,
the right-hand sides of (5.35) and (5.36) are updated as follows:
 
x1 0
= p1 − (I − P1 ) ,
ξ w

and  
ξ̂ w
= p2 + (I − P2 ) .
x2 0

For the general p-partition case, the matrix MM in (5.38) becomes of the form,


p
MM = Mi Mi , (5.40)
i=1

in which each Mi Mi is actually a section of the projector

I − Pi = I − E i+ E i . (5.41)

More specifically,
  
− Iˆ 0 0 − Iˆ 0 0
M̃i M̃i = (I − Pi ) .
0 0 Iˆ 0 0 Iˆ

Hence, it can be seen that MM can be expressed as the sum of sections of the
projectors (I − Pi ), i = 1, . . . , p. As outlined above, for the 2-partition case, such a
form of MM allows for the exploitation of multilevel parallelism when performing
matrix-vector product in the conjugate gradient algorithm for solving the balance
system. Once the solution of the balance system is obtained, the individual particular
solutions for the p partitions are updated simultaneously as outlined above for the
2-partition case.
5.3 The Spike-Balance Scheme 115

While the projection-based approach has the advantage of replacing the orthogonal
factorization of the block rows E i with projections onto the null spaces of E i , leading
to significant savings in storage and computation, we are now faced with a problem
of solving balance systems in the form of the normal equations. Hence, it is essential
to adopt a preconditioning strategy (e.g. using the block-diagonal of MM as a
preconditioner) to achieve a solution for the balance system in few CG iterations.

5.4 A Tearing-Based Banded Solver

5.4.1 Introduction

Here we consider solving wide-banded linear systems that can be expressed as over-
lapping diagonal blocks in which each is a block-tridiagonal matrix. Our approach
in this tearing-based scheme is different from the Spike algorithm variants discussed
above. This scheme was first outlined in [24], where the study was restricted to diag-
onally dominant symmetric positive definite systems . This was later generalized to
nonsymmetric linear systems without the requirement of diagonal dominance, e.g.
see [25]. Later, in Chap. 10, we extend this tearing scheme for the case when we strive
to obtain a central band preconditioner that encapsulates as many nonzero elements
as possible.
First, we introduce the algorithm by showing how it tears the original block-
tridiagonal system and extract a smaller balance system to solve. Note that the
extracted balance system here is not identical to that described in the Spike-balance
scheme. Second, we analyze the conditions that guarantee the nonsingularity of the
balance system. Further, we show that if the original system is symmetric positive
definite and diagonally dominant then the smaller balance system is also symmet-
ric positive definite as well. Third, we discuss preconditioned iterative methods for
solving the balance system.

5.4.2 Partitioning

Consider the linear system


Ax = f , (5.42)

where A ∈ Rn×n is nonsingular and x, f ∈ Rn . Let A = [ai j ] be a banded matrix of


bandwidth 2τ + 1, i.e. ai j = 0 for |i − j| ≥ τ  n. Hence we can cast our banded
linear system (5.42) in the block-tridiagonal form. For clarity of the presentation,
we illustrate the partitioning and tearing scheme using three overlapped partitions
( p = 3). Also, without loss of generality, we assume that all the partitions, or
overlapped block-diagonal matrices, are of equal size s, and that all the overlaps are
of identical size τ . Generalization to the case of p > 3 partitions of different sizes
is straightforward.
116 5 Banded Linear Systems

Let the banded matrix A and vectors x and f be partitioned as


⎛ ⎞ ⎛ ⎞ ⎛ ⎞
A11 A12 x1 f1
⎜ A21 A22 A23 ⎟ ⎜ x2 ⎟ ⎜ f2 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜ A32 A33 A34 ⎟ ⎜ x3 ⎟ ⎜ f3 ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟

A=⎜ ⎟ ⎜ ⎟ ⎜ ⎟
A43 A44 A45 ⎟ , x = ⎜ x4 ⎟ , and f = ⎜ f 4 ⎟ .
⎜ A54 A55 A56 ⎟ ⎜ ⎟ ⎜ f5 ⎟
⎜ ⎟ ⎜ x5 ⎟ ⎜ ⎟
⎝ A65 A66 A67 ⎠ ⎝ x6 ⎠ ⎝ f6 ⎠
A76 A77 x7 f7
(5.43)
Here, Ai j , xi and f i for i, j = 1, . . . , 7 are blocks of appropriate size. Let the
partitions of A, delineated by lines in the illustration of A in (5.43), be of the form
⎛ ⎞
A(k) (k)
11 A12
⎜ (k) (k) (k) ⎟
Ak = ⎜ ⎟
⎝ A21 A22 A23 ⎠ , for k = 1, 2, 3. (5.44)
A(k) (k)
32 A33

(k)
The blocks Aμν = Aη+μ,η+ν for η = 2(k − 1) and μ, ν = 1, 2, 3, except for the
overlaps between partitions. The overlaps consist of the top left blocks of the last and
middle partitions, and bottom right blocks of the first and middle partitions. For these
(k−1) (k)
blocks the following equality holds A33 + A11 = Aη+1,η+1 . The exact choice of
(k−1) (k)
splitting A33 into A33 and A11 will be discussed below.
Thus, we can rewrite the original system as a set of smaller linear systems, k =
1, 2, 3,
⎛ ⎞⎛ ⎞
A(k) (k) ⎛x1(k) ⎞
11 A12 (1 − αk−1 ) f η+1 − yk−1
⎜ (k) (k) (k) ⎟ ⎜ (k) ⎟
⎜ A A A ⎟⎜x ⎟ = ⎝ f η+2 ⎠, (5.45)
⎝ 21 22 23 ⎠ ⎝ 2 ⎠
α f
k η+3 + y
A(k) (k)
x3(k)
k
32 A33

where α0 = 0, α3 = 1, y0 = y3 = 0, and yζ , αζ for ζ = 1, 2 are yet to be specified.


Now, we need to choose the scaling parameters 0 ≤ α1 , α2 ≤ 1 and determine the
adjustment vector y  = (y1 , y2 ).
Notice that the parameters αζ , can be used for example to give more weight to
a part of the original right hand-side corresponding to a particular partition. For
simplicity, however, we set all the αζ ’s to 0.5.
The adjustment vector y is used to modify the right hand-sides of the smaller
systems in (5.45), such that their solutions match at the portions corresponding to
the overlaps. This is equivalent to determining the adjustment vector y so that

(ζ ) (ζ +1)
x3 = x1 for ζ = 1, 2. (5.46)
5.4 A Tearing-Based Banded Solver 117

If we assume that the partition Ak in (5.44) is selected to be nonsingular with,


⎛ (k) (k) (k)

B11 B12 B13
⎜ (k) (k) (k) ⎟
A−1 ⎜ ⎟
k = ⎝ B21 B22 B23 ⎠ ,
(k) (k) (k)
B31 B32 B33

then, we from (5.45) we have


⎛ ⎞ ⎛ (k) (k) (k) ⎞ ⎛ ⎞
x1(k) B11 B12 B13 (1 − αk−1 ) f η+1 − yk−1
⎜ (k) ⎟ ⎜ ⎟
⎝ x2 ⎠ = ⎜
(k) (k) (k) ⎟ ⎝ ⎠,
⎝ B21 B22 B23 ⎠ f η+2 (5.47)
x3
(k)
(k) (k) (k) αk f η+3 + yk
B31 B32 B33

i.e.
 (k) (k) (k) (k)
x1 = B11 ((1 − αk−1 ) f η+1 − yk−1 ) + B12 f η+2 + B13 (αk f η+3 + yk ),
(k) (k) (k) (k)
x3 = B31 ((1 − αk−1 ) f η+1 − yk−1 ) + B32 f η+2 + B33 (αk f η+3 + yk ).
(5.48)
Using (5.46) and (5.48) we obtain

(ζ ) (ζ +1) (ζ ) (ζ +1)
(B33 + B11 )yζ = gζ + B31 yζ −1 + B13 yζ +1 (5.49)

for ζ = 1, 2, where
⎛ ⎞
f η−1
⎜ fη ⎟
(ζ ) (ζ ) (ζ +1) (ζ ) (ζ +1) (ζ +1) ⎜ ⎟
gζ = ((αζ −1 −1)B31 , −B32 , (1−αζ )B11 −αζ B33 , B12 , αζ +1 B13 )⎜
⎜ f η+1 ⎟
⎟ . (5.50)
⎝ f η+2 ⎠
f η+3

Finally, letting g  = (g1 , g2 ), the adjustment vector y can be found by solving the
balance system
M y = g, (5.51)

where
(1) (2) (2)
B33 + B11 −B13
M= (2) (2) (3) . (5.52)
−B31 B33 + B11

Once y is obtained,we can solve the linear systems in (5.45) independently for
each k = 1, 2, 3. Next we focus our attention on solving (5.51). First we note that the
matrix M is not available explicitly, thus using a direct method to solve the balance
system (5.52) is not possible and we need to resort to iterative schemes that utilize
M implicitly for performing matrix-vector multiplications of the form z = M ∗ v.
For example, one can use Krylov subspace methods, e.g. CG or BiCGstab for the
118 5 Banded Linear Systems

symmetric positive definite and nonsymmetric cases, respectively. However, to use


such solvers we must be able to compute the initial residual rinit = g − M yinit , and
matrix-vector products of the form q = M p. Further, we need to determine those
conditions that A and its partitions Ak must satisfy in order to yield a nonsingular
balance system.
Note that for an arbitrary number p of partitions, i.e. p overlapped block-diagonal
matrices, the balance system is block-tridiagonal of the form
⎛ (1) (2) (2) ⎞
B33 + B11 −B13
⎜ (2) (2) (3) (3) ⎟
⎜ −B31 B33 + B11 −B13 ⎟
⎜ ⎟
⎜ .. ⎟
⎜ . ⎟ . (5.53)
⎜ ⎟
⎜ ( p−2)
−B31
( p−2) ( p−1)
+ B11 −B13
( p−1) ⎟
⎝ B33 ⎠
( p−1) ( p−1) ( p)
−B31 B33 + B11

5.4.3 The Balance System

Since the tearing approach that enables parallel scalability depends critically on how
effectively we can solve the balance system via an iterative method, we explore
further the characteristics of its coefficient matrix M.
The Symmetric Positive Definite Case
First, we assume that A is symmetric positive definite (SPD) and investigate the
conditions under which the balance system is also SPD.
Theorem 5.2 ([25]) If the partitions Ak in (5.44) are SPD for k = 1, . . . , p then
the balance system in (5.51) is also SPD .
Proof Without loss of generality, let p = 3. Since each Ak is SPD then A−1
k is also
SPDṄext, let 
 I 0 0
Q = , (5.54)
0 0 −I

then from (5.47), we obtain


(k) (k)

B11 −B13
Q 
A−1
k Q = (k) (k)
, (5.55)
−B31 B33

which is also symmetric positive definite. But, since the balance system coefficient
matrix M can be written as the sum
(2) (2)

(1)
B33 0 B11 −B13 0 0
M = M1 + M2 + M3 = + (2) (2)
+ (3) , (5.56)
0 0 −B31 B33 0 B11
5.4 A Tearing-Based Banded Solver 119

then for any nonzero vector z, we have z  M z > 0.



n
If, in addition, A is also diagonally dominant (d.d.), i.e., |ai j | < |aii | for
j=1, j =i
i = 1, . . . , n, then we can find a splitting that results in each Ak that is also SPD
and d.d., which, in turn, guarantees that M is SPD but not necessarily diagonally
dominant [25].
Theorem 5.3 ([25]) If A in (5.42) is SPD and d.d., then the partitions Ak in (5.44)
can be chosen such that they inherit the same properties. Further, the coefficient
matrix M, of the resulting balance system in (5.51), is also SPD.
Proof Since A is d.d., we only need to obtain a splitting that ensures the diagonal
dominance of the overlapping parts. Let ei be the ith column of the identity, e =
[1, . . . , 1] , diag(C) and offdiag(C) denote the diagonal and off-diagonal elements
of the matrix C respectively. Now let the elements of the diagonal matrices Hζ(1) =
(ζ,1) (2) (ζ +1,2)
[h ii ] and Hζ +1 = [h ii ] of appropriate sizes, be given by

(ζ,1) (ζ ) 1 (ζ ) (ζ +1)
h ii = ei |A32 | e + ei | offdiag (A33 + A11 )| e, and (5.57)
2
(ζ +1,2) (ζ +1) 1 (ζ ) (ζ +1)
h ii = ei |A12 | e + ei | offdiag (A33 + A11 )| e, (5.58)
2
(ζ,1) (ζ +1,2)
respectively. Note that h ii and h ii are the sum of absolute values of all the
off-diagonal elements, with elements in the overlap being halved, in the ith row
to the left and right of the diagonal, respectively. Next, let the difference between
the positive diagonal elements and the sums of absolute values of all off-diagonal
elements in the same row be given by

(ζ ) (ζ +1) (1) (2)


Dζ = diag(A33 + A11 ) − Hζ − Hζ +1 . (5.59)

Now, if

(ζ ) 1 1 (ζ ) (ζ +1)
A33 = Hζ(1) + Dζ + offdiag (A33 + A11 ),
2 2
(ζ +1) (2) 1 1 (ζ ) (ζ +1)
A11 = Hζ +1 + Dζ + offdiag (A33 + A11 ), (5.60)
2 2
(ζ ) (ζ +1)
it is easy to verify that A33 + A11 = A2ζ +1,2ζ +1 and each Ak , for k = 1, . . . , p,
is SPDd.d. Consequently, if (5.43) is SPD and d.d., so are the partitions Ak and by
Theorem 5.2, the balance system is guaranteed to be SPD.
The Nonsymmetric Case
Next, if A is nonsymmetric, we explore under which conditions will the balance
system (5.51) become nonsingular.
120 5 Banded Linear Systems

Theorem 5.4 ([25]) Let the matrix A in (5.42) be nonsingular with partitions Ak ,
k = 1, 2, . . . , p, in (5.44) that are also nonsingular. Then the coefficient matrix M
of the balance system in (5.51), is nonsingular.

Proof Again, let p = 3 and write A as


⎛ ⎞ ⎛ ⎞
A1 0s−τ
A=⎝ 0s−2τ ⎠+⎝ A2 ⎠. (5.61)
A3 0s−τ

Next, let ⎛ ⎞ ⎛ ⎞
A1 Is−τ
AL = ⎝ Is−2τ ⎠ , AR = ⎝ A2 ⎠, (5.62)
A3 Is−τ

and consider the nonsingular matrix C given by

C = A−1 A A−1
⎛L R ⎞⎛ ⎞ ⎛ ⎞⎛ ⎞
Is Is−τ A−1 0s−τ
⎠ . (5.63)
1
= ⎝ 0s−2τ ⎠⎝ A−1
2
⎠+⎝ Is−2τ ⎠⎝ Is
Is Is−τ A−1
3
0s−τ

where Is and 0s are the identity and zero matrices of order s, respectively.
Using (5.61), (5.63) and (5.47), we obtain
⎛ (1)

⎛ ⎞ 0τ 0 B13
Iτ ⎜ (1) ⎟
⎜ ⎟ ⎜ 0 0s−2τ B23 ⎟
⎜ Is−2τ ⎟ ⎜ ⎟
⎜ (2) (2) (2) ⎟ ⎜
⎜ (1) ⎟

⎜ B11 B12 B13 ⎟ ⎜ 0 0 B33 ⎟
⎜ ⎟ ⎜ ⎟
C =⎜
⎜ 0 0s−2τ 0 ⎟+⎜
⎟ ⎜ I s−2τ ⎟,
⎜ ⎟ ⎜ ⎟
⎜ (2) (2) (2) ⎟ ⎜ (3) ⎟
⎜ B31 B32 B33 ⎟ ⎜ B 0 0 ⎟
⎝ ⎠ ⎜
11

B21 0s−2τ 0 ⎟
(3)
Is−2τ
Iτ ⎝ ⎠
(3)
B31 0 0τ
⎛ (1)

Iτ 0 B13
⎜ (1) ⎟
⎜ 0 Is−2τ B23 ⎟
⎜ ⎟
⎜ (1) (2) (2) (2) ⎟
⎜0 0 B +B B B ⎟
⎜ 33 11 12 13 ⎟
⎜ ⎟
=⎜ 0 Is−2τ 0 ⎟, (5.64)
⎜ ⎟
⎜ (2) (2) (2)
+
(3) ⎟
⎜ B B B B 0 0 ⎟
⎜ 31 32 33 11

⎜ (3) ⎟
⎝ B21 I s−2τ 0 ⎠
(3)
B31 0 Iτ

where the zero matrices denoted by 0 without subscripts, are considered to be of the
appropriate sizes. Using the orthogonal matrix P, of order 3s − 2τ , given by
5.4 A Tearing-Based Banded Solver 121
⎛ ⎞

⎜ Is−2τ 0 ⎟
⎜ ⎟
⎜ 0 Is−2τ 0 ⎟
⎜ ⎟
P =⎜
⎜ 0 0 Is−2τ ⎟,
⎟ (5.65)
⎜ 0 0 Iτ ⎟
⎜ ⎟
⎝ Iτ 0 ⎠
−Iτ

we obtain
⎛ (1)

Iτ B13
⎜ ⎟
⎜ Is−2τ B23
(1) ⎟
⎜ ⎟
⎜ Is−2τ ⎟
⎜ ⎟
⎜ (3)
−B21 ⎟
⎜ Is−2τ ⎟
P C P = ⎜ ⎟. (5.66)
⎜ (3) ⎟
⎜ Iτ B31 ⎟
⎜ ⎟
⎜ ⎟
⎜ B12
(2) (1)
B33 + B11
(2)
−B13
(2) ⎟
⎝ ⎠
(2) (2) (2) (3)
−B32 −B31 B33 + B11

Rewriting (5.66) as the block 2 × 2 matrix,


⎛ ⎞
I3s−4τ Z 1
P C P = ⎝ Z  M
⎠, (5.67)
2

and considering the eigenvalue problem


⎛ ⎞
I3s−4τ Z 1  
⎝ ⎠ u1 = λ u1 , (5.68)
Z 2 M u2 u2

we see that premultiplying the first block row of (5.68) by Z 2 and noticing that
Z 2 Z 1 = 0, we obtain
(1 − λ)Z 2 u 1 = 0. (5.69)

Hence, either λ = 1 or Z 2 u 1 = 0. If λ = 1, then the second block row of (5.68)


yields
Mu 2 = λu 2 . (5.70)

Thus, the eigenvalues of C are either 1 or identical to those of the balance system,
i.e.,
λ(C) = λ(P  C P) ⊆ {1, λ(M)}. (5.71)

Hence, since C is nonsingular, the balance system is also nonsingular.


122 5 Banded Linear Systems

Since the size of the coefficient matrix M of the balance system is much smaller
than n, the above theorem indicates that A L and A R are effective left and right
preconditioners of system (5.42).
Next, we explore those conditions which guarantee that there exists a splitting of
the coefficient matrix in (5.42) resulting in nonsingular partitions Ak , and provide a
scheme for computing such a splitting.
Theorem 5.5 ([25]) Assume that the matrix A in (5.42) is nonsymmetric with a
positive definite symmetric part H = 21 (A + A ). Also let

⎛ ⎞ ⎛ ⎞
0(s−τ )×τ 0(s−τ )×τ
⎜ ⎟ ⎜ ⎟
⎜ A(2) ⎟
T
⎜ (2)
A11 ⎟ ⎜ ⎟
⎜ ⎟ ⎜
11

⎜ (2) ⎟ ⎜ A(2)T ⎟
⎜ A21 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ 12

⎜ (3) ⎟ ⎜ (3)T ⎟
⎜ A11 ⎟ ⎜ A11 ⎟
⎜ ⎟ ⎜ ⎟
⎜ (3) ⎟ ⎜ ⎟
⎜ A21 ⎟ ⎜ (3)T
⎟.
B1 = ⎜ ⎟ and B2 = ⎜
A12

⎜ ⎟ ⎜ ⎟
⎜ .. ⎟ ⎜ .. ⎟
⎜ . ⎟ ⎜ . ⎟
⎜ ⎟ ⎜ ⎟
⎜ ( p) ⎟ ⎜ ⎟
⎜ A11 ⎟ ⎜ ( p)T

⎜ ⎟ ⎜
A11

⎜ ( p) ⎟ ⎜ ⎟
⎜ A21 ⎟ ⎜ ( p)T ⎟
⎝ ⎠ ⎝ A12 ⎠
0τ 0τ
(5.72)
Assuming that partial symmetry holds, i.e., B1 = B2 = B, then there exists a splitting
such that the partitions Ak in (5.44) for k = 1, . . . , p are nonsingular.
Proof Let p = 3, and let à be the block-diagonal matrix in which the blocks are the
partitions Ak .
⎛ ⎞
A(1) (1)
11 A12
⎜ (1) (1) (1) ⎟
⎜ A21 A22 A23 ⎟
⎜ ⎟
⎜ (1) (1) ⎟
⎜ A32 A33 ⎟
⎜ ⎟
⎜ (2) (2) ⎟
⎜ A11 A12 ⎟
⎜ ⎟
⎜ (2) (2) (2) ⎟

à = ⎜ A A A ⎟. (5.73)
21 22 23 ⎟
⎜ (2) (2) ⎟
⎜ A32 A33 ⎟
⎜ ⎟
⎜ (3) (3) ⎟
⎜ A11 A12 ⎟
⎜ ⎟
⎜ (3) (3) (3) ⎟
⎜ A21 A22 A23 ⎟
⎝ ⎠
(3) (3)
A32 A33
5.4 A Tearing-Based Banded Solver 123

Let V  be the nonsingular matrix of order 3m given by

(5.74)

then,
⎛ ⎞
A(1) (1)
11 A12
⎜ ⎟
⎜ A(1) A(1) (1) ⎟
⎜ 21 22 A23 ⎟
⎜ ⎟
⎜ A(1) (1) (2)
A(2) A(2) ⎟
⎜ 32 A33 + A11 ⎟
⎜ 12 11

⎜ (2) (2) (2) (2) ⎟
⎜ A21 A22 A23 A21 ⎟
⎜ ⎟
⎜ (2) (2) (3) (3) (3) ⎟
J  Ã J = ⎜

A32 A33 + A11 A12 A11 ⎟ .

⎜ (3) ⎟

⎜ A(3)
21 A(3) (3)
22 A23 A21 ⎟⎟
⎜ (3) (3) ⎟
⎜ A32 A33 ⎟
⎜ ⎟
⎜ ⎟
⎜ (2) (2) (2) ⎟
⎜ A11 A12 A11 ⎟
⎝ ⎠
A(3)
11 A(3)
12
(3)
A11
(5.75)
Writing (5.75) as a block 2 × 2 matrix, we have
⎛ ⎞
A B1
J  Ã J = ⎝ B  K ⎠ . (5.76)
2

(k) (k−1)
Using the splitting A11 = 21 (Aη+1,η+1 +A η+1,η+1 )+β I and A33 = 21 (Aη+1,η+1 −
Aη+1,η+1 ) − β I , we can choose β so as to ensure that B1 and B2 are of full rank and
K is SPD Thus, using Theorem 3.4 on p. 17 of [26], we conclude that J  Ã J is of
full rank, hence à has full rank and consequently the partitions Ak are nonsingular.

Note that the partial symmetry assumption B1 = B2 in the theorem above is not as
restrictive as it seems. Recalling that the original matrix is banded, it is easy to see
that the matrices A(k) (k)
12 and A21 are almost completely zero except for small parts in
their respective corners, which are of size no larger than the overlap. This condition
then can be viewed as a requirement of symmetry surrounding the overlaps.
124 5 Banded Linear Systems

Let us now focus on two special cases. First, if the matrix A is SPD the conditions
of Theorem 5.5 are immediately satisfied and we obtain the following.
Corollary 5.1 If the matrix A in (5.42) is SPD then there is a splitting (as described
in Theorem 5.5) such that the partitions Ak in (5.44) for k = 1, . . . , p are nonsin-
gular and consequently the coefficient matrix M of the balance system in (5.51), is
nonsingular.
Second, note that Theorem 5.3 still holds even if the symmetry requirement is
dropped. Combining the results of Theorems 5.3 and 5.4, without any requirement
of symmetry, we obtain the following.
Corollary 5.2 If the matrix A in (5.42) is d.d., then the partitions Ak in (5.44)
can be chosen such that they are also nonsingular and d.d. for k = 1, . . . , p and
consequently the coefficient matrix M, of the balance system in (5.51), is nonsingular.

5.4.4 The Hybrid Solver of the Balance System

Next, we show how one can compute the residual rinit needed to start a Krylov
subspace scheme for solving the balance system. Rewriting (5.47) as
⎛ ⎞ ⎛ ⎞ ⎛ ⎞
(k) (k) (k)
x1 h1 ȳ1
⎜ (k) ⎟ ⎜ (k) ⎟ ⎜ (k) ⎟
⎜ x ⎟ = ⎜ h ⎟ + ⎜ ȳ ⎟ , (5.77)
⎝ 2 ⎠ ⎝ 2 ⎠ ⎝ 2 ⎠
(k) (k) (k)
x3 h3 ȳ3

where,
⎛ ⎞ ⎛ ⎞ ⎛ (k) ⎞
h (k) (1 − αk−1 ) f η+1 ȳ1 ⎛ ⎞
1
⎜ (k) ⎟ −yk−1
⎜ h ⎟ = A−1 ⎜ ⎟ ⎜ ⎟
⎠,⎜
(k) ⎟ −1 ⎝
⎝ 2 ⎠ k ⎝ f η+2 ⎝ ȳ2 ⎠ = Ak 0 ⎠, (5.78)
(k) αk f η+3 (k) yk
h3 ȳ3

the residual can be written as


⎛ ⎞ ⎛ ⎞ ⎛ ⎞
x1(2) − x3(1) h (2) (1)
1 − h3 ȳ1(2) − ȳ3(1)
⎜ (3) ⎟ ⎜ (3) ⎟ ⎜ (3) ⎟
⎜ x1 − x3(2) ⎟ ⎜ h 1 − h (2) ⎟ ⎜ ȳ1 − ȳ3(2) ⎟
⎜ ⎟ ⎜ 3 ⎟ ⎜ ⎟
r = g − My = ⎜ .. ⎟=⎜ .. ⎟+⎜ .. ⎟,
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ . ⎠ ⎝ . ⎠ ⎝ . ⎠
( p) ( p−1) ( p) ( p−1) ( p) ( p−1)
x1 − x3 h1 − h3 ȳ1 − ȳ3
(5.79)

where the second equality in (5.79) follows from the combination of (5.46), (5.48),
(5.50) and (5.52). Let the initial guess be yinit = 0, then we have
5.4 A Tearing-Based Banded Solver 125
⎛ (2) (1) ⎞
h1 − h3
⎜ (3) ⎟
⎜ h 1 − h (2) ⎟
⎜ 3 ⎟
rinit =g=⎜ . ⎟. (5.80)
⎜ .. ⎟
⎝ ⎠
( p) ( p−1)
h1 − h3

Thus, to compute the initial residual we must solve the p independent linear systems
ζ
(5.45) and subtract the bottom part of the solution vector of partition ζ , h 3 , from the
(ζ +1)
top part of the solution vector of partition ζ + 1, h 1 , for ζ = 1, . . . , p − 1.
Finally, to compute matrix-vector products of the form q = M p, we use (5.79)
and (5.80), to obtain
⎛ (1) (2) ⎞
ȳ3 − ȳ1
⎜ (2) ⎟
⎜ ȳ3 − ȳ1(3) ⎟
⎜ ⎟
M y = g − r = rinit − r = ⎜ .. ⎟. (5.81)
⎜ ⎟
⎝ . ⎠
( p−1) ( p)
ȳ3 − ȳ1

Hence, we can compute the matrix-vector products M p for any vector p in a fashion
similar to computing the initial residual using (5.81) and (5.78). The modified Krylov
subspace methods (e.g. CG, or BiCGstab) used to solve (5.51) are the standard ones
except that the initial residual and the matrix-vector products are computed using
(5.80) and (5.81), respectively. We call these solvers Domain-Decomposition-CG
(DDCG) and Domain-Decomposition-BiCGstab (DDBiCGstab). They are schemes
in which the solutions of the smaller independent linear systems in (5.45) are obtained
via a direct solver, while the adjustment vector y is obtained using an iterative method,
CG for SPD and BiCGStab for nonsymmetric linear systems. The outline of the
DDCG algorithm is shown in Algorithm 5.2. The usual steps of CG are omitted, but
the two modified steps are shown in detail. The outline of the DDBiCGstab scheme
is similar and hence ommitted.
The balance system is preconditioned using a block-diagonal matrix of the form,
⎛ (1) (2) ⎞
B̃33 + B̃11
⎜ .. ⎟
M̃ = ⎝ . ⎠, (5.82)
( p−1) ( p)
B̃33 + B̃11

(ζ ) (ζ )−1 (ζ ) (ζ +1) (ζ +1)−1 (ζ +1)


where B̃33 = A33 ≈ B33 and B̃11 = A11 ≈ B11 . Here, we are taking
advantage of the fact that the elements of the inverse of the banded balance system
decay as we move away from the diagonal, e.g., see [20]. Also such decay becomes
more pronounced as the degree of diagonal dominance of the banded balance system
becomes more pronounced. Using the Woodbury formula [22], we can write
126 5 Banded Linear Systems

Algorithm 5.2 Domain Decomposition Conjugate Gradient (DDCG)


Input: Banded SPD matrix A and right-hand side f
Output: Solution x of Ax = f
1: Tear the coefficient matrix A into partitions Ak for k = 1, . . . , p.
2: Distribute the partitions across p processors.
3: Perform the Cholesky factorization of partition Ak on processor k.
 = (y  , . . . , y 
4: Distribute the vector yinit init1 init p−1 ) across p − 1 processors, so that processor ζ
contains yinitζ for ζ = 1, . . . , p − 1 and the last processor is idle. All vectors in the modified
iterative method are distributed similarly.
//Perform the modified Conjugate Gradient:
5: Compute the initial residual rinit = g using (5.80), in other words, on processor ζ , we compute
(ζ +1) (ζ ) (ζ ) (ζ )
gζ = h̄ 1 − h̄ 3 , where h̄ 3 and h̄ 1 are computed on processor ζ by solving the first system
in (5.78) directly.
6: do i = 1, ..., until convergence
7: Standard CG steps
8: Compute the matrix-vector multiplication q = M p using (5.81), in other words, on processor
(ζ ) (ζ +1) (ζ ) (ζ )
ζ we compute qζ = ȳ3 − ȳ1 , where ȳ3 and ȳ1 are computed on processor ζ by
solving the second system in (5.78) directly.
9: Standard CG steps
10: end
11: Solve the smaller independent linear systems in (5.45) directly in parallel.

(ζ ) (ζ +1) −1 (ζ )−1 (ζ +1)−1 −1 (ζ ) (ζ ) (ζ +1) −1 (ζ +1)


( B̃33 + B̃11 ) = (A33 + A11 ) = A33 (A33 + A11 ) A11 ,
(ζ ) (ζ ) (ζ ) (ζ +1) −1 (ζ )
= A33 − A33 (A33 + A11 ) A33 ,
(5.83)
The last equality is preferred as it avoids extra internode communication, assuming
(ζ ) (ζ +1)
that the original overlapping block A33 + A11 is stored separately on node ζ .
Consequently, to precondition (5.51) we only need to perform matrix-vector products
(ζ ) (ζ )
involving A33 , and solve small linear systems with coefficient matrices (A33 +
(ζ +1)
A11 ).

5.5 Tridiagonal Systems

So far, we discussed algorithms for general banded systems; this section is devoted to
tridiagonal systems because their simple structure makes possible the use of special
purpose algorithms. Moreover, many applications and algorithms contain a tridiag-
onal system solver as a kernel, used directly or indirectly.
The parallel solution of linear systems of equations, Ax = f , with coefficient
matrix A that is point (rather than block) tridiagonal,
5.5 Tridiagonal Systems 127
⎛ ⎞
α1,1 α1,2
⎜α2,1 α2,2 α2,3 ⎟
⎜ ⎟
⎜ .. .. .. ⎟
A=⎜
⎜ . . . ⎟
⎟ (5.84)
⎜ . .. ⎟
⎝ .. . αn−1,n ⎠
αn,n−1 αn,n

and abbreviated as [αi,i−1 , αi,i , αi,i+1 ] when the dimension is known from the con-
text, has been the subject of many studies. Even though the methods can be extended
to handle multiple right-hand sides, here we focus on the case of only one, that we
denote by f . Because of their importance in applications, tridiagonal solvers have
been developed for practically every type of parallel computer system to date. The
classic monographs [27–29] discuss the topic extensively. The activity is contin-
uing, with refinements to algorithms and implementations in libraries for parallel
computer systems such as ScaLAPACK, and with implementations on different
computer models and novel architectures; see e.g. [30–32].
In some cases, the matrix is not quite tridiagonal but it differs from one only by
a low rank modification. It is then possible to express the solution (e.g. by means
of the Sherman-Morrison-Woodbury formula [22]) in a manner that involves the
solution of tridiagonal systems, possibly with multiple right-hand sides. The latter is
a special case of a class of problems that requires the solution of multiple systems,
all with the same or with different tridiagonal matrices. This provides the algorithm
designer with more opportunities for parallelism. For very large matrices, it might
be preferable to use a parallel algorithm for a single or just a few right-hand sides
so as to be able to maintain all required coefficients in fast memory; see for instance
remarks in [33, 34]. Sometimes the matrices have special properties such as Toeplitz
structure, diagonal dominance, or symmetric positive definiteness. Algorithms that
take advantage of these properties are more effective and sometimes safer in terms of
their roundoff error behavior. We also note that because the cost of standard Gaussian
elimination for tridiagonal systems is already linear, we do not expect impressive
speedups. In particular, observe that in a general tridiagonal system of order n, each
solution element, ξi , depends on the values of all inputs, that is 4n − 2 elements.
This is also evident by the fact that A−1 is dense (though data sparse, which could
be helpful in designing inexact solvers). Therefore, with unbounded parallelism, we
need O(log n) steps to compute each ξi . Without restricting the number of processors,
the best possible speedup of a tridiagonal solver is expected to be O( logn n ); see also
[35] for this fan-in based argument.
The best known and earliest parallel algorithms for tridiagonal systems are recur-
sive doubling, cyclic reduction and parallel cyclic reduction. We first review these
methods as well as some variants. We then discuss hybrid and divide-and-conquer
strategies that are more flexible as they offer hierarchical parallelism and ready adap-
tation for limited numbers of processors.
128 5 Banded Linear Systems

5.5.1 Solving by Marching

We first describe a very fast parallel but potentially unstable algorithm described in
[12, Algorithm III]. This was inspired by the shooting methods [36–38] and by the
marching methods described in [39, 40] and discussed also in Sect. 6.4.7 of Chap. 6
It is worth noting that the term “marching” has been used since the early days of
numerical methods; cf. [41]. In fact, a definition can be found in the same book by
Richardson [42, Chap. I, p. 2] that also contains the vision of the “human parallel
computer” that we discussed in the Preface.
In the sequel we will assume that the matrix is irreducible, thus all elements in
the super- and subdiagonal are nonzero; see also Definition 9.1 in Chap. 9. Otherwise
the matrix can be brought into block-upper triangular (block diagonal, if symmetric)
form with tridiagonal submatrices along the diagonal, possibly after row and column
permutations and solved using block back substitution, where the major computation
that has the leading cost at each step is the solution of a smaller tridiagonal system.
In practice, we assume that there exists a preprocessing stage which detects such
elements and if such are discovered, the system is reduced to smaller ones.
The key observation is the following: If we know the last element, ξn , of the solu-
tion x = (ξi )1:n , then the remaining elements can be computed from the recurrence

1
ξn−k−1 = (φn−k − αn−k,n−k ξn−k − αn−k,n−k+1 ξn−k+1 ). (5.85)
αn−k,n−k−1

for k = 0, . . . , n − 2, assuming that αn,n+1 = 0. Note that because of the assumption


of irreducibility, the denominator above is always nonzero. To compute ξn , consider
the reordered system in which the first equation is ordered last. Then, assuming that
n ≥ 3 the system becomes
  
R̃ b x̂ g
=
a 0 ξn φ1

where for simplicity we denote R̃ = A2:n,1:n−1 that is non-singular and banded upper
triangular with bandwidth 3, b = A2:n,n , a  = A1,1:n−1 , x̂ = x1:n−1 , and g = f 2:n .
Applying block LU factorization
  
R̃ b I 0 R̃ b
=  −1
a 0 a R̃ 1 0 −a  R̃ −1 b

it follows that

ξn = −(a  R̃ −1 b)−1 (φ1 − a  R̃ −1 g),


5.5 Tridiagonal Systems 129

and

x̂ = R̃ −1 g − ξn R̃ −1 b.

Observe that because A is tridiagonal,

a  = (α1,1 , α1,2 , 0, . . . , 0), b = (0, . . . , 0, αn−1,n , αn,n ) .

Taking into account that certain terms in the formulas for ξn and x̂ are repeated, the
leading cost is due to the solution of a linear system with coefficient matrix R̃ and
two right-hand sides ( f 2:n , A2:n,n ). From these results, ξ1 and x2:n can be obtained
with few parallel operations. If the algorithm of Theorem 3.2 is extended to solve
non-unit triangular systems with two right-hand sides, the overall parallel cost is
T p = 3 log n + O(1) operations on p = 4n + O(1) processors. The total number
of operations is O p = 11n + O(1), which is only slightly larger than Gaussian
elimination, and the efficiency E p = 12 11 log n .

Remark 5.2 The above methodology can also be applied to more general banded
matrices. In particular, any matrix of bandwidth (2m +1) (it is assumed that the upper
and lower half-bandwidths are equal) of order n can be transformed by reordering
the rows, into

R B
(5.86)
C 0

where R is of order n −m and upper triangular with bandwidth 2m +1, and the corner
zero matrix is of order m. This property has been used in the design of other banded
solvers, see e.g. [43]. As in Sect. “Notation”, let J = e1 e2 + · · · + en−1 en , and set
S = J + en e1 be the “circular shift matrix”. Then S m A is as (5.86). Indeed, this is
straightforward to show by probing its elements, e.g. ei S m Ae j = 0 if j < i < n −m
or if i + 2m < j < n − m. Therefore, instead of solving Ax = b, we examine the
equivalent system S m Ax = S m b and then use block LU factorization. We then solve
by exploiting the banded upper triangular structure of R and utilizing the parallel
algorithms described in Chap. 3.
The marching algorithm has the smallest parallel cost among the tridiagonal
solvers described in this section. On the other hand, it has been noted in the lit-
erature that marching methods are prone to considerable error growth that render
them unstable unless special precautions are taken. An analysis of this issue when
solving block-tridiagonal systems that arise from elliptic problems was conducted
in [40, 44]; cf. Sect. 6.4.7 in Chap. 6. We propose an explanation of the source of
instability that is applicable to the tridiagonal case. Specifically, the main kernel of
the algorithm is the solution of two banded triangular systems with the same coeffi-
cient matrix. The speed of the parallel marching algorithm is due to the fact that these
systems are solved using a parallel algorithm, such as those described in Chap. 3.
As is well known, serial substitution algorithms for solving triangular systems are
130 5 Banded Linear Systems

extremely stable and the actual forward error is frequently much smaller than what
a normwise or componentwise analysis would predict; cf. [45] and our discussion
in Sect. 3.2.3. Here, however, we are interested in the use of parallel algorithms
for solving the triangular systems. Unfortunately, normwise and componentwise
analyses show that their forward error bounds, in the general case, depend on the
cube of the condition number of the triangular matrix; cf. [46]. As was shown in
[47], the 2-norm condition number, κ2 (An ), of order-n triangular matrices An whose
nonzero entries are independent and normally√ distributed with mean 0 and vari-
ance 1 grows exponentially, in particular n κ2 (An ) → 2 almost surely. Therefore,
the condition is much worse than that for random dense matrices, where it grows
linearly. There is therefore the possibility of exponentially fast increasing condi-
tion number compounded by the error’s cubic dependence on it. Even though we
suspect that banded random matrices are better behaved, our experimental results
indicate that the increase in condition number is still rapid for triangular matrices
such as R̃ that have bandwidth 3.
It is also worth noting that even in the serial application of marching, in which
case the triangular systems are solved by stable back substitution, we cannot assume
any special conditions for the triangular matrix except that it is banded. In fact, in
experiments that we conducted with random matrices, the actual forward error also
increases rapidly as the dimension of the system grows. Therefore, when the problem
is large, the parallel marching algorithm is very likely to suffer from severe loss of
accuracy.

5.5.2 Cyclic Reduction and Parallel Cyclic Reduction

Cyclic reduction (cr) relies on the combination of groups of three equations, each
consisting of an even indexed one, say 2i, together with its immediate neighbors,
indexed (2i ± 1), and involve five unknowns. The equations are combined in order to
eliminate odd-indexed unknowns and produce one equation with three unknowns per
group. Thus, a reduced system with approximately half the unknowns is generated.
Assuming that the system size is n = 2k −1, the process is repeated for k steps, until a
single equation involving one unknown remains. After this is solved, 2, 22 , . . . , 2k−1
unknowns are computed and the system is fully solved. Consider, for instance, three
adjacent equations from (5.84).
⎛ ⎞
⎛ ⎞ ξi−2 ⎛ ⎞
αi−1,i−2 αi−1,i−1 αi−1,i ⎜ξi−1 ⎟ φi−1
⎜ ⎟
⎝ αi,i−1 αi,i αi,i+1 ⎠ ⎜ ξi ⎟ = ⎝ φi ⎠
⎜ ⎟
αi+1,i αi+1,i+1 αi+1,i+2 ⎝ξi+1 ⎠ φi+1
ξi+2
5.5 Tridiagonal Systems 131

under the assumption that unknowns indexed n or above and 0 or below are zero. If
both sides are multiplied by the row vector
 α αi,i+1

− αi−1,i−1
i,i−1
, 1, − αi+1,i+1

the result is the following equation:


αi,i−1 αi,i−1 αi,i+1
− αi−1,i−2 ξi−2 + (αi,i − αi−1,i − αi+1,i )ξi
αi−1,i−1 αi−1,i−1 αi+1,i+1
αi,i+1 αi,i−1 αi,i+1
− αi+1,i+2 ξi+2 = − φi−1 + φi − φi+1
αi+1,i+1 αi−1,i−1 αi+1,i+1

This involves only the unknowns ξi−2 , ξi , ξi+2 . Note that 12 floating-point operations
suffice to implement this transformation.
To simplify the description, unless mentioned otherwise we assume that cr is
applied on a system of size n = 2k − 1 for some k and that all the steps of the algo-
rithm can be implemented without encountering division by 0. These transformations
can be applied independently for i = 2, 2 × 2, . . . , 2 × (2k−1 − 1) to obtain a tridi-
agonal system that involves only the (even numbered) unknowns ξ2 , ξ4 , . . . , ξ2k−1 −2
and is of size 2k−1 −1 which is almost half the size of the previous one. If one were to
compute these unknowns by solving this smaller tridiagonal system then the remain-
ing ones can be recovered using substitution. Cyclic reduction proceeds recursively
by applying the same transformation to the smaller tridiagonal system until a single
scalar equation remains and the middle unknown, ξ2k−1 is readily obtained. From
then on, using substitutions, the remaining unknowns are computed.
The seminal paper [48] introduced odd-even reduction for block-tridiagonal sys-
tems with the special structure resulting from discretizing Poisson’s equation; we
discuss this in detail in Sect. 6.4. That paper also presented an algorithm named
“recursive cyclic reduction” for the multiple “point” tridiagonal systems that arise
when solving Poisson’s equation in 2 (or more) dimensions.
An key observation is that cr can be interpreted as Gaussian elimination with
diagonal pivoting applied on a system that is obtained from the original one after
renumbering the unknowns and equations so that those that are odd multiples of 20
are ordered first, then followed by the odd multiples of 21 , the odd multiples of 22 ,
etc.; cf. [49]. This equivalence is useful because it reveals that in order for cr to be
numerically reliable, it must be applied to matrices for which Gaussian elimination
without pivoting is applicable. See also [50–52].
The effect of the above reordering on the binary representation of the indices of the
unknown and right-hand side vectors, is an unshuffle permutation of the bit-permute-
complement (BPC) class; cf. [53, 54]. If Q denotes the matrix that implements the
unshuffle, then

Do B
Q  AQ = ,
C  De
132 5 Banded Linear Systems

where the diagonal blocks are diagonal matrices, specifically

Do = diag(α1,1 , α3,3 , . . . , α2k −1,2k −1 ), De = diag(α2,2 , α4,4 , . . . , α2k −2,2k −2 )

and the off-diagonal blocks are the rectangular matrices


⎛ ⎞ ⎛ ⎞
α1,2 α2,1 α2,3
⎜ ⎟
⎜α3,2 . . . ⎟  ⎜
⎜ α4,3 α4,5 ⎟

B=⎜

⎟,C
⎟ =⎜ .. .. ⎟.
⎝ .. ⎝ . . ⎠
. α2k −2,2k −2 ⎠
α2k −1,2k −2 α2k −2,2k −3 α2k −2,2k −1

Note that B and C are of size 2k−1 × (2k−1 − 1). Then we can write
 
 I2k−1 0 Do B
Q AQ = .
C  Do−1 I2k−1 −1 0 De − C  Do−1 B

Therefore the system can be solved by computing the subvectors containing the odd
and even numbered unknowns x (o) , x (e)

(De − C  Do−1 B)x (e) = b(e) − C  Do−1 f (0)


Do x (o) = f (o) − Bx (e) ,

where f (o) , f (e) are the subvectors containing the odd and even numbered elements
of the right-hand side. The crucial observation is that the Schur complement De −
C  Do−1 B is of almost half the size, 2k−1 − 1, and tridiagonal, because the term
C  Do−1 B is tridiagonal and De diagonal. The tridiagonal structure of C  Do−1 B is
due to the fact that C  and B are upper and lower bidiagonal respectively (albeit not
square). It also follows that cr is equivalent to Gaussian elimination with diagonal
pivoting on a reordered matrix. Specifically, the unknowns are eliminated in the
nested dissection order (cf. [55]) that is obtained by the repeated application of the
above scheme: First the odd-numbered unknowns (indexed by 2k − 1), followed by
those indexed 2(2k − 1), followed by those indexed 22 (2k − 1), and so on. The cost
to form the Schur complement tridiagonal matrix and corresponding right-hand side
is 12 operations per equation and so the cost for the first step is 12 × (2k−1 − 1)
operations. Applying the same method recursively on the tridiagonal system with
the Schur complement matrix for the odd unknowns one obtains after k − 1 steps
1 scalar equation for ξ2k−1 that is the unknown in middle position. From then on,
the remaining unknowns can be recovered using repeated applications of the second
equation above. The cost is 5 operations per computed unknown, independent for
each. On p = O(n) processors, the cost is approximately T p = 17 log n parallel
operations and O p = 17n − 12 log n in total, that is about twice the number of
operations needed by Gaussian elimination with diagonal pivoting. cr is listed as
Algorithm 5.3. The cr algorithm is a component in many other parallel tridiagonal
5.5 Tridiagonal Systems 133

Algorithm 5.3 cr: cyclic reduction tridiagonal solver


Input: A = [αi,i−1 , αi,i , αi,i+1 ] ∈ Rn×n and f ∈ Rn where n = 2k − 1.
Output: Solution x of Ax = f .
(0) (0) (0)
//It is assumed that αi,i−1 = αi,i−1 , αi,i = αi,i , αi,i+1 = αi,i+1 for i = 1 : n //Reduction
stage
1: do l = 1 : k − 1
2: doall i = 2l : 2l : n − 2l
(l−1) (l−1)
3: ρi = −αi,i−2l−1 /αi−2(l−1) , τi = −αi,i+2l−1 /αi+2(l−1)
(l) (l−1)
4: αi,i−2l = ρi αi,i−2l−1
(l) (l−1)
5: αi,i+2l = τi αi,i+2l−1
(l) (l−1) (l−1) (l−1)
6: αi,i = αi,i + ρi αi,i−2l−1 + τi αi,i+2l−1
(l) (l−1) (l−1) (l−1)
7: φi = φi + ρi φi−2l−1 + τi φi+2l−1
8: end
9: end
//Back substitution stage
10: do l = k : −1 : 1
11: doall i = 2l−1 : 2l−1 : n − 2l−1
12: ξi = (φi(l−1) − αi,i−2l−1 ξi−2l−1 − αi , i + 2l−1 ξi+2l−1 )/φi(l−1)
13: end
14: end

system solvers, e.g. [56–58]. A block extension of cr culminated in an important


class of RES algorithms described in Sect. 6.4.5. The numerical stability of cr is
analyzed in [59]. The complexity and stability results are summarized in the following
proposition.
Proposition 5.1 The cr method for solving Ax = f when A ∈ Rn×n is tridiagonal
on O(n) processors requires no more than T p = 17 log n + O(1) steps for a total
of O p = 17n + O(log n) operations. Moreover, if A is diagonally dominant by rows
or columns, then the computed solution satisfies

(A + δ A)x̃ = f, δ A ∞ ≤ 10(log n) A ∞ u

and the relative forward error satisfies the bound

x̃ − x ∞
≤ 10(log n)κ∞ (A)u
x̃ ∞

where κ(A) = A ∞ A−1 ∞ .


Note that even though the upper bound on the backward error depends on log n, its
effect was not seen in the experiments conducted in [59].
Since CR is equivalent to Guassian elimination with diagonal pivoting on a
recorded matrix, it can be used for tridiagonal matrices for which the latter method
is applicable These include symmetric positive definite matrices, M-matrices, to-
tally nonnegative matrices, and matrices that can be written as D1 AD2 where
134 5 Banded Linear Systems

|D1 | = |D2 | = I and A is of the aforementioned types. In those cases, it can


be proved that Gaussian elimination with diagonal pivoting to solve Ax = f suc-
ceeds and that the computed solution satisfies (A + δ A)x̂ = f where |δ A| ≤ 4u|A|
ignoring second order terms. For diagonally dominant matrices (by rows or columns)
the bound is multiplied by a small constant. See the relevant section in the treatise
[45, Sect. 9.6] and the relevant paper [60] for a more detailed discussion.
One cause of inefficiency in this algorithm are memory bank conflicts that can
arise in the course of the reduction and back substitution stages. We do not address
this issue here but point to detailed discussions in [34, 61]. Another inefficiency,
is that as the reduction proceeds as well as early in the back substitution stage, the
number of equations that can be processed independently is smaller than the number
of processors. At the last step of the reduction, for example, there is only one equation.
It has been observed that this inefficiency emerges as an important bottleneck after
bank conflicts are removed; see e.g. the discussion in [61] regarding cr performance
on GPUs.
To address this issue, another algorithm, termed paracr, has been designed (cf.
[27]) to maintain the parallelism throughout at the expense of some extra arithmetic
redundancy. Three equations are combined at a time, as in cr, however this com-
bination is applied to subsequent equations, say i − 1, i, i + 1, irrespective of the
parity of i, rather than only even ones. The eliminations that take place double the
distance of the coupling between unknowns: we assume this time, for simplicity,
that there are n = 2k equations, the distance is 1 at the first step (every unknown
ξi is involved in an equation with its immediate neighbors, say ξi±1 ), to 2 (every
unknown is involved in an equation with its neighbors at distance 2, say ξi±2 ), to
22 , etc. keeping note that when the distance is such to make i ± d be smaller than
1 or larger than n, the variables are 0. Therefore, in k = log n steps there will be n
equations, each connecting only one of the variables to the right-hand side.
As we explain next, each iteration of the algorithm can be elegantly described
in terms of operations with the matrices D, L , U that arise in the splitting A( j) =
D ( j) − L ( j) − U ( j) where D ( j) is diagonal and L ( j) , U ( j) are strictly lower and
strictly upper triangular respectively and of a very special form. For this reason, we
will refer to our formulation as matrix splitting-based paracr.
Definition 5.1 A lower (resp. upper) triangular matrix is called t-upper (resp. lower)
diagonal if its only non-zero elements are at the tth diagonal below (resp. above) the
main diagonal. A matrix that is the sum of a diagonal matrix, a t-upper diagonal and
a t-lower diagonal matrix is called t-tridiagonal.
Clearly, if t ≥ n, then the t-tridiagonal matrix will be diagonal and the t-diagonal
matrices will be zero. Note that t-tridiagonal matrices appear in the study of special
Fibonacci numbers (cf. [62]) and are a special case of “triadic matrices”, defined in
[63] to be matrices for which the number of non-zero off-diagonal elements in each
column is bounded by 2.

Lemma 5.1 The product of two t-lower (resp. upper) diagonal matrices of equal
size is 2t-lower (resp. upper) diagonal. Also if 2t > n then the product is the zero
5.5 Tridiagonal Systems 135

matrix. If L is t-lower diagonal and U is t-upper diagonal, then LU and UL are both
diagonal. Moreover the first t elements of the diagonal of LU are 0 and so are the
last t diagonal elements of UL.
Proof The nonzero structure of a 1-upper diagonal matrix is the same (without loss
of generality, we ignore the effect of zero values along the superdiagonal) with that of
J  , where as usual J  = e1 e2 +e2 e3 +· · ·+en−1 en is the 1-upper diagonal matrix
with all nonzero elements equal to 1. A similar result holds for 1-lower diagonal
matrices, except that we use J . Observe that (J  )2 = e1 e3 + · · · + en−2 en which
is 2-upper diagonal and in general

(J  )t = e1 et+1

+ · · · en−t en .

which is t-upper diagonal and has the same nonzero structure as any t-upper diagonal
U . If we multiply two t-upper diagonal matrices, the result has the same nonzero
structure as (J  )2t and thus will be 2t-upper diagonal. A similar argument holds for
the t-lower diagonal case, which proves the first part of the lemma. Note next that

(J  )t J t = (e1 et+1

+ · · · en−t en )(e1 et+1

+ · · · en−t en )
= e1 e1 + · · · + en−t en−t


which is a diagonal matrix with zeros in the last t elements of its diagonal. In the
same way, we can show that J t (J  )t is diagonal with zeros in the first t elements.
Corollary 5.3 Let A = D − L − U be a t-tridiagonal matrix for some nonnegative
integer t such that t < n, L (resp. U ) is t-lower (resp. upper) diagonal and all
diagonal elements of D are nonzero. Then (D + L + U )D −1 A is 2t-tridiagonal. If
2t > n then the result of all the above products is the zero matrix.
Proof After some algebraic simplifications we obtain

(I + L D −1 + U D −1 )A = D − (L D −1 U + U D −1 L) − (L D −1 L + U D −1 U ). (5.87)

Multiplications with D −1 have no effect on the nonzero structure of the results. Using
the previous lemma and the nonzero structure of U and L, it follows that UD−1 U
is 2t-upper diagonal, LD−1 L is 2t-lower diagonal. Also UD−1 L is diagonal with its
last t diagonal elements zero and LD−1 U is diagonal with its first t diagonal elements
zero. Therefore, in formula (5.87), the second right-hand side term (in parentheses)
is diagonal and the third is the sum of a 2t-lower diagonal and a 2t-upper diagonal
matrix, proving the claim.
Let now n = 2k and consider the sequence of transformations

(A( j+1) , f ( j+1) ) = (I + L ( j) (D ( j) )−1 + U ( j) (D ( j) )−1 )(A( j) , f ( j) ) (5.88)

for j = 1, . . . , k −1, where (A(1) , f (1) ) = (A, f ). Observe the structure of A( j) . The
initial A(1) is tridiagonal and so from Corollary 5.3, matrix A(2) is 2-tridiagonal, A(3)
136 5 Banded Linear Systems

is 22 -tridiagonal, and so on; finally A(k) is diagonal. Therefore (A, f ) is transformed


to (A(k) , f (k) ) with A(k) diagonal from which the solution can be computed in 1 vector
division. Figure 5.9 shows the nonzero structure of A(1) up to A(4) for A ∈ R16×16 .
Also note that at each step. the algorithm only needs the elements of A( j) which can
overwrite A.
We based the construction of paracr in terms of the splitting A = D− L −U , that
is central in classical iterative methods. In Algorithm 5.4 we describe the algorithm in
terms of vector operations. At each stage, the algorithm computes the three diagonals
of the new 2 j -tridiagonal matrix from the current ones and updates the right-hand
side. Observe that there appear to be at most 12 vector operations per iteration.
These affect vectors of size n − 2 j at steps j = 1 : k − 1. Finally, there is one
extra operation of length n. So the total number of operations is approximately
12n log n. All elementwise multiplications in lines 5–8 can be computed concurrently
if sufficiently many arithmetic units are available. In this way, there are only 8 parallel
operations in each iteration of paracr.

Proposition 5.2 The matrix splitting-based paracr algorithm can be implemented


in T p = 8 log n + O(1) parallel operations on p = O(n) processors.

Algorithm 5.4 paracr: matrix splitting-based paracr (using transformation 5.88).


Operators ,  denote elementwise multiplication and division respectively.
Input: A = [λi , δi , υi+1 ] ∈ Rn×n and f ∈ Rn where n = 2k .
Output: Solution x of Ax = f .
1: l = (λ2 , . . . , λn ) , d = (δ1 , . . . , δn ) , c = (υ2 , . . . , υn ) .
2: p = −l; q = −u;
3: do j = 1 : k
4: σ = 2 j−1 ;
5: p = l  d 1:n−σ ; q = u  d1+σ  :n
0ρ q  f σ +1:n
6: f = f + +
p  f 1:n−σ 0σ
 
0σ q l
7: d = d − −
pu 0σ
8: l = pσ +1:n−σ  l1:n−2σ ; u = q1:n−2σ  u 1+σ :n−σ
9: end
10: x = f  d

It is frequently the case that the dominance of the diagonal terms becomes more
pronounced as cr and paracr progress. One can thus consider truncated forms of the
above algorithms to compute approximate solutions. This idea was explored in [51]
where it was called incomplete cyclic reduction, in the context of cyclic reduction for
block-tridiagonal matrices. The idea is to stop the reduction stage before the log n
steps, and instead of solving the reduced system exactly, obtain an approximation
followed by the necessary back substitution steps. It is worth noting that terminating
the reduction phase early alleviates or avoids completely the loss of parallelism of cr.
5.5 Tridiagonal Systems 137

Fig. 5.9 Nonzero structure 0 0


of A( j) as paracr progresses
from j = 1 (upper left) to
5 5
j = 4 (lower right)

10 10

15 15

0 5 10 15 0 5 10 15
nz = 44 nz = 40

0 0

5 5

10 10

15 15

0 5 10 15 0 5 10 15
nz = 32 nz = 16

Incomplete point and block cyclic reduction were studied in detail in [64, 65];
it was shown that if the matrix is diagonally dominant by rows then if the row
dominance factor (revealing the degree of diagonal dominance) of A, defined by
⎧ ⎫
⎨ 1  ⎬
rdf(A) := max |αi, j |
i ⎩ |αi,i | ⎭
j =i

is sufficiently smaller than 1, then incomplete cyclic reduction becomes a viable


alternative to the ScaLAPACK algorithms for tridiagonal and banded systems.
We next show the feasibility of an incomplete matrix splitting-based paracr
algorithm for matrices that are strictly diagonally dominant by rows based on the
previously established formulation for paracr and a result from [64]. We illustrate
this idea by considering the first step of paracr. Recall that

A(2) = (D − LD−1 U − UD−1 L) − (LD−1 L + UD−1 U ).

is 2-tridiagonal (and thus triadic). The first term in parentheses provides the diagonal
elements and the second term the off-diagonal ones. Next note that A ∞ = |A| ∞
and the terms L D −1 L and U D −1 U have their nonzeros at different position. There-
fore, the row dominance factor of A(2) is equal to

rdf(A(2) ) = (D − L D −1 U − U D −1 L)−1 (L D −1 L + U D −1 U ) ∞ .
138 5 Banded Linear Systems

The element in diagonal position i of D − LD−1 U − UD−1 L is


αi,i−1 αi,i+1
αi,i − αi−1,i − αi+1,i .
αi−1,i−1 αi+1,i+1

Also the sum of the magnitudes of the off-diagonal elements at row i is


αi,i−1 αi,i+1
|αi−1,i−2 | + |αi+1,i+2 |.
αi−1,i−1 αi+1,i+1

We follow the convention that elements whose indices are 0 or n + 1 are 0.


Then the row dominance factor of A(2) is
⎧ αi−1,i−2 αi+1,i+2 ⎫

⎨ |αi,i−1 α | + |αi,i+1 | ⎪⎬
αi+1,i+1
rdf(A(2) ) = max
i−1,i−1
i ⎪ α α
⎩ |αi,i − i,i−1 αi−1,i − i,i+1 αi+1,i | ⎪ ⎭
αi−1,i−1 αi+1,i+1
⎧ α αi,i+1 αi+1,i+2 ⎫
i,i−1 αi−1,i−2

⎨ | αi,i α |+| | ⎪
i−1,i−1 αi,i αi+1,i+1 ⎬
= max α α α α
i ⎪⎩ |1 − i,i−1 i−1,i − i,i+1 i+1,i | ⎪ ⎭
αi,i αi−1,i−1 αi,i αi+1,i+1
⎧ αi,i−1 αi−1,i−2 αi,i+1 αi+1,i+2 ⎫

⎨ | α |+| | ⎪⎬
i,i αi−1,i−1 αi,i αi+1,i+1
≤ max α α α α
i ⎪⎩ |1 − | i,i−1 i−1,i | − | i,i+1 i+1,i || ⎪ ⎭
αi,i αi−1,i−1 αi,i αi+1,i+1

For row i set


αi,i−1 αi,i+1 αi−1,i−2 αi+1,i+2
ψ̂i = | |, ψi = | |, ζi = | |, ηi = | |,
αi,i αi,i αi−1,i−1 αi+1,i+1
αi−1,i αi+1,i
θ̂i = | |, θi = | |.
αi−1,i−1 αi+1,i+1

and let ε = rdf(A). Then,

ψ̂i + ψi ≤ ε, ζi + θ̂i ≤ ε, ηi + θi ≤ ε (5.89)

and because of strict row diagonal dominance, ψ̂i θ̂i + ψi θi < 1. Therefore if we
compute for i = 1, . . . , n the maximum values of the function

ψ̂ζi + ψηi
gi (ψ̂, ψ) =
1 − ψ̂ θ̂i − ψθi

over (ψ, ψ̂) assuming that conditions such as (5.89) hold, then the row dominance
factor of A(2) is less than or equal to the maximum of these values. It was shown in
5.5 Tridiagonal Systems 139

[64] that if conditions such as (5.89) hold then

ψ̂ζi + ψηi
≤ ε2 . (5.90)
1 − ψ̂ θ̂i − ψθi

Therefore, rdf(A(2) ) ≤ ε2 . The following proposition can be used to establish a


termination criterion for incomplete matrix splitting-based paracr. We omit the
details of the proof.
Proposition 5.3 Let A be tridiagonal and strictly diagonally dominant by rows with
dominance factor ε. Then

rdf(A( j+1) ) ≤ ε2 , j = 1, . . . , log n − 2.


j

We also observe that it is possible to use the matrix splitting framework in order
to describe cr. We sketch the basic idea, assuming this time that n = 2k − 1. At
each step j = 1, . . . , k − 1, of the reduction phase the following transformation is
implemented:

( j) ( j)
(A( j+1) , f ( j+1) ) = (I + L e (D ( j) )−1 + Ue (D ( j) )−1 )(A( j) , f ( j) ), (5.91)

where initially (A(1) , f (1) ) = (A, f ). We assume that the steps of the algorithm can
be brought to completion without division by zero. Let A( j) = D ( j) − L ( j) − U ( j)
denote the splitting of A( j) into its diagonal and strictly lower and upper triangular
( j) ( j)
parts. From L ( j) and U ( j) we extract the strictly triangular matrices L e and Ue
( j) (
following the rule that they contain the values of L and U at locations (i2 , i2 −
j) j j

2 j−1 ) and (i2 j −2 j−1 , i2 j ) respectively for i = 1, . . . , 2k− j −1 and zero everywhere
else. Then at the end of the reduction phase, row 2k−1 (at the middle) of matrix A(k−1)
will only have its diagonal (middle) element nonzero and the unknown ξ2k−1 can be
computed in one division. This is the first step of the back substitution phase which
consists of k steps. At step j = 1, . . . , k, there is a vector division of length 2 j−1
to compute the unknowns indexed by (1, 3, . . . , 2 j − 1) · 2k− j and 2 j _AXPY,
BLAS1, operations on vectors of length 2k− j − 1 for the updates. The panels in
Fig. 5.10 illustrate the matrix structures that result after k steps of cyclic reduction
(k = 3 in the left and k = 4 in the right panel).
The previous procedure can be extended to block-tridiagonal systems. An analysis
similar to ours for the case of block-tridiagonal systems and block cyclic reduction
was described in [66]; cf. Sect. 6.4.

5.5.3 LDU Factorization by Recurrence Linearization

Assume that all leading principal submatrices of A are nonsingular. Then there exists
a factorization A = L DU where D = diag(δ1 , . . . , δn ) is diagonal and
140 5 Banded Linear Systems

0 0

1 2

2 4

3 6

4 8

5 10

6 12

7 14
8 16
0 2 4 6 8 0 5 10 15
nz = 15 nz = 37

Fig. 5.10 Nonzero structure of A(3) ∈ R7×7 (left) and A(4) ∈ R15×15 (right). In both panels, the
unknown corresponding to the middle equation is obtained using the middle value of each matrix
enclosed by the rectangle with double border. The next set of computed unknowns (2 of them)
correspond to the diagonal elements enclosed by the simple rectangle, the next set of computed
unknowns (22 of them) correspond to the diagonal elements enclosed by the dotted rectangles. For
A(4) , the final set of 23 unknowns correspond to the encircled elements

⎛ ⎞ ⎛ ⎞
1 1 υ2
⎜λ2 ⎟ ⎜ .. .. ⎟
⎜ ⎟ ⎜ . . ⎟
L=⎜ . . ⎟ ,U = ⎜ ⎟.
⎝ .. .. ⎠ ⎝ υn ⎠
λn 1 1

We can write the matrix A as a sum of the rank-1 matrices formed by the columns
of L and the rows of U multiplied by the corresponding diagonal element of D.

A = δ1 L :,1 U1,: + · · · + δn L :,n Un,:

Observing the equalities along the diagonal, the following recurrence of degree 1
holds:
αi−1,i αi,i−1
δi = αi,i − , i = 2 : n. (5.92)
δi−1

This recurrence is linearized as described in Sect. 3.4. Specifically, using new vari-
ables τi and setting δi = τi /τi−1 with τ0 = 1, τ1 = α1,1 it follows that

τi = αi,i τi−1 − αi−1,i αi,i−1 τi−2 , τ0 = 1, τ1 = α1,1 . (5.93)


5.5 Tridiagonal Systems 141

The values τ1 , . . . , τn satisfy the system


⎛ ⎞⎛ ⎞ ⎛ ⎞
1 τ0 1
⎜ −α1,1 1 ⎟ ⎜τ1 ⎟ ⎜0⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜α1,2 α2,1 −α2,2 1 ⎟ ⎜τ2 ⎟ ⎜0⎟
⎜ ⎟⎜ ⎟ = ⎜ ⎟ (5.94)
⎜ .. .. .. ⎟ ⎜ .. ⎟ ⎜ .. ⎟
⎝ . . . ⎠ ⎝ . ⎠ ⎝.⎠
αn−1,n αn,n−1 −αn,n 1 τn 0

The right-hand side is the unit vector, thus the solution is the first column of the
inverse of the coefficient matrix, that is lower triangular with bandwidth 3. This
specific system, as was shown in [18, Lemma 2], can be solved in 2 log n + O(1)
steps using at most 2n processors. It is worth noting that this is faster than the
approximate cost of 3 log n steps predicted by Theorem 3.2 and this is due to the unit
vector in the right-hand side. Once the τi ’s are available, the elements of D, L and
U are computable in O(1) parallel steps using n processors. Specifically
αi+1,i αi,i+1
δi = τi /τi−1 , λi+1 = , υi = .
di di

In theory, therefore, if the factorization exists, we can compute the factors L , D, U


in 2 log n + O(1) steps using at most 2n processors. To solve a linear system, it only
remains to perform the backward and forward substitutions and one parallel division.
Using the same theorem, each of the bidiagonal systems can be solved in 2 log n steps
on n − 1 processors. We thus obtain the following result for the parallel computation
of the LDU factorization and its application to solving tridiagonal systems.
Proposition 5.4 [67] If a tridiagonal matrix A ∈ Rn×n is non-singular and so are
all its leading principal submatrices then all the elements of its LDU factorization
can be computed on p = 2n processors using T p = 2 log n + O(1) parallel steps.
For such an A, the linear system Ax = f can thus be solved using 6 log n + O(1)
steps.
Algorithm 5.5 shows the steps for the LDU factorization. In practice, the above
algorithm must be used with care. In particular, there is the possibility of overflow or
underflow because τi = δi δi−1 · · · δ1 . For example, when δi = i making τi = i! [67].
It is worth noting that the τi ’s can also be computed by rewriting relations (5.93) as



αi,i
2 1


τi τi−1 = τi−1 τi−2 , τ1 τ0 = α1,1 1
−αi−1,i αi,i−1 0

This is a vector recurrence for the row vector t (i) = (τi , τi−1 ) that we can write as

t (i) = t (i−1) Si ,
142 5 Banded Linear Systems

Algorithm 5.5 trid_ldu: LDU factorization of tridiagonal matrix.


Input: A = [αi,i−1 , αi,i , αi,i+1 ] ∈ Rn×n irreducible and such that all leading principal submatrices
are nonsingular.
Output: The elements δ1:n , λ2:n and υ2:n of the diagonal D, lower bidiagonal L and upper bidiagonal
U.
//Construct triangular matrix C = [γi, j ] ∈ R(n+1)×(n+1)
(n+1) (n+1) 
1: C = e1 (e1 ) , γn+1,n−1 = αn−1,n αn,n−1
2: doall i = 1 : n
3: γi+1,i+1 = 1, γi+1,i = −αi,i
4: end
//Solve using parallel algorithm for banded lower triangular systems
(n+1)
5: Solve Ct = e1 where C ∈ R(n+1)×(n+1)
//Compute the unknown elements of L , D, U
6: doall i = 1 : n
7: δi = ti+1 ti
8: end
9: doall i = 1 : n − 1
α αi,i+1
10: λi+1 = i+1,i δi , υi+1 = δi
11: end

where Si denotes the 2 × 2 multiplier. Therefore, the unknowns can be recovered by


computing the terms

t (2) = t (1) S1 , t (3) = t (1) S1 S2 , . . . , t (n) = t (1) S1 · · · Sn

and selecting the first element from each t (i) ,


⎛ ⎞ ⎛ (1) ⎞
τ1 t
⎜ .. ⎟ ⎜ .. ⎟
⎝ . ⎠ = ⎝ . ⎠ e1 .
τn t (n)

This can be done using a matrix (product) parallel prefix algorithm on the set
t (1) , S1 , . . . , Sn . Noticing that t (i) and t (i+1) both contain the value τi , it is suffi-
cient to compute approximately half the terms, say t (1) , t (3) , . . . . For convenience
assume that n is odd. Then this can be accomplished by first computing the prod-
ucts t (1) S1 and S2i S2i+1 for i = 1, . . . , (n − 1)/2 and then applying parallel prefix
matrix product on these elements. For example, when n = 7, the computation can
be accomplished in 3 parallel steps, shown in Table 5.1.

Table 5.1 Steps of a parallel prefix matrix product algorithm to compute the recurrence (5.93)
Step 1 t (1) S1 S2:3 S4:5 S6:7
Step 2 t (1) S1:3 S2:5 S4:7
Step 3 t (1) S1:5 t (1) S1:7
For j > i, the term Si: j denotes the product Si Si+1 · · · S j
5.5 Tridiagonal Systems 143

Regarding stability in finite precision, to achieve the stated complexity, the algo-
rithm uses either parallel prefix matrix products or a lower banded triangular solver
(that also uses parallel prefix). An analysis of the latter process in [18] and some
improvements in [68], show that the bound for the 2-norm of the absolute forward
error contains a factor σ n+1 , where σ = maxi Si 2 . This suggests that the absolute
forward error can be very large in norm. Focusing on the parallel prefix approach, it
is concluded that the bound might be pessimistic, but deriving a tighter one, for the
general case, would be hard. Therefore, the algorithm must be used with precaution,
possibly warning the user when the error becomes large.

5.5.4 Recursive Doubling

Let a tridiagonal matrix be irreducible; then the following recurrence holds:

αi,i αi,i−1 φi
ξi+1 = − ξi − ξi−1 +
αi, i + 1 αi, i + 1 αi,i+1
⎛ ⎞
ξi
If we set x̂i = ⎝ξi−1 ⎠ then we can write the matrix recurrence
1
⎛ ⎞
ρi σi τi
αi,i αi,i−1 φi
x̂i+1 = Mi x̂i , where Mi = ⎝ 1 0 0 ⎠ , ρi = − , σi = − , τi = .
0 0 1 αi,i+1 αi,i+1 αi,i+1

The initial value is x̂1 = (ξ1 , 0, 1) . Observe that ξ1 is yet unknown. If it were
available, then the matrix recurrence can be used to compute all elements of x from
x̂2 , . . . , x̂n :

x̂2 = M1 x̂1 , x̂3 = M2 x̂2 = M2 M1 x̂1 , . . . , x̂n = Mn−1 · · · M1 x̂1 .

We can thus write

x̂n+1 = Pn x̂1 , where Pn = Mn · · · M1 .

From the structure of M j we can write


⎛ ⎞
π1,1 π1,2 π1,3
Pn = ⎝π2,1 π2,2 π3,3 ⎠ .
0 0 1
144 5 Banded Linear Systems

To find ξ1 we observe that ξ0 = ξn+1 = 0 and so 0 = π1,1 ξ1 + π1,3 from which


it follows that ξ1 = −π1,3 /π1,1 once the elements of Pn are computed. In [69], a
tridiagonal solver is proposed based on this formulation. We list this as Algorithm 5.6.
The algorithm uses matrix parallel prefix on the matrix sequence M1 , . . . , Mn in order
to compute Pn and then x̂1 , . . . , x̂n to recover x, and takes a total of 15n + O(1)
arithmetic operations on one processor. However, the matrix needs to be diagonally
dominant for the method to proceed without numerical problems; some relevant error
analysis was presented in [70].

5.5.5 Solving by Givens Rotations

Givens rotations can be used to implement the QR factorization in place of House-


holder reflections in order to bring a matrix into upper triangular form [22, 71].
Givens rotations become more attractive for matrices with special structure, e.g.
upper Hessenberg and tridiagonal. Givens rotations have been widely used in par-
allel algorithms for eigenvalues; see for example [18, 72] and our discussion and
references in Chaps. 7 and 8. Parallel algorithms for linear systems based on Givens
rotations were proposed early on in [12, 18].

Algorithm 5.6 rd_pref: tridiagonal system solver using rd and matrix parallel
prefix.
Input: Irreducible A = [αi,i−1 , αi,i , αi,i+1 ] ∈ Rn×n , and right-hand side f ∈ Rn .
Output: Solution of Ax = f .
1: doall i = 1, . . . , n
αi,i αi,i−1 φi
2: ρi = − αi,i+1 , σi = − αi,i+1 , τi = αi,i+1 .
3: end
4: Compute the products P2⎛= M2 M1⎞, . . . , Pn = Mn · · · M1 using a parallel prefix matrix product
ρi σi τi
algorithm, where Mi = ⎝ 1 0 0 ⎠.
0 0 1
5: Compute ξ1 = −(Pn )1,3 /(Pn )1,1
6: doall i = 2, . . . , n
7: x̂i = Pi x̂1 where x̂1 = (ξ1 , 0, 1) .
8: end
9: Gather the elements of x from {x̂1 , . . . , x̂n }

We first consider the Givens based parallel algorithm from [12, Algorithm I]).
As will become evident, the algorithm shares many features with the parallel LDU
factorization (Algorithm 5.5). Then in Sect. 5.5.5 we revisit the Givens rotation solver
based on Spike partitioning from [12, Algorithm II].
We will assume that the matrix is irreducible and will denote the elementary
2 × 2 Givens rotation submatrix that will be used to eliminate the element in position
(i + 1, i) of the tridiagonal matrix by

ci si
G i+1,i = (5.95)
−si ci .
5.5 Tridiagonal Systems 145

To perform the elimination of subdiagonal element in position (i + 1, i) we use the


block-diagonal matrix
⎛ ⎞
Ii−1
(i)
G i+1,i = ⎝ G i+1,i ⎠.
In−i−1

(i) (1)
Let A(0) = A and A(i) = G i+1,i · · · G 2,1 A(0) be the matrix after i = 1, . . . , n − 1
rotation steps so that the subdiagonal elements in positions (2, 1), . . . , (i, i − 1) of
A(i) are 0 and A(n−1) = R is upper triangular. Let
⎛ ⎞
λ1 μ1 ν1
⎜ λ2 μ2 ν2 ⎟
⎜ ⎟
⎜ . .
.. .. . .. ⎟
⎜ ⎟
⎜ ⎟
A(i−1) =⎜
⎜ λ i−1 μ i−1 νi−1

⎟ (5.96)
⎜ π c α ⎟
⎜ i i−1 i,i+1 ⎟
⎜ α α α ⎟
⎝ i+1,i i+1,i+1 i+1,i+2 ⎠
.. .. ..
. . .

be the result after the subdiagonal entries in rows 2, . . . , i have been annihilated.
Observe that R is banded, with bandwidth 3. If i = n − 1, the process terminates
and the element in position (n, n) is πn ; otherwise, if i < n − 1 then rows i and i + 1
need to be multiplied by a rotation matrix to zero out the subdiagonal element in row
i + 1.
Next, rows i and i + 1 of A(i−1) need to be brought into their final form. The
following relations hold for i = 1, . . . , n − 1 and initial values c0 = 1, s0 = 0:
πi αi+1,i
ci = , si = (5.97)
λi λi
πi = ci−1 αi,i − si−1 ci−2 αi−1,i (5.98)
μi = si αi+1,i+1 + ci ci−1 αi,i+1 (5.99)
νi = si αi+1,i+2 (5.100)

From the above we obtain

ci λi = ci−1 αi,i − si−1 ci−2 αi−1,i


αi,i−1
= ci−1 αi,i − ci−2 αi−1,i
λi−1

that we write as
ci αi,i−1 αi−1,i
λi = αi,i − ci−1 .
ci−2 λi−1
ci−1
146 5 Banded Linear Systems

Observe the similarity with the nonlinear recurrence (5.92) we encountered in the
steps leading to the parallel LDU factorization Algorithm 5.5. Following the same
approach, we apply the change of variables
τi ci
= λi (5.101)
τi−1 ci−1

in order to obtain the linear recurrence

τi = αi,i τi−1 − αi,i−1 αi−1,i τi−2 , τ0 = 1, τ1 = α1,1 , (5.102)

and thus the system


⎛ ⎞⎛ ⎞ ⎛ ⎞
1 τ0 1
⎜ −α1,1 1 ⎟⎜ τ1 ⎟ ⎜0⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜α1,2 α2,1 −α2,2 1 ⎟⎜ τ2 ⎟ ⎜0⎟
⎜ ⎟⎜ ⎟ = ⎜ ⎟ (5.103)
⎜ . . ⎟⎜ .. ⎟ ⎜ .. ⎟
⎝ . ⎠⎝ . ⎠ ⎝.⎠
Gant αn−2,n−1 αn−1,n−2 −αn−1,n−1 1 τn−1 0

The coefficient matrix can be generated in one parallel multiplication. As noted


in our earlier discussion for Algorithm 5.5, forming and solving the system takes
2 log n + O(1) steps using 2n − 4 processors (cf. Algorithm 3.3). Another way is to
observe that the recurrence (5.102) can also be expressed as follows



αi,i 1
τi τi−1 = τi−1 τi−2 , (5.104)
−αi,i−1 αi−1,i 0



where τ1 τ0 = α1,1 1 (5.105)

We already discussed the application of parallel prefix matrix multiplication to com-


pute the terms of this recurrence in the context of the LDU algorithm; cf. Sect. 5.5.3.
From relation (5.101) and the initial values for c0 = 1, τ0 = 1, it follows that
ci ci c1
τi = λi τi−1 = λi · · · λ1 · · · τ0
ci−1 ci−1 c0
= λi · · · λ1 ci .
i
Setting θi = ( j=1 λ j )
2 it follows from trigonometric identities that

θi = αi+1,i
2
θi−1 + τi2 , for i = 1, . . . , n − 1 (5.106)

with initial values θ0 = τ0 = 1. This amounts to a linear recurrence for θi that can
be expressed as the unit lower bidiagonal system
5.5 Tridiagonal Systems 147
⎛ ⎞
⎛ ⎞⎛ ⎞ τ02
1 θ0 ⎜ ⎟
⎜−α2,1
2 1 ⎟⎜ θ1 ⎟ ⎜ τ12 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ −α 2 ⎟⎜ θ2 ⎟ ⎜ τ22 ⎟
⎜ 3,2 1 ⎟⎜ ⎟=⎜ ⎟ (5.107)
⎜ .. .. ⎟⎜ .. ⎟ ⎜ ⎟
⎝ . . ⎠⎝ . ⎠ ⎜ .. ⎟
⎝ . ⎠
−αn,n−1
2 1 θn−1
τn−1
2

On 2(n − 1) processors, one step is needed to form the coefficient matrix and right-
hand side. This lower bidiagonal system can be solved in 2 log n steps on n − 1
processors, (cf. [18, Lemma 1] and Theorem 3.2).
As before, we can also write the recurrence (5.106) as
 2


αi+1,i 0
θi 1 = θi−1 1 , θ0 = 1. (5.108)
τi2 1

It is easy to see that all θi ’s can be obtained using a parallel algorithm for the prefix
matrix product. Moreover, the matrices are all nonnegative, and so the componen-
twise relative errors are expected to be small; cf. [68]. This further strengthens the
finding in [12, Lemma 3] that the normwise relative forward error bound in comput-
ing the θi ’s only grows linearly with n.
Using the values {θi , τi } the following values can be computed in 3 parallel steps
for i = 0, . . . , n − 1:

τi2 2 θi αi+1,i
2
ci2 = , λi = , si2 = . (5.109)
θi θi−1 λi2

From these, the values μi , νi along the superdiagonals of R = A(n−1) can be obtained
in 3 parallel steps on 2n processors.
The solution of a system using QR factorization also requires the multiplication of
the right-hand side vector by the matrix Q that is the product of the n − 1 associated
Givens rotations. By inspection, it can be easily seen that this product is lower
Hessenberg; cf. [73]. An observation with practical implications made in [12] that is
not difficult is that Q can be expressed as Q = W LY + S J  [12], where L is the
lower triangular matrix with all nonzero elements equal to 1, J = (e2 , . . . , en , 0)
and W = diag(ω1 , . . . , ωn ), Y = diag(η1 , . . . , ηn ) and S = diag(s1 , . . . , sn−1 , 0),
where the {ci , si } are as before and
ci−1
ωi = ci ρi−1 , ηi = , i = 1, . . . , n, (5.110)
ρi−1

i
and ρi = (−1)i s j , i = 1, . . . , n − 1 with ρ0 = c0 = cn = 1.
j=1
148 5 Banded Linear Systems

The elements ρi are obtained from a parallel prefix product in log n steps using
n−2
2 processors. It follows that multiplication by Q can be decomposed into five
easy steps involving arithmetic, 3 being multiplications of a vector with the diagonal
matrices W, Y and S, and a computation of all partial sums of a vector (the effect of
L). The latter is a prefix sum and can be accomplished in log n parallel steps using
n processors; see for instance [74].
Algorithm 5.7, called pargiv, incorporates all these steps to solve the tridiagonal
system using the parallel generation and application of Givens rotations.

Algorithm 5.7 pargiv: tridiagonal system solver using Givens rotations.


Input: Irreducible A = [αi,i−1 , αi,i , αi,i+1 ] ∈ Rn×n , and right-hand side f ∈ Rn .
Output: Solution of Ax = f .
//Stage I :
1: Compute (τ1 . . . , τn−1 ) by solving the banded lower triangular system (5.103) or using parallel
prefix matrix product for (5.104).
//Stage II:
2: Compute (θ1 , . . . , θn−1 ) by solving the banded lower triangular system (5.107) or using parallel
prefix matrix product for (5.108).
//Stage III:
3: Compute {ci , λi } for i = 1 : n followed by {si } from Eqs. 5.109.
//Stage IV:
4: Compute W and Y from (5.110).
5: Compute fˆ = W LY f + S J f
6: Solve Rx = fˆ where R = A(n−1) is the banded upper triangular matrix in (5.96).

The leading cost of pargiv is 9 log n parallel operations: There are 2 log n in
each of stages I , II because of the special triangular systems that need to be solved,
2 log n in Stage IV (line 4) because of the special structure of the factors W, L , Y, S, J
and finally, another 3 log n in stage IV (line 6) also because of the banded trangular
system. Stage III contributes only a constant to the cost.
From the preceding discussion the following theorem holds:
Theorem 5.6 ([12, Theorem 3.1, Lemma 3.1]) Let A be a nonsingular irreducible
tridiagonal matrix of order n. Algorithm pargiv for solving Ax = f based on the
orthogonal factorization QA = R constructed by means of Givens rotations takes
T p = 9 log n + O(1) parallel steps on 3n processors. The resulting speedup is
S p = O( logn n ) and the efficiency E p = O( log1 n ). Specifically, the speedup over
8n
the algorithm implemented on a single processor is approximately 3 log n and over
Gaussian elimination with partial pivoting approximately Moreover, if B ∈ 4n
3 log n .
R , then the systems AX = B can be solved in the same number of parallel
n×k

operations using approximately (2 + k)n processors.


Regarding the numerical properties of the algorithm, we have already seen that
the result of stage I can have large absolute forward error. Since the bound grows
exponentially with n if the largest 2-norm of the multipliers Si is larger than 1, care
5.5 Tridiagonal Systems 149

is required in using the method in the general case. Another difficulty is that the
computation of the diagonal elements of W and Y in Stage I V can lead to underflow
and overflow. Both of the above become less severe when n is small. We next describe
a partitioning approach for circumventing these problems.

5.5.6 Partitioning and Hybrids

Using the tridiagonal solvers we encountered so far as basic components, we can


construct more complex methods either by combination into hybrids that attempt to
combine the advantages of each component method (see e.g. [30]), and/or by applying
partitioning in order to create another layer of parallelism. In addition, in the spirit
of the discussion for Spike in Sect. 5.2.2, we could create polyalgorithms that adapt
according to the characteristics of the problem and the underlying architecture.
Many algorithms for tridiagonal systems described in the literature adopt the parti-
tioning approach. Both Spike (cf. Sect. 5.1) and the algorithms used in ScaLAPACK
[2, 3, 75] for tridiagonal and more general banded systems are of this type. In a well
implemented partitioning scheme it appears relatively easy to obtain an algorithm
that returns almost linear speedup for a moderate number of processors, say p, by
choosing the number of partitions equal to the number of processors. The sequen-
tial complexity of tridiagonal solvers is linear in the size of the problem, therefore,
since the reduced system is of size O( p) and p  n, then the parallel cost of the
partitioning method would be O(n/ p), resulting in linear speedup.
We distinguish three major characteristics of partitioning methods: (i) the parti-
tioning scheme for the matrix, (ii) the method used to solve the subsystems, and (iii)
the method used to solve the reduced system. The actual partitioning of the matrix,
for example, leads to submatrices and a reduced system that can be anything from
well conditioned to almost singular. In some cases, the original matrix is SPD or
diagonally dominant, and so the submatrices along the diagonal inherit this prop-
erty and thus partitioning can be guided solely by issues such as load balancing and
parallel efficiency. In the general case, however, there is little known beforehand
about the submatrices. One possibility would be to attempt to detect singularity of
the submatrices corresponding to a specific partitioning, and repartition if this is
the case. Note that it is possible to detect in linear time whether a given tridiagonal
matrix can become singular due to small perturbations of its coefficients [76, 77]. We
next show a parallel algorithm based on Spike that detects exact singularity in any
subsystem and implements a low-cost correction procedure to compute the solution
of the original (nonsingular) tridiagonal system. From the partitionings proposed in
the literature (see for instance [5, 12, 15, 56, 57, 78, 79]) we consider the Spike
partitioning that we already encountered for banded matrices earlier in this chapter
and write the tridiagonal system Ax = f as follows:
150 5 Banded Linear Systems
⎛ ⎞⎛ ⎞ ⎛ ⎞
A1,1 A1,2 x1 f1
⎜ A2,1 A2,2 A2,3 ⎟ ⎜ x2 ⎟ ⎜ f 2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. .. .. ⎟ ⎜ . ⎟ = ⎜ . ⎟, (5.111)
⎝ . . . ⎠ ⎝ .. ⎠ ⎝ .. ⎠
A p, p−1 A p, p xp fp

with all the diagonal submatrices being square. Furthermore, to render the exposition
simpler we assume that p divides n and that each Ai,i is of size, m = n/ p. Under these
assumptions, the Ai,i ’s are tridiagonal, the subdiagonal blocks A2,1 , . . . , A p, p−1 are
multiples of e1 em and each superdiagonal block A , . . . , A
1,2 p−1, p is a multiple of

em e1 .
The algorithm we describe here uses the Spike methodology based on a Givens-
QR tridiagonal solver applied to each subsystem. In fact, the first version of this
method was proposed in [12] as a way to restrict the size of systems on which pargiv
(Algorithm 5.7) is applied and thus prevent the forward error from growing too large.
This partitioning for stabilization was inspired by work on marching methods; cf. [12,
p. 87] and Sect. 6.4.7 of Chap. 6. Today, of course, partitioning is considered primarily
as a method for adding an extra level of parallelism and greater opportunities for
exploiting the underlying architecture. Our Spike algorithm, however, does not have
to use pargiv to solve each subsystem. It can be built, for example, using a sequential
method for each subsystem.
Key to the method are the following two results that restrict the rank deficiency
of tridiagonal systems.
Proposition 5.5 Let the nonsingular tridiagonal matrix A be of order n, where
n = pm and let it be partitioned into a block-tridiagonal matrix of p blocks of order
m each in the diagonal with rank-1 off-diagonal blocks. Then each of the tridiagonal
submatrices along the diagonal and in block positions 2 up to p − 1 have rank at
least m − 2 and the first and last submatrices have rank at least m − 1. Moreover, if
A is irreducible, then the rank of each diagonal submatrix is at least m − 1.
The proof is omitted; cf. related results in [12, 80]. Therefore, the rank of each
diagonal block of an irreducible tridiagonal matrix is at least n −1. With this property
in mind, in infinite precision at least, we can correct rank deficiencies by only adding
a rank-1 matrix. Another result will also be useful.

Proposition 5.6 ([12, Lemma 3.2]) Let the order n irreducible tridiagonal matrix
A be singular. Then its rank is exactly n − 1 and the last row of the triangular factor
R in any Q R decomposition of A will be zero, that is πn = 0. Moreover, if this
is computed via the sequence of Givens rotations (5.95), the last element satisfies
cn−1 = 0.

The method, we call SP_Givens, proceeds as follows. Given the partitioned sys-
tem (5.111), each of the submatrices Ai,i along the diagonal is reduced to upper
triangular form using Givens transformations. This can be done, for example, by
means of Algorithm 5.7 (pargiv).
5.5 Tridiagonal Systems 151

In the course of this reduction, we obtain (in implicit form) the orthogonal matrices
Q 1 , . . . , Q p such that Q i Ai,i = Ri , where Ri is upper triangular. Applying the same
transformations on the right-hand side, the original system becomes
⎛ ⎞⎛ ⎞ ⎛ ⎞
R1 B2 x1 f˜1
⎜C 2 R2 B3 ⎟ ⎜ x2 ⎟ ⎜ f˜2 ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. .. .. ⎟ ⎜ . ⎟=⎜ . ⎟ (5.112)
⎝ . . . ⎠ ⎝ .. ⎠ ⎝ .. ⎠
Cp Rp xp f˜p

where

Ci = Q i Ai,i−1 , Bi+1 = Q i Ai,i+1 , f˜i = Q i f i .

We denote the transformed system by Ãx = b̃.


Recall from the presentation of pargiv that the orthogonal matrices Q i can be
written in the form

Q i = Wi LYi + Si J 

where the elements of the diagonal matrices Wi , Yi , Si , see (5.109) and (5.110)
correspond to the values obtained in the course of the upper triangularization of Ai,i
by Givens rotations. Recalling that Ai,i−1 = α(i−1)m+1,(i−1)m e1 em  and A
i,i+1 =

αim,im+1 em e1 , then Ci and Bi+1 are given by:

Ci = Q i Ai,i−1 = α(i−1)m+1,(i−1)m (Wi LYi e1 + Si J  e1 )em




= α(i−1)m+1,(i−1)m w(i) em

, where w(i) = diag(Wi ),

(i)
(note that J  e1 = 0 and e1 Yi e1 = η1 = 1 from (5.110)) and

Bi+1 = Q i Ai,i+1 = αim,im+1 (Wi LYi em + Si J  em )e1


(i) (i) (i)
= αim,im+1 (ηm ωm em + sm−1 em−1 )e1 .

(i)
From relations (5.110), and the fact that cm = 1, it holds that

(i) (i)
Bi+1 = αim,im+1 (cm−1 em + sm−1 em−1 )e1 . (5.113)

Note that the only nonzero terms of each Ci = Q i Ai,i−1 are in the last column
whereas the only nonzero elements of Bi+1 = Q i Ai,i+1 are the last two elements of
the first column.
Consider now any block row of the above partition. If all diagonal blocks Ri
are invertible, then we proceed exactly as with the Spike algorithm. Specifically, for
152 5 Banded Linear Systems

each block row, consisting of (Ci , Ri , Bi+1 ) and the right-hand side f˜i , the following
(3 p − 2) independent triangular systems can be solved simultaneously:

(1) (1) (1)


R1−1 (sm−1 em−1 + ηm ωm ), R1−1 f˜1 ,
(i) (i) (i)
Ri−1 w(i) , Ri−1 (sm−1 em−1 + ηm ωm ), Ri−1 f˜i , (i = 2, . . . , p − 1), (5.114)
−1
Rp w , ( p) R −1
p fp
˜

Note that we need to solve upper triangular systems with 2 or 3 right-hand sides
to generate the spikes as well as update the right-hand side systems.
We next consider the case when one or more of the tridiagonal blocks Ai,i is
singular. Then it is possible to modify à so that the triangular matrices above are
made invertible. In particular, since the tridiagonal matrix A is invertible, so wiil
be the coefficient matrix à in (5.112). When one or more of the of the submatrices
Ai,i is singular, then so will be the corresponding submatrices Ri and, according to
Proposition 5.6, this would manifest itself with a zero value appearing at the lower
right corner of the corresponding Ri ’s. To handle this situation, the algorithm applies
multiple boostings to shift away from zero the values at the corners of the triangular
blocks Ri along the diagonal.) As we will see, we do not have to do this for the
last block R p . Moreover, all these boostings are independent and can be applied in
parallel. This step for blocks other than the last one is represented as a multiplication
of à with a matrix, Pboost , of the form


p−1

Pboost = In + ζi eim+1 eim ,
i=1

1 if |(Ri )m,m | < threshold
where ζi =
0 otherwise.

In particular, consider any k = 1, . . . , p − 1, then


 
ekm à Pboost ekm = (Rk )m,m + ζk ekm Ãekm+1
(k)
= (Rk )m,m + ζk α(k−1)m,(k−1)m+1 cm−1 . (5.115)

Observe that the right-hand side is nonzero since in case (Rk )m,m = 0, the term
immediately to its right (that is in position (km, km + 1)) is nonzero. In this way,
matrix à Pboost has all diagonal blocks nonsingular except possibly the last one. Call
these (modified in case of singularity, unmodified otherwise) diagonal blocks R̃i and
set the bock diagonal matrix R̃ = diag[ R̃1 , . . . , R̃ p−1 , R̃ p ]. If A p, p is found to be
singular, then R̃ p = diag[ R̂ p , 1], where R̂ p is the leading principal submatrix of R p
of order m − 1 which will be nonsingular in case R p was found to be singular. Thus,
as constructed, R̃ is nonsingular.
5.5 Tridiagonal Systems 153

It is not hard to verify that


p−1
−1 
Pboost = In − ζi eim+1 eim .
i=1

Taking the above into account, the original system is equivalent to


−1
à Pboost Pboost x = f˜.

where the matrix à Pboost is block-tridiagonal with all its p diagonal blocks upper
triangular and invertible with the possible exception of the last one, which might
contain a zero in the last diagonal element. It is also worth noting that the above
amounts to the Spike DS factorization of a modified matrix, in particular

A Pboost = D̃ S̃ (5.116)

where D̃ = diag[ Q̃ 1 , . . . , Q̃ p ] R̃ and S̃ is the Spike matrix.


The reduced system of order 2( p − 1) or 2 p − 1 is then obtained by reordering
the matrix and numbering first the unknowns m, m + 1, 2m, 2m + 1, . . . , ( p − 1)m,
( p − 1)m + 1. The system is of order 2 p − 1 if R p is found to be singular in which
case the last component of the solution is ordered last in the reduced system. The
resulting system has the form
  
T 0 x̂1 fˆ
= ˆ1
S I x̂2 f2

where T is tridiagonal and nonsingular. The reduced system T x̂1 = fˆ1 is solved
first. Note that even when its order is 2 p − 1 and the last diagonal element of T is
zero, the reduced system is invertible.
Finally, once x̂1 is computed, the remaining unknowns x̂2 are easily computed in
parallel.
We next consider the solution of the reduced system. This is essentially tridiagonal
and thus a hybrid scheme can be used: That is, apply the same methodology and use
partitioning and DS factorization producing a new reduced system, until the system
is small enough to switch to a parallel method not based on partitioning, e.g. pargiv,
and later to a serial method.
The idea of boosting was originally described in [81] as a way to reduce row
interchanges and preserve sparsity in the course of Gaussian elimination. Boosting
was used in sparse solvers including Spike; cf. [16, 82]. The mechanism consists
of adding a suitable value whenever a small diagonal element appears in the pivot
position so as to avoid instability. This is equivalent to a rank-1 modification of the
matrix. For instance, if before the first step of LU on a matrix A, element α1,1 were
found to be small enough, then A would be replaced by A + γ1 e1 e1 with γ chosen
154 5 Banded Linear Systems

so as to make α1,1 + γ large enough to be an acceptable pivot. If no other boosting


were required in the course of the LU decomposition, the factors computed would
satisfy LU = A + γ1 e1 e1 . From these factors it is possible to recover the solution
A−1 f using the Sherman-Morrison-Woodbury formula.
In our case, boosting is applied on diagonal blocks Ri , of Ã, that contain a zero
(in practice, a value below some tolerance threshold) at their corner position. The
boosting is based on the spike columns of Ã, thus avoiding the need for solving
auxiliary systems in the Sherman-Morrison-Woodbury formula. Therefore, boosting
can be expressed as the product


Ã(I + γi eim+1 eim )
{i||(Ri )m,m |≈0}

which amounts to right preconditioning with a low rank modification of the identity
and this as well as the inverse transformation can be applied without solving auxiliary
systems.
We note that the method is appropriate when the partitioning results in blocks
along the diagonal whose singularities are revealed by the Givens QR factorization.
In this case, the blocks can be rendered nonsingular by rank-1 modifications. It will
not be effective, however, if some blocks are almost, but not exactly singular, and
QR is not rank revealing without extra precautions, such as pivoting. See also [83]
for more details regarding this algorithm and its implementation on GPUs.
Another partition-based method that is also applicable to general banded matrices
and does not necessitate that the diagonal submatrices are invertible is proposed
in [80]. The algorithm uses row and possibly column pivoting when factoring the
diagonal blocks so as to prevent loss of stability. The linear system is partitioned as
in [5], that is differently from the Spike.
Factorization based on block diagonal pivoting without interchanges
If the tridiagonal matrix is symmetric and all subsystems Ai,i are nonsingular, then
it is possible to avoid interchanges by computing an L B L  factorization, where L
is unit lower triangular and B is block-diagonal, with 1 × 1 and 2 × 2 diagonal
blocks. This method avoids the loss of symmetry caused by partial pivoting when
the matrix is indefinite; cf. [84, 85] as well as [45]. This strategy can be extended
for nonsymmetric systems computing an L B M  factorization with M also unit
lower triangular; cf. [86, 87]. These methods become attractive in the context of
high performance systems where interchanges can be detrimental to performance. In
[33] diagonal pivoting of this type was combined with Spike as well as “on the fly”
recursive Spike (cf. Sect. 5.2.2 of this chapter) to solve large tridiagonal systems on
GPUs, clusters of GPUs and systems consisting of CPUs and GPUs. We next briefly
describe the aforementioned L B M  factorization.
The first step of the algorithm illustrates the basic idea. The crucial observation is
that if a tridiagonal matrix, say A, is nonsingular, then it must hold that either α1,1 or
the determinant of its 2 × 2 leading principal submatrix, that is α1,1 α2,2 − α1,2 α2,1
must be nonzero. Therefore, if we partition the matrix as
5.5 Tridiagonal Systems 155

A1,1 A1,2
A= ,
A2,1 A2,2

where A1,1 is d × d, that is A1,1 = α1,1 or whereas in the latter it is



α1,1 α1,2
.
α2,1 α2,2

Then the following factorization will be valid for either d = 1 or d = 2:



Id 0 A1,1 0 Id A−1 A 1,2
A= 1,1 . (5.117)
A2,1 A−1 I
1,1 n−d 0 A1,1 − A2,1 A−1
1,1 A1,2 0 In−d

Moreover, the corresponding Schur complement, Sd = A1,1 − A2,1 A−1 1,1 A1,2 is tridi-
agonal and nonsingular and so the same strategy can be recursively applied until the
final factorization is computed. In finite precision, instead of testing √
if any of these val-
ues are exactly 0, a different approach is used. In this, the root μ = ( 5−1)/2 ≈ 0.62
of the quadratic equation μ2 + μ − 1 = 0 plays an important role. Two pivoting
strategies proposed in the literature are as follows. If the largest element, say τ , of
A is known, then select d = 1 if

|α1,1 |τ ≥ μ|α2,1 α1,2 |. (5.118)

Otherwise, d = 2. In both cases, the factorization is updated as in (5.117). Another


strategy that is based on information from only the adjacent equations, is to set

τ = max{|α2,2 |, |α1,2 |, |α2,1 |, |α2,3 |, |α3,2 |}.

and apply the same strategy, (5.118). From the error analysis conducted in [86], it is
claimed that the method is backward stable.

5.5.7 Using Determinants and Other Special Forms

It is one of the basic tenets of numerical linear algebra that determinants and Cramer’s
rule are not to be used for solving general linear systems. When the matrix is tridi-
agonal, however, determinental equalities can be used to show that the inverse has
a special form. This can be used to design methods for computing elements of the
inverse, and for solving linear systems. These methods are particularly useful when
we want to compute one or few selected elements of the inverse or the solution,
something that is not possible with standard direct methods. In fact, there are several
applications in science and engineering that require such selective computations, e.g.
see some of the examples in [88].
156 5 Banded Linear Systems

It has long been known, for example, that the inverse of symmetric irreducible
tridiagonal matrices, has special structure; reference [89] provides several remarks
on the history of this topic, and the relevant discovery made in [90]. An algorithm
for inverting such matrices was presented in [91]. Inspired by that work, a paral-
lel algorithm for computing selected elements of the solution or the inverse that is
applicable for general tridiagonal matrices was introduced in [67]. A similar algo-
rithm for solving tridiagonal linear systems was presented in [92]. We briefly outline
the basic idea of these methods.
For the purpose of this discussion we recall the notation A = [αi,i−1 , αi,i , αi,i+1 ].
As before we assume that it is nonsingular and irreducible. Let

A−1 = [τi, j ] i, j = 1, 2, . . . , n,

then
τi, j = (−1)i+ j det(A( j, i))/det(A)

where A( j, i) is that matrix of order (n − 1) obtained from A by deleting the ith


column and jth row. We next use the notation Ai−1 for the leading principal tridiag-
onal submatrix of A and à j+1 for the trailing tridiagonal submatrix of order n − j.
Assuming that we pick i ≤ j, we can write
⎛ ⎞
Ai−1
A( j, i) = ⎝ Z 1 S j−i ⎠
Z 2 Ã j+1

where Ai−1 and à j+1 are tridiagonal of order i −1 and n− j respectively, S j−i is of or-
der j −i and is lower triangular with diagonal elements αi,i+1 , αi+1,i+2 , . . . , α j−1, j .
(i−1)
Moreover, Z 1 is of size ( j − i) × (i − 1) with first row αi,i−1 (ei−1 ) and Z 2 of
( j−i)
order (n − j) × ( j − i) with first row α j+1, j (e j−i ) Hence, for i ≤ j,

 j−1
det(Ai−1 )det( Ã j+1 ) k=i αk,k+1
τi, j = (−1) i+ j
.
det(A)

Rearranging the terms in the above expression we obtain


 i
n−1
(−1)i det(Ai−1 ) (−1) j det( Ã j+1 ) k=i αk,k+1 k=2 αk,k−1
τi, j = i n−1
k=2 αk,k−1 k= j αk,k+1
det(A)

or τi, j = υi ν j ωi . Similarly, τi, j = υ j νi ωi for i ≥ j. Therefore



υi ν j ωi if i ≤ j
τi, j =
υ j νi ωi if i ≥ j
5.5 Tridiagonal Systems 157

This shows that the lower (resp. upper) triangular part of the inverse of an irreducible
tridiagonal matrix is the lower (resp. upper) triangular part of a rank-1 matrix. The
respective rank-1 matrices are generated by the vectors u = (υ1 , . . . , υn ) v =
(ν1 , . . . , νn ) and w = (ω1 , . . . , ωn ) . Moreover, to compute one element of the
inverse we only need a single element from each of u, v and w. Obviously, if the
vectors u, v and w are known, then we can compute any of the elements of the inverse
and with little extra work, any element of the solution. Fortunately, these vectors can
be computed independently from the solution of three lower triangular systems of
very small bandwidth and very special right-hand sides, using, for example, the
algorithm of Theorem 3.2. Let us see how to do this.
Let dk = det(Ak ). Observing that in block form

(k−1)
Ak−1 αk−1,k ek−1
Ak = (k−1) ,
αk,k−1 (ek−1 ) αk,k

expanding along the last row, we obtain then

dk = αk,k dk−1 − αk,k−1 αk−1,k dk−2 k = 2, 3, . . . , n,

where d0 = 1 and d1 = α1,1 . Now, υi can be written as



αi−1,i−1 di−2 αi−2,i−1 di−3
υi = (−1) i
i−1 − i−2 .
αi,i−1 k=2 αk,k−1
αi,i−1 k=2 αk−1,k

Thus,
υi = −αi−1, i−1 υi−1 − αi−2, i−1 υi−2 i = 2, 3, . . . , n

in which
α̂i = αi,i /αi+1,i
β̂i = αi,i+1 /αi+2,i+1
υ0 = 0, υ1 = −1.

This is equivalent to the unit lower triangular system

(n)
L 1 u = e1 (5.119)

where ⎛ ⎞
1
⎜ α̂1 1 ⎟
⎜ ⎟
⎜ β̂1 α̂2 1 ⎟
L1 = ⎜ ⎟.
⎜ . . . . . . ⎟
⎝ . . . ⎠
β̂n−2 α̂n−1 1
158 5 Banded Linear Systems

We next follow a similar approach considering instead the trailing tridiagonal sub-
matrix

αk,k αk,k+1 (e1(n−k) )
Ãk = (n−k) .
αk+1,k e1 Ãk+1

Expanding along the first row we obtain

d̂k = αk,k d̂k+1 − αk,k+1 αk+1,k d̂k+2 k = n − 1, . . . , 3, 2

in which d̂i = det( Ãi ), dn+1 = 1 and dn = an . Now, ν j is given by

ν j = −γ̂ j+1 ν j+1 − δ̂ j+2 ν j+2 j = n − 1, . . . , 1, 0

where,
γ̂ j = α j, j /α j−1, j ,
δ̂ j = α j, j−1 /α j−2, j−1 , and
νn+1 = 0, νn = (−1)n .

Again, this may be expressed as the triangular system of order (n + 1),

L 2 v = (−1)n e1(n+1) (5.120)

where ⎛ ⎞
1
⎜ γ̂n 1 ⎟
⎜ ⎟
⎜ δ̂n γ̂n−1 1 ⎟
L2 = ⎜ ⎟.
⎜ . .. .. .. ⎟
. .
⎝ ⎠
δ̂2 γ̂1 1

Finally, from

n−1 
i
ωi = αk,k+1 αk,k−1 /det(T )
k=i k=2

we have
ωk+1 = θk+1 ωk k = 1, 2, . . . , n − 1

where,
θi = αi,i−1 /αi,i+1
ω1 = (α1,2 α2,3 · · · αk,k+1 )/det(T ) = (1/ν0 ).
5.5 Tridiagonal Systems 159

In other words, w is the solution of the triangular system

(n)
L 3 w = (1/ν0 )e1 (5.121)

where, ⎛ ⎞
1
⎜ −θ2 1 ⎟
⎜ ⎟
L3 = ⎜ .
⎝ . .. .. ⎟
. ⎠
−θn 1

Observe that the systems

L 1 u = e1(n) ,
(n+1)
L 2 v = (−1)n e1 , and
(n)
L 3 w̃ = e1

where w̃ = ν0 w, can be solved independently for u, v, and w̃ using the parallel


algorithms of this chapter from which we obtain w = w̃/ν0 . One needs to be careful,
however, because as in other cases where parallel triangular solvers that require the
least number of parallel arithmetic operations are used, there is the danger of under-
flow, overflow or inaccuracies in the computed results; cf. the relevant discussions
in Sect. 3.2.3.
Remarks
The formulas for the inverse of a symmetric tridiagonal matrix with nonzero ele-
ments [91] were also extended independently to the nonsymmetric tridiagonal case
in [93, Theorem 2]. Subsequently, there have been many investigations concerning
the form of the inverse of tridiagonal and more general banded matrices. For exten-
sive discussion of these topics see [94] as well as [89] which is a seminal monograph
in the area of structured matrices.

References

1. Arbenz, P., Hegland, M.: On the stable parallel solution of general narrow banded linear systems.
High Perform. Algorithms Struct. Matrix Probl. 47–73 (1998)
2. Arbenz, P., Cleary, A., Dongarra, J., Hegland, M.: A comparison of parallel solvers for general
narrow banded linear systems. Parallel Distrib. Comput. Pract. 2(4), 385–400 (1999)
3. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,
Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: ScaLAPACK
User’s Guide. SIAM, Philadelphia (1997). URL http://www.netlib.org/scalapack
4. Conroy, J.: Parallel algorithms for the solution of narrow banded systems. Appl. Numer. Math.
5, 409–421 (1989)
5. Dongarra, J., Johnsson, L.: Solving banded systems on a parallel processor. Parallel Comput.
5(1–2), 219–246 (1987)
160 5 Banded Linear Systems

6. George, A.: Numerical experiments using dissection methods to solve n by n grid problems.
SIAM J. Numer. Anal. 14, 161–179 (1977)
7. Golub, G., Sameh, A., Sarin, V.: A parallel balance scheme for banded linear systems. Numer.
Linear Algebra Appl. 8, 297–316 (2001)
8. Johnsson, S.: Solving narrow banded systems on ensemble architectures. ACM Trans. Math.
Softw. 11, 271–288 (1985)
9. Meier, U.: A parallel partition method for solving banded systems of linear equations. Parallel
Comput. 2, 33–43 (1985)
10. Tang, W.: Generalized Schwarz splittings. SIAM J. Sci. Stat. Comput. 13, 573–595 (1992)
11. Wright, S.: Parallel algorithms for banded linear systems. SIAM J. Sci. Stat. Comput. 12,
824–842 (1991)
12. Sameh, A., Kuck, D.: On stable parallel linear system solvers. J. Assoc. Comput. Mach. 25(1),
81–91 (1978)
13. Dongarra, J.J., Sameh, A.: On some parallel banded system solvers. Technical Report
ANL/MCS-TM-27, Mathematics Computer Science Division at Argonne National Labora-
tory (1984)
14. Gallivan, K., Gallopoulos, E., Sameh, A.: CEDAR—an experiment in parallel computing.
Comput. Math. Appl. 1(1), 77–98 (1994)
15. Lawrie, D.H., Sameh, A.: The computation and communication complexity of a parallel banded
system solver. ACM TOMS 10(2), 185–195 (1984)
16. Polizzi, E., Sameh, A.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel
Comput. 32, 177–194 (2006)
17. Polizzi, E., Sameh, A.: SPIKE: a parallel environment for solving banded linear systems.
Compon. Fluids 36, 113–120 (2007)
18. Sameh, A., Kuck, D.: A parallel QR algorithm for symmetric tridiagonal matrices. IEEE Trans.
Comput. 26(2), 147–153 (1977)
19. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
20. Demko, S., Moss, W., Smith, P.: Decay rates for inverses of band matrices. Math. Comput.
43(168), 491–499 (1984)
21. Björck, Å.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
22. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins. University Press,
Baltimore (2013)
23. Davis, T.: Algorithm 915, SuiteSparseQR: multifrontal multithreaded rank-revealing sparse
QR factorization. ACM Trans. Math. Softw. 38(1), 8:1–8:22 (2011). doi:10.1145/2049662.
2049670, URL http://doi.acm.org/10.1145/2049662.2049670
24. Lou, G.: Parallel methods for solving linear systems via overlapping decompositions. Ph.D.
thesis, University of Illinois at Urbana-Champaign (1989)
25. Naumov, M., Sameh, A.: A tearing-based hybrid parallel banded linear system solver. J. Com-
put. Appl. Math. 226, 306–318 (2009)
26. Benzi, M., Golub, G., Liesen, J.: Numerical solution of saddle-point problems. Acta Numer.
1–137 (2005)
27. Hockney, R., Jesshope, C.: Parallel Computers. Adam Hilger (1983)
28. Ortega, J.M.: Introduction to Parallel and Vector Solution of Linear Systems. Plenum Press,
New York (1988)
29. Golub, G., Ortega, J.: Scientific Computing: An Introduction with Parallel Computing. Acad-
emic Press Inc., San Diego (1993)
30. Davidson, A., Zhang, Y., Owens, J.: An auto-tuned method for solving large tridiagonal systems
on the GPU. In: Proceedings of IEEE IPDPS, pp. 956–965 (2011)
31. Lopez, J., Zapata, E.: Unified architecture for divide and conquer based tridiagonal system
solvers. IEEE Trans. Comput. 43(12), 1413–1425 (1994). doi:10.1109/12.338101
32. Santos, E.: Optimal and efficient parallel tridiagonal solvers using direct methods. J. Super-
comput. 30(2), 97–115 (2004). doi:10.1023/B:SUPE.0000040615.60545.c6, URL http://dx.
doi.org/10.1023/B:SUPE.0000040615.60545.c6
References 161

33. Chang, L.W., Stratton, J., Kim, H., Hwu, W.M.: A scalable, numerically stable, high-
performance tridiagonal solver using GPUs. In: Proceedings International Conference High
Performance Computing, Networking Storage and Analysis, SC’12, pp. 27:1–27:11. IEEE
Computer Society Press, Los Alamitos (2012). URL http://dl.acm.org/citation.cfm?id=
2388996.2389033
34. Goeddeke, D., Strzodka, R.: Cyclic reduction tridiagonal solvers on GPUs applied to mixed-
precision multigrid. IEEE Trans. Parallel Distrib. Syst. 22(1), 22–32 (2011)
35. Codenotti, B., Leoncini, M.: Parallel Complexity of Linear System Solution. World Scientific,
Singapore (1991)
36. Ascher, U., Mattheij, R., Russell, R.: Numerical Solution of Boundary Value Problems for
Ordinary Differential Equations. Classics in Applied Mathematics. SIAM, Philadelphia (1995)
37. Isaacson, E., Keller, H.B.: Analysis of Numerical Methods. Wiley, New York (1966)
38. Keller, H.B.: Numerical Methods for Two-Point Boundary-Value Problems. Dover Publica-
tions, New York (1992)
39. Bank, R.E.: Marching algorithms and block Gaussian elimination. In: Bunch, J.R., Rose, D.
(eds.) Sparse Matrix Computations, pp. 293–307. Academic Press, New York (1976)
40. Bank, R.E., Rose, D.: Marching algorithms for elliptic boundary value problems. I: the constant
coefficient case. SIAM J. Numer. Anal. 14(5), 792–829 (1977)
41. Roache, P.: Elliptic Marching Methods and Domain Decomposition. CRC Press Inc., Boca
Raton (1995)
42. Richardson, L.F.: Weather Prediction by Numerical Process. Cambridge University Press.
Reprinted by Dover Publications, 1965 (1922)
43. Arbenz, P., Hegland, M.: The stable parallel solution of narrow banded linear systems. In: Heath,
M., et al. (eds.) Proceedings of Eighth SIAM Conference Parallel Processing and Scientific
Computing SIAM, Philadelphia (1997)
44. Bank, R.E., Rose, D.: Marching algorithms for elliptic boundary value problems. II: the variable
coefficient case. SIAM J. Numer. Anal. 14(5), 950–969 (1977)
45. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
46. Higham, N.: Stability of parallel triangular system solvers. SIAM J. Sci. Comput. 16(2), 400–
413 (1995)
47. Viswanath, D., Trefethen, L.: Condition numbers of random triangular matrices. SIAM J.
Matrix Anal. Appl. 19(2), 564–581 (1998)
48. Hockney, R.: A fast direct solution of Poisson’s equation using Fourier analysis. J. Assoc.
Comput. Mach. 12, 95–113 (1965)
49. Gander, W., Golub, G.H.: Cyclic reduction: history and applications. In: Luk, F., Plemmons, R.
(eds.) Proceedings of the Workshop on Scientific Computing, pp. 73–85. Springer, New York
(1997). URL http://people.inf.ethz.ch/gander/papers/cyclic.pdf
50. Amodio, P., Brugnano, L.: Parallel factorizations and parallel solvers for tridiagonal linear
systems. Linear Algebra Appl. 172, 347–364 (1992). doi:10.1016/0024-3795(92)90034-8,
URL http://www.sciencedirect.com/science/article/pii/0024379592900348
51. Heller, D.: Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems.
SIAM J. Numer. Anal. 13(4), 484–496 (1976)
52. Lambiotte Jr, J., Voigt, R.: The solution of tridiagonal linear systems on the CDC STAR 100
computer. ACM Trans. Math. Softw. 1(4), 308–329 (1975). doi:10.1145/355656.355658, URL
http://doi.acm.org/10.1145/355656.355658
53. Nassimi, D., Sahni, S.: An optimal routing algorithm for mesh-connected parallel computers.
J. Assoc. Comput. Mach. 27(1), 6–29 (1980)
54. Nassimi, D., Sahni, S.: Parallel permutation and sorting algorithms and a new generalized
connection network. J. Assoc. Comput. Mach. 29(3), 642–667 (1982)
55. George, A.: Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. 10(2),
345–363 (1973). URL http://www.jstor.org/stable/2156361
56. Amodio, P., Brugnano, L., Politi, T.: Parallel factorization for tridiagonal matrices. SIAM J.
Numer. Anal. 30(3), 813–823 (1993)
162 5 Banded Linear Systems

57. Johnsson, S.: Solving tridiagonal systems on ensemble architectures. SIAM J. Sci. Stat. Com-
put. 8, 354–392 (1987)
58. Zhang, Y., Cohen, J., Owens, J.: Fast tridiagonal solvers on the GPU. ACM SIGPLAN Not.
45(5), 127–136 (2010)
59. Amodio, P., Mazzia, F.: Backward error analysis of cyclic reduction for the solution of tridi-
agonal systems. Math. Comput. 62(206), 601–617 (1994)
60. Higham, N.: Bounding the error in Gaussian elimination for tridiagonal systems. SIAM J.
Matrix Anal. Appl. 11(4), 521–530 (1990)
61. Zhang, Y., Owens, J.: A quantitative performance analysis model for GPU architectures. In:
Proceedings of the 17th IEEE International Symposium on High-Performance Computer Ar-
chitecture (HPCA 17) (2011)
62. El-Mikkawy, M., Sogabe, T.: A new family of k-Fibonacci numbers. Appl. Math. Com-
put. 215(12), 4456–4461 (2010). URL http://www.sciencedirect.com/science/article/pii/
S009630031000007X
63. Fang, H.R., O’Leary, D.: Stable factorizations of symmetric tridiagonal and triadic matrices.
SIAM J. Math. Anal. Appl. 28(2), 576–595 (2006)
64. Mikkelsen, C., Kågström, B.: Parallel solution of narrow banded diagonally dominant linear
systems. In: Jónasson, L. (ed.) PARA 2010. LNCS, vol. 7134, pp. 280–290. Springer (2012).
doi:10.1007/978-3-642-28145-7_28, URL http://dx.doi.org/10.1007/978-3-642-28145-7_28
65. Mikkelsen, C., Kågström, B.: Approximate incomplete cyclic reduction for systems which are
tridiagonal and strictly diagonally dominant by rows. In: Manninen, P., Öster, P. (eds.) PARA
2012. LNCS, vol. 7782, pp. 250–264. Springer (2013). doi:10.1007/978-3-642-36803-5_18,
URL http://dx.doi.org/10.1007/978-3-642-36803-5_18
66. Bini, D., Meini, B.: The cyclic reduction algorithm: from Poisson equation to stochastic
processes and beyond. Numer. Algorithms 51(1), 23–60 (2008). doi:10.1007/s11075-008-
9253-0, URL http://www.springerlink.com/index/10.1007/s11075-008-9253-0; http://www.
springerlink.com/content/m40t072h273w8841/fulltext.pdf
67. Sameh, A.: Numerical parallel algorithms—a survey. In: Kuck, D., Lawrie, D., Sameh, A.
(eds.) High Speed Computer and Algorithm Optimization, pp. 207–228. Academic Press, Sans
Diego (1977)
68. Mathias, R.: The instability of parallel prefix matrix multiplication. SIAM J. Sci. Comput.
16(4) (1995), to appear
69. Eğecioğlu, O., Koç, C., Laub, A.: A recursive doubling algorithm for solution of tridiagonal
systems on hypercube multiprocessors. J. Comput. Appl. Math. 27, 95–108 (1989)
70. Dubois, P., Rodrigue, G.: An analysis of the recursive doubling algorithm. In: Kuck, D., Lawrie,
D., Sameh, A. (eds.) High Speed Computer and Algorithm Organization, pp. 299–305. Acad-
emic Press, San Diego (1977)
71. Hammarling, S.: A survey of numerical aspects of plane rotations. Report Maths. 1, Middlesex
Polytechnic (1977). URL http://eprints.ma.man.ac.uk/1122/. Available as Manchester Institute
for Mathematical Sciences MIMS EPrint 2008.69
72. Bar-On, I., Codenotti, B.: A fast and stable parallel QR algorithm for symmetric tridiagonal
matrices. Linear Algebra Appl. 220, 63–95 (1995). doi:10.1016/0024-3795(93)00360-C, URL
http://www.sciencedirect.com/science/article/pii/002437959300360C
73. Gill, P.E., Golub, G., Murray, W., Saunders, M.: Methods for modifying matrix factorizations.
Math. Comput. 28, 505–535 (1974)
74. Lakshmivarahan, S., Dhall, S.: Parallelism in the Prefix Problem. Oxford University Press,
New York (1994)
75. Cleary, A., Dongarra, J.: Implementation in ScaLAPACK of divide and conquer algorithms
for banded and tridiagonal linear systems. Technical Report UT-CS-97-358, University of
Tennessee Computer Science Technical Report (1997)
76. Bar-On, I., Codenotti, B., Leoncini, M.: Checking robust nonsingularity of tridiagonal matrices
in linear time. BIT Numer. Math. 36(2), 206–220 (1996). doi:10.1007/BF01731979, URL
http://dx.doi.org/10.1007/BF01731979
References 163

77. Bar-On, I.: Checking non-singularity of tridiagonal matrices. Electron. J. Linear Algebra 6,
11–19 (1999). URL http://math.technion.ac.il/iic/ela
78. Bondeli, S.: Divide and conquer: a parallel algorithm for the solution of a tridiagonal system
of equations. Parallel Comput. 17, 419–434 (1991)
79. Wang, H.: A parallel method for tridiagonal equations. ACM Trans. Math. Softw. 7, 170–183
(1981)
80. Wright, S.: Parallel algorithms for banded linear systems. SIAM J. Sci. Stat. Comput. 12(4),
824–842 (1991)
81. Stewart, G.: Modifying pivot elements in Gaussian elimination. Math. Comput. 28(126), 537–
542 (1974)
82. Li, X., Demmel, J.: SuperLU-DIST: A scalable distributed-memory sparse direct solver for
unsymmetric linear systems. ACM TOMS 29(2), 110–140 (2003). URL http://doi.acm.org/10.
1145/779359.779361
83. Venetis, I.E., Kouris, A., Sobczyk, A., Gallopoulos, E., Sameh, A.: A direct tridiagonal solver
based on Givens rotations for GPU-based architectures. Technical Report HPCLAB-SCG-
06/11-14, CEID, University of Patras (2014)
84. Bunch, J.: Partial pivoting strategies for symmetric matrices. SIAM J. Numer. Anal. 11(3),
521–528 (1974)
85. Bunch, J., Kaufman, K.: Some stable methods for calculating inertia and solving symmetric
linear systems. Math. Comput. 31, 162–179 (1977)
86. Erway, J., Marcia, R.: A backward stability analysis of diagonal pivoting methods for solv-
ing unsymmetric tridiagonal systems without interchanges. Numer. Linear Algebra Appl. 18,
41–54 (2011). doi:10.1002/nla.674, URL http://dx.doi.org/10.1002/nla.674
87. Erway, J.B., Marcia, R.F., Tyson, J.: Generalized diagonal pivoting methods for tridiagonal
systems without interchanges. IAENG Int. J. Appl. Math. 4(40), 269–275 (2010)
88. Golub, G.H., Meurant, G.: Matrices, Moments and Quadrature with Applications. Princeton
University Press, Princeton (2009)
89. Vandebril, R., Van Barel, M., Mastronardi, N.: Matrix Computations and Semiseparable
Matrices. Volume I: Linear Systems. Johns Hopkins University Press (2008)
90. Gantmacher, F., Krein, M.: Sur les matrices oscillatoires et complèments non négatives. Com-
position Mathematica 4, 445–476 (1937)
91. Bukhberger, B., Emelyneko, G.: Methods of inverting tridiagonal matrices. USSR Comput.
Math. Math. Phys. 13, 10–20 (1973)
92. Swarztrauber, P.N.: A parallel algorithm for solving general tridiagonal equations. Math. Com-
put. 33, 185–199 (1979)
93. Yamamoto, T., Ikebe, Y.: Inversion of band matrices. Linear Algebra Appl. 24, 105–111 (1979).
doi:10.1016/0024-3795(79)90151-4, URL http://www.sciencedirect.com/science/article/pii/
0024379579901514
94. Strang, G., Nguyen, T.: The interplay of ranks of submatrices. SIAM Rev. 46(4), 637–646
(2004). URL http://www.jstor.org/stable/20453569
Chapter 6
Special Linear Systems

One key idea when attempting to build algorithms for large scale matrix problems
is to detect if the matrix has special properties, possibly due to the characteristics
of the application, that could be taken into account in order to design faster solution
methods. This possibility was highlighted early on by Turing himself, when he noted
in his report for the Automatic Computing Engine (ACE) that even though with the
storage capacities available at that time it would be hard to store and handle systems
larger than 50 × 50,
... the majority of problems have very degenerate matrices and we do not need to store
anything like as much (...) the coefficients in these equations are very systematic and mostly
zero. [1].

The special systems discussed in this chapter encompass those that Turing char-
acterized as “degenerate” in that they can be represented and stored much more
economically than general matrices as their entries are systematic (in ways that will
be made precise later), and frequently, most are zero. Because the matrices can be
represented with fewer parameters, they are also termed structured [2] or data sparse.
In this chapter, we are concerned with the solution of linear systems with methods
that are designed to exploit the matrix structure. In particular, we show the opportuni-
ties for parallel processing when solving linear systems with Vandermonde matrices,
banded Toeplitz matrices, a class of matrices that are called SAS-decomposable, and
special matrices that arise when solving elliptic partial differential equations which
are amenable to the application of fast direct methods, commonly referred as rapid
elliptic solvers (RES).
Observe that to some degree, getting high speedup and efficiency out of parallel
algorithms for matrices with special structure is more challenging than for general
ones since the gains are measured vis-a-vis serial solvers of reduced complexity.
It is also worth noting that in some cases, the matrix structure is not known a
priori or is hidden and it becomes necessary to convert the matrix into a special
representation permitting the construction of fast algorithms; see for example [3–5].
This can be a delicate task because arithmetic and data representation are in finite
precision.
© Springer Science+Business Media Dordrecht 2016 165
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_6
166 6 Special Linear Systems

Another type of structure that is present in the Vandermonde and Toeplitz matrices
is that they have small displacement rank, a property introduced in [6] to characterize
matrices for which it is possible to construct low complexity algorithms; cf. [7].
What this means is that if A is the matrix under consideration, there exist lower
triangular matrices P, Q such that the rank of either A − PAQ or PA − AQ (called
displacements) is small. A similar notion, of block displacement, exists for block
matrices. For this reason, such matrices are also characterized as “low displacement”.
Finally note that even if a matrix is only approximately but not exactly structured,
that is it can be expressed as A = S + E, where S structured and E is nonzero but
small, in some sense, (e.g. has small rank or small norm), this can be valuable because
then the corresponding structured matrix S could be an effective preconditioner in
an iterative scheme.
A detailed treatment of structured matrices (in particular structured rank matrices,
that is matrices for which any submatrix that lies entirely below or above the main
diagonal has rank that is bounded above by some fixed value smaller than its size)
can be found in [8]. See also [9, 10] regarding data sparse matrices.

6.1 Vandermonde Solvers

We recall that Vandermonde matrices are determined from one vector, say x =
(ξ1 , . . . , ξn ) , as ⎛ ⎞
1 1 ··· 1
⎜ ξ1 ξ2 · · · ξn ⎟
⎜ ⎟
Vm (x) = ⎜ .. .. .. ⎟ ,
⎝ . . . ⎠
ξ1m−1 ξ2m−1 · · · ξnm−1

where m indicates the number of rows and the number of columns is the size of x.
When the underlying vector or the row dimension are implied by the context, the
symbols are omitted.
If V (x) is a square Vandermonde matrix of order n and Q = diag(ξ1 , . . . , ξn ),
then rank(V (x) − JV(x)Q) = 1. It follows that Vandermonde matrices have small
displacement rank (equal to 1). We are interested in the following problems for any
given nonsingular Vandermonde matrix V and vector of compatible size b.
1. Compute the inverse V −1 .
2. Solve the primal Vandermonde system V a = b.
3. Solve the dual Vandermonde system V  a = b.
The inversion of Vandermonde matrices (as described in [11]) and the solution
of Vandermonde systems (using algorithms in [12] or via inversion and multipli-
cation as proposed in [13]) can be accomplished with fast and practical algorithms
that require only O(n 2 ) arithmetic operations rather than the O(n 3 ) predicted by
6.1 Vandermonde Solvers 167

(structure-oblivious) Gaussian elimination. See also [14, 15] and historical remarks
therein. Key to many algorithms is a fundamental result from the theory of poly-
nomial interpolation, namely that given n + 1 interpolation (node, value)-pairs,
{(ξk , βk )}k=0:n , where the ξk are all distinct, there exists a unique polynomial of
degree at most n, say pn , that satisfies pn (ξ kn) = βk j for k = 0, . . . , n. Writing
the polynomial in power form, pn (ξ ) = j=0 α j ξ , the vector of coefficients

a = (α0 , . . . , αn ) is the solution of problem (3). We also recall the most com-
mon representations for the interpolating polynomial (see for example [16, 17]):
Lagrange form:


n
pn (ξ ) = βk lk (ξ ), (6.1)
k=0

where lk are the Lagrange basis polynomials, defined by

n
(ξ − ξ j )
lk (ξ ) = . (6.2)
(ξk − ξ j )
j=0
k= j

Newton form:

pn (ξ ) = γ0 + γ1 (ξ − ξ0 ) + · · · + γn (ξ − ξ0 ) · · · (ξ − ξn−1 ), (6.3)

where γ j are the divided differences coefficients.


A word of warning: Vandermonde matrices are notoriously ill-conditioned so
manipulating them in floating-point can be extremely error prone unless special
conditions hold. It has been shown in [18] that the condition with respect to the ∞-
norm when all nodes are positive grows as O(2n ) for Vandermonde matrices of order
n. The best known, well-conditioned Vandermonde matrices, are those defined on the
n roots of unity. To distinguish this case we replace x by w = (1, ω, . . . , ωn ) where
ω = exp(ι2π n+1 1
). The matrix V (w) of order n + 1, is unitary (up to scaling) and
thus is perfectly conditioned, symmetric, and moreover corresponds to the discrete
Fourier transform. We will need this matrix in the sequel so we show it here, in
simplified notation:
⎛ ⎞
1 1 1 ··· 1
⎜1 ω ω2 · · · ωn ⎟
⎜ ⎟
⎜ ω2 ω4 · · · ω2n ⎟
Vn+1 = ⎜1 ⎟. (6.4)
⎜ .. .. .. ⎟
⎝. . . ⎠
2
1 ωn ω2n · · · ωn
168 6 Special Linear Systems

The action of V either in multiplication by a vector or in solving a linear system


on a uniprocessor can be computed using O((n +1) log(n +1)) arithmetic operations
using the fast Fourier transform (FFT); see for example [19] for a comprehensive
1
treatise on the subject. We call √n+1 V (w) a unitary Fourier matrix. Good condi-
tioning can also be obtained for other sets of interpolation nodes on the unit circle,
[20, 21]. Some improvements are also possible for sequences of real nodes, though
the growth in the condition number remains exponential; cf. [22].
The algorithms in this section are suitable for Vandermonde matrices that are not
extremely ill-conditioned and that cannot be manipulated with the FFT. On the other
hand they must be large enough to justify the use of parallelism.
Finally, it is worth mentioning that the solution of Vandermonde systems of order
in the thousands using very high-precision on a certain parallel architecture has
been described in [23]. It would be of interest to investigate whether the parallel
algorithms we present here can be used, with extended arithmetic, to solve large
general Vandermonde systems.
Product to Power Form Conversion
All algorithms for Vandermonde matrices that we describe require, at some stage,
transforming a polynomial from product to power form. Given (real or complex)
values {ξ1 , . . . , ξn } and a nonzero γ then if

n
p(ξ ) = γ (ξ − ξ j ) (6.5)
j=1

is the product form representation of the polynomial, we seek fast and practical
algorithms for computing the transformation

F : (ξ1 , . . . , ξn , γ ) → (α0 , . . . , αn ),
n
where i=0 αi ξ i is the power form representation of p(ξ ). The importance of making
available the power form representation of the interpolation polynomials was noted
early on in [11]. By implication, it is important to provide fast and practical trans-
formations between representations. The standard serial algorithm, implemented for
example by MATLAB’s poly function, takes approximately n 2 arithmetic opera-
tions. The algorithms presented in this section for converting from product to power
form (6.1) and (6.2) are based on the following lemma.
(n+1) (n+1)
Proposition 6.1 Let u j = e2 − ρ j e1 for j = 1, . . . , n be the vectors con-
taining the coefficients for term x − ρ j of p(x) = nj=1 (x − ρ j ), padded with zeros.
Denote by dftk (resp. dft−1k ) the discrete (resp. inverse discrete) Fourier transform
of length k and let a = (α0 , . . . , αn ) be the vector of coefficients of the power form
of pn (x). Then

a = dft−1
n+1 dftn+1 (u 1 )  · · ·  dftn+1 (u n ) .
6.1 Vandermonde Solvers 169

The proposition can be proved from classical results for polynomial multiplication
using convolution (e.g. see [24] and [25, Problem 2.4]).
In the following observe that if V is as given in (6.4), then

1
dftn+1 (u) = V u, and dft−1
n+1 (u) = V ∗ u, (6.6)
n+1

Algorithm 6.1 (pr2pw) uses the previous proposition to compute the coefficients
of the power form from the roots of the polynomial.

Algorithm 6.1 pr2pw: Conversion of polynomial from product form to power form.

Input: r = (ρ1 , . . . , ρn ) //product form nj=1 (x − ρ j )
n
Output: coefficients a = (α0 , . . . , αn ) //power form i=0 αi ξ j
(n+1)  (n+1) (n) 
1: U = −e1 r + e2 (e )
2: doall j = 1 : n
3: Û:, j = dftn+1 (U:, j )
4: end
5: doall i = 1 : n + 1
6: α̂i = prod (Ûi,: )
7: end
8: a = dft−1
n+1 (â) //â = (αˆ1 , . . . , α̂n+1 )


We next comment on Algorithm 6.1 (pr2pw). Both dft and prod can be imple-
mented using parallel algorithms. If p = O(n 2 ) processors are available, the cost
is O(log n). On a distributed memory system with p ≤ n processors, using a one-
dimensional block-column distribution for U , then in the first loop (lines 2–4) only
local computations in each processor are performed. In the second loop (lines 5–7)
there are (n + 1)/ p independent products of vectors of length n performed sequen-
tially on each processor at a total cost of (n − 1)(n + 1)/ p. The result is a vector of
length n + 1 distributed across the p processors. Vector â would then be distributed
across the processors, so the final step consists of a single transform of length n + 1
p log(n +1))
that can be performed in parallel over the p processors at a cost of O( n+1
parallel operations. So the parallel cost for Algorithm 6.1 (pr2pw) is

n (n + 1)
Tp = τ1 + (n − 1) + τ p , (6.7)
p p

where τ1 , τ p are the times for an dftn+1 on 1 and p processors respectively.


Using p = n processors, the dominant parallel cost is O(n log n) and is caused
by the first term, that is the computation of the DFTs via the FFT (lines 2–4).
We next show that the previous algorithm can be modified to reduce the cost of
the dominant, first step. Observe that the coefficient vectors u j in Proposition 6.1
(n+1) (n+1)
are sparse and in particular, linear combinations of e1 and e2 because the
170 6 Special Linear Systems

polynomials being multiplied are all linear and monic. From (6.6) it follows that the
DFT of each u j is then

V (w)u j = V (w)e2 − ρ j V (w)e1


= w − ρ j e = (1 − ρ j , ω − ρ j , . . . , ωn − ρ j ) . (6.8)

Therefore each DFT can be computed by means a DAXPY-type operation with-


out using the FFT. Moreover, the computations can be implemented in a parallel
architecture in a straightforward manner. Assuming that the powers of ω are readily
available, the arithmetic cost of this step on p ≤ n processors is n(n + 1)/ p parallel
operations. If there are p = O(n 2 ) processors, the cost is 1 parallel operation. Algo-
rithm 6.2 (powform) implements this idea and also incorporates the multiplication
with a scalar coefficient γ when the polynomial is not monic. Note that the major
difference from Algorithm pr2pw is the replacement of the first major loop with the
doubly nested loop that essentially computes the size (n + 1) × n matrix product
⎛ ⎞
1 1
⎜1 ω⎟  
⎜ ⎟ −ρ1 −ρ2 . . . −ρn
⎜ .. .. ⎟ .
⎝. . ⎠ 1 1 ... 1
1 ωn

Since all terms are independent, this can be computed in 2n(n + 1)/ p parallel oper-
ations. It follows that the overall cost to convert from product to power form on p
processors is approximately

n(n + 1)
Tp = 2 + τp, (6.9)
p

where τ p is as in (6.7). When p = n, the dominant cost of powform is O(n) instead


of O(n log n) for pr2pw. On n 2 processors, the parallel cost of both algorithms is
O(log n). We note that the replacement of the FFT-based multiplication with V (w)
with matrix-multiplication implemented as linear combination of the columns was
inspired by early work in [26] on rapid elliptic solvers with sparse right-hand sides.

6.1.1 Vandermonde Matrix Inversion

We make use of the following result (see for example [15]):


Proposition 6.2 Consider the order n + 1 Vandermonde matrix V (x), where x =
(ξ0 , . . . , ξn ) . Then if its inverse is written in row form,
6.1 Vandermonde Solvers 171

Algorithm 6.2 powform: Conversion from full product (roots and leading coef-
ficient) (6.5) to power form (coefficients) using the explicit formula (6.8) for the
transforms.
Function: r = powform(r, γ )
Input: vector r = (ρ1 , . . . , ρn ) and scalar γ
Output: power form coefficients a = (α0 , . . . , αn )
1: ω = exp(−ι2π/(n + 1))
2: doall i = 1 : n + 1
3: doall j = 1 : n
4: υ̂i, j = ωi−1 − ρ j
5: end
6: end
7: doall i = 1 : n + 1
8: α̂i−1 = prod (Ûi,: )
9: end
10: a = dft−1 n+1 (â)
11: if γ = 1 then
12: a = γ r
13: end if

⎛ ⎞
v̂1
⎜ ⎟
V −1 = ⎝ ... ⎠

v̂n+1

then each row v̂i , i = 1, . . . , n + 1, is the vector of coefficients of the power form
representation of the corresponding Lagrange basis polynomial li−1 (ξ ).
Algorithm 6.3 for computing the inverse of V , by rows, is based on the previous
proposition and Algorithm 6.2 (powform).
Most operations of ivand occur in the last loop, where n + 1 conversions to
power form are computed. Using pr2pw on O(n 3 ) processors, these can be done in
O(log n) operations; the remaining computations can be accomplished in equal or
fewer steps so the total parallel cost of ivand is O(log n).

Algorithm 6.3 ivand: computing the Vandermonde inverse.


Input: x = (ξ0 , . . . , ξn ) with pairwise distinct values // Vandermonde matrix is V (x)
Output: v̂1 , . . . , v̂n+1 //V −1 = (v̂1 , . . . , v̂n+1 )
1: U = x(e(n+1) ) − e(n+1) x  + I
2: doall i = 1 : n + 1
3: πi = prod (Ui,: ) //π are power form coefficients of j=i (ξi − ξ j )
4: end
5: doall i = 1 : n + 1
6: x̂ (i) = (x1:i−1 ; xi+1:n )
7: v̂(i) = powform(x̂ (i) , 1/πi )
8: end
172 6 Special Linear Systems

On (n + 1)2 processors, the cost is dominated by O(n) parallel operations for


the last loop according to the preceding analysis of pr2pw for the case of n proces-
sors. Note that dedicating all processors for each invocation of pr2pw incurs higher
cost, namely O(n log n). The remaining steps of ivand incur smaller cost, namely
O(log n) for the second loop and O(1) for the first. So on (n + 1)2 processors,
the parallel cost of ivand is O(n) operations. It is also easy to see that on p ≤ n
processors, the cost becomes O(n 3 / p).

6.1.2 Solving Vandermonde Systems and Parallel Prefix

The solution of primal or dual Vandermonde systems can be computed by first obtain-
ing V −1 using Algorithm 6.3 (ivand) and then using a dense BLAS2 (matrix vector
multiplication) to obtain V −1 b or V − b.
We next describe an alternative approach for dual Vandermonde systems,
(V (x)) a = b, that avoids the computation of the inverse. The algorithm consists
of the following major stages.
1. From the values in (x, b) compute the divided difference coefficients g =
(γ0 , . . . , γn ) for the Newton form representation (6.3).
2. Using g and x construct the coefficients of the power form representation for each
j−1
term γ j i=0 (ξ − ξi ) of (6.3).
3. Combine these terms to produce the solution a as the vector of coefficients of the
power form representation for the Newton polynomial (6.3).

Step 1: Computing the divided differences


Table-based Approach
In this method, divided difference coefficients g = (γ0 , . . . , γn ) emerge as elements
of a table whose first two columns are the vectors x and b. The elements of the table
are built recursively, column by column while the elements in each column can
be computed independently. One version of this approach, Neville, is shown as
Algorithm 6.4. A matrix formulation of the generation of divided differences can be
found in [14]. Neville lends itself to parallel processing using e.g. the systolic array
model; cf. [27, 28].

Algorithm 6.4 Neville: computing the divided differences by the Neville method.
Input: x = (ξ0 , . . . , ξn ) , b = (β0 , . . . , βn ) where the ξ j are pairwise distinct.
Output: Divided difference coefficients c = (γ0 , . . . , γn )
1: c = b //initialization
2: do k = 0 : n − 1
3: doall i = k + 1 : n
4: γi = (γi − γi−1 )/(ξi − ξi−k−1 )
5: end
6: end
6.1 Vandermonde Solvers 173

Sums of Fractions Approach


We next describe a method of logarithmic complexity for constructing divided differ-
ences. The algorithm uses the sum of fractions representation for divided differences.
In particular, consider the order n + 1 matrix

U = xe − ex  + I, (6.10)

where x = (ξ0 , . . . , ξn ) . Its elements are



ξi−1 − ξ j−1 if i = j
υi, j = , i, j = 1, . . . , n + 1
1 otherwise

Observe that the matrix is shifted skew-symmetric and that it is also used in Algorithm
6.3 (line 1). The divided differences can be expresssed as a linear combination of the
values β j with coefficients computed from U . Specifically, for l = 0, . . . , n,

1 1 1
γl = β0 + β1 + · · · + βl .
υ1,1 · · · υ1,l+1 υ2,1 · · · υ2,l+1 υl+1,1 · · · υl+1,l+1
(6.11)

The coefficients of the terms β0 , . . . , βl for γl are the (inverses) of the products of the
elements in rows 1 to l +1 of the order l +1 leading principal submatrix, U1:l+1,1:l+1 ,
of U . The key to the parallel algorithm is the following observation [29, 30]:
For fixed j, the coefficients of β j in the divided difference coefficients γ0 , γ1 , . . . , γn are
the (inverses) of the n + 1 prefix products {υ j,1 , υ j,1 υ j,2 , . . . , υ j,1 · · · υ j,n+1 }.

The need to compute prefixes arises frequently in Discrete and Computational Math-
ematics and is recognized as a kernel and as such has been studied extensively in
the literature. For completeness, we provide a short description of the problem and
parallel algorithms in Sect. 6.1.3.
We now return to computing the divided differences using (6.11). We list the
steps as Algorithm 6.5 (dd_pprefix). First U is calculated. Noting that the matrix is
shifted skew-symmetric, this needs 1 parallel subtraction on n(n + 1)/2. Applying
n +1 independent instances of parallel prefix (Algorithm 6.8) and 1 parallel division,
all of the inverse coefficients of the divided differences are computed in log n steps
using (n + 1)n processors. Finally, the application of n independent instances of
a logarithmic depth tree addition algorithm yields the sought values in log(n + 1)
arithmetic steps on (n + 1)n/2 processors. Therefore, the following result holds.
Proposition 6.3 ([30]) The divided difference coefficients of the Newton inter-
polating polynomial of degree n can be computed from the n + 1 value pairs
{(ξk , βk ), k = 0, . . . , n} in 2 log(n + 1) + 2 parallel arithmetic steps using (n + 1)n
processors.
174 6 Special Linear Systems

Algorithm 6.5 dd_pprefix: computing divided difference coefficients by sums of


rationals and parallel prefix.
Input: x = (ξ0 , . . . , ξn ) and b = (β0 , . . . , βn ) where the ξ j ’s are pairwise distinct
Output: divided difference coefficients {γ j }nj=0
1: U = xe − ex  + I ; Û = 1n+1,n+1
U ;
2: doall j = 1 : n + 1
3: P j,: = prefix_opt(Û j,: ); //It is assumed that n = 2k and that prefix_opt (cf. Sect. 6.1.3)
is used to compute prefix products
4: end
5: doall l = 1 : n + 1
6: γl−1 = b1:l  P
1:l,l
7: end

Steps 2–3: Constructing power form coefficients from the Newton form
The solution a of the dual Vandermonde system V  (x)a = b contains the coeffients
of the power form representation of the Newton interpolation polynomial in (6.3).
Based on the previous discussion, this will be accomplished by independent invoca-
tions of Algorithm 6.2 (powform). Each of these returns the set of coefficients for
the power form of an addend, γl l−1 j=0 (ξ − ξ j ), in (6.3). Finally, each of these inter-
mediate vectors is summed to return the corresponding element of a. When O(n 2 )
processors are available, the parallel cost of this algorithm is T p = O(log n). These
steps are listed in Algorithm 6.6 (dd2pw).

Algorithm 6.6 dd2pw: converting divided differences to power form coefficients


Input: vector of nodes x = (ξ0 , . . . , ξn−1 ) and divided difference coefficients g = (γ0 , . . . , γn ).
Output: power form coefficients a = (α0 , . . . , αn ).
1: R = 0n+1,n+1 ; R1,1 = γ0 ;
2: doall j = 1 : n
3: R1: j+1, j+1 = powform((ξ0 , . . . , ξ j−1 ), γ j )
4: end
5: doall j = 1 : n + 1
6: a = sum (R j,: , 2)
7: end

6.1.3 A Brief Excursion into Parallel Prefix

Definition 6.1 Let be an associative binary operation on a set S . The prefix compu-
tation problem is as follows: Given an ordered n-tuple (α1 , α2 , . . . , αn ) of elements
of S , compute all n prefixes α1 · · · αi for i = 1, 2, . . . , n.
The computations of interest in this book are prefix sums on scalars and prefix
products on scalars and matrices. Algorithm 6.7 accomplishes the prefix computation;
to simplify the presentation, it is assumed that n = 2k . The cost of the algorithm is
T p = O(log n) parallel “ ” operations.
6.1 Vandermonde Solvers 175

Algorithm 6.7 prefix: prefix computation


Input: a = (α1 , . . . , αn ) where n = 2k and an associative operation .
Output: p = (π1 , . . . , πn ) where πi = α1 · · · αi .
1: p = a;
2: do j = 1 : log n
3: h = 2 j−1
4: doall i = 1 : n − h
5: π̂i+h = pi+h pi
6: end
7: doall i = 1 : n − h
8: πi+h = π̂i+h
9: end
10: end

Algorithm 6.7 (prefix) is straightforward but the total number of operations is of


the order n log n, so there is O(log n) redundancy over the simple serial computation.
A parallel prefix algorithm of O(1) redundancy can be built by a small modi-
fication of Algorithm 6.7 at the cost of a few additional steps that do not alter the
logarithmic complexity. We list this as Algorithm 6.8 (prefix_opt). The cost is
T p = 2 log n − 1 parallel steps and O p = 2n − log n − 2 operations. The computa-
tions are organized in two phases, the first performing k steps of a reduction using a
fan-in approach to compute final and intermediate prefixes at even indexed elements
2, 4, . . . , 2k and the second performing another k − 1 steps to compute the remaining
prefixes. See also [31–33] for more details on this and other implementations.

Algorithm 6.8 prefix_opt: prefix computation by odd-even reduction.


Input: Vector a = (α1 , . . . , αn ) where n is a power of 2. // is an associative operation
Output: Vector p of prefixes πi = α1 · · · αi .
1: p = a
//forward phase
2: do j = 1 : log n
3: doall i = 2 j : 2 j : n
4: π̂i = πi πi−2 j−1
5: end
6: doall i = 2 j : 2 j : n
7: πi = π̂i
8: end
9: end
//backward phase
10: do j = log n − 1 : −1 : 1
11: doall i = 3 · 2 j−1 : 2 j : n − 2 j−1
12: π̂i = πi πi−2 j−1
13: end
14: doall i = 3 · 2 j−1 : 2 j : n − 2 j−1
15: πi = π̂i
16: end
17: end
176 6 Special Linear Systems

The pioneering array oriented programming language APL included the scan
instruction for computing prefixes [34]. In [35] many benefits of making prefix avail-
able as a primitive instruction are outlined. The monograph [32] surveys early work
on parallel prefix. Implementations have been described for several parallel archi-
tectures, e.g. see [36–40], and via using the Intel Threading Building Blocks C++
template library (TBB) in [41].

6.2 Banded Toeplitz Linear Systems Solvers

6.2.1 Introduction

Problems in mathematical physics and statistics give rise to Toeplitz, block-Toeplitz,


and the related Hankel and block-Hankel systems of linear equations. Examples
include problems involving convolutions, integral equations with difference kernels,
least squares approximations using polynomials as basis functions, and stationary
time series. Toeplitz matrices are in some sense the prototypical structured matrices.
In fact, in certain situations, banded Toeplitz matrices could be used as effective
preconditioners especially if fast methods for solving systems involving such pre-
conditioners are available.
We already discussed algorithms for linear systems with triangular Toeplitz and
banded triangular Toeplitz matrices in Sect. 3.2.4 of Chap. 3. Here, we present algo-
rithms for systems with more general banded Toeplitz matrices. Several uniproces-
sors algorithms have been developed for the inversion of dense Toeplitz and Hankel
matrices of order n, all requiring O(n 2 ) arithmetic operations, e.g. see [42–50].
Extending these algorithms to block-Toeplitz and block-Hankel matrices of order np
with square blocks of order p shows that the inverse can be obtained in O( p 3 n 2 )
operations; see [51, 52]. One of the early surveys dealing with inverses of Toeplitz
operators is given in [53]. Additional uniprocessor algorithms have been developed
for solving dense Toeplitz systems or computing the inverse of dense Toeplitz matri-
ces, e.g. see [54–58] in O(n log2 n) arithmetic operations. These superfast schemes
must be used with care because sometimes they exhibit stability problems.
In what follows, we consider solving banded Toeplitz linear systems of equations
of bandwidth (2m +1), where the system size n is much larger than m. These systems
arise in the numerical solution of certain initial-value problems, or boundary-value
problems, for example. Uniprocessor banded solvers require O(n log m) arithmetic
operations, e.g. see [7]. We describe three algorithms for solving banded Toeplitz sys-
tems, originally presented in [59], that are suitable for parallel implementation. Essen-
tially, in terms of complexity, we show that a symmetric positive-definite banded
Toeplitz system may be solved in O(m log n) parallel arithmetic operations given
enough processors. Considering that a banded Toeplitz matrix of bandwidth 2m + 1
can be described by at most (2m + 1) floating-point numbers, a storage scheme
may be adopted so as to avoid excessive internode communications and maximize
conflict-free access within each multicore node.
6.2 Banded Toeplitz Linear Systems Solvers 177

The banded Toeplitz algorithms to be presented in what follows rely on several


facts concerning Toeplitz matrices and the closely related circulant matrices. We
review some of these facts and present some of the basic parallel kernels that are
used in Sect. 6.2.2.
Definition 6.2 Let A = [αij ] ∈ Rn×n . Then A is Toeplitz if αi j = αi− j , i, j =
1, 2, . . . , n. In other words, the entries of A along each diagonal are the same.
It follows from the definition that Toeplitz matrices are low displacement matrices.
In particular it holds that

A − JAJ = (A − (e1 Ae1 )I )e1 e1 + e1 e1 A,

where J is the lower bidiagonal matrix defined in section and the rank of the term
on the right-hand side of the equality above is at most 2.

Definition 6.3 Let A, E n ∈ Rn×n , where E n = [en , en−1 , . . . , e1 ], in which ei


is the ith column of the identity matrix In . A is then said to be persymmetric if
E n AE n = A ; in other words, if A is symmetric about the cross-diagonal.

Definition 6.4 ([60]) Let A ∈ R2n×2n . Hence, A is called centrosymmetric if it can


be written as  
B C
A=
C  En B En

where B, C ∈ Rn×n , B = B  , and C  = E n C E n . Equivalently, E 2n AE 2n = A.

Lemma 6.1 ([50]) Let A ∈ Rn×n be a nonsingular Toeplitz matrix. Then, both A
and A−1 are persymmetric; i.e., E n AE n = A , and E n A−1 E n = A− .

The proof follows directly from the fact that A is nonsingular, and E n2 = I .

Lemma 6.2 ([61]) Let A ∈ R2n×2n be a symmetric Toeplitz matrix of the form
 
B C
A= .
C B

Thus, A is also centrosymmetric. Furthermore, if


 
1 In E n
P2n = √
2 In −E n

we have  
 B + CEn 0
P2n A P2n = .
0 B − CEn

The proof is a direct consequence of Definition 6.4 and Lemma 6.1.


178 6 Special Linear Systems

Theorem 6.1 ([53, 62, 63]) Let A ∈ Rn×n be a Toeplitz matrix with all its leading
principal submatrices being nonsingular. Let also, Au = αe1 and Av = βen , where

u = (1, μ1 , μ2 , . . . , μn−1 ) ,
v = (νn−1 , νn−2 , . . . , ν1 , 1) .

Since A−1 is persymmetric, then α = β, and A−1 can be written as

α A−1 = UV − Ṽ Ũ (6.12)

in which the right-hand side consists of the triangular Toeplitz matrices

⎛ ⎞ ⎛ ⎞
⎜ 1 ⎟ ⎜ 1 ν1 ν n−2 ν n−1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ μ1 ⎟ ⎜ ν n−2 ⎟
⎜ 1 ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
U =⎜ ⎜ ⎟ V =⎜ ⎟
⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ μ n−2 ⎟ ⎜ ν1 ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
μ n−1 μ n−2 μ1 1 1
⎛ ⎞ ⎛ ⎞
⎜ 0 ⎟ ⎜ 0 μ n−1 μ2 μ 1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ν n−1 0 ⎟ ⎜ μ2 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
Ṽ = ⎜

⎟ Ũ = ⎜
⎟ ⎜


⎜ ⎟ ⎜ ⎟
⎜ ν2 ⎟ ⎜ μ n−1 ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
ν1 ν2 ν n−1 0 0

Equation (6.12) was first given in [62]. It shows that A−1 is completely determined
by its first and last columns, and can be used to compute any of its elements. Cancel-
lation, however, is assured if one attempts to compute an element of A−1 below the
cross-diagonal. Since A−1 is persymmetric, this situation is avoided by computing
the corresponding element above the cross-diagonal. Moreover, if A is symmetric,
its inverse is both symmetric and persymmetric, and u = E n v completely determines
A−1 via
α A−1 = UU − Ũ  Ũ . (6.13)

Finally, if A is a lower triangular Toeplitz matrix, then A−1 = α −1 U , see Sect. 3.2.3.
Definition 6.5 Let A ∈ Rn×n be a banded symmetric Toeplitz matrix with elements
αij = α|i− j| = αk , k = 0, 1, . . . , m. where m < n, αm = 0 and αk = 0 for k > m.
Then the complex rational polynomial

φ(ξ ) = αm ξ −m + · · · + α1 ξ −1 + α0 + α1 ξ + · · · + αm ξ m (6.14)

is called the symbol of A [64].


6.2 Banded Toeplitz Linear Systems Solvers 179

Lemma 6.3 Let A be as in Definition 6.5. Then A is positive definite for all n > m
if and only if:

m
(i) φ(eiθ ) = α0 + 2 αk cos kθ  0 for all θ , and
k=1
(ii) φ(eiθ ) is not identically zero.
Note that condition (ii) is readily satisfied since αm = 0.
Theorem 6.2 ([65]) Let A in Definition 6.5 be positive semi-definite for all n > m.
Then the symbol φ(ξ ) can be factored as

φ(ξ ) = ψ(ξ )ψ(1/ξ ) (6.15)

where
ψ(ξ ) = β0 + β1 ξ + · · · + βm ξ m (6.16)

is a real polynomial with βm = 0, β0 > 0, and with no roots strictly inside the unit
circle.
This result is from [66], see also [67, 68]. The factor ψ(ξ ) is called the Hurwitz
factor. Such a factorization arises mainly in the time series analysis and the realization
of dynamic systems. The importance of the Hurwitz factor will be apparent from the
following theorem.
Theorem 6.3 Let φ(eiθ ) > 0, for all θ , i.e. A is positive definite and the symbol
(6.14) has no roots on the unit circle. Consider the Cholesky factorization

A = LL (6.17)

where, ⎛ ⎞
(1)
λ0
⎜ ⎟
⎜ λ(2) λ(2) ⎟
⎜ 1 0 ⎟

⎜ λ(3)
2 λ(3)
1 λ(3)
0


⎜ .. .. .. ⎟
⎜ . . . ⎟
L=⎜
⎜ λ(m+1) (m+1) (m+1) (m+1)

⎟ (6.18)
⎜ m λm−1 λm−2 · · · λ0 ⎟
⎜ ⎟
⎜ λ(m−2)
m
(m+2)
λm−1 · · · λ(m+2) λ(m+2) ⎟
⎜ 1 0 ⎟
⎜ .. .. .. .. ⎟
⎝ . . . . ⎠
(k) (k) (k) (k)
0 λm λm−1 λ1 λ0

Then,
lim λ(k)
j = βj, 0  j  m,
k→∞

where β j is given by (6.16). In fact, if τ is the root of ψ(ξ ) closest to the unit circle
(note that |τ | > 1) with multiplicity p, then we have
180 6 Special Linear Systems

(k)
λ j = β j + O[(k − j)2( p−1) /|τ |2(k− j) ]. (6.19)

This theorem shows that the rows of the Cholesky factor L converge, linearly, with
an asymptotic convergence factor that depends on the magnitude of the root of ψ(ξ )
closest to the unit circle. The larger this magnitude, the faster the convergence.
Next, we consider circulant matrices that play an important role in one of the
algorithms in Sect. 6.2.2. In particular, we consider these circulants associated with
banded symmetric or nonsymmetric Toeplitz matrices. Let A = [α−m , . . . , α−1 , α0 ,
α1 , . . . , αm ] ∈ Rn×n denote a banded Toeplitz matrix of bandwidth (2m + 1) where
2m + 1  n, and αm , α−m = 0. Writing this matrix as

⎛ ⎞
B C
⎜ ⎟
⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎜ ⎟
A=⎜



⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎝ ⎠
D B

(6.20)

where B ∈ Rm×m is the Toeplitz matrix


⎛ ⎞
α0 α1 · · · αm−1
⎜ .. ⎟
⎜ α−1 α0 . ⎟
B=⎜
⎜ .


⎝ .. ..
. α1 ⎠
α−m+1 α−1 α0

and C, D ∈ Rm×m are the triangular Toeplitz matrices


⎛ ⎞ ⎛ ⎞
αm α −m α −m+1 α −1
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ α −2 ⎟
⎜ α m−1 α m ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
C =⎜ ⎟ and D = ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ α −m+1 ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
α1 α m−1 α m α −m

The circulant matrix à corresponding to A is therefore given by


6.2 Banded Toeplitz Linear Systems Solvers 181

⎛ ⎞
B C D
⎜ ⎟
⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎜ ⎟
à = ⎜



⎜ ⎟
⎜ D B C ⎟
⎜ ⎟
⎝ ⎠
C D B

(6.21)

which may also be written as the matrix polynomial


m
m
à = αj K j + α− j K n− j ,
j=0 j=1

where K = Jn + en e1 , that is

⎛ ⎞
0 1
⎜ ⎟
⎜ 0 1 ⎟
K =⎜



⎝ 0 1 ⎠
1 0 0
(6.22)

It is of interest to note that K  = K −1 = K n−1 , and K n = In .


In the next lemma we make use of the unitary Fourier matrix introduced in
Sect. 6.1.
Lemma 6.4 Let W be the unitary Fourier matrix of order n
⎛ ⎞
1 1 1 ··· 1
⎜1 ω ω2 ··· ω (n−1) ⎟
1 ⎜⎜ ω2 ω4

· · · ω2(n−1) ⎟
W = √ ⎜1 ⎟ (6.23)
n ⎜ .. .. .. .. ⎟
⎝. . . . ⎠
1 ω(n−1) ω2(n−1) · · · ω(n−1)
2

in which ω = ei(2π/n) is an nth root of unity. Then

W ∗ KW = Ω
(6.24)
= diag(1, ω, ω2 , . . . , ωn−1 ),
182 6 Special Linear Systems

and
W ∗ ÃW = Γ
m
m
(6.25)
= αjΩ j + α− j Ω − j
j=0 j=1

where Γ = diag(γ0 , γ1 , . . . , γn−1 ).


In other words, the kth eigenvalue of Ã, k = 0, 1, . . . , n − 1, is given by

γk = φ̃(ωk ) (6.26)

where

m
φ̃(ξ ) = αjξ j
j=−m

is the symbol of the nonsymmetric Toeplitz matrix (6.20), or


√ 
γk = nek+1 W a, (6.27)

in which a  = e1 Ã.


In what follows, we make use of the following well-known result.
Theorem 6.4 ([69]) Let W ∈ Cm×n be as in Lemma 6.4, y ∈ Cn and n be a power
of 2. Then the product W y may be obtained in T p = 3 log n parallel arithmetic
operations using p = 2n processors.

6.2.2 Computational Schemes

In this section we present three algorithms for solving the banded Toeplitz linear
system
Ax = f (6.28)

where A is given by (6.20) and n m. These have been presented first in [59]. We
state clearly the conditions under which each algorithm is applicable, and give com-
plexity upper bounds on the number of parallel arithmetic operations and processors
required.
The first algorithm, listed as Algorithm 6.9, requires the least number of par-
allel arithmetic operations of all three, 6 log n + O(1). It is applicable only when
the corresponding circulant matrix is nonsingular. The second algorithm, numbered
Algorithm 6.10 may be used if A, in (6.28), is positive definite or if all its principal
minors are non-zero. It solves (6.28) in O(m log n) parallel arithmetic operations.
The third algorithm, Algorithm 6.11, uses a modification of the second algorithm to
compute the Hurwitz factorization of the symbol φ(ξ ) of a positive definite Toeplitz
matrix. The Hurwitz factor, in turn, is then used by an algorithm proposed in [65] to
6.2 Banded Toeplitz Linear Systems Solvers 183

solve (6.28). This last algorithm is applicable only if none of the roots of φ(ξ ) lie on
the unit circle. In fact, the root of the factor ψ(ξ ) nearest to the unit circle should be
far enough away to assure early convergence of the modified Algorithm 6.10. The
third algorithm requires O(log m log n) parallel arithmetic operations provided that
the Hurwitz factor has already been computed. It also requires the least storage of all
three algorithms. In the next section we discuss another algorithm that is useful for
block-Toeplitz systems that result from the discretization of certain elliptic partial
differential equations.
A Banded Toeplitz Solver for Nonsingular Associated Circulant Matrices:
Algorithm 6.9
First, we express the linear system (6.28), in which the banded Toeplitz matrix A is
given by (6.20), as
( Ã − S)x = f (6.29)

where à is the circulant matrix (6.21) associated with A, and


 
0 D
S=U U ,
C 0

in which  
Im 0 · · · 0 0
U = .
0 0 · · · 0 Im

Assuming à is nonsingular, the Sherman-Morrison-Woodbury formula [70] yields


the solution
x = Ã−1 f − Ã−1 UG−1 U  Ã−1 f, (6.30)

where  −1
 −1 0 D
G = U Ã U− .
C 0

The number of parallel arithmetic operations required by this algorithm, Algo-


rithm 6.9, which is clearly dominated by the first stage, is 6 log n + O(m log m)
employing no more than 4n processors. The corresponding sequential algorithm
consumes O(n log n) arithmetic operations and hence realize O(n) speedup. Note
that at no time do we need more than 2n + O(m 2 ) storage locations.
The notion of inverting a Toeplitz matrix via correction of the correspond-
ing circulant-inverse is well known and has been used many times in the past;
cf. [65, 71].
Such an algorithm, when applicable, is very attractive on parallel architectures
that allow highly efficient FFT’s. See for example [72] and references therein. As we
have seen, solving (6.28) on a parallel architecture can be so inexpensive that one is
tempted to improve the solution via one step of iterative refinement. Since v1 , which
determines Ã−1 completely, is already available, each step of iterative refinement
184 6 Special Linear Systems

Algorithm 6.9 A banded Toeplitz solver for nonsymmetric systems with nonsingular
associated circulant matrices
Input: Banded nonsymmetric Toeplitz matrix A as in (6.20) and the right-hand side f .
Output: Solution of the linear system Ax = f
//Stage 1 //Consider the circulant matrix à associated with A as given by Eq. (6.21). First,
determine whether à is nonsingular and, if so, determine Ã−1 and y = Ã−1 f . Since the inverse
of a circulant matrix is also circulant, Ã−1 is completely determined by solving Ãv1 = e1 . This
is accomplished via (6.25), i.e., y = W Γ −1 W ∗ f and v1 = W Γ −1 W ∗ e1 . This computation is
organized as follows: √
1: Simultaneously form nW a and W ∗ f //see (6.27) and Theorem 6.4 √ (FFT). This is an inex-
pensive test for the nonsingularity of Ã. If none of the elements of nW a (eigenvalues of Ã)
vanish, we proceed to step (2). √
2: Simultaneously, obtain Γ −1 (W ∗ e1 ) and Γ −1 (W ∗ f ). //Note that nW ∗ e1 = (1, 1, . . . , 1) .
3: Simultaneously obtain v1 = W (Γ W e1 ) and y = W (Γ W ∗ f ) via the FFT.
−1 ∗ −1 //see
Theorem 6.4
//Stage 2 //solve the linear system
 
y1
Gz = U  y = (6.31)

where y1 , yν ∈ Rm contain the first and last m elements of y, respectively.


4: form the matrix G ∈ R2m×2m . //From Stage 1, v1 completely determines Ã−1 , in particular
we have  
F M
U  Ã−1 U = ,
N F
where F, M, and  N are Toeplitz matrices
 each of order m.
F M − C −1
5: Compute G = . //Recall that C −1 and D −1 are completely determined
N − D −1 F
by their first columns; see also Theorem 6.1
6: compute the solution z of (6.31)
//Stage 3
7: compute u = Ã−1 U z
//Stage 4
8: compute x = y − u

is equally inexpensive. The desirability of iterative refinement for this algorithm is


discussed later.

A Banded Toeplitz Solver for Symmetric Positive Definite Systems:


Algorithm 6.10
Here, we consider symmetric positive definite systems which implies that for the
banded Toeplitz matrix A in (6.20), we have D = C  , and B is symmetric positive
definite. For the sake of ease of illustration, we assume that n = 2q p, where p and q
are integers with p = O(log n)  m. On a uniprocessor, the Cholesky factorization is
the preferred algorithm for solving the linear system (6.28). Unless the corresponding
circulant matrix is also positive definite, however, the rows of L (the Cholesky factors
of A) will not converge, see Theorem 6.3. This means that one would have to store
6.2 Banded Toeplitz Linear Systems Solvers 185

O(mn) elements, which may not be acceptable for large n and relatively large m.
Even if the corresponding circulant matrix is positive definite, Theorem 6.3 indicates
that convergence can indeed be slow if the magnitude of that root of the Hurwitz
factor ψ(ξ ) (see Theorem 6.2) nearest to the unit circle is only slightly greater than
1. If s is that row of L at which convergence takes place, the parallel Cholesky
factorization and the subsequent forward and backward sweeps needed for solving
(6.28) are more efficient the smaller s is compared to n.
In the following, we present an alternative parallel algorithm that solves the
same positive definite system in O(m log n) parallel arithmetic operations with O(n)
processors, which requires no more than 2n + O(m 2 ) temporary storage locations.
For n = 2q p, the algorithm consists of (q + 1) stages which we outline as follows.

Stage 0.
Let the pth leading principal submatrix of A be denoted by A0 , and the right-hand
side of (6.28) be partitioned as

(0) (0) 
f  = ( f1 , f2 , . . . , f η(0) ),

where f i(0) ∈ R p and η = n/ p = 2q . In this initial stage, we simultaneously solve


the (η + 1) linear systems with the same coefficient matrix:

(0) ( p) (0)
A0 (z 0 , y1 , . . . , yη(0) ) = (e1 , f 1 , . . . , f p(0) ) (6.32)

( p)
where e1 is the first column of the identity I p . From Theorem 4.1 and the discussion
of the column sweep algorithm (CSweep) in Sect. 3.2.1 it follows that the above
systems can be solved in 9( p −1) parallel arithmetic operations using mη processors.

Stage ( j = 1, 2, . . . , q).
Let ⎛ ⎞
A j−1 0
⎜ C ⎟
Aj = ⎜


⎠ (6.33)
C
0 A j−1

be the leading 2r × 2r principal submatrix of A, where r = 2 j−1 p. Also, let f be


partitioned as
( j) ( j) 
f  = ( f1 , f2 , . . . , f ν( j) )
186 6 Special Linear Systems

( j)
where f i ∈ R2r , and ν = n/2r . Here, we simultaneously solve the (ν + 1) linear
systems
A j z j = e1(2r ) , (6.34)

( j) ( j)
A j yi = fi , i = 1, 2, . . . , ν, (6.35)

(r ) ( j−1) ( j−1)
where stage ( j − 1) has already yielded z j−1 = A−1 j−1 e1 and yi = A−1j−1 f i ,
i = 1, 2, . . . , 2ν. Next, we consider solving the ith linear system in (6.35). Observing
that  
( j−1)
( j) f 2i−1
fi = ( j−1) ,
f 2i

and premultiplying both sides of (6.35) by the positive definite matrix

D j = diag(A−1 −1
j−1 , A j−1 )

we obtain the linear system


   
( j−1)
Ir G j 0 ( j) y2i−1
yi = ( j−1) (6.36)
0 H j Ir y2i

where    
0 C
Gj = A−1
j−1 and H j = A−1
j−1 .
C 0

From Lemma 6.1, both A−1 −1 −1


j−1 and C are persymmetric, i.e., E r A j−1 E r = A j−1 and
E m C E m = C  , thus H j = Er G j E m . Using the Gohberg-Semencul formula (6.13),
see Theorem 6.1, H j may be expressed as follows. Let

u = αz j−1
= (1, μ1 , μ2 , . . . , μr −1 ) ,

and
ũ = Jr Er u.

Hence, H j is given by
α H j = (YY  
1 − Ỹ Ỹ1 )C , (6.37)

in which the m × r matrices Y and Ỹ are given by

Y = (u, Jr u, . . . , Jrm−1 u),


Ỹ = (ũ, Jr ũ, . . . , Jrm−1 ũ),
6.2 Banded Toeplitz Linear Systems Solvers 187

⎛ ⎞
1
⎜ ⎟
⎜ ⎟
⎜ μ1 1 ⎟
⎜ ⎟
⎜ ⎟
Y1 = (Im , 0)Y = ⎜
⎜ μ2 μ1 1 ⎟

⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ⎠
μ m−1 μ2 μ1 1

and
⎛ ⎞
0
⎜ ⎟
⎜ ⎟
⎜ μ r−1 0 ⎟
⎜ ⎟
⎜ ⎟
Ỹ1 = (Im , 0)Ỹ = ⎜
⎜ μ r−2 μ r−1 0 ⎟

⎜ ⎟
⎜ ⎟
⎜ ⎟
⎝ ⎠
μ r−m+1 μ r−2 μ r−1 0

The coefficient matrix in (6.36), D −1


j A j , may be written as
⎛ ( j)

Ir −m N1 0
⎜ ( j) ⎟
⎝ 0 N2 0 ⎠,
( j)
0 N3 Ir −m

( j)
where the central block N2 is clearly nonsingular since A j is invertible. Note
( j)
also that the eigenvalues of N2 ∈ R2m×2m are the same as those eigenvalues of
(D −1 −1
j A j ) that are different from 1. Since D j A j is similar to the positive definite
−1/2 −1/2 ( j)
matrix D j A j D j , N2 is not only nonsingular, but also has all its eigenvalues
positive. Hence, the solution of (6.36) is trivially obtained if we first solve the middle
2m equations,
( j)
N2 h = g, (6.38)

or     
Im E m M j E m h 1,i g1,i
= , (6.39)
Mj Im h 2,i g2,i

where
M j = (Im , 0)H j = α −1 (Y1 Y1 − Ỹ1 Ỹ1 )C 
188 6 Special Linear Systems

( j) ( j)
and gk,i , h k,i , k = 1, 2, are the corresponding partitions of fi and yi , respectively.
( j)
Observing that N2 is centrosymmetric (see Definition 6.4), then from Lemma 6.2
we reduce (6.39) to the two independent linear systems

(Im + E m M j )(h 1,i + E m h 2,i ) = (g1,i + E m g2,i ),


(6.40)
(Im − E m M j )(h 1,i − E m h 2,i ) = (g1,i − E m g2,i ).

( j) has all its leading principal minors positive, these two systems
Since P2m N2 P2m
may be simultaneously solved using Gaussian elimination without partial pivoting.
( j)
Once h 1 and E m h 2 are obtained, yi is readily available.
 
( j)
( j) y2i−1 − Er H j E m h 2,i
yi = ( j−1) . (6.41)
y2i − H j h 1,i

A Banded Toeplitz Solver for Symmetric Positive Definite Systems


with Positive Definite Associated Circulant Matrices: Algorithm 6.11
The last banded Toeplitz solver we discuss in this section is applicable only to those
symmetric positive definite systems that have positive definite associated circulant
matrices. In other words, we assume that both A in (6.20) and à in (6.21) are symmet-
ric positive definite. Hence, from Lemma 6.4 and Theorem 6.3, row ı̂ of the Cholesky
factor of A will converge to the coefficients of ψ(ξ ), where ı̂ < n. Consider the
Cholesky factorization (6.17), A = LL , where

⎛ ⎞
S1
⎜ ⎟
⎜ ⎟
⎜ V1 S2 ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
L=⎜

Vi−1 Si ⎟

⎜ ⎟
⎜ V S ⎟
⎜ ⎟
⎜ ⎟
⎜ ⎟
⎜ V S ⎟
⎝ ⎠

(6.42)
in which ı̂ = mi + 1, S j and V j ∈ Rm×m are lower and upper triangular, respectively,
and S, V are Toeplitz and given by
6.2 Banded Toeplitz Linear Systems Solvers 189

⎛ ⎞ ⎛ ⎞
⎜ β0 ⎟ ⎜ β m β m−1 β1 ⎟
⎜ ⎟ ⎜ ⎟
⎜ β1 ⎟ ⎜ β2 ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
S=⎜ ⎟ and V = ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ β m−1 ⎟
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
β m−1 β1 β0 βm

Equating both sides of (6.17), we get

B = SS + VV ,

and
C = SV ,

where B, and C are given in (6.20), note that in this case D = C  . Hence, A can be
expressed as
 
Im
A = R R + VV (Im , 0) (6.43)
0

where R  ∈ Rn×n is the Toeplitz lower triangular matrix

⎛ ⎞
S
⎜ ⎟
⎜ ⎟
⎜ V S ⎟
⎜ ⎟
⎜ ⎟
R =⎜




⎜ ⎟
⎜ V S ⎟
⎜ ⎟
⎝ ⎠
V S

(6.44)


m
Assuming that an approximate Hurwitz factor, ψ̃(ξ ) = β̃ j ξ j , is obtained from
j=0
the Cholesky factorization of A, i.e.,
190 6 Special Linear Systems

⎛ ⎞
⎜ S̃ ⎟
⎜ ⎟
⎜ Ṽ ⎟
⎜ S̃ ⎟
⎜ ⎟
⎜ ⎟
R̃ = ⎜



⎜ ⎟
⎜ Ṽ S̃ ⎟
⎜ ⎟
⎝ ⎠
Ṽ S̃

is available, the solution of the linear system (6.28) may be computed via the
Sherman-Morrison-Woodbury formula [14], e.g. see [65],
 
−1 −1 Ṽ
x=F f −F Q −1 [Ṽ  , 0]F −1 f (6.45)
0

where
F = R̃  R̃,

and  
 −1 Ṽ
Q = Im + (Ṽ , 0)F .
0

Computing x in (6.45) may then be organized as follows:


(a) Solve the linear systems Fv = f , via the forward and backward sweeps R̃  w =
f and R̃v = w. Note that algorithm BTS (Algorithm 3.5) which can be used
for these forward and backward sweeps corresponds to Theorem 3.4. Further, by
computing R̃ −1 e1 , we completely determine R̃ −1 .
(b) Let L̃ 1 = R̃ − I0m and v1 = (Im , 0)v, and simultaneously form T = L̃ 1 Ṽ and
u = Ṽ  v1 . Thus, we can easily compute Q = Im + T  T .
(c) Solve the linear system Qa = u and obtain the column vector b = T a. See
Theorem 4.1 and Algorithms 3.1 and 3.2 in Sect. 3.2.1.
(d) Finally, solve the Toeplitz triangular system R̃c = b, and obtain the solution
x = v − c.
From the relevant basic lemmas and theorems we used in this section, we see that,
given the approximation to the Hurwitz factor R̃, obtaining the solution x in (6.45)
requires (10 + 6 log m) log n + O(m) parallel arithmetic operations using no more
than mn processors, and only n + O(m 2 ) storage locations, the lowest among all
three algorithms in this section.
Since the lowest number of parallel arithmetic operations for obtaining S̃ and Ṽ
via the Cholesky factorization of A using no less than m 2 processors is O(ı̂), the cost
of computing the Hurwitz factorization can indeed be very high unless ı̂ n. This
will happen only if the Hurwitz factor ψ(ξ ) has its nearest root to the unit circle of
magnitude well above 1.
6.2 Banded Toeplitz Linear Systems Solvers 191

Next, we present an alternative to the Cholesky factorization. A simple modifica-


tion of the banded Toeplitz positive definite systems solver (Algorithm 6.10), yields
an efficient method for computing the Hurwitz factor. Let r̃ = 2k p  m(i + 1) =
ı̂ + m − 1. Then, from (6.17), (6.36)–(6.42) we see that

E m Mk E m = S − S −1 C. (6.46)

In other words, the matrices M j in (6.39) converge to a matrix M (say), and the ele-
ments β̃ j , 0  j  m −1, of S̃ are obtained by computing the Cholesky factorization
of   −1
 −1 −1 0
E m C M E m = (0, Im )A j−1 .
Im

Now, the only remaining element of Ṽ , namely β̃m is easily obtained as αm /β̃0 ,
which can be verified from the relation C = SV .
Observing that we need only to compute the matrices M j , Algorithm 6.10 may be
(2r )
modified so that in any stage j, we solve only the linear systems A j z j = e1 , where
j = 0, 1, 2, . . ., and 2r = 2 j+1 p. Since we assume that convergence takes place in
the kth stage, where r̃ = 2k p  n/2, the number of parallel arithmetic operations
required for computing the Hurwitz factor is approximately 6mk  6m log(n/2 p)
using 2m r̃  mn processors.
Using positive definite Toeplitz systems with bandwidths 5, i.e. m = 2. Lemma 6.3
states that all sufficiently large pentadiagonal Toeplitz matrices of the form [1, σ, δ,
σ, 1] are positive definite provided that the symbol function φ(eiθ ) = δ + 2σ cos θ +
2 cos 2θ has no negative values. For positive δ, this condition is satisfied when (σ, δ)
corresponds to a point on or above the lower curve in Fig. 6.1. The matrix is diagonally
dominant when the point lies above the upper curve. For example, if we choose test
matrices for which δ = 6 and 0  σ  4; every point on the dashed line in Fig. 6.1
represents one of these matrices. In [59], numerical experiments were conducted to

Fig. 6.1 Regions in the 10


δ = 2+σ
(σ, δ)-plane of positive δ = 2–σ
definiteness and diagonal 8
dominance of pentadiagonal
Toeplitz matrices
(1, σ, δ, σ, 1) 6
δ


⎪ 2
2 δ = ⎨ (σ + 8) ⁄ 4 if σ ≤ 4
⎪ 2( σ – 1) if σ ≥ 4

0
-4 -2 0 2 4
σ
192 6 Special Linear Systems

Table 6.1 Parallel arithmetic operations, number of processors, and overall operation counts for
Algorithms 6.9, 2 and 3
Parallel arith. ops Number of Overall ops. Storage
processors
Algorithm 6.9 6 log n 2n 10n log n 2n
a Algorithm 6.10 18 log n 4n 4n log n 2n
b Algorithm 6.11 16 log n 2n 12n log n n
a The 2-by-2 linear systems (6.40) are solved via Cramer’s rule
b Algorithm 6.11 does not include the computation of the Hurwitz factors

compare the relative errors in the computed solution achieved by the above three
algorithm and other solvers. Algorithm 6.9 with only one step of iterative refinement
semed to yield the lowest relative error.
Table 6.1, summarizes the number of parallel arithmetic operations, and the num-
ber of required processors for each of the three pentadiagonal Toeplitz solvers (Algo-
rithms 6.9, 6.10, and 6.11). In addition, Table 6.1 lists the overall number of arithmetic
operations required by each solver if implemented on a uniprocessor, together with
the required storage. Here, the pentadiagonal test matrices are of order n, with the
various entries showing only the leading term.
It is clear, however, that implementation details of each algorithm on a given
parallel architecture will determine the cost of internode communications, and the
cost of memory references within each multicore node. It is such cost, rather than the
cost of arithmetic operations, that will determine the most scalable parallel banded
Toeplitz solver on a given architecture.

6.3 Symmetric and Antisymmetric Decomposition (SAS)

The symmetric-and-antisymmetric (SAS) decomposition method is aimed at certain


classes of coefficient matrices. Let A and B ∈ Rn×n , satisfy the relations A = PAP
and B = −PBP where P is a signed symmetric permutation matrix, i.e. a permutation
matrix in which its nonzero elements can be either 1 or −1, with P 2 = I . We call
such a matrix P a reflection matrix, and the matrices A and B are referred to as
reflexive and antireflexive, respectively, e.g. see [73–75]. The above matrices A and
B include centrosymmetric matrices C as special cases since a centrosymmetric
matrix C satisfies the relation CE = EC where the permutation matrix E is given
by Definition 6.3, i.e. the only nonzero elements of E are ones on the cross diagonal
with E 2 = I . In this section, we address some fundamental properties of such special
matrices A and B starting with the following basic definitions:
• Symmetric and antisymmetric vectors. Let P be a reflection matrix of order n. A
vector x ∈ Rn is called symmetric with respect to P if x = P x. Likewise, we say
a vector z ∈ Rn is antisymmetric if z = −Pz.
6.3 Symmetric and Antisymmetric Decomposition (SAS) 193

• Reflexive and antireflexive matrices. A matrix A ∈ Rn×n is said to be reflexive (or


antireflexive) with respect to a reflection matrix of P of order n if A = PAP (or
A = −PAP).
• A matrix A ∈ Rn×n is said to possess the SAS (or anti-SAS) property with respect
to a reflection matrix P if A is reflexive (or antireflexive) with respect to P.
Theorem 6.5 Given a reflection matrix P of order n, any vector b ∈ Rn can be
decomposed into two parts, u and v, such that

u+v =b (6.47)

where
u = Pu and v = −Pv. (6.48)

The proof is readily established by taking u = 21 (b + Pb) and v = 21 (b − Pb).


Corollary 6.1 Given a reflection matrix P of order n, any matrix A ∈ Rn×n can be
decomposed into two parts, U and V, such that

U +V = A (6.49)

where
U = PUP and V = −PVP. (6.50)

Proof Similar to Theorem 6.5, the proof is easily established if one takes U =
2 (A + PAP) and V = 2 (A − PAP).
1 1

Theorem 6.6 Given a linear system Ax = f , A ∈ Rn×n , and f , x ∈ Rn , with A


nonsingular and reflexive with respect to some reflection matrix P, then A−1 is also
reflexive with respect to P and x is symmetric (or antisymmetric) with respect to P
if and only if f is symmetric (or antisymmetric) with respect to P.
Proof Since P is a reflection matrix, i.e. P −1 = P, we have

A−1 = (PAP)−1 = PA−1 P. (6.51)

Therefore, A−1 is reflexive with respect to P. Now, if x is symmetric with respect


to P, i.e. x = P x, we have

f = Ax = PAPPx = PAx = P f, (6.52)

Further, since A is reflexive with respect to P, and f is symmetric with respect to


P, then from f = P f , Ax = f , and (6.51) we obtain

x = A−1 f = PA−1 P f = PA−1 f = P x. (6.53)

completing the proof.


194 6 Special Linear Systems

Corollary 6.2 Given a linear system Ax = f , A ∈ Rn×n , and f , x ∈ Rn , with


A nonsingular and antireflexive with respect to some reflection matrix P, then A−1
is also antireflexive with respect to P and x is antisymmetric (or symmetric) with
respect to P if and only if f is symmetric (or antisymmetric) with respect to P.
Theorem 6.7 Given two matrices A and B where A, B ∈ Rn×n , the following rela-
tions hold:
1. if both A and B are reflexive with respect to P, then

(α A)(β B) = P(α A)(β B)P and (α A + β B) = P(α A + β B)P;

2. if both A and B are antireflexive with respect to P, then

(α A)(β B) = P(α A)(β B)P and (α A + β B) = −P(α A + β B)P; and

3. if A is reflexive and B is antireflexive, or vise versa, with respect to P, then

(α A)(β B) = −P(α A)(β B)P

where α and β are scalars.


Three of the most common forms of P are given by the Kronecker products,

P = Er ⊗ ±E s , P = Er ⊗ ±Is , P = Ir ⊗ ±E s , (6.54)

where Er is as defined earlier, Ir is the identity of order r , and r × s = n.

6.3.1 Reflexive Matrices as Preconditioners

Reflexive matrices, or low rank perturbations of reflexive matrices, arise often in


several areas in computational engineering such as structural mechanics, see for
example [75]. As an illustration, we consider solving the linear system Az = f
where the stiffness matrix A ∈ R2n×2n is reflexive, i.e., A = PAP, and is given by,
 
A11 A12
A= , (6.55)
A21 A22

in which the reflection matrix P is of the form,


 
0 P1
(6.56)
P1 0

with P1 being some signed permutation matrix of order n, for instance. Now, consider
the orthogonal matrix
6.3 Symmetric and Antisymmetric Decomposition (SAS) 195
 
1 I −P1
X=√ . (6.57)
2 P1 I

Instead of solving Az = f directly, the SAS decomposition method leads to the


linear system,
à x̃ = f˜ (6.58)

where à = X  AX, x̃ = X  x, and f˜ = X  f . It can be easily verified that à is of


the form,  
A11 + A12 P1 0
à = . (6.59)
0 A22 − A21 P1

From (6.56), it is clear that the linear system has been decomposed into two inde-
pendent subsystems that can be solved simultaneously. This decoupling is a direct
consequence of the assumption that the matrix A is reflexive with respect to P.
In many cases, both of the submatrices A11 + A12 P1 and A22 − A21 P1 still
possess the SAS property, with respect to some other reflection matrix. For exam-
ple, in three-dimensional linear isotropic or orthotropic elasticity problems that are
symmetrically discretized using rectangular hexahedral elements, the decomposition
can be further carried out to yield eight independent subsystems each of order n/4,
and not possessing the SAS property; e.g. see [73, 75]. Now, the smaller decoupled
subsystems in (6.58) can each be solved by either direct or preconditioned iterative
methods offering a second level of parallelism.
While for a large number of orthotropic elasticity problems, the resulting stiffness
matrices can be shown to possess the special SAS property, some problems yield
stiffness matrices A that are low-rank perturbations of matrices that possess the
SAS property. As an example consider a three-dimensional isotropic elastic long
bar with asymmetry arising from the boundary conditions. Here, the bar is fixed at
its left end, ξ = 0, and supported by two linear springs at its free end, ξ = L,
as shown in Fig. 6.2. The spring elastic constants K 1 and K 2 are different. The
dimensionless constants and material properties are given as: length L, width b,
height c, Young’s Modulus E, and the Poisson’s Ratio ν. The loading applied to

P
z P z

M y
z c
K2
K1

b
L

Fig. 6.2 Prismatic bar with one end fixed and the other elastically supported
196 6 Special Linear Systems

this bar is a uniform simple bending moment M across the cross section at its right
end, and a concentrated force P at L , −b 2 , 2 . For finite element discretization, we
c

use the basic 8-node rectangular hexahedral elements to generate an N1 × N2 × N3


grid in which all discretized elements are of identical size. Here, Nd (d = 1, 2, or 3)
denotes the number of grid spacings in each direction d. The resulting system stiffness
matrix is not globally SAS-decomposable due to the presence of the two springs of
unequal strength. This problem would become SAS-decomposable, however, into
four independent subproblems (the domain is symmetrical only about two axes) if
the springs were absent. In other words, the stiffness matrix is SAS-decomposable
into only two independent subproblems if the two springs had identical stiffness.
Based on these observations, we consider splitting the stiffness matrix A into two
parts, U and W , in which U is reflexive and can be decomposed into four disjoint
submatrices via the SAS technique, and W is a rank-2 perturbation that contains
only the stiffness contributions from the two springs. Using the reflexive matrix U
as a preconditioner for the Conjugate Gradient method for solving a system of the
form Ax = f , convergence is realized in no more than three iterations to obtain an
approximate solution with a very small relative residual.

6.3.2 Eigenvalue Problems

If the matrix A, in the eigenvalue problem Ax = λx, possesses the SAS property,
it can be shown (via similarity transformations) that the proposed decomposition
approach can be used for solving the problem much more efficiently. For instance,
if P is of the form
P = E 2 ⊗ E n/2 (6.60)

with A = PAP partitioned as in (6.55), then using the orthogonal matrix,


 
1 I −E n/2
X=√ . (6.61)
2 E n/2 I

we see that X  AX is a block-diagonal matrix consisting of the two blocks: A1 =


A11 + A12 E n/2 , and A2 = A22 − A21 E n/2 .
Thus first, all the eigenvalues of the original matrix A can be obtained from those
of the decomposed submatrices A1 and A2 , which are of lower order than A. Fur-
ther, the amount of work for computing those eigenpairs is greatly reduced even for
sequential computations. Second, the extraction of the eigenvalues of the different
submatrices can be performed independently, which implies that high parallelism
can be achieved. Third, the eigenvalues of each submatrix are in general better sep-
arated, which implies faster convergence for schemes such as the QR algorithm
(e.g., see [76]).
6.3 Symmetric and Antisymmetric Decomposition (SAS) 197

To see how much effort can be saved, we consider the QR iterations for a real-
valued full matrix (or order N ). In using the QR iterations for obtaining the eigenpairs
of a matrix, we first reduce this matrix to the upper Hessenberg form (or a tridiagonal
matrix for symmetric problems). On a uniprocessor, the reduction step takes about
cN 3 flops [14] for some constant c. If the matrix A satisfying PAP = A can be
decomposed into four submatrices each of order N /4; then the amount of floating
point operations required in the reduction step is reduced to cN 3 /16. In addition,
because of the fully independent nature of the four subproblems, we can further
reduce the computing time on parallel architectures.
Depending on the form of the signed reflection matrix P, several similarity trans-
formations can be derived for this special class of matrices A = PAP. Next, we
present another computationally useful similarity transformation.
Theorem 6.8 ([75]) Let A ∈ Rn×n be partitioned as (Ai, j ), i,j = 1,2, and 3 with
A11 and A33 of order r and A22 of order s, where 2r + s = n. If A = PAP where P
is of the form ⎛ ⎞
0 0 P1
P = ⎝ 0 Is 0 ⎠
P1 0 0

in which P1 is some signed permutation matrix of order r, then there exists an orthog-
onal matrix,
⎛ ⎞
I √0 −P1
1 ⎝
X=√ 0 2Is 0 ⎠ . (6.62)
2 P 0 I
1

that yields X  AX as a block-diagonal matrix of the form,


⎛ √ ⎞
A11√+ A12 P1 2 A12 0
⎝ 2 A21 A22 0 ⎠. (6.63)
0 0 
A33 − A31 P1

It should be noted that if A is symmetric, then both diagonal blocks in (6.63) are
also symmetric. The same argument holds for the two diagonal blocks in (6.59).
Note that the application of the above decomposition method can be extended to
the generalized eigenvalue problem Ax = λBx if B also satisfies the SAS property,
namely B = PBP.

6.4 Rapid Elliptic Solvers

In this section we consider the parallelism in the solution of linear systems with
matrices that result from the discretization of certain elliptic partial differential equa-
tions. As it turns out, there are times when the equations, boundary conditions and
198 6 Special Linear Systems

discretization are such that the structure of the coefficient matrix for the linear system
allow one to develop very fast direct solution methods, collectively known as Rapid
Elliptic Solvers (RES for short) or Fast Poisson Solvers. The terms are due to the
fact that their computational complexity on a uniprocessor is only O(mn log mn) or
less for systems of mn unknowns compared to the O((mn)3 ) complexity of Gaussian
elimination for dense systems.
Here we are concerned with direct methods, in the sense that in the absence of
roundoff the solvers return an exact solution. When well implemented, RES are
faster than other direct and iterative methods [77, 78]. The downside is their limited
applicability: RES can only be used directly for special elliptic PDEs (meaning the
equation, domain of definition) under suitable discretization; moreover, their perfor-
mance in general depends on the boundary conditions and problem size. On the other
hand, there are many cases when RES cannot be used directly but can be helpful as
preconditioners. Interest in the design and implementation of parallel algorithms for
RES started in the early 1970s; see Sect. 6.4.9 for some historical notes. Theoretically,
parallel RES can solve the linear systems under consideration in O(log mn) parallel
operations on O(mn) processors instead of the fastest but impractical algorithm [79]
for general linear systems that requires O(log2 mn) parallel operations on O((mn)4 )
processors. In the sequel we describe and evaluate the properties of some interesting
algorithms from this class.

6.4.1 Preliminaries

The focus of this chapter is on parallel RES for the following model problem:
⎛ ⎞⎛u ⎞ ⎛ f1

T −I 1
⎜ −I T −I ⎟⎜ u ⎟ ⎜ f2 ⎟
⎟⎜ ⎟ ⎜ ⎟
2

⎜ . . . ⎜
⎟⎜ .⎟. ⎜ .. ⎟
⎜ . . . . . . ⎟⎜ ⎟ . =⎜ .⎟ ⎟, (6.64)
⎜ ⎟⎜ ⎟ ⎜ ⎟
⎜ .. ⎟
⎝ −I T −I ⎠ ⎝ ... ⎠ ⎝ .⎠
−I T un fn

where T = [−1, 4, −1]m and the unknowns and right-hand side u, f are conformally
partitioned into subvectors u i = (υi,1 , . . . , υi,m ) ∈ Rm and f i = (φi,1 , . . . , φi,m ) .
See Sect. 9.1 of Chap. 9 for some details regarding the derivation of this system but
also books such as [80, 81]. We sometimes refer to the linear system as discrete
Poisson system and to the block-tridiagonal matrix A in (6.64) as Poisson matrix.
Note that this is a 2-level Toeplitz matrix, that is a block-Toeplitz matrix with Toeplitz
blocks [82]. The 2-level Toeplitz structure is a consequence of constant coefficients
in the differential equation and the Dirichlet boundary conditions. It is worth noting
that the discrete Poisson system (6.64) can be reordered and rewritten as
6.4 Rapid Elliptic Solvers 199

[−In , T̃ , −In ]ũ = f˜.

This time the Poisson matrix has m blocks of order n each. The two systems are
equivalent; for that reason, in the algorithms that follow, the user can select the
formulation that minimizes the cost. For example, if the cost is modeled by T p =
γ1 log m log n + γ2 log2 n for positive γ1 , γ2 , then it is preferable to choose n ≤ m.
The aforementioned problem is a special version of the block-tridiagonal system

Au = f, where A = [W, T, W ]n , (6.65)

where T, W ∈ Rm×m are symmetric and commute under multiplication so that they
are simultaneously diagonalizable by the same orthogonal similarity transformation.
Many of the parallel methods we discuss here also apply with minor modifications
to the solution of system (6.65).
It is also of interest to note that A is of low block displacement rank. In particular
if we use our established notation for matrices Im and Jn then

A − (Jn ⊗ Im ) A(Jn ⊗ Im ) = XY + YX , (6.66)

where
 
1
X= T, W, 0, . . . , 0 , Y = (Im , 0, . . . , . . . , 0) .
2

Therefore the rank of the result in (6.66) is at most 2m. This can also be a starting point
for constructing iterative solvers using tools from the theory of low displacement rank
matrices; see for example [57, 82].

6.4.2 Mathematical and Algorithmic Infrastructure

We define the Chebyshev polynomials that are useful here and in later chapters. See
also refs. [83–86].
Definition 6.6 The degree-k Chebyshev polynomial of the 1st kind is defined as:

cos(k arccos ξ ) when |ξ | ≤ 1
Tk (ξ ) =
cosh(karccoshξ ) when |ξ | > 1.

The following recurrence holds:

Tk+1 (ξ ) = 2ξ Tk (ξ ) − Tk−1 (ξ )
where T0 (ξ ) = 1, T1 (ξ ) = ξ.
200 6 Special Linear Systems

The degree-k modified Chebyshev polynomial of the 2nd kind is defined as:

⎪ sin((k+1)θ) ξ
⎨ sin(θ) where cos(θ ) = 2 when 0 ≤ ξ < 2
Uˆk (ξ ) = k+1 when ξ = 2

⎩ sinh((k+1)ψ) where cosh(ψ) = ξ
sinh(ψ) 2 when ξ > 2.

The following recurrence holds:

Uˆk+1 (ξ ) = ξ Uˆk (ξ ) − Uˆk−1 (ξ )


where Uˆ0 (ξ ) = 1, Uˆ1 (ξ ) = ξ.

Because of the special structure of matrices A and T , several important quantities,


like their inverse, eigenvalues and eigenvectors, can be expressed analytically; cf. [83,
87–89].
Proposition 6.4 Let T = [−1, α, −1]n and α ≥ 2. For j ≥ i
 sinh(iξ ) sinh((n− j+1)ξ ) α
−1 
sinh(ξ ) sinh((n+1)ξ ) where cosh(ξ ) = 2 when α > 2
(T )i, j = n− j+1
i n+1 when α = 2.

The eigenvalues and eigenvectors of T are given by


 

λ j = α − 2 cos , (6.67)
m+1
     
2 jπ jmπ
qj = sin , . . . , sin , j = 1, . . . , m.
m+1 m+1 m+1

A similar formula, involving Chebyshev polynomials can be written for the inverse
of the Poisson matrix.
Proposition 6.5 For any nonsingular T ∈ Rm×m , the matrix A = [−I, T, −I ]n is
nonsingular if and only if Uˆn (T ) is nonsingular. Then, A−1 can be written as a block
matrix that has, as block (i, j) the order m submatrix

−1 Uˆn−1 (T )Uˆi−1 (T )Uˆn− j (T ), j ≥ i,
(A )i, j = (6.68)
Uˆn−1 (T )Uˆj−1 (T )Uˆn−i (T ), i ≥ j.

The eigenvalues and eigenvectors of the block-tridiagonal matrix A in Proposi-


tion 6.5 can be obtained using the properties of Kronecker products and sums; cf.
[14, 90].
Multiplication of the matrix of eigenvectors Q = (q1 , . . . , qm ) of T with a vector
y = (ψ1 , . . . , ψm ) amounts to computing the elements
6.4 Rapid Elliptic Solvers 201
  
2
m
ij π
ψ j sin , i = 1, . . . , m.
m+1 m+1
j=1

Therefore this multiplication is essentially equivalent to the discrete sine transform


(DST) of y. The DST and its inverse can be computed in O(m log m) operations
using FFT-type methods. For this, in counting complexities in the remainder of this
section we do not distinguish between the forward and inverse transforms.
The following operations are frequently needed in the context of RES:
(1.i, 1.ii) Given Y = (y1 , . . . , ys ) ∈ Rm×s , solve TX = Y , that is solve a linear
system with banded coefficient matrix for (i) s = 1 or (ii) s > 1 right-hand sides;
(1.iii) Given y ∈ Rm and scalars μ1 , . . . , μd solve the d linear systems (T −
μ j I )x j = y;
(1.iv) Given Y ∈ Rm×d and scalars μ j as before solve the d linear systems
(T − μ j I )x j = y j ;
(1.v) Given Y and scalars μ j as before, solve the sd linear systems (T −μ j I )x (k)
j =
yk , j = 1, . . . , d and k = 1, . . . , s;
(2) Given Y ∈ Rm×n compute the discrete Fourier-type transform (DST, DCT,
DFT) of all columns or of all rows.
For an extensive discussion of the fast Fourier and other discrete transforms used in
RES see the monograph [19] and the survey [72] and references therein for efficient
FFT algorithms.

6.4.3 Matrix Decomposition

Matrix decomposition (MD) refers to a large class of methods for solving the model
problem we introduced earlier as well as more general problems, including (6.65);
cf. [91]. As was stated in [92] “It seldom happens that the application of L processors
would yield an L-fold increase in efficiency relative to a single processor, but that
is the case with the MD algorithm.” The first detailed study of parallel MD for the
Poisson equation was presented in [93]. The Poisson matrix can be written as a
Kronecker sum of Toeplitz tridiagonal matrices. Specifically,

A = In ⊗ T̃m + T̃n ⊗ Im , (6.69)

where T̃k = [−1, 2, −1]k for k = m, n. To describe MD, we make use of this
representation. We also use the vec operator: Acting on a matrix Y = (y1 , . . . , yn ),
it returns the vector that is formed by stacking the columns of X , that is
202 6 Special Linear Systems
⎛ ⎞
y1
⎜ .. ⎟
vec(y1 , . . . , yn ) = ⎝ . ⎠ .
yn

The unvecn operation does the reverse: For a vector of mn elements, it selects its
n contiguous subvectors of length m and returns the matrix with these as columns.
Finally, the permutation matrix Πm,n ∈ Rmn×mn is defined as the unique matrix,
sometimes called vec-permutation, such that vec(A) = Πm,n vec(A ), where A ∈
Rm×n .
MD algorithms consist of 3 major stages. The first amounts to transforming A
and f , then solving a set of independent systems, followed by back transforming to
compute the final solution. These 3 stages are characteristic of MD algorithms; cf.
[91]. Moreover, directly or after transformations, they consist of several independent
subproblems that enable straightforward implementation on parallel architectures.
Denote by Q x the matrix of eigenvectors of T̃m . In the first stage, both sides of
the discrete Poisson system are multiplied by the block-diagonal orthogonal matrix
In ⊗ Q x . We now write

(In ⊗ Q   
x )(In ⊗ T̃m + T̃n ⊗ Im )(In ⊗ Q x )(In ⊗ Q x )u = (In ⊗ Q x ) f
(In ⊗ Λ̃m + T̃n ⊗ Im )(In ⊗ Q  
x )u = (In ⊗ Q x ) f. (6.70)

The computation of (In ⊗ Q  


x ) f amounts to n independent multiplications with Q x ,
one for each subvector of f .
Using the vec-permutation matrix Πm,n , (6.70) is transformed to

(Λ̃m ⊗ In + Im ⊗ T̃n )(Πm,n (In ⊗ Q  


x )u) = Πm,n (In ⊗ Q x ) f. (6.71)
  
B

(m)
Matrix B is block-diagonal with diagonal blocks of the form T̃n + λi I where
(m) (m)
λ1 , . . . , λm are the eigenvalues of T̃m . Recall also from Proposition 6.4 that the
eigenvalues are computable from closed formulas. Therefore, the transformed system
is equivalent to m independent subsystems of order n, each of which has the same
structure as T̃n . These are solved in the second stage of MD. The third stage of MD
consists of the multiplication of the result of the previous stage with In ⊗ Q x . From
(6.71), it follows that

u = (In ⊗ Q x )Πm,n B −1 Πm,n (In ⊗ Q 
x ) f. (6.72)

where Πm,n is as before.


The steps that we described can be applied to more general systems with coefficient
matrices of the form T1 ⊗ W2 + W1 ⊗ T2 where the order n matrices T1 , W1 commute
and W2 , T2 are order m; cf. [91]. Observe also that if we only need few of the u i s,
6.4 Rapid Elliptic Solvers 203

it is sufficient to first compute Πm,n B −1 Π 


m,n (In ⊗ Q x ) f and then apply Q x to the
appropriate subvectors.
Each of the 3 stages of the MD algorithm consists of independent operations
with matrices of order m and n rather than mn. In the case of the discrete Poisson
system, the multiplications with Q x can be implemented with fast, Fourier-type
transformations implementing the DST. We list the resulting method as Algorithm
6.12 and refer to it as MD- Fourier. Its cost on a uniprocessor is T1 = O(mn log m).
Consider the cost of each step on p = mn processors. Stage (I) consists of n
independent applications of a length m DST at cost O(log m). Stage (II) amounts to
the solution of m tridiagonal systems of order n. Using cr or paracr (see Sect. 5.5
of Chap. 5), these are solved in O(log n) steps. Stage (III) has a cost identical to stage
(I). The total cost of MD- Fourier on mn processors is T p = O(log m +log n). If the
number of processors is p < mn, the cost for stages (I, III) becomes O( np m log m).
In stage (II), if the tridiagonal systems are distributed among the processors, and
a sequential algorithm is applied to solve each system, then the cost is O( mp n).
Asymptotically, the dominant cost is determined by the first and last stages with a
total cost of T p = O( mn p log n).

Algorithm 6.12 MD- Fourier: matrix decomposition method for the discrete Pois-
son system.
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f = ( f 1 ; . . . ; f n )
//Stage I: apply fast DST on each subvector f 1 , . . . , f n
1: doall j = 1 : n
2: fˆj = Q 
x fj
3: set F̂ = ( fˆ1 , . . . , fˆn )
4: end
//Stage II: implemented with suitable solver (Toeplitz, tridiagonal, multiple shifts, multiple
right-hand sides)
5: doall i = 1 : m
(m)  and store the result in ith row of a temporary matrix Û ;
6: compute (T̃n + λi I )−1 F̂i,:
λ(m) (m)
1 , . . . , λm are the eigenvalues of T̃m
7: end
//Stage III: apply fast DST on each column of Û
8: doall j = 1 : n
9: u j = Q x Û:, j
10: end

6.4.4 Complete Fourier Transform

When both T̃m and T̃n in (6.69) are diagonalizable with Fourier-type transforms, it
becomes possible to solve Eq. (6.64) using another approach, called the complete
204 6 Special Linear Systems

Fourier transform method (CFT) that we list as Algorithm 6.13 [94]. Here, instead
of (6.72) we write the solution of (6.64) as

U = (Q y ⊗ Q x )(In ⊗ Λ̃m + Λ̃n ⊗ Im )−1 (Q  


y ⊗ Q x ) f,

where matrix (In ⊗ Λ̃m + Λ̃n ⊗ Im ) is diagonal. Multiplication by Q  


y ⊗ Q x amounts
to performing n independent DSTs of length m and m independent DSTs of length
n; similarly for Q y ⊗ Q x . The middle stage is an element-by-element division by
the diagonal of In ⊗ Λ̃m + Λ̃n ⊗ Im that contains the eigenvalues of A.
CFT is rich in operations that can be performed in parallel. As in MD, we dis-
tinguish three stages, the first and last of which are a combination of two steps, one
consisting of n independent DSTs of length m each, the other consisting of m inde-
pendent DSTs. The middle step consists of mn independent divisions. The total cost
on mn processors is T p = O(log m + log n).

Algorithm 6.13 CFT: complete Fourier transform method for the discrete Poisson
system.
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f
//The right-hand side arranged as the m × n matrix F = ( f 1 , . . . , f n )
Output: Solution u = (u 1 ; . . . ; u n )
//Stage Ia: DST on columns of F
1: doall j = 1 : n
2: fˆj = Q  x fj
3: set F̂ = ( fˆ1 , . . . , fˆn )
4: end //Stage Ib: DST on rows of F̂
5: doall i = 1 : m
6: F̃i,: = Q  y F̂i,:
7: end //Stage II: elementwise division of F̃ by eigenvalues of A
8: doall i = 1 : m
9: doall j = 1 : n
10: F̃i, j = F̂i, j /(λi(m) + λ(n)
j )
11: end
12: end //Stage IIIa: DST on rows of F̃
13: doall i = 1 : m
14: F̂i,: = Q y F̃i,:
15: end //Stage IIIb: DST on columns of F̂
16: doall j = 1 : n
17: u j = Q x F̂:, j
18: end
6.4 Rapid Elliptic Solvers 205

6.4.5 Block Cyclic Reduction

BCR for the discrete Poisson system (6.65) is a method that generalizes the cr
algorithm (cf. 5.5) for point tridiagonal systems (cf. Chap. 5) while taking advantage
of its 2-level Toeplitz structure; cf. [95]. BCR is more general than Fourier-MD in the
sense that it does not require knowledge of the eigenstructure of T nor does it deploy
Fourier-type transforms. We outline it next for the case that the number of blocks
is n = 2k − 1; any other value can be accommodated following the modifications
proposed in [96].
For steps r = 1, . . . , k − 1, adjacent blocks of equations are combined in groups
of 3 to eliminate 2 blocks of unknowns; in the first step, for instance, unknowns from
even numbered blocks are eliminated and a reduced system with block-tridiagonal
coefficient matrix A(1) = [−I, T 2 − 2I, −I ]2k−1 −1 , containing approximately only
half of the blocks remains. The right-hand side is transformed accordingly. Setting
T (0) = T , f (0) = f , and T (r ) = (T (r −1) )2 − 2I , the reduced system at the r th step
is

[−I, T (r ) , −I ]2k−r −1 u (r ) = f (r ) ,

where f (r ) = vec[ f 2(rr .1) , . . . , f 2(rr .(2


)
k−r −1) ] and

(r ) (r −1) (r −1) (r −1)


f j2r = f j2r −2r −1 + f j2r +2r −1 + T (r −1) f j2r . (6.73)

The key observation here is that the matrix T (r ) can be written as


 
T
T (r ) = 2T2r . (6.74)
2

From the closed form expressions of the eigenvalues of T and the roots of the Cheby-
shev polynomials of the 1st kind T2r , and relation (6.74), it is straightforward to show
that the roots of the matrix polynomial T (r ) are
 
(r ) (2 j − 1)
ρi = 2 cos π . (6.75)
2r +1

Therefore, the polynomial in product form is

(r ) (r )
T (r ) = (T − ρ2r I ) · · · (T − ρ1 I ). (6.76)

The roots (cf. (6.75)) are distinct therefore the inverse can also be expressed in terms
of the partial fraction representation of the rational function 1/2T2r ( ξ2 ):
206 6 Special Linear Systems

r

2
(r ) (r )
(T (r ) )−1 = γi (T − ρi I )−1 . (6.77)
i=1

From the analytic expression for T (r ) in (6.74) and standard formulas, the partial
fraction coefficients are equal to
 
(r ) 1 (2i − 1)π
γi = (−1) i+1
sin , i = 1, . . . , 2r .
2r 2r +1

The partial fraction approach for solving linear systems with rational matrix coeffi-
cients is discussed in detail in Sect. 12.1 of Chap. 12. Here we list as Algorithm 6.14
(simpleSolve_PF),
  one version that can be applied to solving systems of the form
d
j=1 (T − ρ j I ) x = b for mutually distinct ρ j ’s. This will be applied to solve
any systems with coefficient matrix such as (6.76).

 
d
Algorithm 6.14 simpleSolve_PF: solving j=1 (T − ρ j I ) x = b for mutually
distinct values ρ j from partial fraction expansions.
Input: T ∈ Rm×m , b ∈ Rm and distinct values {ρ1 , . . . , ρd } none of them equal to an eigenvalue
of T .  −1
d
Output: Solution x = j=1 (T − ρ j I ) b.
1: doall j = 1 : d
2: compute coefficient γ j =  1 , where p(ζ ) = dj=1 (ζ − ρ j )
p (τ j )
3: solve (T − ρ j I )x j = b
4: end
5: set c = (γ1 , . . . , γd ) , X = (x1 , . . . , xd )
6: compute and return x = X c

Next implementations of block cyclic reduction are presented that deploy the
product (6.76) and partial fraction representations (6.77) for operations with T (r ) .
After r < k − 1 block cyclic reduction steps the system has the form:
⎛ (r ) ⎞
f 2r
⎜ (r ) ⎟
⎜ f 2·2r ⎟
[−I, T (r ) , −I ]u (r ) = f (r ) , where f (r ) = ⎜
⎜ .. ⎟.
⎟ (6.78)
⎝ . ⎠
(r )
f (2k−r −1)·2r

Like the Poisson matrix, T (r ) is block-tridiagonal, it has the same eigenvectors as


T and its eigenvalues are available analytically from the polynomial form of T (r )
and the eigenvalues of T . If we stop after step r = r̂ and compute u (r̂ ) , then a back
substitution stage consists of solving consecutive block-diagonal systems to compute
the missing subvectors, as shown next for r = r̂ , r̂ − 1, . . . , 1.
6.4 Rapid Elliptic Solvers 207

⎛ ⎞ ⎛ (r −1) ⎞
u 1·2r −1 f 1·2r −1 + u 1·2r
⎜ ⎟ ⎜ f 3·2r −1 + u 3·2r −2r −1 + u 3·2r +2r −1 ⎟
(r −1)
⎜ u 3·2r −1 ⎟ ⎜ ⎟
diag[T (r −1) ] ⎜ .. ⎟=⎜
⎜ .. ⎟.

⎝ . ⎠ ⎝ . ⎠
u (2r̂ −r +1 −1)·2r −1 (r −1)
f (2r̂ −r +1 −1)·2r −1 + u (2r̂ −r +1 −1)·2r −1 −2r −1

If r̂ = k − 1 reduction steps are applied, a single tridiagonal system remains,

(k−1)
T (k−1) u 2k−1 = f 2k−1 , (6.79)

so back substitution can be used to recover all the subvectors.


As described, the algorithm is called Cyclic Odd-Even Reduction and Factoriza-
tion (CORF) and listed as Algorithm 6.15 [95]. Consider its cost on a uniprocessor.
Each reduction step r = 1, . . . , k − 1, requires 2k−r − 1 independent matrix-vector
multiplications, each with T (r −1) (in product form); cf. (6.73). To compute the “mid-
dle subvector” u 2k−1 from (6.79), requires solving 2k tridiagonal systems. Finally,
each step r = k, k − 1, . . . , 1, of back substitution, consists of solving 2k−r indepen-
dent tridiagonal systems with coefficient matrix T (r −1) . The cost of the first and last
stages is O(nm log n) while the cost of solving (6.79) is O(mn), for an overall cost of
T1 = O(mn log n). Note that the algorithm makes no use of fast transforms to achieve
its low complexity. Instead, this depends on the factored form of the T (r ) term. In both
the reduction and back substitution stages, CORF involves the independent applica-
tion in multiplications and system solves respectively, of matrices T (r ) ; see lines 3
and 9 respectively. CORF, therefore, provides two levels of parallelism, one from
the individual matrix operations, and another from their independent application.
Parallelism at this latter level, however, varies considerably. In the reduction, the

Algorithm 6.15 CORF: Block cyclic reduction for the discrete Poisson system
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f . It is assumed that n = 2k − 1.
Output: Solution u = (u 1 ; . . . ; u n )
//Stage I: Reduction
1: do r = 1 : k − 1
2: doall j = 1 : 2k−r − 1
(r ) (r −1) (r −1) (r −1)
3: f j2r = f j2r −2r −1 + f j2r +2r −1 + T (r −1) f j2r
//multiply T (r −1) f 2(rr −1)
j exploiting the product form (6.76)
4: end
5: end
//Stage II: Solution by back substitution
(k−1)
6: Solve T (k−1) u 2k−1 = f 2k−1
7: do r = k − 1 : −1 : 1
8: doall j = 1 : 2k−r
(r −1)
9: solve T (r −1) u (2 j−1)·2r −1 = f (2 j−1)·2r −1 + u (2 j−1)·2r −1 −2r −1 + u (2 j−1)·2r −1 +2r −1
10: end
11: end
208 6 Special Linear Systems

number of independent matrix operations is halved at every step, ranging from


2k−1 − 1 down to 1. In back substitution, the number is doubled at every step,
from 1 up to 2k−1 . So in the first step of back substitution only one system with a
coefficient matrix (that is a product of 2k−1 matrices) is solved. Therefore, in the
later reduction and earlier back substitution steps, the opportunity for independent
manipulations using the product form representation lowers dramatically and even-
tually disappears. In the back substitution stage (II), this problem can be resolved
by replacing the product form (6.76) when solving with T (r ) with the partial frac-
tion representation (6.77) of its inverse. In particular, the solution is computed via
Algorithm 6.14.
On mn processors, the first step of back substitution (line 6) can be implemented in
O(log m) operations to solve each system in parallel and then O(log n) operations
to sum the solutions by computing their linear combination using the appropriate
partial fraction coefficients as weight vector. The next steps of back substitution,
consist of the loop in lines 7–11. The first loop iteration involves the solution of two
systems with coefficient matrix T (k−2) each. Using the partial fraction representation
for T (k−2) , the solution of each of system can be written as the sum of 2k−2 vectors
that are solutions of tridiagonal systems. Using no more than mn processors, this step
can be accomplished in O(log m) parallel steps to solve all the systems and log(n/2)
for the summation. Continuing in this manner for all steps of back substitution, its
cost is O(log n(log n + log m)) parallel operations. On the other hand, during the
reduction stage, the multiplication with T (r ) , involves 2r consecutive multiplications
with tridiagonal matrices of the form T − ρ I and these multiplications cannot be
readily implemented independently. In the last reduction step, for example, one must
compute T (k−2) times a single vector. This takes O(n) parallel operations, irrespec-
tive of the number of processors, since they are performed consecutively. Therefore,
when m ≈ n, the parallel cost lower bound of CORF is O(n), causing a performance
bottleneck.
An even more serious problem that hampers CORF is that the computation of
(6.73) leads to catastrophic errors. To see that, first observe that the eigenvalues of
T (r ) are
   
λj πj
2T2r , j = 1, . . . m, where λ j = 4 − 2 cos .
2 m+1

Therefore all eigenvalues of T lie in the interval (2, 6) so the largest eigenvalue of
T (r ) will be of the order of 2T2r (3 − δ), for some small δ. This is very large even
for moderate values of r since T2r is known to grow very large for arguments of
magnitude greater than 1. Therefore, computing (6.73) will involve the combination
of elements of greatly varying magnitude causing loss of information. Fortunately,
both the computational bottleneck and the stability problem in the reduction phase
can be overcome. In fact, as we show in the sequel, resolving the stability problem
also enables enhanced parallelism.
6.4 Rapid Elliptic Solvers 209

BCR Stabilization and Parallelism


The loss of information in BCR is reduced and its numerical properties considerably
improved if one applies a modification proposed by Buneman; we refer to [95] for
a detailed dicussion of the scheme and a proof of its numerical properties. The idea
is: instead of computing f (r ) directly, to express it indirectly in terms of 2 vectors,
p (r ) , q (r ) , updated so that at every step they satisfy
(r ) (r −1) (r −1) (r −1)
f j2r = f j2r −2r −1 + f j2r +2r −1 + T (r −1) f j2r
(r −1) (r −1)
= T (r −1) p j2r + q j2r

In one version of this scheme, the recurrence (6.73) for f j2r is replaced by the
recurrences
(r ) (r −1) (r −1) (r −1) (r −1)
pj = pj − (T (r −1) )−1 ( p j−2r −1 + p j+2r −1 − q j )
(r ) (r −1) (r −1) (r )
qj = (q j−2r −1 + q j+2r −1 − 2 p j ).

The steps of stabilized BCR are listed as Algorithm 6.16 (BCR). Note that because
the sequential costs of multiplication and solution with tridiagonal matrices are both
linear, the number of arithmetic operations in Buneman stabilized BCR is the same
as that of CORF.

Algorithm 6.16 BCR: Block cyclic reduction with Buneman stabilization for the
discrete Poisson system
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f . It is assumed n = 2k − 1.
Output: Solution u = (u 1 ; . . . ; u n )
//Initialization
(0) (0)
1: p j = 0m,1 and q j = f j ( j = 1 : n)
//Stage I: Reduction. Vectors with subscript 0 or 2k are taken to be 0
2: do r = 1 : k − 1
3: doall j = 1 : 2k−r − 1
(r ) (r −1) (r −1) (r −1) (r −1)
4: p j2r = p j2r − (T (r −1) )−1 ( p j2r −2r −1 + p j2r +2r −1 − q j2r )
(r ) (r −1) (r −1) (r )
5: q j2r = (q j2r −2r −1 + q j2r +2r −1 − 2 p j2r )
6: end
7: end
//Stage II: Solution by back substitution. It is assumed that u 0 = u 2k = 0
(k)
8: Solve T (k) u 2k−1 = q2k−1
9: do r = k − 1 : −1 : 1
10: doall j = 1 : 2k−r
(k)
11: solve T (r −1) û (2 j−1)2r −1 = q(2 j−1)2r −1
− (u ( j−1)2r −1 + u ( j+1)2r −1 )
12: u (2 j−1)2r −1 = û (2 j−1)2r −1 + p(2 j−1)2r −1
13: end
14: end
210 6 Special Linear Systems

In terms of operations with T (r ) , the reduction phase (line 4) now consists only
of applications of (T (r ) )−1 so parallelism is enabled by utilizing the partial fraction
representation (6.77) as in the back substitution phase of CORF.
Therefore, solutions with coefficient matrix T (r ) and 2k−r − 1 right-hand sides
for r = 1, . . . , k − 1 can be accomplished by solving 2r independent tridiagonal
systems for each right-hand side and then combining the partial solutions by multi-
plying 2k−r − 1 matrices, each of size m × 2r with the vector of 2r partial fraction
coefficients. Therefore, Algorithm 6.16, can be efficiently implemented on parallel
architectures using partial fractions to solve one or more independent linear systems
with coefficient matrix T (r ) in lines 4, 8 and 11. If we assume that the cost of solving
a tridiagonal system of order m using m processors is τ (m), then if there are P = mn
processors, the parallel cost of BCR is approximately equal for the 2 stages: It is easy
to see that there is a total of 2kτ (m) + k 2 + O(k) operations. If we use paracr to
solve the tridiagonal systems, the cost becomes 16 log n log m + log2 n + O(log n).
This is somewhat more than the cost of parallel Fourier-MD and CFT, but BCR is
applicable to a wider range of problems, as we noted earlier.
Historically, the invention of stabilized BCR by Buneman preceded the paralleliza-
tion of BCR based on partial fractions [97, 98]. In light of the preceding discussion,
we can view the Buneman scheme as a method for resolving the parallelization bot-
tleneck in the reduction stage of CORF, that also handles the instability and an exam-
ple where the introduction of multiple levels of parallelism stabilizes a numerical
process. This is interesting, especially in view of discussions regarding the interplay
between numerical stability and parallelism; cf. [99].
It was assumed so far for convenience that n = 2k − 1 and that BCR was applied
for the discrete Poisson problem that originated from a PDE with Dirichlet boundary
conditions. For general values of n or other boundary conditions, the use of reduction
can be shown to generate more general matrix rational functions numerator that has
nonzero degree that is smaller than than of the demoninator. The systems with these
matrices are then solved using the more general method listed as Algorithm 12.3
of Chap. 12. This not only enables parallelization but also eliminates the need for
multiplications with the numerator polynomial; cf. the discussion in the Notes and
References Sect. 6.4.9. In terms of the kernels of Sect. 6.4.2 one calls from category
(1.iii) (when r = k − 1) or (1.v) for other values of r .
It is also worth noting that the use of partial fractions in these cases is numerically
safe; cf. [100] as well as the discussion in Sect. 12.1.3 of Chap. 12 for more details
on the numerical issues that arise when using partial fractions.

6.4.6 Fourier Analysis-Cyclic Reduction

There is another way to resolve the computational bottleneck of CORF and BCR
and to stabilize the process. This is to monitor the reduction stage and terminate it
before the available parallelism is greatly reduced and accuracy is compromised and
6.4 Rapid Elliptic Solvers 211

switch to another method for the smaller system. Therefore the algorithm is made to
adapt to the available computational resources yielding acceptable solutions.
The method presented next is in this spirit but combines early stopping with the
MD-Fourier technique. The Fourier analysis-cyclic reduction method (FACR) is a
hybrid method consisting of l block cyclic reduction steps as in CORF, followed
by MD-Fourier for the smaller block-tridiagonal system and back substitution to
compute the final solution. To account for limiting the number of reduction steps,
l, the method is denoted by FACR(l). It is based on the fact that at any step of the
reduction stage of BCR, the coefficient matrix is A(r ) = [−I, T (r ) , −I ]2k−r −1 and
(m)
λ
since T (r ) = 2T2r ( T2 ) has the same eigenvectors as T with eigenvalues 2T2r ( i2 ),
the reduced system can be solved using Fourier-MD. If reduction is applied without
stabilization, l must be small. Several analyses of the computational complexity of
the algorithm (see e.g. [101]) indicate that for the l ≈ log log m, the sequential
complexity is O(mn log(log n)). Therefore, properly designed FACR is faster than
MD-Fourier and BCR. In practice, the best choice for l depends on the relative
performance of the underlying kernels and other characteristics of the target computer
platform. The parallel implementation of all steps can proceed using the techniques
deployed for BCR and MD; FACR(l) can be viewed as an alternative to partial
fractions to avoid the parallel implementation bottleneck that was observed after a
few steps of reduction in BCR. For example, the number of systems solved in parallel
in BCR the algorithm can be monitored in order to trigger a switch to MD-Fourier
before they become so few as not make full use of available parallel resources.

6.4.7 Sparse Selection and Marching

In many applications using a matrix or matrix function as an operator acting on a


sparse vector, one often desires to obtain only very few elements of the result (prob-
ing). In such a case, it is possible to realize great savings compared to when all
elements of the results are required. We have already seen this in Sect. 6.1, where the
DFT of vectors with only 2 nonzero elements was computed with BLAS1, _AXPY,
operations between columns of a Vandermonde matrix instead of the FFT. We men-
tioned then that this was inspired by the work in [26] in the context of RES. The
situation is especially favorable when the matrix is structured and there are closed
formulas for the inverse as the case for the Poisson matrix, that can be computed,
albeit at high serial cost. A case where there could be further significant reductions
in cost is when only few elements of the solution are sought. The next lemma shows
that the number of arithmetic operations to compute c Ab when the matrix A is
dense and the vector b, c are sparse depends only on the sparsity of the vectors.
Proposition 6.6 Given an arbitrary matrix A and sparse vectors b, c then c Ab
can be computed with leading cost 2nnz (b)nnz (c) operations.
212 6 Special Linear Systems

Proof Let Pb and Pc be the permutations that order b and c so that their nonzero ele-
ments are listed first in Pb b and Pc c respectively. Then c Ab = (Pc c) Pc A Pb Pb b
where
  
 Â 0 bnz
c Ab = (Pc c) Pc A Pb Pb b = cnz 0
0 0μ̂,ν̂ 0

= cnz Âbnz ,

where  is of dimension nnz (c) × nnz (b), μ̂ = nnz (c), ν̂ = nnz (b), and bnz and cnz
are the subvectors of nonzero elements of b and c. This can be computed in

min((2nnz (b) − 1)nnz (c), (2nnz (c) − 1)nnz (b))

operations, proving the lemma. This cost can be reduced even further if A is also
sparse.

From the explicit formulas in Proposition 6.4 we obtain the following result.
Proposition 6.7 Let T = [−1, α, −1]m be a tridiagonal Toeplitz matrix and con-
sider the computation of ξ = c T −1 b for sparse vectors b, c. Then ξ can be computed
in O(nnz (b)nnz (c)) arithmetic operations.
This holds if the participating elements of T −1 are already available or if they can
be computed in O(nnz (b)nnz (c)) operations as well.
When we seek only k elements of the solution vector, then the cost becomes
O(k nnz (b)). From Proposition 6.5 it follows that these ideas can also be used to
reduce costs when applying the inverse of A. To compute only u n , we write

u n = C  A−1 f, where C = (0, 0, . . . , Im ) ∈ R(mn)×m

hence


n
un = Uˆn−1 (T )Uˆj−1 (T ) f j (6.80)
j=1

since Uˆ0 (T ) = I . It is worth noting that on uniprocessors or computing platforms


with limited parallelism, instead of the previous explicit formula, it is more econom-
ical to compute u n as the solution of the following system:

Uˆn (T )u n = f 1 − [T, −I, 0, . . . , 0]H −1 f¯, where f¯ = [ f 2 , . . . , f n ] ,


6.4 Rapid Elliptic Solvers 213

where
⎛ ⎞
−I T −I
⎜ −I T −I ⎟
⎜ ⎟
⎜ .. .. ⎟
H =⎜ . . ⎟.
⎜ ⎟
⎝ −I T ⎠
−I

The term H −1 f¯ can be computed first using a block recurrence, each step of which
consists of a multiplication of T with a vector and some other simple vector oper-
ations, followed by the solution with Uˆn (T ) which can be computed by utilizing
its product form or the partial fraction representation of its inverse. Either way, this
requires solving n linear systems with coefficient matrices that are simple shifts of
T , a case of kernel (1.iv).
If the right-hand side is also very sparse, so that f j = 0 except for l values of the
index j, then the sum for u n consists of only l block terms. Further cost reductions
are possible if the nonzero f j ’s are also sparse and if we only seek few elements
from u n . One way to compute u n based on these observations is to diagonalize T
as in the Fourier-MD methods, applying the DFT on each of the non-zero f j terms.
Further savings are possible if these f j ’s are also sparse, since DFTs could then be
computed directly as BLAS, without recourse to FFT kernels.
After the transform, the coefficients in the sum (6.80) are diagonal matrices with
(m) (m)
entries Uˆn−1 (λ j )Uˆj−1 (λ j ). The terms can be computed directly from Defini-
tion 6.6. Therefore, each term of the sum (6.80) is the result of the element-by-element
multiplication of the aforementioned diagonal with the vector fˆj , which is the DFT
of f j . Thus
⎛ (m) (m)

Uˆn−1 (λ1 )Uˆj−1 (λ1 ) fˆj,1

n
⎜ .. ⎟
un = Q ⎝. ⎠
(m) (m)
j=1 Uˆn−1 (λm )Uˆj−1 (λm ) fˆj,m

Diagonalization decouples the computations and facilitates parallel implementa-


tion as in the case with the Fourier-MD approach. Specifically, with O(mn) proces-
sors, vector u n can be computed at the cost of O(log m) to perform the n independent
length-m transforms of f j , j = 1, . . . , n, followed by few parallel operations to pre-
pare the partial subvectors, O(log n) operations to add them, followed by O(log m)
operations to do the back transform, for a total of O(log n +log m) parallel arithmetic
operations. This cost can be lowered even further when most of the f j ’s are zero.
The use of u n to obtain u n−1 = − f n − T u n and then the values of the previous
subvectors using the simple block recurrence

u j−1 = − f j + T u j − u j+1 , j = n − 1, . . . , 2, (6.81)


214 6 Special Linear Systems

is an example of what is frequently referred as marching (in fact here we march


backwards). The overall procedure of computing u n as described and then the remain-
ing subvectors from the block recurrence is known to cost only O(mn) operations
on a uniprocessor. Unfortunately, the method is unstable. A much better approach,
called generalized marching, was described in [83]. The idea is to partition the orig-
inal (6.64) into subproblems that are sufficiently small in size (to prevent instability
from manifesting itself). We do not describe this approach any further but note that
the partitioning into subproblems is a form of algebraic domain decomposition that
results in increased opportunities for parallelism, since each block recurrence can
be evaluated independently and thus assigned to a different processor. Generalized
marching is one more case where partitioning and the resulting hierarchical paral-
lelism that it induces, introduces more parallelism while it serves to reduce the risk
of instability. In fact, generalized marching was the inspiration behind the stabilizing
transformations that also led from the Givens based parallel tridiagonal solver of
[102] to the Spike algorithm; cf. our discussion in Sects. 5.2 and 5.5.

6.4.8 Poisson Inverse in Partial Fraction Representation

Proposition 6.5, in combination with partial fractions makes possible the design of a
method for computing all or selected subvectors of the solution vector of the discrete
Poisson system.
From Definition 6.6, the roots of the Chebyshev polynomial Uˆn are


ρ j = 2 cos , j = 1, . . . , n.
n+1

From Proposition 6.5, we can write the subvector u i of the solution of the discrete
Poisson system as


n
ui = (Aij )−1 f j
j=1


i−1
n
= Uˆn−1 (T )Uˆj−1 (T )Uˆn−i (T ) f j + Uˆn−1 (T )Uˆi−1 (T )Uˆn− j (T ) f j .
j=1 j=i

Because each block (A−1 )i, j is a rational matrix function with denominator of degree
n and numerator of smaller degree and, with the roots of the denominator being
distinct, (Aij )−1 can be expressed as a partial fraction sum of n terms.
6.4 Rapid Elliptic Solvers 215


n
n
(i)
ui = γ j,k (T − ρk I )−1 f j (6.82)
j=1 k=1
n
n
(i)
= (T − ρk I )−1 γ j,k f j (6.83)
k=1 j=1
n
(i)
= (T − ρk I )−1 f˜k , (6.84)
k=1

(i) (i)
where F = ( f 1 , . . . , f n ) ∈ Rm×n , and ρ1 , . . . , ρn and γ j,1 , . . . , γ j,n are the roots
of the denominator and partial fraction coefficients for the rational polynomial for
(i)
(A−1 )i, j , respectively. Let G (i) = [γ j,k ] be the matrix that contains along each row
(i) (i)
the partial fraction coefficients (γ j,1 , . . . , γ j,n ) used in the inner sum of (6.82). In
(i) (i) (i) (i)
column format G (i) = (g1 , . . . , gn ), where in (6.84), f˜k = Fgk . Based on this
formula, it is straightforward to compute the u i ’s as is shown in Algorithm 6.17;
cf. [103]. Algorithm EES offers abundant parallelism and all u i ’s can be evaluated

Algorithm 6.17 EES: Explicit Elliptic Solver for the discrete Poisson system
Input: Block tridiagonal matrix A = [−Im , T, Im ]n , where T = [−1, 4, −1]n and the right-hand
side f = ( f 1 ; . . . ; f n ).
Output: Solution u = (u 1 ; . . . ; u n ) //the method can be readily modified to produce only selected
subvectors
1: Compute n roots ρk of Uˆn (x) in (6.82)
(i)
2: Compute coefficients γ j,k in (6.83)
3: doall i = 1 : n
4: doall k = 1 : n 
(i) (i)
5: compute f˜k = nj=1 γ j,k f j
(i) (i)
6: compute ū k = (T − ρk I )−1 f˜k
7: end  (i)
8: compute u i = nk=1 ū k
9: end

in O(log n + log m). For this, however, O(n 3 m) processors appear to be necessary,
which is too high. An O(n) reduction in the processor count is possible by first
observing that for each i, the matrix G (i) of partial fraction coefficients is the sum of
a Toeplitz matrix and a Hankel matrix and then using fast multiplications with these
special matrices; cf. [103].

MD-Fourier Based on Partial Fraction Representation of the Inverse


We next show that EES can be reorganized in such a way that its steps are the same as
those of MD- Fourier, that is independent DSTs and tridiagonal solves, and thus few
or all u i ’s can be computed in O(log n + log m) time using only O(mn) processors.
This is not surprising since MD- Fourier is also based on an explicit formula for
216 6 Special Linear Systems

the Poisson matrix inverse (cf. (6.72)) but has some interest as it reveals a direct
connection between MD- Fourier and EES.
The elements and structure of G (i) are key to establishing the connection. For
i ≥ j (the case i < j can be treated similarly)

(i) Uˆj−1 (ρk )Uˆn−i (ρk ) sin( jθk ) sin((n + 1 − i)θk )


γ j,k = = (6.85)
Uˆn (ρk ) sin2 θk Uˆn (ρk )
 


where Uˆn denotes the derivative of Uˆn . From standard trigonometric identities it
follows that for i ≥ j the numerator of (6.85) is equal to

jkπ ikπ
sin( jθk ) sin((n + 1 − i)θk ) = (−1)k+1 sin sin . (6.86)
n+1 n+1

The same equality holds when i < j since relation (6.86) is symmetric with respect
to i and j. So from now on we use this numerator irrespective of the relative ordering
of i and j. With some further algebraic manipulations it can be shown that

 (n + 1)
Uˆn (x) |x=ρk = (−1)k+1 .
2 sin2 θk

From (6.85) and (6.86) it follows that the elements of G (i) are

(i) 2 jkπ ikπ


γ j,k = sin sin .
n+1 n+1 n+1

Hence, we can write


⎛ π nπ ⎞ ⎛ ⎞
sin( n+1 ) ··· sin( n+1 ) sin iπ
n+1
2 ⎜ .. .. .. ⎟⎜ .. ⎟
G (i) = ⎝. . . ⎠⎝ . ⎠.
n+1
n π
2
n+1 .
inπ

sin( n+1 ) · · · sin( n+1 ) sin

We can thus write the factorization

G (i) = QD(i) , (6.87)

where matrix Q and the diagonal matrix D (i) are


     
2 jkπ (i) 2 iπ inπ
Q= sin , D = diag sin , . . . , sin .
n+1 n + 1 j,k n+1 n+1 n+1

Observe that multiplication of a vector with the symmetric matrix Q is a DST. From
(6.83) it follows that
6.4 Rapid Elliptic Solvers 217


n
(i)
ui = (T − ρk I )−1 h k (6.88)
k=1

(i) (i) (i)


where h k is the kth column of FG(i) . Setting H (i) = (h 1 , . . . , h n ) and recalling
that Q encodes a DST, it is preferable to compute by rows

(H (i) ) = (FQD(i) ) = D (i) QF .

This amounts to applying m independent DSTs, one for each row of F. These can
be accomplished in O(log n) steps if O(mn) processors are available. Following
that and the multiplication with D (i) , it remains to solve n independent tridiagonal
systems, each with coefficient matrix (T − ρk I ) and right-hand side consisting of the
kth column of H (i) . With O(nm) processors this can be done in O(log m) operations.
Finally, u i is obtained by adding the n partial results, in O(log n) steps using the
same number of processors. Therefore, the overall parallel cost for computing any
u i is O(log n + log m) operations using O(mn) processors.
It is actually possible to compute all subvectors u 1 , . . . , u n without exceeding the
O(log n + log m) parallel cost while still using O(mn) processors. Set Ĥ = FQ and
denote the columns of this matrix by ĥ k . These terms do not depend on the index
i. Therefore, in the above steps, the multiplication with the diagonal matrix D (i)
can be deferred until after the solution of the independent linear systems in (6.88).
Specifically, we first compute the columns of the m × n matrix

H̆ = [(T − ρ1 I )−1 ĥ 1 , . . . , (T − ρn I )−1 ĥ n ].

We then multiply each of the columns of H̆ with the appropriate diagonal element
of D (i) and then sum the partial results to obtain u i . However, this is equivalent to
multiplying H̆ with the column vector consisting of the diagonal elements of D (i) .
Let us call this vector d (i) , then from (6.87) it follows that the matrix (d (1) , . . . , d (n) )
is equal to the DST matrix Q. Therefore, the computation of

H̆ (d (1) , . . . , d (n) ) = H̆ Q,

can be performed using independent DSTs on the m rows of H̆ . Finally note that the
coefficient matrices of the systems we need to solve are given by,
     
kπ kπ
T − ρk I = T − 2 cos I = −1, 2 + 2 − 2 cos , −1
n+1 n+1 m
(n)
= T̃m + λk I.

It follows that the major steps of EES can be implemented as follows. First apply
m independent DSTs, one for each row of F, and assemble the results in matrix
Ĥ . Then solve n independent tridiagonal linear systems, each corresponding to the
218 6 Special Linear Systems

application of (T − ρk I )−1 on the kth column of Ĥ . If the results are stored in H̆ ,


finally compute the m independent DSTs, one for each row of H̆ . The cost is cost
T p = O(log m + log n) using O(mn) processors. We summarize as follows:
Proposition 6.8 The direct application of Proposition 6.5 to solve (6.64) is essen-
tially equivalent to the Fourier-MD method, in particular with a reordered version
in which the DSTs are first applied across the rows (rather than the columns) of the
(m)
matrix of right-hand sides and the independent tridiagonal systems are T̃m + λk Im
(m)
instead of T̃n + λk In .

6.4.9 Notes

Some important milestones in the development of RES are (a) reference [104],
where Fourier analysis and marching were proposed to solve Poisson’s equation; (b)
reference [87] appears to be the first to provide the explicit formula for the Poisson
matrix inverse, and also [88–90], where this formula, spectral decomposition, and a
first version of MD and CFT were described (called the semi-rational and rational
solutions respectively in [88]); (c) reference [105], on Kronecker (tensor) product
solvers.
The invention of the FFT together with reference [106] (describing FACR(1) and
the solution of tridiagonal systems with cyclic reduction) mark the beginning of the
modern era of RES. These and reference [95], with its detailed analysis of the major
RES were extremely influential in the development of the field; see also [107, 108]
and the survey of [80] on the numerical solution of elliptic problems including the
topic of RES.
The first discussions of parallel RES date from the early 1970s, in particular
references [92, 93] while the first implementation on the Illiac IV was reported in
[109]. There exist descriptions of implementations of RES for most important high
performance computing platforms, including vector processors, vector multiproces-
sors, shared memory symmetric multiprocessors, distributed memory multiproces-
sors, SIMD and MIMD processor arrays, clusters of heterogeneous processors and
in Grid environments. References for specific systems can be found for the Cray-1
[110, 111]; the ICL DAP [112]; Alliant FX/8 [97, 113]; Caltech and Intel hypercubes
[114–118]; Thinking Machines CM-2 [119]; Denelcor HEP [120, 121]; the Univer-
sity of Illinois Cedar machine [122, 123]; Cray X-MP [110]; Cray Y-MP [124]; Cray
T3E [125, 126]; Grid environments [127]; Intel multicore processors and processor
clusters [128]; GPUs [129, 130]. Regarding the latter, it is worth noting the extensive
studies conducted at Yale on an early GPU, the FPS-164 [131]. Proposals for special
purpose hardware can be found in [132]. RES were included in the PELLPACK
parallel problem-solving environment for elliptic PDEs [133].
The inverses of the special Toeplitz tridiagonal and 2-level Toeplitz matrices that
we encountered in this chapter follow directly from usual expansion formula for
determinants and the adjoint formula for the matrix inverse; see [134].
6.4 Rapid Elliptic Solvers 219

MD-Fourier algorithms have been designed and implemented on vector and


parallel architectures and are frequently used as the baseline in evaluating new Pois-
son solvers; cf. [92, 93, 109, 111, 112, 115–117, 119, 124, 125, 131, 135, 136].
It is interesting to note that CFT was discussed in [104] long before the discovery
of the FFT. Studies of parallel CFT were conducted in [94, 137], and it was found
to have lower computational and communication complexity than MD and BCR for
systems with O(mn) processors connected in a hypercube. Specific implementations
were described in reference [131] for the FPS-164 attached array processor and in
[116] for Intel hypercubes.
BCR was first analyzed in the influential early paper [95]. Many formulations
and uses of the algorithm were studied by the authors of [138]. They interpret BCR,
applied on a block-Toeplitz tridiagonal system such as the discrete Poisson system, by
means of a set of matrix recurrences that resemble the ones we derived in Sect. 5.5.2
for simple tridiagonal systems. The approach, however, is different in that they con-
sider Schur complements rather than the structural interpretation provided by the
t-tridiagonal matrices (cf. Definition 5.1). They also consider a functional form of
these recurrences that provides an elegant, alternative interpretation of the splitting
in Corollary 5.2, if that were to be used with cr to solve block-Toeplitz systems.
It took more than 10 years for the ingenious “parallelism via partial fractions” idea1
to be applied to speed-up the application of rational functions of matrices on vectors,
in which case the gains can be really substantial. The application of partial fractions
to parallelize BCR was described in [98] and independently in [97]. References [94,
113, 123] also discuss the application of partial fraction based parallel BCR and
evaluate its performance on a variety of vector and parallel architectures. Parallel
versions of MD and BCR were implemented in CRAYFISHPAK, a package that
contained most of the functionality of FISHPAK. Results from using the package on
a Cray Y-MP were presented in [110].
FACR was first proposed in [106]. The issue of how to choose l for parameterized
vector and parallel architectures has been studied at length in [137] and for a shared
memory multiprocessor in [120] where it was proposed to monitor the number of
systems that can be solved in parallel.
A general conclusion is that the number of steps should be small (so that BCR
can be applied without stabilization) especially for high levels of parallelism. Its
optimal value for a specific configuration can be determined empirically. FACR(l)
can be viewed as an alternative to partial fractions to avoid the bottlenecks in parallel
implementation that are observed after a few steps of reduction in BCR. For example,
one can monitor the number of systems that can be solved in parallel in BCR and
before that number of independent computations becomes too small and no longer
acceptable, switch to MD. This approach was analyzed in [120] on a shared-memory
multiprocessor (Denelcor HEP). FACR is frequently found to be faster than Fourier-
MD and CFT; see e.g. [131] for results on an attached array processor (FPS-164).
See also [111, 112] for designs of FACR on vector and parallel architectures (Cray-1,
Cyber-205, ICL DAP).

1 Due to H.T. Kung.


220 6 Special Linear Systems

Exploiting sparsity and solution probing was observed early on for RES in
reference [26] as we already noted in Sect. 6.1. These ideas were developed fur-
ther in [126, 139–141]. A RES for an SGI with 8 processors and a 16-processo
Beowulf cluster based on [139, 140] was presented in [142].

References

1. Turing, A.: Proposed electronic calculator. www.emula3.com/docs/Turing_Report_on_ACE.


pdf (1946)
2. Dewilde, P.: Minimal complexity realization of structured matrices. In: Kailath, T., Sayed, A.
(eds.) Fast Reliable Algorithms for Matrices with Structure, Chapter 10, pp. 277–295. SIAM
(1999)
3. Chandrasekaran, S., Dewilde, P., Gu, M., Somasunderam, N.: On the numerical rank of the
off-diagonal blocks of Schur complements of discretized elliptic PDEs. SIAM J. Matrix Anal.
Appl. 31(5), 2261–2290 (2010)
4. Lin, L., Lu, J., Ying, L.: Fast construction of hierarchical matrix representation from matrix-
vector multiplication. J. Comput. Phys. 230(10), 4071–4087 (2011). doi:10.1016/j.jcp.2011.
02.033. http://dx.doi.org/10.1016/j.jcp.2011.02.033
5. Martinsson, P.: A fast randomized algorithm for computing a hierarchically semiseparable
representation of a matrix. SIAM J. Matrix Anal. Appl. 32(4), 1251–1274 (2011)
6. Kailath, T., Kung, S.-Y., Morf, M.: Displacement ranks of matrices and linear equations. J.
Math. Anal. Appl. 68(2), 395–407 (1979)
7. Kailath, T., Sayed, A.: Displacement structure: theory and applications. SIAM Rev. 37(3),
297–386 (1995)
8. Vandebril, R., Van Barel, M., Mastronardi, N.: Matrix Computations and Semiseparable Matri-
ces. Volume I: Linear Systems. Johns Hopkins University Press, Baltimore (2008)
9. Bebendorf, M.: Hierarchical Matrices: A Means to Efficiently Solve Elliptic Boundary Value
Problems. Lecture Notes in Computational Science and Engineering (LNCSE), vol. 63.
Springer, Berlin (2008). ISBN 978-3-540-77146-3
10. Hackbusch, W., Borm, S.: Data-sparse approximation by adaptive H2-matrices. Computing
69(1), 1–35 (2002)
11. Traub, J.: Associated polynomials and uniform methods for the solution of linear problems.
SIAM Rev. 8(3), 277–301 (1966)
12. Björck, A., Pereyra, V.: Solution of Vandermonde systems of equations. Math. Comput. 24,
893–903 (1971)
13. Gohberg, I., Olshevsky, V.: The fast generalized Parker-Traub algorithm for inversion of
Vandermonde and related matrices. J. Complex. 13(2), 208–234 (1997)
14. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins, Baltimore (2013)
15. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
16. Davis, P.J.: Interpolation and Approximation. Dover, New York (1975)
17. Gautschi, W.: Numerical Analysis: An Introduction. Birkhauser, Boston (1997)
18. Gautschi, W., Inglese, G.: Lower bounds for the condition number of Vandermonde matrices.
Numer. Math. 52, 241–250 (1988)
19. Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM, Philadelphia
(1992)
20. Córdova, A., Gautschi, W., Ruscheweyh, S.: Vandermonde matrices on the circle: spectral
properties and conditioning. Numer. Math. 57, 577–591 (1990)
21. Berman, L., Feuer, A.: On perfect conditioning of Vandermonde matrices on the unit circle.
Electron. J. Linear Algebra 16, 157–161 (2007)
References 221

22. Gautschi, W.: Optimally scaled and optimally conditioned Vandermonde and Vandermonde-
like matrices. BIT Numer. Math. 51, 103–125 (2011)
23. Gunnels, J., Lee, J., Margulies, S.: Efficient high-precision matrix algebra on parallel archi-
tectures for nonlinear combinatorial optimization. Math. Program. Comput. 2(2), 103–124
(2010)
24. Aho, A., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms.
Addison-Wesley, Reading (1974)
25. Pan, V.: Complexity of computations with matrices and polynomials. SIAM Rev. 34(2), 255–
262 (1992)
26. Banegas, A.: Fast Poisson solvers for problems with sparsity. Math. Comput. 32(142), 441–
446 (1978). http://www.jstor.org/stable/2006156
27. Cappello, P., Gallopoulos, E., Koç, Ç.: Systolic computation of interpolating polynomials.
Computing 45, 95–118 (1990)
28. Koç, Ç., Cappello, P., Gallopoulos, E.: Decomposing polynomial interpolation for systolic
arrays. Int. J. Comput. Math. 38, 219–239 (1991)
29. Koç, Ç.: Parallel algorithms for interpolation and approximation. Ph.D. thesis, Department
of Electrical and Computer Engineering, University of California, Santa Barbara, June 1988
30. Eğecioğlu, Ö., Gallopoulos, E., Koç, Ç.: A parallel method for fast and practical high-order
Newton interpolation. BIT 30, 268–288 (1990)
31. Breshaers, C.: The Art of Concurrency—A Thread Monkey’s Guide to Writing Parallel Appli-
cations. O’Reilly, Cambridge (2009)
32. Lakshmivarahan, S., Dhall, S.: Parallelism in the Prefix Problem. Oxford University Press,
New York (1994)
33. Harris, M., Sengupta, S., Owens, J.: Parallel prefix sum (scan) with CUDA. GPU Gems 3(39),
851–876 (2007)
34. Falkoff, A., Iverson, K.: The evolution of APL. SIGPLAN Not. 13(8), 47–57 (1978). doi:10.
1145/960118.808372. http://doi.acm.org/10.1145/960118.808372
35. Blelloch, G.E.: Scans as primitive operations. IEEE Trans. Comput. 38(11), 1526–1538 (1989)
36. Chatterjee, S., Blelloch, G., Zagha, M.: Scan primitives for vector computers. In: Proceed-
ings of the 1990 ACM/IEEE Conference on Supercomputing, pp. 666–675. IEEE Computer
Society Press, Los Alamitos (1990). http://dl.acm.org/citation.cfm?id=110382.110597
37. Hillis, W., Steele Jr, G.: Data parallel algorithms. Commun. ACM 29(12), 1170–1183 (1986).
doi:10.1145/7902.7903. http://doi.acm.org/10.1145/7902.7903
38. Dotsenko, Y., Govindaraju, N., Sloan, P.P., Boyd, C., Manferdelli, J.: Fast scan algorithms on
graphics processors. In: Proceedings of the 22nd International Conference on Supercomputing
ICS’08, pp. 205–213. ACM, New York (2008). doi:10.1145/1375527.1375559. http://doi.
acm.org/10.1145/1375527.1375559
39. Sengupta, S., Harris, M., Zhang, Y., Owens, J.: Scan primitives for GPU computing. Graphics
Hardware 2007, pp. 97–106. ACM, New York (2007)
40. Sengupta, S., Harris, M., Garland, M., Owens, J.: Efficient parallel scan algorithms for many-
core GPUs. In: Kurzak, J., Bader, D., Dongarra, J. (eds.) Scientific Computing with Multicore
and Accelerators, pp. 413–442. CRC Press, Boca Raton (2010). doi:10.1201/b10376-29
41. Intel Corporation: Intel(R) Threading Building Blocks Reference Manual, revision 1.6 edn.
(2007). Document number 315415-001US
42. Bareiss, E.: Numerical solutions of linear equations with Toeplitz and vector Toeplitz matrices.
Numer. Math. 13, 404–424 (1969)
43. Gallivan, K.A., Thirumalai, S., Van Dooren, P., Varmaut, V.: High performance algorithms
for Toeplitz and block Toeplitz matrices. Linear Algebra Appl. 241–243, 343–388 (1996)
44. Justice, J.: The Szegö recurrence relation and inverses of positive definite Toeplitz matrices.
SIAM J. Math. Anal. 5, 503–508 (1974)
45. Trench, W.: An algorithm for the inversion of finite Toeplitz matrices. J. Soc. Ind. Appl. Math.
12, 515–522 (1964)
46. Trench, W.: An algorithm for the inversion of finite Hankel matrices. J. Soc. Ind. Appl. Math.
13, 1102–1107 (1965)
222 6 Special Linear Systems

47. Phillips, J.: The triangular decomposition of Hankel matrices. Math. Comput. 25, 599–602
(1971)
48. Rissanen, J.: Solving of linear equations with Hankel and Toeplitz matrices. Numer. Math.
22, 361–366 (1974)
49. Xi, Y., Xia, J., Cauley, S., Balakrishnan, V.: Superfast and stable structured solvers for Toeplitz
least squares via randomized sampling. SIAM J. Matrix Anal. Appl. 35(1), 44–72 (2014)
50. Zohar, S.: Toeplitz matrix inversion: The algorithm of W. Trench. J. Assoc. Comput. Mach.
16, 592–701 (1969)
51. Watson, G.: An algorithm for the inversion of block matrices of Toeplitz form. J. Assoc.
Comput. Mach. 20, 409–415 (1973)
52. Rissanen, J.: Algorithms for triangular decomposition of block Hankel and Toeplitz matrices
with applications to factoring positive matrix polynomials. Math. Comput. 27, 147–154 (1973)
53. Kailath, T., Vieira, A., Morf, M.: Inverses of Toeplitz operators, innovations, and orthogonal
polynomials. SIAM Rev. 20, 106–119 (1978)
54. Gustavson, F., Yun, D.: Fast computation of Padé approximants and Toeplitz systems of
equations via the extended Euclidean algorithm. Technical report 7551, IBM T.J. Watson
Research Center, New York (1979)
55. Brent, R., Gustavson, F., Yun, D.: Fast solution of Toeplitz systems of equations and compu-
tation of Padé approximants. J. Algorithms 1, 259–295 (1980)
56. Morf, M.: Doubling algorithms for Toeplitz and related equations. In: Proceedings of the IEEE
International Conference on Acoustics, Speech and Signal Processing, pp. 954–959 (1980)
57. Chandrasekaran, S., Gu, M., Sun, X., Xia, J., Zhu, J.: A superfast algorithm for Toeplitz
systems of linear equations. SIAM J. Matrix Anal. Appl. 29(4), 1247–1266 (2007). doi:10.
1137/040617200. http://dx.doi.org/10.1137/040617200
58. Xia, J., Xi, Y., Gu, M.: A superfast structured solver for Toeplitz linear systems via randomized
sampling. SIAM J. Matrix Anal. Appl. 33(3), 837–858 (2012)
59. Grcar, J., Sameh, A.: On certain parallel Toeplitz linear system solvers. SIAM J. Sci. Stat.
Comput. 2(2), 238–256 (1981)
60. Aitken, A.: Determinants and Matrices. Oliver Boyd, London (1939)
61. Cantoni, A., Butler, P.: Eigenvalues and eigenvectors of symmetric centrosymmetric matrices.
Numer. Linear Algebra Appl. 13, 275–288 (1976)
62. Gohberg, I., Semencul, A.: On the inversion of finite Toeplitz matrices and their continuous
analogues. Mat. Issled 2, 201–233 (1972)
63. Gohberg, I., Feldman, I.: Convolution equations and projection methods for their solution.
Translations of Mathematical Monographs, vol. 41. AMS, Providence (1974)
64. Gohberg, I., Levin, S.: Asymptotic properties of Toeplitz matrix factorization. Mat. Issled 1,
519–538 (1978)
65. Fischer, D., Golub, G., Hald, O., Leiva, C., Widlund, O.: On Fourier-Toeplitz methods for
separable elliptic problems. Math. Comput. 28(126), 349–368 (1974)
66. Riesz, F., Sz-Nagy, B.: Functional Analysis. Frederick Ungar, New York (1956). (Translated
from second French edition by L. Boron)
67. Szegö, G.: Orthogonal Polynomials. Technical Report, AMS, Rhode Island (1959). (Revised
edition AMS Colloquium Publication)
68. Grenander, U., Szegö, G.: Toeplitz Forms and their Applications. University of California
Press, California (1958)
69. Pease, M.: The adaptation of the fast Fourier transform for parallel processing. J. Assoc.
Comput. Mach. 15(2), 252–264 (1968)
70. Householder, A.S.: The Theory of Matrices in Numerical Analysis. Dover Publications, New
York (1964)
71. Morf, M., Kailath, T.: Recent results in least-squares estimation theory. Ann. Econ. Soc. Meas.
6, 261–274 (1977)
72. Franchetti, F., Püschel, M.: Fast Fourier transform. In: Padua, D. (ed.) Encyclopedia of Parallel
Computing. Springer, New York (2011)
References 223

73. Chen, H.C.: The SAS domain decomposition method. Ph.D. thesis, University of Illinois at
Urbana-Champaign (1988)
74. Chen, H.C., Sameh, A.: Numerical linear algebra algorithms on the Cedar system. In: Noor,
A. (ed.) Parallel Computations and Their Impact on Mechanics. Applied Mechanics Division,
vol. 86, pp. 101–125. American Society of Mechanical Engineers, New York (1987)
75. Chen, H.C., Sameh, A.: A matrix decomposition method for orthotropic elasticity problems.
SIAM J. Matrix Anal. Appl. 10(1), 39–64 (1989)
76. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, Oxford (1965)
77. Botta, E.: How fast the Laplace equation was solved in 1995. Appl. Numer. Math.
24(4), 439–455 (1997). doi:10.1016/S01689274(97)00041X. http://dx.doi.org/10.1016/
S0168-9274(97)00041-X
78. Knightley, J.R., Thompson, C.P.: On the performance of some rapid elliptic solvers on a vector
processor. SIAM J. Sci. Stat. Comput. 8(5), 701–715 (1987)
79. Csansky, L.: Fast parallel matrix inversion algorithms. SIAM J. Comput. 5, 618–623 (1977)
80. Birkhoff, G., Lynch, R.: Numerical Solution of Elliptic Problems. SIAM, Philadelphia (1984)
81. Iserles, A.: Introduction to Numerical Methods for Differential Equations. Cambridge Uni-
versity Press, Cambridge (1996)
82. Olshevsky, V., Oseledets, I., Tyrtyshnikov, E.: Superfast inversion of two-level Toeplitz matri-
ces using Newton iteration and tensor-displacement structure. Recent Advances in Matrix and
Operator Theory. Birkhäuser Verlag, Basel (2007)
83. Bank, R.E., Rose, D.: Marching algorithms for elliptic boundary value problems. I: the con-
stant coefficient case. SIAM J. Numer. Anal. 14(5), 792–829 (1977)
84. Lanczos, C.: Tables of the Chebyshev Polynomials Sn (x) and Cn (x). Applied Mathematics
Series, vol. 9. National Bureau of Standards, New York (1952)
85. Rivlin, T.: The Chebyshev Polynomials. Wiley-Interscience, New York (1974)
86. Abramowitz, M., Stegun, I.: Handbook of Mathematical Functions. Dover, New York (1965)
87. Karlqvist, O.: Numerical solution of elliptic difference equations by matrix methods. Tellus
4(4), 374–384 (1952). doi:10.1111/j.2153-3490.1952.tb01025.x. http://dx.doi.org/10.1111/
j.2153-3490.1952.tb01025.x
88. Bickley, W.G., McNamee, J.: Matrix and other direct methods for the solution of systems of
linear difference equations. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 252(1005), 69–131
(1960). doi:10.1098/rsta.1960.0001. http://rsta.royalsocietypublishing.org/cgi/doi/10.1098/
rsta.1960.0001
89. Egerváry, E.: On rank-diminishing operations and their application to the solution of linear
equations. Zeitschrift fuer angew. Math. und Phys. 11, 376–386 (1960)
90. Egerváry, E.: On hypermatrices whose blocks are computable in pair and their application in
lattice dynamics. Acta Sci. Math. Szeged 15, 211–222 (1953/1954)
91. Bialecki, B., Fairweather, G., Karageorghis, A.: Matrix decomposition algorithms for elliptic
boundary value problems: a survey. Numer. Algorithms (2010). doi:10.1007/s11075-010-
9384-y. http://www.springerlink.com/index/10.1007/s11075-010-9384-y
92. Buzbee, B.: A fast Poisson solver amenable to parallel computation. IEEE Trans. Comput.
C-22(8), 793–796 (1973)
93. Sameh, A., Chen, S.C., Kuck, D.: Parallel Poisson and biharmonic solvers. Computing 17,
219–230 (1976)
94. Swarztrauber, P.N., Sweet, R.A.: Vector and parallel methods for the direct solution of Pois-
son’s equation. J. Comput. Appl. Math. 27, 241–263 (1989)
95. Buzbee, B., Golub, G., Nielson, C.: On direct methods for solving Poisson’s equation. SIAM
J. Numer. Anal. 7(4), 627–656 (1970)
96. Sweet, R.A.: A cyclic reduction algorithm for solving block tridiagonal systems of arbitrary
dimension. SIAM J. Numer. Anal. 14(4), 707–720 (1977)
97. Gallopoulos, E., Saad, Y.: Parallel block cyclic reduction algorithm for the fast solution of
elliptic equations. Parallel Comput. 10(2), 143–160 (1989)
98. Sweet, R.A.: A parallel and vector cyclic reduction algorithm. SIAM J. Sci. Stat. Comput.
9(4), 761–765 (1988)
224 6 Special Linear Systems

99. Demmel, J.: Trading off parallelism and numerical stability. In: Moonen, M.S., Golub, G.H.,
Moor, B.L.D. (eds.) Linear Algebra for Large Scale and Real-Time Applications. NATO ASI
Series E, vol. 232, pp. 49–68. Kluwer Academic Publishers, Dordrecht (1993)
100. Calvetti, D., Gallopoulos, E., Reichel, L.: Incomplete partial fractions for parallel evaluation
of rational matrix functions. J. Comput. Appl. Math. 59, 349–380 (1995)
101. Temperton, C.: On the FACR(l) algorithm for the discrete Poisson equation. J. Comput. Phys.
34, 314–329 (1980)
102. Sameh, A., Kuck, D.: On stable parallel linear system solvers. J. Assoc. Comput. Mach. 25(1),
81–91 (1978)
103. Gallopoulos, E., Saad, Y.: Some fast elliptic solvers for parallel architectures and their com-
plexities. Int. J. High Speed Comput. 1(1), 113–141 (1989)
104. Hyman, M.: Non-iterative numerical solution of boundary-value problems. Appl. Sci. Res. B
2, 325–351 (1951–1952)
105. Lynch, R., Rice, J., Thomas, D.: Tensor product analysis of partial differential equations. Bull.
Am. Math. Soc. 70, 378–384 (1964)
106. Hockney, R.: A fast direct solution of Poisson’s equation using Fourier analysis. J. Assoc.
Comput. Mach. 12, 95–113 (1965)
107. Haigh, T.: Bill Buzbee, Oral History Interview (2005). http://history.siam.org/buzbee.htm
108. Cooley, J.: The re-discovery of the fast Fourier transform algorithm. Mikrochim. Acta III,
33–45 (1987)
109. Ericksen, J.: Iterative and direct methods for solving Poisson’s equation and their adaptabil-
ity to Illiac IV. Technical report UIUCDCS-R-72-574, Department of Computer Science,
University of Illinois at Urbana-Champaign (1972)
110. Sweet, R.: Vectorization and parallelization of FISHPAK. In: Dongarra, J., Kennedy, K.,
Messina, P., Sorensen, D., Voigt, R. (eds.) Proceedings of the Fifth SIAM Conference on
Parallel Processing for Scientific Computing, pp. 637–642. SIAM, Philadelphia (1992)
111. Temperton, C.: Fast Fourier transforms and Poisson solvers on Cray-1. In: Hockney, R.,
Jesshope, C. (eds.) Infotech State of the Art Report: Supercomputers, vol. 2, pp. 359–379.
Infotech Int. Ltd., Maidenhead (1979)
112. Hockney, R.W.: Characterizing computers and optimizing the FACR(l) Poisson solver on
parallel unicomputers. IEEE Trans. Comput. C-32(10), 933–941 (1983)
113. Jwo, J.S., Lakshmivarahan, S., Dhall, S.K., Lewis, J.M.: Comparison of performance of three
parallel versions of the block cyclic reduction algorithm for solving linear elliptic partial
differential equations. Comput. Math. Appl. 24(5–6), 83–101 (1992)
114. Chan, T., Resasco, D.: Hypercube implementation of domain-decomposed fast Poisson
solvers. In: Heath, M. (ed.) Proceedings of the 2nd Conference on Hypercube Multiprocessors,
pp. 738–746. SIAM (1987)
115. Resasco, D.: Domain decomposition algorithms for elliptic partial differential equations.
Ph.D. thesis, Yale University (1990). http://www.cs.yale.edu/publications/techreports/tr776.
pdf. YALEU/DCS/RR-776
116. Cote, S.: Solving partial differential equations on a MIMD hypercube: fast Poisson solvers
and the alternating direction method. Technical report UIUCDCS-R-91-1694, University of
Illinois at Urbana-Champaign (1991)
117. McBryan, O., Van De Velde, E.: Hypercube algorithms and implementations. SIAM J. Sci.
Stat. Comput. 8(2), s227–s287 (1987)
118. Sweet, R., Briggs, W., Oliveira, S., Porsche, J., Turnbull, T.: FFTs and three-dimensional
Poisson solvers for hypercubes. Parallel Comput. 17, 121–131 (1991)
119. McBryan, O.: Connection machine application performance. Technical report CH-CS-434-
89, Department of Computer Science, University of Colorado, Boulder (1989)
120. Briggs, W.L., Turnbull, T.: Fast Poisson solvers for MIMD computers. Parallel Comput. 6,
265–274 (1988)
121. McBryan, O., Van de Velde, E.: Elliptic equation algorithms on parallel computers. Commun.
Appl. Numer. Math. 2, 311–318 (1986)
References 225

122. Gallivan, K.A., Heath, M.T., Ng, E., Ortega, J.M., Peyton, B.W., Plemmons, R.J., Romine,
C.H., Sameh, A., Voigt, R.G.: Parallel Algorithms for Matrix Computations. SIAM, Philadel-
phia (1990)
123. Gallopoulos, E., Sameh, A.: Solving elliptic equations on the Cedar multiprocessor. In: Wright,
M.H. (ed.) Aspects of Computation on Asynchronous Parallel Processors, pp. 1–12. Elsevier
Science Publishers B.V. (North-Holland), Amsterdam (1989)
124. Chan, T.F., Fatoohi, R.: Multitasking domain decomposition fast Poisson solvers on the Cray
Y-MP. In: Proceedings of the Fourth SIAM Conference on Parallel Processing for Scientific
Computing. SIAM (1989) (to appear)
125. Giraud, L.: Parallel distributed FFT-based solvers for 3-D Poisson problems in meso-scale
atmospheric simulations. Int. J. High Perform. Comput. Appl. 15(1), 36–46 (2001). doi:10.
1177/109434200101500104. http://hpc.sagepub.com/cgi/content/abstract/15/1/36
126. Rossi, T., Toivanen, J.: A parallel fast direct solver for block tridiagonal systems with separable
matrices of arbitrary dimension. SIAM J. Sci. Stat. Comput. 20(5), 1778–1796 (1999)
127. Tromeur-Dervout, D., Toivanen, J., Garbey, M., Hess, M., Resch, M., Barberou, N., Rossi,
T.: Efficient metacomputing of elliptic linear and non-linear problems. J. Parallel Distrib.
Comput. 63(5), 564–577 (2003). doi:10.1016/S0743-7315(03)00003-0
128. Intel Cluster Poisson Solver Library—Intel Software Network. http://software.intel.com/en-
us/articles/intel-cluster-poisson-solver-library/
129. Rossinelli, D., Bergdorf, M., Cottet, G.H., Koumoutsakos, P.: GPU accelerated simulations of
bluff body flows using vortex particle methods. J. Comput. Phys. 229(9), 3316–3333 (2010)
130. Wu, J., JaJa, J., Balaras, E.: An optimized FFT-based direct Poisson solver on CUDA GPUs.
IEEE Trans. Parallel Distrib. Comput. 25(3), 550–559 (2014). doi:10.1109/TPDS.2013.53
131. O’Donnell, S.T., Geiger, P., Schultz, M.H.: Solving the Poisson equation on the FPS-164.
Technical report, Yale University, Department of Computer Science (1983)
132. Vajteršic, M.: Algorithms for Elliptic Problems: Efficient Sequential and Parallel Solvers.
Kluwer Academic Publishers, Dordrecht (1993)
133. Houstis, E.N., Rice, J.R., Weerawarana, S., Catlin, A.C., Papachiou, P., Wang, K.Y., Gai-
tatzes, M.: PELLPACK: a problem-solving environment for PDE-based applications on mul-
ticomputer platforms. ACM Trans. Math. Softw. (TOMS) 24(1) (1998). http://portal.acm.org/
citation.cfm?id=285864
134. Meurant, G.: A review on the inverse of symmetric tridiagonal and block tridiagonal matrices.
SIAM J. Matrix Anal. Appl. 13(3), 707–728 (1992)
135. Hoffmann, G.R., Swarztrauber, P., Sweet, R.: Aspects of using multiprocessors for meteoro-
logical modelling. In: Hoffmann, G.R., Snelling, D. (eds.) Multiprocessing in Meteorological
Models, pp. 125–196. Springer, New York (1988)
136. Johnsson, S.: The FFT and fast Poisson solvers on parallel architectures. Technical Report
583, Yale University, Department of Computer Science (1987)
137. Hockney, R., Jesshope, C.: Parallel Computers. Adam Hilger, Bristol (1983)
138. Bini, D., Meini, B.: The cyclic reduction algorithm: from Poisson equation to stochastic
processes and beyond. Numerical Algorithms 51(1), 23–60 (2008). doi:10.1007/s11075-008-
9253-0. http://www.springerlink.com/content/m40t072h273w8841/fulltext.pdf
139. Kuznetsov, Y.A., Matsokin, A.M.: On partial solution of systems of linear algebraic equations.
Sov. J. Numer. Anal. Math. Model. 4(6), 453–467 (1989)
140. Vassilevski, P.: An optimal stabilization of the marching algorithm. Comptes Rendus Acad.
Bulg. Sci. 41, 29–32 (1988)
141. Rossi, T., Toivanen, J.: A nonstandard cyclic reduction method, its variants and stability.
SIAM J. Matrix Anal. Appl. 20(3), 628–645 (1999)
142. Bencheva, G.: Parallel performance comparison of three direct separable elliptic solvers. In:
Lirkov, I., Margenov, S., Wasniewski, J., Yalamov, P. (eds.) Large-Scale Scientific Computing.
Lecture Notes in Computer Science, vol. 2907, pp. 421–428. Springer, Berlin (2004). http://
dx.doi.org/10.1007/978-3-540-24588-9_48
Chapter 7
Orthogonal Factorization and Linear Least
Squares Problems

Orthogonal factorization (or QR factorization) of a dense matrix is an essential tool in


several matrix computations. In this chapter we consider algorithms for QR as well as
its application to the solution of linear least squares problems. The QR factorization is
also a major step in some methods for computing eigenvalues; cf. Chap. 8. Algorithms
for both the QR factorization and the solution of linear least squares problems are
major components in data analysis and areas such as computational statistics, data
mining and machine learning; cf. [1–3].

7.1 Definitions

We begin by introducing the two problems we address in this chapter.


Definition 7.1 (QR factorization) For any matrix A ∈ Rm×n , there exists a pair of
matrices Q ∈ Rm×m and R ∈ Rm×n such that:

A = Q R, (7.1)

where Q isorthogonal
 and R is either upper triangular when m ≥ n (i.e. R is of the
R1
form R = , or upper trapezoidal when m < n).
0
If A is of maximal column rank n, and if we require that the nonzero diagonal
elements of R1 to be positive, then the QR-factorization is unique. Further, R1 is the
transpose of the Cholesky factor of the matrix A A. Computing the factorization
(7.1) consists of pre-multiplying A by a finite number of elementary orthogonal
transformations Q  
1 , . . . , Q q , such that

Q q · · · Q  
2 Q 1 A = R, (7.2)

© Springer Science+Business Media Dordrecht 2016 227


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_7
228 7 Orthogonal Factorization and Linear Least Squares Problems

where Q = Q 1 Q 2 · · · Q q satisfies (7.2). When m > n, partitioning the columns


of Q as Q = [U, V ] with U ∈ Rm×n consisting of n orthonormal columns, the
factorization can be expressed as

A = U R1 , (7.3)

This is referred to as thin factorization, whereas the one obtained in (7.2) is referred
to as thick factorization.

Definition 7.2 (Linear least squares problem)

Obtain x ∈ Rn , so as to minimize the 2-norm of (b − Ax), (7.4)

where A ∈ Rm×n and b ∈ Rm .

The vector x is a solution if and only if the residual r = b − Ax is orthogonal to the


range of A: Ar = 0. When A is of maximal column-rank n, the solution is unique.
When rank(A) < n, the solution of choice is that one with the smallest 2-norm. In
this case x=A+ b, where A+ ∈ Rn×m is the Moore-Penrose generalized inverse of A.
Thus, solving (7.4) consists of first computing the QR factorization (7.1) of A.
Further, since

b − Ax = Q  (b − Ax) = Q  b − Rx = c − Rx, (7.5)

the solution x is chosen so as to minimize c − Rx. When A is of  fullcolumn rank,


c1
the solution is unique. By partitioning the right hand side c = conformally
c2
with Q = [U, V ], the least squares solution x is obtained by solving the upper
triangular system R1 x = c1 . When A is rank deficient a strategy of column pivoting
is necessary for realizing a rank-revealing orthogonal factorization.
In this chapter, we first review the existing parallel algorithms for computing the
QR-factorization of a full rank matrix, and then we discuss algorithms for the parallel
solution of rank deficient linear least squares problems. We also consider a method
for solving systems with tall-narrow matrices and its application to the adjustment
of geodetic networks.
Both the QR factorization and the linear least squares problem have been discussed
extensively in the literature, e.g. see [4, 5] and references therein.

7.2 QR Factorization via Givens Rotations

For the sake of illustrating this parallel factorization scheme, we consider only square
matrices A ∈ Rn×n even though the procedure can be easily extended to rectangular
matrices A ∈ Rm×n , with m > n. In this procedure, each orthogonal matrix Q j
which appears in (7.2) is built as the product of several plane rotations of the form
7.2 QR Factorization via Givens Rotations 229

⎛ ⎞
Ii−1
⎜ ( j) ( j) ⎟
⎜ ci si ⎟
Ri j = ⎜
⎜ I j−i−2 ⎟,
⎟ (7.6)
⎝ ( j)
−si
( j)
ci ⎠
I j−i−1 In− j
↑ ↑
th th
i col. j col.

( j) 2 ( j) 2
where ci + si = 1.
For example, one of the simplest organization for successively eliminating entries
below the main diagonal column by column starting with the first column, i.e. j =
1, 2, . . . , n − 1, i.e. Q  is the product of n(n − 1)/2 plane rotations.
( j)
Let us denote by Ri,i+1 that plane rotation which uses rows i and (i + 1) of A
to annihilate the off-diagonal element in position (i + 1, j). Such a rotation is given
by:

if |αi+1, j | > |αi, j |,


αi, j ( j) ( j) ( j)
τ = − αi+1, , si = √ 1 , ci = si τ,
j 1+τ 2
else (7.7)
α ( j) ( j) ( j)
τ= − αi+1, j
, ci = √ 1 , si = ci τ.
i, j 1+τ 2
end

Thus, the annihilation of the entries of column j is obtained by the sequence


( j) ( j) ( j)
of rotations Q j = R j, j+1 · · · Rn−2,n−1 Rn−1,n . Hence, the orthogonal matrix Q =
Q 1 · · · Q n−2 Q n−1 is such that Q  A is upper-triangular. This sequential algorithm
requires n(4n 2 − n − 3)/2 arithmetic operations, and (n − 1)/2 square roots.
Next, we construct a multiprocessor algorithm that achieves a speedup of O(n 2 )
using O(n 2 ) processors.

Theorem 7.1 ([6]) Let A be a nonsingular matrix of even order n, then we can
obtain the factorization (7.2) in T p = 10n − 15 parallel arithmetic operations and
(2n − 3) square roots using p = n(3n − 2)/2 processors. This results in a speedup
of O(n 2 ), efficiency O(1) and no redundant arithmetic operations.

Proof Let

( j) ( j)
( j) ci si
Ui,i+1 = ( j) ( j)
−si ci

be that plane rotation that acts on rows i and i + 1 for annihilating the element in
position (i +1, j). Starting with A1 = A, we construct the sequence Ak+1 = Q  k Ak ,
k = 1, . . . , 2n − 3, where Q 
k is the direct sum of the independent plane rotations
( j)
Ui,i+1 where the indices i, j are as given in Algorithm 7.1. Thus, each orthogonal
230 7 Orthogonal Factorization and Linear Least Squares Problems

Fig. 7.1 Ordering for the


Givens rotations: all the
entries of label k are
annihilated by Q k

matrix Q  k annihilates simultaneously several off-diagonal elements without de-


stroying zeros introduced in a previous step. Specifically, Q  k annihilates the entries
(i j + 1, j) where i j = 2 j + n − k − 2, for j = 1, 2, . . . , n − 1. Note that elements
annihilated by each Q  k relate to one another by the knight’s move on a Chess board,
see Fig. 7.1 for n = 16, and Algorithm 7.1.
Hence, A2n−2 ≡ Q   
2n−3 . . . Q 2 Q 1 A is upper triangular. Let
 
cs
U=
−s c

rotate the two row vectors u  and v ∈ Rq so as to annihilate v1 , the first component
of v. Then U can be determined in 3 parallel steps and one square root using two
processors. The resulting vectors û and v̂ have the elements

û 1 = (u 21 + v12 )1/2
û i = cu i + svi , 2 ≤ i ≤ q, (7.8)
v̂i = −su i + cvi , 2 ≤ i ≤ q.
7.2 QR Factorization via Givens Rotations 231

Each pair û i and v̂i in (7.8) is evaluated in 2 parallel arithmetic operations using
4 processors. Note that the cost of computing û 1 has already been accounted for in
determining U . Consequently, the total cost of each transformation Ak+1 = Q k Ak
is 5 parallel arithmetic operations and one square root. Thus we can triangularize A
in 5(2n − 3) parallel arithmetic operations and (2n − 3) square roots. The maximum
number of processors is needed for k = n − 1, and is given by


n/2
p=4 n − j = n(3n − 2)/2.
j=1

Ignoring square roots in both the sequential and multiprocessor algorithms, we


2
obtain a speedup of roughly n5 , and a corresponding efficiency of roughly 152
. If
we compare this parallel Givens reduction with the sequential Gaussian elimination
scheme, we obtain a speedup and efficiency that are one third of the above.

Algorithm 7.1 QR by Givens rotations


Input: A ∈ Rn×n (for even n).
Output: A is upper triangular.
1: do k = 1 : n − 1,
2: doall j = 1 :
k/2 ,
3: i = 2j + n − k − 2 ;
( j)
4: Apply on the left side of A the rotation Ri,i+1 to annihilate the entry (i + 1, j) ;
5: end
6: end
7: do k = n : 2n − 3,
8: doall j = k − n + 2 :
k/2 ,
9: i = 2j + n − k − 2 ;
( j)
10: Apply on the left side of A the rotation Ri,i+1 to annihilate the entry (i + 1, j) ;
11: end
12: end

Other orderings for parallel Givens reduction are given in [7–9], with the ordering
presented here judged as being asymptotically optimal [10].
If n is not even, we simply consider the orthogonal factorization of diag(A, 1).
Also, using square root-free Givens’ rotations, e.g., see [11] or [12], we can obtain the
positive diagonal matrix D and the unit upper triangular matrix R in the factorization
Q A = D 1/2 R in O(n) parallel arithmetic operations (and no square roots) employing
O(n 2 ) processors.
Observe that Givens rotations are also useful for determining QR factorization of
matrices with structures such as Hessenberg matrices or more special patterns (see
e.g. [13]).
232 7 Orthogonal Factorization and Linear Least Squares Problems

7.3 QR Factorization via Householder Reductions

Given a vector a ∈ Rs , an elementary reflector H = H (v) = Is − βvv can be


determined such that H a = ±ae1 . Here, β = 2/v v, v ∈ Rs , and e1 is the
first column of the identity Is . Note that H is orthogonal and symmetric, i.e. H is
involutory. Application of H to a given vector x ∈ Rs , i.e. H (v)x = x − β(v x)v
can be realized on a uniprocessor using two BLAS1 routines [14], namely _DOT
and _AXPY, involving only 4s arithmetic operations.
The orthogonal factorization of A is obtained by successively applying such
orthogonal transformations to A: A1 = A and for k = 1, . . . , min(n, m − 1),
Ak+1 = Pk Ak where
 
Ik−1 0
Pk = , (7.9)
0 H (vk )

with vk chosen such that all but the first entry of the column vector H (vk )Ak (k : m, k)
(in Matlab notations) are zero.
At step k, the rest of Ak+1 = Pk Ak is obtained using the multiplication
H (vk )Ak (k : m, k : n). It can be implemented by two BLAS2 routines [15], namely
the matrix-vector multiplication DGEMV and the rank-one update DGER. The pro-
cedure stores the matrix Ak+1 in place of Ak .
On a uniprocessor, the total procedure involves 2n 2 (m − n/3) + O(mn) arith-
metic operations for obtaining R. When necessary for subsequent calculations, the
sequence of the vectors (vk )1,n , can be stored for instance in the lower part of the trans-
formed matrix Ak . When an orthogonal basis Q = [q1 , . . . , qn ] ∈ Rm×n of the range
 
I
of A is needed, the basis is obtained by pre-multiplying the matrix n successively
0
by Pn , Pn−1 , . . . , P1 . On a uniprocessor, this procedure involves 2n 2 (m − n/3) +
O(mn) additional arithmetic operations instead of 4(m 2 nmn 2 + n 3 /3) + O(mn)
operations when the whole matrix Q = [q1 , · · · , qm ] ∈ Rm×m must be assembled.
Fine Grain Parallelism
The general pipelined implementation has already been presented in Sect. 2.3. Addi-
tional details are given in [16]. Here, however, we explore the finest grain tasks that
can be used in this orthogonal factorization.
Similar to Sect. 7.2, and in order to present an account similar to that in [17], we
assume that A is a square matrix of order n.
The classical Householder reduction produces the sequence Ak+1 = Pk Ak , k =
1, 2, . . . , n − 1, so that An is upper triangular, with Ak+1 given by,
 
Rk−1 bk−1 Bk−1
Ak+1 = .
0 ρkk e1 Hk Ck−1

where Rk−1 is upper triangular of order k −1. The elementary reflector Hk = Hk (νk ),
and the scalar ρkk can be obtained in (3+log(n−k +1)) parallel arithmetic operations
7.3 QR Factorization via Householder Reductions 233

and one square root, using (n − k + 1) processors. In addition, Hk Ck−1 can be


computed in (4 + log(n − k + 1)) parallel arithmetic operations using (n − k + 1)2
processors. Thus, the total cost of obtaining the orthogonal factorization Q  A = R,
where Q = P1 P2 · · · Pn−1 is given by,


n−1
(7 + 2 log(n − r + 1)) = 2n log n + O(n),
r =1

parallel arithmetic operations and (n −1) square roots, using no more than n 2 proces-
sors. Since the sequential algorithm requires T1 = O(n 3 ) arithmetic operations,
we obtain a speedup of S p = O(n 2 / log n) and an efficiency E p proportional to
(1/ log n) using p = O(n 2 ) processors. Such a speedup is not as good as that real-
ized by parallel Givens’ reduction.
Block Reflectors
Block versions of this Householder reduction have been introduced in [18] (the W Y
form), and in [19] (the GG  form). The generation and application of the W Y form
are implemented in the routines of LAPACK [20]. They involve BLAS3 primitives
[21]. More specifically, the W Y form consists of considering a narrow window of
few columns that is reduced to the upper triangular form using elementary reflectors.
If s is the window width, the s elementary reflectors are accumulated first in the form
Pk+s · · · Pk+1 = I + W Y  where W, Y ∈ Rm×s . This expression allows the use of
BLAS3 in updating the remaining part of A.
On a limited number of processors, the block version of the algorithm is the
preferred parallel scheme; it is implemented in ScaLAPACK [22] where the matrix
A is distributed on a two-dimensional grid of processes according to the block cyclic
scheme. The block size is often chosen large enough to allow BLAS3 routines to
achieve maximum performance on each involved uniprocessor. Note, however, that
while increasing the block size improves the granularity of the computation, it may
negatively affect concurrency. An optimal tradeoff therefore must be found depending
on the architecture of the computing platform.

7.4 Gram-Schmidt Orthogonalization

The goal is to compute an orthonormal basis Q = [q1 , . . . , qn ] of the subspace


spanned by the columns of the matrix A = [a1 , . . . , an ], in such a way that, for 1 ≤
k ≤ n, the columns of Q k = [q1 , . . . , qk ] is a basis of the subspace spanned by the
first k columns of A. The Gram-Schmidt schemes consists of applying successively
the transformations Pk , k = 1, . . . , n − 1, where Pk is the orthogonal projector onto
the orthogonal complement of the subspace spanned by {a1 , . . . , ak }. R = Q  A can
be built step-by-step during the process.
234 7 Orthogonal Factorization and Linear Least Squares Problems

Two different versions of the algorithm are obtained by expressing Pk in distinct


ways:
CGS: In the classical Gram-Schmidt (Algorithm 7.2): Pk = I − Q k Q  k , where
Q k = [q1 , . . . , qk ]. Unfortunately, it was proven in [23] to be numerically unre-
liable except when it is applied two times which makes it numerically equivalent
to the following, modified Gram-Schmidt (MGS) procedure.
MGS: In the modified Gram-Schmidt (Algorithm 7.3): Pk = (I − qk qk ) · · · (I −
q1 q1 ). Numerically, the columns of Q are orthogonal up to a tolerance determined
by the machine precision multiplied by the condition number of A [23]. By ap-
plying the algorithm a second time (i.e. with a complete reorthogonalization), the
columns of Q become orthogonal up to the tolerance determined by the machine
precision parameter similar to the Householder or Givens reductions.

Algorithm 7.2 CGS: classical Gram-Schmidt


Input: A = [a1 , · · · , an ] ∈ Rm×n .
Output: Q = [q1 , · · · , qn ] ∈ Rm×n , orthonormal basis of R (A).
1: q1 = a1 /a1  ;
2: do k = 1 : n − 1,
3: r = Q  k ak+1
4: w = ak+1 − Q k r ;
5: qk+1 = w/w ;
6: end

Algorithm 7.3 MGS: modified Gram-Schmidt


Input: A = [a1 , · · · , an ] ∈ Rm×n .
Output: Q = [q1 , · · · , qn ] ∈ Rm×n , orthonormal basis of R (A).
1: q1 = a1 /a1  ;
2: do k = 1 : n − 1,
3: w = ak+1 ;
4: do j = 1 : k,
5: α j,k+1 = q j w;
6: w = w − α j,k+1 q j ;
7: end
8: qk+1 = w/w ;
9: end

The basic procedures involve 2mn 2 +O(mn) arithmetic operations on a uniproces-


sor. CGS is based on BLAS2 procedures but it must be applied two times while MGS
is based on BLAS1 routines. As mentioned in Sect. 2.3, by inverting the two loops of
MGS, the procedure can proceed with a BLAS2 routine. This is only possible when
A is available explicitly (for instance, this is not possible with the Arnoldi process
as defined in Algorithm 9.3).
7.4 Gram-Schmidt Orthogonalization 235

The parallel implementations of Gram-Schmidt orthogonalization has been dis-


cussed in Sect. 2.3. For distributed memory architectures, however, a block partition-
ing must be considered, similar to that of ScaLAPACK for MGS. In order to proceed
with BLAS3 routines, a block version BGS is obtained by considering blocks of
vectors instead of single vectors as in MGS , and by replacing the normalizing step by
an application of MGS on the individual blocks. Here, the matrix A is partitioned as
A = (A1 , . . . , A ), where we assume that  divides n (n = q). The basic primitives
used in that algorithm belong to BLAS3 thus insuring high performance on most
architectures. Note that it is necessary to apply the normalizing step twice to reach
the numerical accuracy of MGS, e.g. see [24]. The resulting method B2GS is given in
Algorithm 7.4. The number of arithmetic operations required in BGS is the same as

Algorithm 7.4 B2GS: block Gram-Schmidt


Input: A = [A1 , · · · , A ] ∈ Rm×n .
Output: Q = [Q 1 , · · · , Q  ] ∈ Rm×n , orthonormal basis of R (A).
1: W = MGS(A1 );
2: Q 1 = MGS(W );
3: do k = 1 :  − 1,
4: W = Ak+1 ;
5: do i = 1 : k,
6: Ri,k+1 = Q i W ;
7: W = W − Q i Ri,k+1 ;
8: end
9: W = MGS(W ) ;
10: Q k+1 = MGS(W ) ;
11: end

that of MGS (2mn 2 ) while that of B2GS involves 2mq 2 = 2mnq additional arith-
metic operations to reorthogonalize the  blocks of size m × q. Thus, the number of
arithmetic operations required by B2GS is (1 + 1 ) times that of MGS. Consequently,
for B2GS to be competitive, we need  to be large or the number of columns in each
block to be small. Having blocks with a small number of columns, however, will not
allow us to capitalize on the higher performance afforded by BLAS3. Using blocks
of 32 or 64 columns is often a reasonable compromise on many architectures.

7.5 Normal Equations Versus Orthogonal Reductions

For linear least squares problems (7.4) in which m n, normal equations are
very often used in several applications including statistical data analysis. In matrix
computation literature, however, one is warned of possible loss of information in
forming A A explicitly, and solving the linear system,

A Ax = A b. (7.10)
236 7 Orthogonal Factorization and Linear Least Squares Problems

whose condition number is the square of that of A.


If it is known a priori that A is properly scaled and extremely well conditioned,
one may take advantage of the high parallel scalability of the matrix multiplication
A (A, b), and solving a small linear system of order n using any of the orthogonal
factorization schemes discussed so far. Note that the total number of arithmetic opera-
tions required for solving the linear least squares problem using the normal equations
approach is 2n 2 m + O(n 3 ), while that required using Householder’s reduction is
2n 2 m + O(mn) . Hence, from the sequential complexity point of view and consider-
ing the high parallel scalability of the multiplication A (A, b), the normal equations
approach has an advantage of realizing more parallel scalability whenever m n 2 .

7.6 Hybrid Algorithms When m  n

As an alternative to the normal equations approach for solving the linear least squares
problem (7.4) when m n, we offer the following orthogonal factorization scheme
for tall-narrow matrices of maximal column rank. Such matrices arise in a variety of
applications. In this book, we consider two examples: (i) the adjustment of geodetic
networks (see Sect. 7.7), and (ii) orthogonalization of a nonorthogonal Krylov basis
(see Sect. 9.3.2). In both cases, m could be larger than 106 while n is as small as
several hundreds.
The algorithm for the orthogonal factorization of a tall-narrow matrix A on a
distributed memory architecture was first introduced in [25]. For the sake of illus-
tration, consider the case of using two multicore nodes. Partitioning A as follows:
A = [A 
1 , A2 ], In the first stage of the algorithm, the orthogonal factorization
of A1 and A2 is accomplished simultaneously to result in the upper triangular fac-
tors T1 and T2 , respectively (Fig. 7.2a). Thus, the first most time consuming stage

Fig. 7.2 Parallel (a) (b)


annihilation when m n: a
two-step procedure
7.6 Hybrid Algorithms When m n 237

is accomplished without any communication overhead. In Fig. 7.2b, T1 and T2 are


depicted as upper triangular matrices of size 4 in order to simplify describing the
second stage – how Givens rotations can be used to annihilate T2 and update T1 .
Clearly, the entry (1, 1) of the first row of T2 can be annihilated by rotating the
first rows of T1 and T2 incurring communication overhead. Once the (1, 1) entry of
T2 is annihilated, it is possible to annihilate all entries on the first diagonal of T2
by rotating the appropriate rows of T2 without the need for communication with the
first node containing T1 . This is followed by rotating row 2 of T1 with row 1 of T2 to
annihilate the entry (1, 2) in T2 incurring communication overhead. Consequently,
local rotations in the second node can annihilate all elements on the second diagonal
of T2 , and so on. Overlapping the communication-free steps with those requiring
communication, we can accomplish the annihilation of T2 much faster. The ordering
of rotations in such a strategy is shown in Fig. 7.2b. Here, as soon as entry (2, 2) of
T2 is annihilated, the entry (1, 2) of T2 can be annihilated (via internode communica-
tions) simultaneously with the communication-free annihilation of entry (3, 3) of T2 ,
and so on. Rotations indicated with stars are the only ones which require internode
communication. By following this strategy, the method RODDEC in [26] performs
the orthogonal factorization of a matrix on a ring of processors. It is clear that for
a ring of p processors, the method is weakly scalable since the first stage is void
of any communication, while in the second stage we have only nearest neighbor
communications (i.e. no global communication). Variants of this approach have also
been recently considered in [27].

7.7 Orthogonal Factorization of Block Angular Matrices

Here we consider the orthogonal factorization of structured matrices of the form,

⎛ ⎞
B1 C1
⎜ B2 C2 ⎟
⎜ ⎟
A=⎜ .. .. ⎟ (7.11)
⎝ . . ⎠
Bp C p

where A, as well as each block Bi , for i = 1, . . . , p, is of full column rank. The block
angular form implies that the QR factorization of A yields an upper triangular matrix
with the same block structure; this can be seen from the Cholesky factorization of the
block arrowhead matrix A A. Therefore, similar to the first stage of the algorithm
described above in Sect. 7.6, for i = 1, . . . , p, multicore node i first performs the
orthogonal factorization
 of the block (Bi , Ci ). The transformed block is now of
Ri E i1
the form . As a result, after the first orthogonal transformation, and an
0 E i2
appropriate permutation, we have a matrix of the form:
238 7 Orthogonal Factorization and Linear Least Squares Problems
⎛ ⎞
R1 E 11
⎜ R2 E 21 ⎟
⎜ ⎟
⎜ .. .. ⎟
⎜ . . ⎟
⎜ ⎟
⎜ R p E p1 ⎟
A1 = ⎜

⎟. (7.12)
⎜ E 12 ⎟

⎜ E 22 ⎟
⎜ ⎟
⎜ .. ⎟
⎝ . ⎠
E p2

Thus, to complete the orthogonal factorization process, we need to obtain the


QR-factorization of the tall and narrow matrix consisting of the sub-blocks E i2 ,
i = 1, . . . , p. using our orthogonal factorization scheme described in (7.6). Other
approaches for the orthogonal factorization of block angular matrices are considered
in [5].
Adjustment of Geodetic Networks
As an illustration of the application of our tall and narrow orthogonal factorization
scheme, we consider the problem of adjustment of geodetic networks which has
enjoyed a revival with the advent of the Global Positioning System (GPS), e.g. see
[28, 29]. The geodetic least squares adjustment problem is to compute accurately
the coordinates of points (or stations) on the surface of the Earth from a collection
of measured distances and angles between these points. The computations are based
upon a geodetic position network which is a mathematical model consisting of several
mesh points or geodetic stations, with unknown positions over a reference surface or
in 3-dimensional space. These observations then lead to a system of overdetermined
nonlinear equations involving, for example, trigonometric identities and distance
formulas relating the unknown coordinates. Each equation typically involves only
a small number of unknowns. Thus, resulting in sparse overdetermined system of
nonlinear equations,

F(x) = q (7.13)

where x is the vector containing the unknown coordinates, and q represents the
observation vector. Using the Gauss-Newton method for solving the nonlinear system
(7.13), we will need to solve a sequence of linear least squares problems of the
form (7.4), where A denotes the Jacobian of F at the current vector of unknown
coordinates, initially x 0 , and r 0 = q − F(x 0 ). The least squares solution vector y
is the adjustment vector, i.e., x 1 = x 0 + y is the improved approximation of the
coordinate vector x, e.g. see [30, 31].
The geodetic adjustment problem just described has the computationally con-
venient feature that the geodetic network domain can be readily decomposed into
smaller subproblems. This decomposition is based upon the Helmert blocking of the
network as described in [32]. Here, the observation matrix A is assembled into the
block angular form (7.11) by the way in which the data is collected by regions.
7.7 Orthogonal Factorization of Block Angular Matrices 239

Fig. 7.3 Geographical


partition of a geodetic
network
a1 b a2

a3 c a4

Junction Interior stations


stations

If the geodetic position network on a geographical region is partitioned as shown


in Fig. 7.3, with two levels of nested bisection, the corresponding observation matrix
A has the particular block-angular form shown in (7.14),
⎛ ⎞
A1 B1 D1
⎜ A2 B2 D2 ⎟
A=⎜

⎟ (7.14)
A3 C 1 D3 ⎠
A4 C 2 D4

where for the sake of illustration we assume that Ai ∈ Rm i ×n i , Di ∈ Rm i ×n d , for


i = 1, . . . , 4, and B j ∈ Rm j ×n b , C j ∈ Rm j+2 ×n c , for j = 1, 2 are all with more
rows than columns and of maximal column rank.
In order to solve this linear least squares problem on a 4-cluster system, in which
each cluster consisting of several multicore nodes, via orthogonal factorization of A,
we proceed as follows:
Stage 1: Let cluster i (or rather its global memory) contain the observations
corresponding to region ai and its share of regions (b or c) and d. Now, each cluster
i proceeds with the orthogonal factorization of Ai and updating its portion of (Bi or
Ci−2 ) and Di . Note that in this stage, each cluster operates in total independence of
the other three clusters. The orthogonal factorization on each cluster can be performed
by one of the algorithms described in Sects. 7.3 or 7.6.
The result of this reduction is a matrix A(1) with the following structure
⎛ ⎞
R1 ∗ ∗ cluster 1
⎜ 0 ∗ ∗ ⎟
⎜ ⎟
⎜ R2 ∗ ∗ ⎟ cluster 2
⎜ ⎟
(1)
⎜ 0 ∗ ∗ ⎟
A =⎜ ⎜ ⎟ (7.15)
R3 ∗ ∗ ⎟ cluster 3
⎜ ⎟
⎜ 0 ∗ ∗ ⎟
⎜ ⎟
⎝ R4 ∗ ∗ ⎠ cluster 4
0 ∗ ∗
240 7 Orthogonal Factorization and Linear Least Squares Problems

where Ri ∈ Rn i ×n i are upper triangular and nonsingular for i = 1, . . . , 4. The stars


represent the resulting blocks after the orthogonal transformations are applied to the
blocks B∗ , C∗ and D∗ .
Stage 2: Each cluster continues, independently, the orthogonal factorization pro-
cedure: clusters 1 and 2 factor their portions of the transformed blocks B and clusters
3 and 4 factor their portions of the transformed blocks C. The resulting matrix, A(2a)
is of the form,
⎛ ⎞
R1 ∗ ∗
⎜ 0 T1 ∗ ⎟ cluster 1
⎜ ⎟
⎜ 0 0 ∗ ⎟
⎜ ⎟
⎜ R2 ∗ ∗ ⎟
⎜ ⎟
⎜ 0 T2 ∗ ⎟ cluster 2
⎜ ⎟
⎜ 0 0 ∗ ⎟
A (2a)
=⎜


⎟ (7.16)
⎜ R3 ∗ ∗ ⎟
⎜ 0 T3 ∗ ⎟ cluster 3
⎜ ⎟
⎜ 0 0 ∗ ⎟
⎜ ⎟
⎜ R4 ∗ ∗ ⎟
⎜ ⎟
⎝ 0 T4 ∗ ⎠ cluster 4
0 0 ∗

where the matrices Ti ∈ Rn b ×n b and Ti+2 ∈ Rn c ×n c are upper triangular, i = 1, 2.


Now clusters 1, and 2, and clusters 3 and 4 need to cooperate (i.e. communicate) so
as to annihilate the upper triangular matrices T1 and T4 , respectively, to obtain the
matrix A(2b) which is of the following structure,
⎛ ⎞
R1 ∗ ∗ cluster 1
⎜ 0 0 S1 ⎟
⎜ ⎟
⎜ R2 ∗ ∗ ⎟
⎜ ⎟
⎜ 0 2
T ∗ ⎟ cluster 2
⎜ ⎟
⎜ 0 0 S2 ⎟
A(2b) =⎜


⎟ (7.17)
⎜ R3 ∗ ∗ ⎟
⎜ 0 3
T ∗ ⎟ cluster 3
⎜ ⎟
⎜ 0 0 S3 ⎟
⎜ ⎟
⎝ R4 ∗ ∗ ⎠
0 0 S4 cluster 4

where T 2 and T 3 are upper triangular and nonsingular. This annihilation is organized
so as to minimize intercluster communication. To illustrate this procedure, consider
the annihilation of T1 by T2 where each is of order n b . It is performed by elimination
of diagonals as outlined in Sect. 7.6 (see Fig. 7.2). The algorithm may be described
as follows:
7.7 Orthogonal Factorization of Block Angular Matrices 241

do k = 1 : n b ,
rotate ek T2 and e1 T1 to annihilate the element of T1 in posi-
tion (1, k), where ek denotes the k-th column of the identity
(requires intercluster communication).
do i = 1 : n b − k,
rotate rows i and i + 1 of T1 to annihilate the element in
position (i + 1, i + k) (local to the first cluster).
end
end

Stage 3: This is similar to stage 2. Each cluster i obtains the orthogonal factoriza-
tion of Si in (7.17). Notice here, however, that the computational load is not perfectly
balanced among the four clusters since S2 and S3 may have fewer rows that S1 and
S4 . The resulting matrix A(3a) is of the form,
⎛ ⎞
R1 ∗ ∗
⎜ 0 0 V1 ⎟ cluster 1
⎜ ⎟
⎜ 0 0 0 ⎟
⎜ ⎟
⎜ R2 ∗ ∗ ⎟
⎜ ⎟
⎜ 0 2
T ∗ ⎟ cluster 2
⎜ ⎟
⎜ 0 0 V2 ⎟
A (3a)
=⎜


⎟ (7.18)
⎜ R3 ∗ ∗ ⎟
⎜ 0 2
T ∗ ⎟ cluster 3
⎜ ⎟
⎜ 0 0 V3 ⎟
⎜ ⎟
⎜ R4 ∗ ∗ ⎟
⎜ ⎟
⎝ 0 0 V4 ⎠ cluster 4
0 0 0

where V j is nonsingular upper triangular, j = 1, 2, 3, 4. This is followed by the


annihilation of all except the one in cluster2. Here, clusters 1 and 2 cooperate (i.e.
communicate) to annihilate V1 , while clusters 3 and 4 cooperate to annihilate V4 .
Next, clusters 2 and 3 cooperate to annihilate V3 . Such annihilation of the Vi ’s is
performed as described in stage 2. The resulting matrix structure is given by:
⎛ ⎞
R1 ∗ ∗ cluster 1
⎜ 0 0 0 ⎟
⎜ ⎟
⎜ R2 ∗ ∗ ⎟
⎜ ⎟
⎜ 0 2
T ∗ ⎟ cluster 2
⎜ ⎟
⎜ 0 0 2
V ⎟
A(3b) =⎜


⎟ (7.19)
⎜ R3 ∗ ∗ ⎟
⎜ 0 3
T ∗ ⎟ cluster 3
⎜ ⎟
⎜ 0 0 0 ⎟
⎜ ⎟
⎝ R4 ∗ ∗ ⎠
0 0 0 cluster 4
242 7 Orthogonal Factorization and Linear Least Squares Problems

where the stars denote dense rectangular blocks.


Stage 4: If the observation vector f were appended to the block D in (7.14),
the back-substitution process would follow immediately. After cluster 2 solves the
upper triangular system V 2 y = g, see (7.19), and communicating y to the other three
clusters, all clusters update their portion of the right-hand side simultaneously. Next,
clusters 2 and 3 simultaneously solve two triangular linear systems involving T 2 ,
3 . Following that, clusters 1 and 2, and clusters 3 and 4 respectively cooperate
and T
in order to simultaneously update their corresponding portions of the right-hand
side. Finally, each cluster j, j = 1, 2, 3, 4 independently solves a triangular system
(involving R j ) to obtain the least squares solution.

7.8 Rank-Deficient Linear Least Squares Problems

When A is of rank q < n, there is an infinite set of solutions of (7.4): for if x is a


solution, then any vector in the set S = x + N (A) is also a solution, where N (A)
is the (n − q)-dimensional null space of A. If the Singular Value Decomposition
(SVD) of A is given by,
 
Σ 0q,n−q
A=V U , (7.20)
0m−q,q 0m−q,n−q

where the diagonal entries of the matrix Σ = diag(σ1 , . . . , σq ) are the singular
values of A, and U ∈ Rn×n , V ∈ Rm×m are orthogonal matrices. If bis the  right-
 c1
hand side of the linear least squares problem (7.4), and c = V b = where
c2
c1 ∈ R q , then x is a solution in S if and only if there exists y ∈ Rn−q such that

Σ −1 c1
x =U . In this case, the corresponding minimum residual norm is equal
y
to c2 . In S , the classically selected solution is the one with smallest 2-norm, i.e.
when y = 0.
Householder Orthogonalization with Column Pivoting
In order to avoid computing the singular-value decomposition which is quite ex-
pensive, the factorization (7.20) is often replaced by the orthogonal factorization
introduced in [33]. This procedure computes the QR factorization by Householder
reductions with column pivoting. Factorization (7.2) is now replaced by
 
T
Q q · · · Q 2 Q 1 AP1 P2 · · · Pq = , (7.21)
0

where matrices P1 , . . . , Pq are permutations and T ∈ Rq×n is upper-trapezoidal.


The permutations are determined so as to insure that the absolute values of the
diagonal entries of T are non increasing. To obtain the so called complete orthogonal
7.8 Rank-Deficient Linear Least Squares Problems 243

factorization of A the process performs first the orthogonal factorization (7.21),


followed by the LQ factorization of T . Here, the LQ factorization is directly obtained
through the QR factorization of T  . Thus, the process yields a factorization similar
to (7.20) but with Σ being a nonsingular lower-triangular matrix. In Sect. 7.8 an
alternative transformation is proposed.
In floating point arithmetic, the process will not end up with exact zero blocks.
Neglecting roundoff errors, we obtain a factorization of the form:
 
 R11 R12
Q AP = ,
0 R22

where Q = Q 1 Q 2 · · · Q q and P1 P2 · · · Pq . The generalized inverse of A can then


approximated by
 +
R11 R12
B=P Q.
0 0

An upper bound on the Frobenius norm of the error in such an approximation of the
generalized inverse is given by

B − A+  F ≤ 2R22  F max{A+ 2 , B2 }, (7.22)

e.g. see [4].


The basic algorithm is based on BLAS2 routines but a block version based on
BLAS3 and its parallel implementation have been proposed in [34, 35]. Dynamic
column pivoting limits potential parallelism in the QR factorization. For instance,
it prevents the use of the efficient algorithm proposed for tall-narrow matrices (see
Sect. 7.6). In order to enhance parallelism, an attempt to avoid global pivoting has
been developed in [36]. It consists of partitioning the matrix into blocks of columns
and adopting a local pivoting strategy limited to each block which is allocated to
a single processor. This strategy is numerically acceptable as long as the condition
number of a block remains below some threshold. In order to control numerical qual-
ity of the procedure, an incremental condition estimator (ICE) is used, e.g. see [37].
Another technique consists of delaying the pivoting procedure by first performing
a regular QR factorization, and then applying the rank revealing procedure on the
resulting matrix R. This is considered next.
A Rank-Revealing Post-Processing Procedure
Let R ∈ Rn×n be the upper triangular matrix resulting from an orthogonal factoriza-
tion without pivoting. Next, determine a permutation P ∈ Rn×n and an orthogonal
matrix Q ∈ Rn×n such that,
 
R11 R12
Q  RP = , (7.23)
0 R22
244 7 Orthogonal Factorization and Linear Least Squares Problems

is an upper triangular matrix in which R11 ∈ Rq×q , where q is the number of singular
values larger than an a priori fixed threshold η > 0, and the 2-norm R22  = O(η).
Here, the parameter η > 0 is chosen depending on the 2-norm of R since at the end
of the process cond(R11 ) ≤ R η .

Algorithm 7.5 Rank-revealing QR for a triangular matrix.


Input: R ∈ Rn×n sparse triangular, η > 0.
Output: S : permuted upper triangular matrix, P is the permutation, q is the rank.
1: k = 1;  = n ; P = In ; S = R;
2: σmin = |S11 |; xmin = 1;
3: while k < ,
4: if σmin > η, then
5: k = k + 1;
6: Compute the smallest singular value σmin of S1:k,1:k and its corresponding left
singular vector xmin ;
7: else
8: From xmin , select the column j to be rejected;
9: Put the rejected column in the last position;
10: Eliminate the subdiagonal entries of Sk+1:,k+1: ;
11: Update(P) ;  =  − 1 ; k = k − 1;
12: Update σmin and xmin ;
13: end if
14: end while
15: if σmin > η, then
16: q = ;
17: else
18: q =  − 1;
19: end if

This post-processing procedure is given in Algorithm 7.5. The output matrix is


computed column by column. At step k, the smallest singular value of the triangular
matrix Sk ∈ Rk×k is larger than η. If, by appending the next column, the smallest
singular value passes below the threshold η, one of the columns is moved to the
right end of the matrix and the remaining columns to be processed are shifted to
the left by one position. This technique assumes that the smallest singular value and
corresponding left singular vector can be computed. The procedure ICE [37] updates
the estimate of the previous rank with O(k) operations. After an updating step to
restore the upper triangular form (by using Givens rotations), the process is repeated.
Proposition 7.1 Algorithm 7.5 involves O(K n 2 ) arithmetic operations where K is
the number of occurrences of column permutation.
Proof Let assume that at step k, column k is rejected in the last position (Step 9 of
Algorithm 7.5). Step 10 of the algorithm is performed by the sequence of Givens
( j) ( j)
rotations R j, j+1 for j = k + 1, . . . ,  where R j, j+1 is defined in (7.7). This process
involves O((l − k)2 ) arithmetic operations which is bounded by O(n 2 ). This proves
the proposition since the total use of ICE involves O(n 2 ) arithmetic operations as
well.
7.8 Rank-Deficient Linear Least Squares Problems 245

Algorithm 7.5 insures that R22  = O(η) as shown in the following theorem:
Theorem 7.2 Let us assume that the smallest singular values of blocks defined by
the Eq. (7.23) satisfy the following conditions:
 
R11 u 1
σmin (R11 ) > η and σmin ≤η
0 u2
   
u1 R12
for any column of the matrix ∈ Rn×(n−q) . Therefore the following
u2 R22
bounds hold
⎛ ⎞
−1
1 + R11 u 1 2
u 2  ≤ ⎝ −1
⎠ η, (7.24)
1 − ηR11 
⎛ ⎞
−1
n − q + R11 R12 2F
R22  F ≤ ⎝ −1
⎠ η, (7.25)
1 − ηR11 

where R11 ∈ Rq×q and where . F denotes the Frobenius norm.

Proof We have
   
R11 u 1 R11 u 1
η ≥ σmin = σmin ,
0 u2 0 ρ

where ρ = u 2 . Therefore,


−1 −1
1 R11 − ρ1 R11 u1
≤ 1 ,
η 0 ρ

−1 1 −1
≤ R11  + 1 + R11 u 1 2
ρ

which implies (7.24). The bound (7.25) is obtained by summing the squares of the
bounds for all the columns of R22 .
−1
Remark 7.1 The hypothesis of the theorem insure that τsep = 1−ηR11  is positive.
This quantity, however, might be too small if the singular values of R are not well
separated in the neighborhood of η.

Remark 7.2 The number of column permutations corresponds to the rank deficiency:

K = n − q,

where q is the η-rank of A.


246 7 Orthogonal Factorization and Linear Least Squares Problems

Once the decomposition (7.23) is obtained, the generalized inverse of R is approxi-


mated by
 +
R11 R12
S=P . (7.26)
0 0

An upper bound for the approximation error isgiven by (7.22).


 To finalize the com-
plete orthogonalization, we post-multiply T = R11 R12 ∈ Rq×n by the orthogonal
matrix Q obtained from the QR factorization of T  :

(L , 0) = T Q. (7.27)

This post processing rank-revealing procedure offers only limited chances for ex-
ploiting parallelism. In fact, if the matrix A is tall and narrow or when its rank
deficiency is relatively small, this post processing procedure can be used on a
uniprocessor.

References

1. Halko, N., Martinsson, P., Tropp, J.: Finding structure with randomness: probabilistic algo-
rithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011).
doi:10.1137/090771806
2. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series
in Statistics. Springer, New York (2001)
3. Kontoghiorghes, E.: Handbook of Parallel Computing and Statistics. Chapman & Hall/CRC,
New York (2005)
4. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
5. Björck, Å.: Numerical Methods for Least Squares Problems. SIAM, Philadelphia (1996)
6. Sameh, A., Kuck, D.: On stable parallel linear system solvers. J. Assoc. Comput. Mach. 25(1),
81–91 (1978)
7. Modi, J., Clarke, M.: An alternative givens ordering. Numerische Mathematik 43, 83–90 (1984)
8. Cosnard, M., Muller, J.M., Robert, Y.: Parallel QR decomposition of a rectangular matrix.
Numerische Mathematik 48, 239–249 (1986)
9. Cosnard, M., Daoudi, E.: Optimal algorithms for parallel Givens factorization on a coarse-
grained PRAM. J. ACM 41(2), 399–421 (1994). doi:10.1145/174652.174660
10. Cosnard, M., Robert, Y.: Complexity of parallel QR factorization. J. ACM 33(4), 712–723
(1986)
11. Gentleman, W.M.: Least squares computations by Givens transformations without square roots.
IMA J. Appl. Math. 12(3), 329–336 (1973)
12. Hammarling, S.: A note on modifications to the givens plane rotation. J. Inst. Math. Appl. 13,
215–218 (1974)
13. Kontoghiorghes, E.: Parallel Algorithms for Linear Models: Numerical Methods and Estima-
tion Problems. Advances in Computational Economics. Springer, New York (2000). http://
books.google.fr/books?id=of1ghCpWOXcC
14. Lawson, C., Hanson, R., Kincaid, D., Krogh, F.: Basic linear algebra subprogams for Fortran
usage. ACM Trans. Math. Softw. 5(3), 308–323 (1979)
15. Dongarra, J., Croz, J.D., Hammarling, S., Hanson, R.: An extended set of FORTRAN basic
linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)
References 247

16. Gallivan, K.A., Plemmons, R.J., Sameh, A.H.: Parallel algorithms for dense linear algebra
computations. SIAM Rev. 32(1), 54–135 (1990). doi:10.1137/1032002
17. Sameh, A.: Numerical parallel algorithms—a survey. In: Kuck, D., Lawrie, D., Sameh, A.
(eds.) High Speed Computer and Algorithm Optimization, pp. 207–228. Academic Press, San
Diego (1977)
18. Bischof, C., van Loan, C.: The WY representation for products of Householder matrices. SIAM
J. Sci. Stat. Comput. 8(1), 2–13 (1987). doi:10.1137/0908009
19. Schreiber, R., Parlett, B.: Block reflectors: theory and computation. SIAM J. Numer. Anal.
25(1), 189–205 (1988). doi:10.1137/0725014
20. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
21. Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.: A set of level-3 basic linear algebra
subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
22. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,
Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: ScaLAPACK
User’s Guide. SIAM, Philadelphia (1997). http://www.netlib.org/scalapack
23. Björck, Å.: Solving linear least squares problems by Gram-Schmidt orthogonalization. BIT 7,
1–21 (1967)
24. Jalby, W., Philippe, B.: Stability analysis and improvement of the block Gram-Schmidt algo-
rithm. SIAM J. Stat. Comput. 12(5), 1058–1073 (1991)
25. Sameh, A.: Solving the linear least-squares problem on a linear array of processors. In: Sny-
der, L., Gannon, D., Jamieson, L.H., Siegel, H.J. (eds.) Algorithmically Specialized Parallel
Computers, pp. 191–200. Academic Press, San Diego (1985)
26. Sidje, R.B.: Alternatives for parallel Krylov subspace basis computation. Numer. Linear Alge-
bra Appl. 4, 305–331 (1997)
27. Demmel, J., Grigori, L., Hoemmen, M., Langou, J.: Communication-optimal parallel and se-
quential QR and LU factorizations. SIAM J. Sci. Comput. 34(1), 206–239 (2012). doi:10.1137/
080731992
28. Chang, X.W., Paige, C.: An algorithm for combined code and carrier phase based GPS posi-
tioning. BIT Numer. Math. 43(5), 915–927 (2003)
29. Chang, X.W., Guo, Y.: Huber’s M-estimation in relative GPS positioning: computational as-
pects. J. Geodesy 79(6–7), 351–362 (2005)
30. Bomford, G.: Geodesy, 3rd edn. Clarendon Press, England (1971)
31. Golub, G., Plemmons, R.: Large scale geodetic least squares adjustment by dissection and
orthogonal decomposition. Numer. Linear Algebra Appl. 35, 3–27 (1980)
32. Wolf, H.: The Helmert block method—its origin and development. In: Proceedings of the
Second Symposium on Redefinition of North American Geodetic Networks, pp. 319–325
(1978)
33. Businger, P., Golub, G.H.: Linear least squares solutions by Householder transformations.
Numer. Math. 7, 269–276 (1965)
34. Quintana-Ortí, G., Quintana-Ortí, E.: Parallel algorithms for computing rank-revealing QR fac-
torizations. In: Cooperman, G., Michler, G., Vinck, H. (eds.) Workshop on High Performance
Computing and Gigabit Local Area Networks. Lecture Notes in Control and Information Sci-
ences, pp. 122–137. Springer, Berlin (1997). doi:10.1007/3540761691_9
35. Quintana-Ortí, G., Sun, X., Bischof, C.: A BLAS-3 version of the QR factorization with column
pivoting. SIAM J. Sci. Comput. 19, 1486–1494 (1998)
36. Bischof, C.: A parallel QR factorization algorithm using local pivoting. In: Proceedings of
1988 ACM/IEEE Conference on Supercomputing, Supercomputing’88, pp. 400–499. IEEE
Computer Society Press, Los Alamitos (1988)
37. Bischof, C.H.: Incremental condition estimation. SIAM J. Matrix Anal. Appl. 11, 312–322
(1990)
Chapter 8
The Symmetric Eigenvalue
and Singular-Value Problems

Eigenvalue problems form the second most important class of problems in numerical
linear algebra. Unlike linear system solvers which could be direct, or iterative, eigen-
solvers can only be iterative in nature. In this chapter, we consider real symmetric
eigenvalue problems (and by extension complex hermitian eigenvalue problems), as
well as the problem of computing the singular value decomposition.
Given a symmetric matrix A ∈ Rn×n , the standard eigenvalue problem consists
of computing the eigen-elements of A which are:
• either all or a few selected eigenvalues only: these are all the n, or p  n selected,
roots λ of the nth degree characteristic polynomial:

det(A − λI ) = 0. (8.1)

The set of all eigenvalues is called the spectrum of A: Λ(A) = {λ1 , . . . , λn }, in


which each λ j is real.
• or eigenpairs: in addition to an eigenvalue λ, one seeks the corresponding eigen-
vector x ∈ Rn which is the nontrivial solution of the singular system

(A − λI )x = 0. (8.2)

The procedures discussed in this chapter apply with minimal changes in case the
matrix A is complex Hermitian since the eigenvalues are still real and the complex
matrix of eigenvectors is unitary.
Since any symmetric or Hermitian matrix can be diagonalized by an orthogonal
or a unitary matrix, the eigenvalues are insensitive to small symmetric perturbations
of A. In other words, when A + E is a symmetric update of A, the eigenvalues of
A + E cannot be at a distance exceeding E2 compared to those of A. Therefore,
the computation of the eigenvalues of A with largest absolute values is always well
conditioned. Only the relative accuracy of the eigenvalues with smallest absolute
values can be affected by perturbations. Accurate computation of the eigenvectors is
more difficult in case of poorly separated eigenvalues, e.g. see [1, 2].

© Springer Science+Business Media Dordrecht 2016 249


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_8
250 8 The Symmetric Eigenvalue and Singular-Value Problems

The Singular Value Decomposition of a matrix can be expressed as a symmetric


eigenvalue problem.
Theorem 8.1 Let A ∈ Rm×n (m ≥ n) have the singular value decomposition

V  AU = Σ,

where U = [u 1 , . . . , u n ] ∈ Rn×n and V = [v1 , . . . , vm ] ∈ Rm×m are orthogo-


nal matrices and Σ ∈ Rm×n is a rectangular matrix which is diagonal. Then, the
symmetric matrix

Anr m = A A ∈ Rn×n , (8.3)

has eigenvalues σ1 2 ≥ · · · ≥ σn 2 ≥ 0, corresponding to the eigenvectors (u i ),


(i = 1, . . . , n). Anr m is called the matrix of the normal equations. The symmetric
matrix
 
0 A
Aaug = (8.4)
A 0

has eigenvalues ±σ1 , . . . , ±σn , corresponding to the eigenvectors


 
1 vi
√ , i = 1, . . . , n.
2 ±u i

Aaug is called the augmented matrix.

Parallel Schemes
We consider the following three classical standard eigenvalue solvers for problems
(8.1) and (8.2):
• Jacobi iterations,
• QR iterations, and
• the multisectioning method.
Details of the above three methods for uniprocessors are given in many references,
see e.g. [1–4].
While Jacobi’s method is capable of yielding more accurate eigenvalues and per-
fectly orthogonal eigenvectors, in general it consumes more time on uniprocessors, or
even on some parallel architectures, compared to tridiagonalization followed by QR
iterations. It requires more arithmetic operations and memory references. Its high
potential for parallelism, however, warrants its examination. The other two meth-
ods are often more efficient when combined with the tridiagonalization process:
T = Q  AQ, where Q is orthogonal and T is tridiagonal.
Computing the full or partial singular value decomposition of a matrix is based
on symmetric eigenvalue problem solvers involving either Anrm , or Aaug , as given by
8 The Symmetric Eigenvalue and Singular-Value Problems 251

(8.3) and (8.4), respectively. We consider variants of the Jacobi methods—(one-sided


[5], and two sided [6]), as well as variant of the QR method, e.g. see [3]. For the
QR scheme, similar to the eigenvalue problem, an initial stage reduces Anrm to the
tridiagonal form,
T = W  A AW, (8.5)

where W is an orthogonal matrix. Further, if A is of maximal column rank, T is


symmetric positive definite for which one can obtain the Cholesky factorization

T = B  B, (8.6)

where B is an upper bidiagonal matrix. Therefore the matrix V = AWB−1 is orthog-


onal and
A = VBW . (8.7)

This decomposition is called bidiagonalization of A. It can be directly obtained by


two sequences of Householder transformations respectively applied to the left and
the right sides of A without any constraints on the matrix rank.

8.1 The Jacobi Algorithms

Jacobi’s algorithm consists of annihilating successively off-diagonal entries of the


matrix via orthogonal similarity transformations. While the method was abandoned,
on uniprocessors, due to the high computational cost compared to Householder’s
reduction to the tridiagonal form followed by Francis’ QR iterations, it was revived
in the early days of parallel computing especially when very accurate eigenvalues
or perfectly orthogonal eigenvectors are needed, e.g. see [7]. From this (two-sided)
method, a one-sided Jacobi scheme is directly derived for the singular value problem.

8.1.1 The Two-Sided Jacobi Scheme for the Symmetric


Standard Eigenvalue Problem

Consider the eigenvalue problem


Ax = λx (8.8)

where A ∈ Rn×n is a dense symmetric matrix. The original Jacobi’s method for
determining all the eigenpairs of (8.8) reduces the matrix A to the diagonal form by
an infinite sequence of plane rotations.

Ak+1 = Uk Ak Uk , k = 1, 2, . . . ,
252 8 The Symmetric Eigenvalue and Singular-Value Problems

where A1 = A, and Uk = Rk (i, j, θikj ) is a rotation of the (i, j)-plane in which

u iik = u kj j = ck = cos θikj and u ikj = −u kji = sk = sin θikj .

The angle θikj is determined so that αik+1


j = α k+1
ji = 0, i.e.

2αikj
tan 2θikj = ,
αiik − α kj j

where |θikj | ≤ 14 π . For numerical stability, we determine the plane rotation by

1
ck =  and sk = ck tk ,
1 + tk2

where tk is the smaller root (in magnitude) of the quadratic equation

αiik − α kj j
tk2 + 2τk tk − 1 = 0, in which τk = cot 2θikj = . (8.9)
2αikj

Hence, tk may be written as

sign (τk )
tk =  (8.10)
|τk | + 1 + τk2

Each Ak+1 remains symmetric and differs from Ak only in the ith and jth rows and
columns, with the modified elements given by

αiik+1 = αiik + tk αikj ,


(8.11)
α k+1
j j = α j j − tk αi j ,
k k

and
k+1
αir = ck αir
k
+ sk α kjr , (8.12)

α k+1
jr = −sk αir + ck α jr ,
k k

in which 1 ≤ r ≤ n and r = i, j. If Ak is expressed as the sum

Ak = Dk + E k + E k , (8.13)

where Dk is diagonal and E k is strictly upper triangular, then as k increases E k  F


(k) (k)
approaches zero, and Ak approaches the diagonal matrix Dk = diag(λ1 , λ2 , . . . ,
8.1 The Jacobi Algorithms 253

(k)
λn ) (here, · F denotes the Frobenius norm). Similarly, the transpose of the product
(Uk · · · U2 U1 ) approaches a matrix whose jth column is an eigenvector correspond-
ing to λ j .
Several schemes are possible for selecting the sequence of elements αikj to be
eliminated via the plane rotations Uk . Unfortunately, Jacobi’s original scheme,
which consists of sequentially searching for the largest off-diagonal element, is
too time consuming for implementation on a multiprocessor. Instead, a simpler
scheme in which the off-diagonal elements (i, j) are annihilated in the cyclic fash-
ion (1, 2), (1, 3), . . . , (1, n), (2, 3), . . . , (2, n), . . . , (n − 1, n) is usually adopted as
its convergence is assured [6]. We refer to each sequence of n(n−1) 2 rotations as a
sweep. Furthermore, quadratic convergence for this sequential cyclic Jacobi scheme
has been established (e.g. see [8, 9]). Convergence usually occurs within a small
number of sweeps, typically in O(log n) sweeps.
A parallel version of this cyclic Jacobi algorithm is obtained by the simultaneous
annihilation of several off-diagonal elements by a given Uk , rather than annihilat-
ing only one off-diagonal element and its symmetric counterpart as is done in the
sequential version. For example, let A be of order 8 and consider the orthogonal
matrix Uk as the direct sum of 4 independent plane rotations, where the ci ’s and si ’s
for i = 1, 2, 3, 4 are simultaneously determined. An example of such a matrix is

Uk = Rk (1, 3) ⊕ Rk (2, 8) ⊕ Rk (4, 7) ⊕ Rk (5, 6), (8.14)

where Rk (i, j) is that rotation which annihilates the (i, j) and ( j, i) off-diagonal
elements. Now, a sweep can be seen as a collection of orthogonal similarity trans-
formations where each of them simultaneously annihilates several off-diagonal pairs
and such that each of the off-diagonal entries is annihilated only once by the sweep.
For a matrix of order 8, an optimal sweep will consist of 7 successive orthogonal
transformations with each one annihilating distinct groups of 4 off-diagonal elements
simultaneously, as shown in the left array of Table 8.1, where the similarity trans-
formation of (8.14) is U6 . On the right array of Table 8.1, the sweep for a matrix

Table 8.1 Annihilation scheme for 2JAC

n = 8. n = 9.

The entries indicate the step in which these elements are annihilated
254 8 The Symmetric Eigenvalue and Singular-Value Problems

of order n = 9 appears to be made of 9 successive orthogonal transformations of 4


independent rotations.
Although several annihilation schemes are possible, we adopt the scheme 2JAC,
see [10], which is used for the two cases in Table 8.1. This scheme is optimal for
any n:
Theorem 8.2 The ordering defined in Algorithm 8.1, is optimal for any n: the n(n−1) 2
rotations of a sweep are partitioned into 2m − 1 steps, each consisting of p = n2 =
n − m independent rotations, where m = n+1 2 .
Proof Obviously, the maximum number of independent rotations in one step is n2 .
Algorithm 8.1 defines an ordering such that: (i) if n is odd, m = n+1
2 and the sweep
is composed of n = 2m − 1 steps each consisting of p = n−1 2 = n − m independent
rotations; (ii) if n is even, m = n2 and the sweep is composed of n − 1 = 2m − 1
steps each consisting of p = n2 independent rotations.
In the annihilation of a particular (i, j)-element, we update the off-diagonal ele-
ments in rows and columns i and j as given by (8.11) and (8.12). It is possible to
modify only those row or column entries above the main diagonal and utilize the
guaranteed symmetry of Ak . However, if one wishes to take advantage of the vector-
ization capability of a parallel computing platform, we may disregard the symmetry
of Ak and operate with full vectors on the entirety of rows and columns i and j in
(8.11) and (8.12), i.e., we may use a full matrix scheme. The product of the Uk ’s,
which eventually yields the eigenvectors for A, is accumulated in a separate two-
dimensional array by applying (8.11) and (8.12) to the identity matrix of order n.
Convergence is assumed to occur at stage k once each off-diagonal element is
below a small fraction of A F , see the tolerance parameter τ in line 38 of Algo-
rithm 8.1. Moreover, in [11, 12] it has been demonstrated that various parallel Jacobi
rotation ordering schemes (including 2JAC) are equivalent to the sequential row
ordering scheme for which convergence is assured. Hence, the parallel schemes
share the same convergence properties, e.g. quadratic convergence, see [9, 13]. An
ordering used in chess and bridge tournaments that requires n to be even but which can
be implemented on a ring or a grid of processors (see also [3]), has been introduced
in [14].

8.1.2 The One-Sided Jacobi Scheme for the Singular Value


Problem

The derivation of the one-sided Jacobi method is motivated by the singular value
decomposition of rectangular matrices. It can be used to compute the eigenvalues of
the square matrix A in (8.8) when A is symmetric positive definite. It is considerably
more efficient to apply a one-sided Jacobi method which in effect only post-multiplies
A by plane rotations. Let A ∈ Rm×n with m ≥ n and rank(A) = r ≤ n. The singular
value decomposition of A is defined by the corresponding notations:
8.1 The Jacobi Algorithms 255

Algorithm 8.1 2JAC: two-sided Jacobi scheme.


Input: A = (αi j ) ∈ Rn×n , symmetric, τ > 0
Output: D = diag(λ1 , · · · , λn )
n(n−1)
1: m = n+12 ; istop = 2 ;
2: while istop > 0
3: istop = n(n−1) 2 ;
4: do k = 1 : 2m − 1,
5: //Define Uk from n − m pairs:
6: Pk = ∅;
7: if k ≤ m − 1, then
8: do j = m − k + 1 : n − k,
9: if j ≤ 2m − 2k, then
10: i = 2m − 2k + 1 − j ;
11: else if j ≤ 2m − k − 1, then
12: i = 4m − 2k − j;
13: else
14: i = n;
15: end if
16: if i > j, then
17: Exchange i and j;
18: end if
19: Pk = Pk ∪ {(i, j)};
20: end
21: else
22: do j = 4m − n − k : 3m − k − 1,
23: if j < 2m − k + 1, then
24: i =n;
25: else if j ≤ 4m − 2k − 1, then
26: i = 4m − 2k − j;
27: else
28: i = 6m − 2k − 1 − j;
29: end if
30: if i > j, then
31: Exchange i and j;
32: end if
33: Pk = Pk ∪ {(i, j)};
34: end
35: end if
36: //Apply A := Uk AUk :
37: doall (i, j) ∈ Pk ,
38: if |αi j | ≤ A F .τ then
39: istop = istop − 1; R(i, j) = I ;
40: else
41: Determine the Jacobi rotation R(i, j) ;
42: Apply R(i, j) on the right side;
43: end if
44: end
45: doall (i, j) ∈ Pk ,
46: Apply rotation R(i, j) on the left side;
47: end
48: end
49: end while
50: doall i=1:n
51: λi = αii ;
52: end
256 8 The Symmetric Eigenvalue and Singular-Value Problems

A = V ΣU  , (8.15)

where U ∈ Rm×r and V ∈ Rn×r are orthogonal matrices and Σ = diag(σ1 , . . . , σr )


∈ Rr ×r with σ1 ≥ σ2 ≥ · · · ≥ σr > 0. The r columns of U and the r columns of V
yield the orthonormalized eigenvectors associated with the r non-zero eigenvalues
of A A and A A , respectively.
As indicated in [15], an efficient parallel scheme for computing the decomposition
(8.15) on a ring of processors is realized through using a method based on the
one-sided iterative orthogonalization method in [5] (see also [16, 17]). Further, this
singular value decomposition scheme was first implemented in [18]. In addition,
versions of the above two-sided Jacobi scheme have been presented in [14, 19]. Next,
we consider a few modifications to the scheme discussed in [15] for the determination
of the singular value decomposition (8.15) on shared memory multiprocessors, e.g.,
see [20].
Our main goal is to determine the orthogonal matrix U ∈ Rn×r , of (8.15) such
that
AU = Q = (q1 , q2 , . . . , qr ), (8.16)

is a matrix of orthogonal columns, i.e.

qi q j = σi2 δi j ,

in which δi j is the Kronecker-delta. Writing Q as

Q = V Σ where V  V = Ir , and Σ = diag(σ1 , . . . , σr ),

then factorization (8.15) is entirely determined. We construct the matrix U via the
plane rotations  
c −s
(ai , a j ) = (ãi , ã j ), i < j,
s c

so that
ãi ã j = 0 and ãi  ≥ ã j , (8.17)

where ai designates the ith column of the matrix A. This is accomplished by choosing
 1/2  
β +γ α
c= and s = if β > 0, (8.18)
2γ 2γ c
or  1/2  
γ −β α
s= and c = if β < 0, (8.19)
2γ 2γ s
8.1 The Jacobi Algorithms 257

where α = 2ai a j , β = ai 2 − a j 2 , and γ = (α 2 + β 2 )1/2 . Note that (8.17)


requires the columns of Q to decrease in norm from left to right thus assuring that
the resulting singular values σi appear in nonincreasing order. Several schemes can
be used to select the order of the (i, j)-plane rotations. Following the annihilation
pattern of the off-diagonal elements in the sequential Jacobi algorithm mentioned in
Sect. 8.1, we could certainly orthogonalize the columns in the same cyclic fashion and
thus perform the one-sided orthogonalization sequentially. This process is iterative
with each sweep consisting of 21 n(n − 1) plane rotations selected in cyclic fashion.
Adopting the parallel annihilation scheme in 2JAC outlined in Algorithm 8.1, we
obtain the parallel version 1JAC of the one-sided Jacobi method for computing the
singular value decomposition on a multiprocessor (see Algorithm 8.2). For example,
let n = 8 and m ≥ n so that in each sweep of our one-sided Jacobi algorithm we
simultaneously orthogonalize four pairs of the columns of A (see the left array in
Table 8.1). More specifically, we can orthogonalize the pairs (1, 3), (2, 8) (4, 7),
(5, 6) simultaneously via post-multiplication by an orthogonal transformation Uk
which consists of the direct sum of 4 plane rotations (identical to U6 introduced in
(8.14)). At the end of any particular sweep si we have

Usi = U1 U2 · · · U2q−1 ,

where q = n+1
2 and hence

U = Us1 Us2 · · · Ust , (8.20)

where t is the number of sweeps required for convergence.


In the orthogonalization step, lines 10 and 11 of Algorithm 8.2, we are implement-
ing the plane rotations given by (8.18) and (8.19), and hence guaranteeing the proper
ordering of column norms and singular values upon termination. Whereas 2JAC must
update rows and columns following each similarity transformation, 1JAC performs
only post-multiplication of Ak by each Uk and hence the plane rotation (i, j) affects
only columns i and j of the matrix Ak , with the updated columns given by

aik+1 = caik + sa kj , (8.21)

a k+1
j = −saik + ca kj , (8.22)

where aik denotes the ith column of Ak , and c, s are determined by either (8.18) or
(8.19). On a parallel architecture with vector capability, one expects to realize high
performance in computing (8.21) and (8.22). Each processor is assigned one rotation
and hence orthogonalizes one pair of the n columns of matrix Ak .
258 8 The Symmetric Eigenvalue and Singular-Value Problems

Algorithm 8.2 1JAC: one-sided Jacobi for rank-revealing SVD


Input: A = [a1 , · · · , an ] ∈ Rm×n where m ≥ n, and τ > 0.
Output: Σ = diag(σ1 , · · · , σr ) and U = [u 1 , · · · , u r ] where r is the column-rank of A.
n(n−1)
1: q = n+1
2 ; istop = 2 ; ν(A) = A F ;
2: while istop > 0
3: istop = n(n−1) 2 ;
4: do k = 1 : 2q − 1
5: Define the parallel list of pairs Pk as in Algorithm 8.1 lines 6–35;
6: doall (i, j) ∈ Pk
ai a j
7: if ≤ τ or ai  + a j  ≤ τ ν(A) then
ai a j 
8: istop = istop − 1;
9: else
10: Determine the rotation R(i, j) as in (8.18) and (8.19);
11: Apply R(i, j) on the right side;
12: end if
13: end
14: end
15: end while
16: doall i = 1 : n
17: σi = ai ;
18: if σi > τ ν(A) then
19: ri = i; vi = σ1i ai ;
20: else
21: ri = 0 ;
22: end if
23: end
24: r = max ri ; Σ = diag(σ1 , · · · , σr ); V = [v1 , · · · , vr ];

Following the convergence test used in [17], we test convergence in line 8, by


counting, as in [17], the number of times the quantity

ai a j
 , (8.23)
(ai ai )(a 
j aj)

falls below a given tolerance in any given sweep with the algorithm terminating when
the counter reaches 21 n(n − 1), the total number of column pairs, after any sweep.
Upon termination, the first r columns of the matrix A are overwritten by the matrix
Q from (8.16) and hence the non-zero singular values σi can be obtained via the r
square roots of the first r diagonal entries of the updated A A. The matrix V in (8.15),
which contains the leading r , left singular vectors of the original matrix A, is readily
obtained by column scaling of the updated matrix A (now overwritten by Q = V Σ)
by the r non-zero singular values. Similarly, the matrix U , which contains the right
singular vectors of the original matrix A, is obtained as in (8.20) as the product of the
orthogonal Uk ’s. This product is accumulated in a separate two-dimensional array
by applying the rotations used in (8.21) and (8.22) to the identity matrix of order n. It
8.1 The Jacobi Algorithms 259

is important to note that the use of the fraction in (8.23) is preferable to using ai a j ,
since this inner product is necessarily small for relatively small singular values.
Although 1JAC concerns the singular value decomposition of rectangular matri-
ces, it is most effective for handling the eigenvalue problem (8.8) when obtaining all
the eigenpairs of square symmetric matrices. Thus, if 1JAC is applied to a symmet-
ric matrix A ∈ Rn×n , the columns of the resulting V ∈ Rn×r are the eigenvectors
corresponding to the nonzero eigenvalues of A. The eigenvalue corresponding to vi
(i = 1, . . . , r ) is obtained by the Rayleigh quotient λi = vi Avi ; therefore: |λi | = σi .
The null space of A is the orthogonal complement of the subspace spanned by the
columns of V .
Algorithm 1JAC has two advantages over 2JAC: (i) no need to access both rows
and columns, and (ii) the matrix U need not be accumulated.

8.1.3 The Householder-Jacobi Scheme

As discussed above, 1JAC is certainly a viable parallel algorithm for computing


the singular value decomposition (8.15) of a dense matrix. However, for matrices
A ∈ Rm×n in which m  n, the arithmetic complexity can be reduced if an initial
orthogonal factorization of A is performed. One can then apply the one-sided Jacobi
method, 1JAC, to the resulting upper-triangular matrix, which may be singular, to
obtain the decomposition (8.15). In this section, we present a parallel scheme, QJAC,
which can be quite effective for computing (8.15) on a variety of parallel architectures.
When m ≥ n, the block orthogonal factorization schemes of ScaLAPACK (pro-
cedure PDGEQRF) may be used for computing the orthogonal factorization

A = QR (8.24)

where Q ∈ Rm×n is a matrix with orthonormal columns, and R ∈ Rn×n is an


upper-triangular matrix. These ScaLAPACK block schemes make full use of finely-
tuned matrix-vector and matrix-matrix primitives (BLAS2 and BLAS3) to assure
high performance. The 1JAC algorithm can then be used to obtain the singular
value decomposition of the upper-triangular matrix R. The benefit that a preliminary
factorization brings is to replace the scalar product of two vectors of length m by
the same operation on vectors of length n for each performed rotation. In order to
counterbalance the additional computation corresponding to the QR factorization,
the dimension n must be much smaller than m, i.e. m  n.
If m  n, however, an alternate strategy for obtaining the orthogonal factor-
ization (8.24) is essential for realizing high performance. In this case, the hybrid
Householder—Givens orthogonal factorization scheme for tall and narrow matrices,
given in Sect. 7.6, is adopted. Hence, the singular value decomposition of the matrix
A (m  n) can be efficiently determined via Algorithm QJAC, see Algorithm 8.3.
Note that in using 1JAC for computing the SVD of R, we must iterate on a full
n×n-matrix which is initially upper-triangular. This sacrifice in storage must be made
260 8 The Symmetric Eigenvalue and Singular-Value Problems

Algorithm 8.3 QJAC: SVD of a tall matrix (m  n).


Input: A = (αi j ) ∈ Rm×n , such that m ≥ np ( p is the number of processors), τ > 0.
Output: r ≤ n: rank of A, Σ = diag(σ1 , · · · , σr ) the nonzero singular values, and V ∈ Rm×r the
corresponding matrix of left singular vectors of A.
1: On p processors, apply the hybrid orthogonal factorization scheme outlined in Sect. 7.6 to obtain
the QR-factorization of A, where Q ∈ Rm×n is with orthonormal columns, and R ∈ Rn×n is
upper triangular.
2: On p processors, apply 1JAC (Algorithm 8.2), to R: get the rank r that is determined from
threshold τ , and compute the nonzero singular values Σ = diag(σ1 , · · · , σr ) and the matrix of
left singular vectors Ṽ of R.
3: Recover the left singular vectors of A via: V = Q Ṽ .

in order to take advantage of the inherent parallelism and vectorization available in


1JAC.
An implementation of Kogbetliantz algorithm for computing the singular value
decomposition of upper-triangular matrices has been shown to be quite effective on
systolic arrays, e.g. see [21]. Kogbetliantz method for computing the SVD of a real
square matrix A ∈ Rn×n mirrors the scheme 2JAC, above, in that the matrix A is
reduced to the diagonal form by an infinite sequence of plane rotations,

Ak+1 = Vk Ak Uk , k = 1, 2, . . . , (8.25)

where A1 ≡ A, and Uk = Uk (i, j, φikj ), Vk = Vk (i, j, θikj ) are plane rotations that
affect rows and columns i, and j. It follows that Ak approaches the diagonal matrix
Σ = diag(σ1 , σ2 , . . . , σn ), where σi is the ith singular value of A, and the products
(Vk · · · V2 V1 ), (Uk · · · U2 U1 ) approach matrices whose ith column is the respective
left and right singular vector corresponding to σi . When the σi ’s are not pathologically
close, it has been shown in [22] that the row (or column) cyclic Kogbetliantz method
ultimately converges quadratically. For triangular matrices, it has been demonstrated
in [23] that Kogbetliantz algorithm converges quadratically for those matrices having
multiple or clustered singular values provided that singular values of the same cluster
occupy adjacent diagonal elements of Aν , where ν is the number of sweeps required
for convergence. Even if we were to assume that R in (8.24) satisfies this condition for
quadratic convergence of the parallel Kogbetliantz method in [22], the ordering of the
rotations and subsequent row (or column) permutations needed to maintain the upper-
triangular form is less efficient on many parallel architectures. One clear advantage
of using 1JAC for obtaining the singular value decomposition of R lies in that the
rotations given in (8.18) or (8.19), and applied via the parallel ordering illustrated in
Table 8.1, see also Algorithm 8.1, require no processor synchronization among any
set of the 21 n or 21 (n − 1) simultaneous plane rotations. The convergence rate of
1JAC, however, does not necessarily match that of the Kogbetliantz algorithm.
Let
Sk = Rk Rk = D̃k + Ẽ k + Ẽ k , (8.26)
8.1 The Jacobi Algorithms 261

where D̃k is a diagonal matrices and Ẽ k is strictly upper-triangular. Although


quadratic convergence cannot be always guaranteed for 1JAC, we can always pro-
duce clustered singular values on adjacent positions of the diagonal matrix D̃k for
any A. This can be realized by monitoring the magnitudes of the elements of D̃k and
Ẽ k in (8.27) for successive values of k in 1JAC. After a particular number of critical
sweeps kc r , Sk approaches a block diagonal form in which each block corresponds
to a principal (diagonal) submatrix of D̃k containing a cluster of singular values of
A (see [20, 21]),
T1
T2
T3
Skcr = .. (8.27)
.
..
.
Tn c

Thus, the SVD of each Ti , i = 1, 2, . . . ., n c , can be computed in parallel by either


a Jacobi or Kogbetliantz method. Each symmetric matrix Ti will, in general, be
dense of order qi , representing the number of singular values of A contained in
the ith cluster. Since the quadratic convergence of Kogbetliantz method for upper-
triangular matrices [23] mirrors the quadratic convergence of the two-sided Jacobi
method, 2JAC, for symmetric matrices having clustered spectra [24], we obtain a
faster global convergence for k > kcr if 2JAC, rather than 1JAC, were used to obtain
the SVD of each block Ti . Thus, a hybrid method consisting of an initial phase of
several 1JAC iterations followed by 2JAC on the resulting subproblems, combines
the optimal parallelism of 1JAC and the faster convergence rate of the 2JAC method.
Of course, optimal implementation of such a method depends on the determination
of the critical number of sweeps kcr required by 1JAC. We should note here that
such a hybrid SVD scheme is quite suitable for implementation on multiprocessors
with hierarchical memory structure and vector capability.

8.1.4 Block Jacobi Algorithms

The above algorithms are well-suited for shared memory architectures. While they
can also be implemented on distributed memory systems, their efficiency on such
systems will suffer due to communication costs. In order to increase the granularity
of the computation (i.e., to increase the number of arithmetic operations between
two successive message exchanges), block algorithms are considered.
Allocating blocks of a matrix in place of single entries or vectors and replacing
the basic Jacobi rotations by more elaborate orthogonal transformations, we obtain
what is known as block Jacobi schemes. Let us consider the treatment of the pair of
off-diagonal blocks (i, j) in such algorithms:
262 8 The Symmetric Eigenvalue and Singular-Value Problems

• Two-sided Jacobi for the symmetric eigenvalue problem: the matrix A ∈ Rn×n is
partitioned into p × p blocks Ai j ∈ Rq×q (1 ≤ i, j ≤ p and n = pq). For any
 
Aii Ai j
pair (i, j) with i < j, let Ji j = ∈ R2q×2q . The matrix U (i, j) is the
Aij A j j
orthogonal matrix that diagonalizes Ji j .
• One-sided Jacobi for the SVD: the matrix A ∈ Rm×n (m ≥ n = pq) is partitioned
into p blocks Ai ∈ Rm×q (1 ≤ i ≤ p). For any pair (i, j) with i < j, let Ji j =
(Ai , A j ) ∈ Rm×2q . The matrix U (i, j) is the orthogonal matrix that diagonalizes
Jij Ji j .
• Two-sided Jacobi for the SVD (Kogbetliantz algorithm) the matrix A ∈ Rn×n is
partitioned into p× p blocks  Ai j ∈ R
q×q (1 ≤ i, j ≤ p). For any pair (i, j) with

Aii Ai j
i < j, let Ji j = ∈ R2q×2q . The matrices U (i, j) and V (i, j) are the
A ji A j j
orthogonal matrices defined by the SVD of Ji j .
For one-sided algorithms, each processor is allocated a block of columns instead of
a single column. The main features of the algorithm remain the same as discussed
above with the ordering of the rotations within a sweep is as given in [25].
For the two-sided version, the allocation manipulates 2-D blocks instead of single
entries of the matrix. A modification of the basic algorithm in which one annihilates,
in each step, two symmetrically positioned off-diagonal blocks by performing a full
SVD on the smaller sized off-diagonal block has been proposed in [26, 27]. While
reasonable performance can be realized on distributed memory architectures, this
block strategy increases the number of sweeps needed to achieve convergence. In
order to reduce the number of needed iterations, a dynamic ordering has been inves-
tigated in [28] in conjunction with a preprocessing step consisting of a preliminary
QR factorization with column pivoting [29]. This procedure has also been considered
for the block one-sided Jacobi algorithm for the singular value decomposition, e.g.
see [30].

8.1.5 Efficiency of Parallel Jacobi Methods

Considering that: (a) scalable parallel schemes for tridiagonalization of a symmetric


matrix A are available together with efficient parallel schemes for extracting the
eigenpairs of a tridiagonal matrix, and (b) Jacobi schemes deal with the whole matrix
until all the eigenvalues and vectors are extracted and hence resulting in a much
higher cost of memory references and interprocessor communications; two-sided
parallel Jacobi schemes are not competitive in general compared to eigensolvers
that depend first on the tridiagonalization process, unless: (i) one requires all the
eigenpairs with high accuracy, e.g. see [7] and Sect. 8.1.2 for handling the singular-
value decomposition via Jacobi schemes, or (ii) Jacobi is implemented on a shared
memory parallel architecture (multi- or many-core) in which one memory reference
8.1 The Jacobi Algorithms 263

is almost as fast as an arithmetic operation for obtaining all the eigenpairs of a modest
size matrix.
In addition to the higher cost of communications in Jacobi schemes, in general,
they also incur almost the same order of overall arithmetic operations. To illustrate
this, let us assume that n is even. Diagonalizing a matrix A ∈ Rn×n on an architecture
of p = n2 processors is obtained by a sequence of sweeps in which each sweep
requires (n − 1) parallel applications of p rotations. Therefore, each sweep costs
O(n 2 ) arithmetic operations. If, in addition, we assume that the number of sweeps
needed to obtain accurate eigenvalues to be O(log n), then the overall number of
arithmetic operations is O(n 2 log n). This estimation can even become as low as
O(n log n) by using O(n 2 ) processors. As seen in Sect. 8.1, one sweep of rotations
involves 6n 3 + O(n 2 ) arithmetic operations when taking advantage of symmetry,
and including accumulation of the rotations. This estimate must be compared to the
8n 3 + O(n 2 ) arithmetic operations needed to reduce the matrix A to the tridiagonal
form and to build the orthogonal matrix Q which realizes such tridiagonalization
(see next section).
A favorable situation for Jacobi schemes arises when one needs to investigate the
evolution of the eigenvalues of a sequence of slowly varying symmetric matrices
(Ak )k≥0 ⊂ Rn×n . If Ak has the spectral decomposition Ak = Uk Dk Uk , where Uk is
an orthogonal matrix (full set of eigenvectors), and Dk a diagonal matrix containing
the eigenvalues. Hence, if the quantity Ak+1 − Ak  / Ak  is small, the matrix

Bk+1 = Uk Ak+1 Uk , (8.28)

is expected to be close to a diagonal matrix. It is in such a situation that Jacobi


schemes yield, after very few sweeps, accurate eigenpairs of Bk+1 ,

Vk+1 Bk+1 Vk+1 = Dk+1 , (8.29)

 , with
and therefore yielding the spectral decomposition Ak+1 = Uk+1 Dk+1 Uk+1
Uk+1 = Uk Vk+1 . When one sweep only is sufficient to get convergence, Jacobi
schemes are competitive.

8.2 Tridiagonalization-Based Schemes

Obtaining all the eigenpairs of a symmetric matrix A can be achieved by the following
two steps: (i) obtaining a symmetric tridiagonal matrix T which is orthogonally
similar to A: T = U  AU where U is an orthogonal matrix ; and (ii) obtaining the
spectral factorization of the resulting tridiagonal matrix T , i.e. D = V  T V , and
computing the eigenvectors of A by the back transformation Q = UV.
264 8 The Symmetric Eigenvalue and Singular-Value Problems

8.2.1 Tridiagonalization of a Symmetric Matrix

Let A ∈ Rn×n be a symmetric matrix. The tridiagonalization of A will be obtained by


applying successively (n − 2) similarity transformation via Householder reflections.
The first elementary reflector H1 is chosen such that the bottom (n − 2) elements
of the first column of H1 A are annihilated:
⎛ ⎞
     
⎜       ⎟
⎜ ⎟
⎜ 0      ⎟
H1 A = ⎜⎜ 0      ⎟.

⎜ ⎟
⎝ 0      ⎠
0     

Therefore, by symmetry, applying H1 on the right results in the following pattern:


⎛ ⎞
  0 0 0 0
⎜       ⎟
⎜ ⎟
⎜ 0      ⎟
A1 = H1 AH1 = ⎜


⎟ (8.30)
⎜ 0      ⎟
⎝ 0      ⎠
0     

In order to take advantage of symmetry, the reduction

A := H1 AH1 = (I − αuu  )A(I − αuu  )

can be implemented by a symmetric rank-2 update which is a BLAS2 routine [31]:

v := α Au;
w := v − 21 α(u  v)u;
A := A − (uw  + wu  ).

By exploiting symmetry, the rank-2 update (that appears in the last step) results in
computational savings [32].
The tridiagonal matrix T is obtained by repeating the process successively on
columns 2 to (n − 2). The total procedure involves 4n 3 /3 + O(n 2 ) arithmetic oper-
ations. Assembling the matrix U = H1 · · · Hn−2 requires 4n 3 /3 + O(n 2 ) additional
operations. The benefit of the BLAS2 variant over that of the original BLAS1 based
scheme is illustrated in [33].
As indicated in Chap. 7, successive applications of Householder reductions can
be done in blocks (block Householder transformations) which allows the use of
BLAS3 [34]. Such reduction is implemented in routine DSYTRD of LAPACK [35]
which also takes advantage of the symmetry of the process. A parallel version of the
algorithm is implemented in routine PDSYTRD of ScaLAPACK [36].
8.2 Tridiagonalization-Based Schemes 265

8.2.2 The QR Algorithm: A Divide-and-Conquer Approach

The Basic QR Algorithm


The QR iteration for computing all the eigenvalues of the symmetric matrix A is
given by:

⎧A0 = A and Q 0 = I
⎨ For k ≥ 0,
(8.31)
(Q k+1 , Rk+1 ) = QR-Factorization of Ak ,

Ak+1 = Rk+1 Q k+1 ,

where the QR factorization is defined in Chap. 7. Under weak assumptions, Ak


approaches a diagonal matrix for sufficiently large k. A direct implementation of
(8.31) would involve O(n 3 ) arithmetic operations at each iteration. If the matrix A is
tridiagonal, it is easy to see that so are the matrices Ak . Therefore, by first tridiago-
nalizing A, i.e. A: T = Q  AQ, at a cost of O(n 3 ) arithmetic operations, the number
of arithmetic operations of each QR iteration in (8.31) is reduced to only O(n 2 ), or
even O(n) if only the eigenvalues are needed. To accelerate convergence, a shifted
version of the QR iteration is given by

⎧T0 = T and Q 0 = I
⎨ For k ≥ 0,
(8.32)
(Q k+1 , Rk+1 ) = QR-Factorization of (Tk − μk I )

Tk+1 = Rk+1 Q k+1 + μk I,

where μk is usually chosen as the eigenvalue of the last 2 × 2 diagonal block of


Tk that is the closest to the nth diagonal entry of Tk (Wilkinson shift). The effective
computation of the QR-step can be either explicit as in (8.32) or implicit (e.g. see [3]).
Parallel implementation of the above iteration is modestly scalable only if the
eigenvectors need to be computed as well. In the following sections, we present
two techniques for achieving higher degree of parallel scalability for obtaining the
eigenpairs of a tridiagonal matrix.
The Divide-and-Conquer Scheme
Since a straightforward implementation of the basic algorithm provides very limited
degree of parallelism, a divide-and-conquer technique is introduced. The resulting
parallel algorithm proved to yield even superior performance on uniprocessors that
it has been included in the sequential version of the library LAPACKas routine
xSTEDC. This method was first introduced in [37], and then modified for implemen-
tation on parallel architectures in [38].
The divide-and-conquer idea consists of “tearing” the tridiagonal matrix T in half
with a rank-one perturbation. Let us denote the three nonzero entries of the kth row
of the tridiagonal symmetric matrix T by βk , αk , βk+1 and let em , 1 < m < n, be
the mth column of the identity matrix In .
266 8 The Symmetric Eigenvalue and Singular-Value Problems

Thus, T can be expressed as the following block-diagonal matrix with a rank-one


correction:  
T1 O
T = + ρ vv ,
0 T2

for some v = em + θ em+1 , with

ρ = βm /θ,
T1 = T1:m,1:m − ρ em em  ,
and T2 = Tm+1:n,m+1:n − ρθ 2 em+1 em+1  ,

where the parameter θ is an arbitrary nonzero real number. A strategy for determining
a safe partitioning is given in [38].
Assuming that the full spectral decompositions of T1 and T2 are already available:
Q i  Ti Q i = Λi where Λi is a diagonal matrix and Q i an orthogonal matrix, for
i = 1, 2. Let    
Q1 O Λ1 O
Q̃ = and Λ̃ = .
0 Q2 0 Λ2

Therefore : J = Q̃  T Q̃ = Λ̃ + ρ zz  where z = Q̃  v, and computing the eigen-


values of T is reduced to obtaining the eigenvalues of J .
Theorem 8.3 ([37]) Let Λ̃ = diag(λ̃1 , . . . , λ̃n ) be a diagonal matrix with distinct
entries, and let z = (ζ1 , . . . , ζn ) be a vector with no null entries.
Then the eigenvalues of the matrix J = Λ̃ + ρ zz  are the zeros λi of the secular
equation:


n
ζk 2
1+ρ = 0. (8.33)
k=1
λ̃k − λ

in which λi and λ̃i are interleaved.


Thus, t = (Λ̃ − λk I )−1 z is an eigenvector of J corresponding to its simple
eigenvalue λk .
Proof The proof is based on the following identity:

det(Λ̃ + ρ zz  ) = det(Λ̃ − λI ) det(I + ρ(Λ̃ − λI )−1 zz  ).

Since λ is not a diagonal entry of Λ̃, the eigenvalues of J are the zeros of the function
f (λ) = det(I + ρ(Λ̃ − λI )−1 zz  ), or the roots of

1 + ρz  (Λ̃ − λI )−1 z = 0. (8.34)

which is identical to (8.33). If u is an eigenvector of J corresponding to the eigenvalue


λ, then
8.2 Tridiagonalization-Based Schemes 267

Fig. 8.1 Levels and


associated tasks method
Divide-and-Conquer

u = −ρ(z  u)(Λ̃ − λI )−1 z,

which concludes the proof. For further details see [3].


Parallel computation of the zeros λi of the secular equation (8.33) can be easily
achieved since every zero is isolated in an interval defined by the set (λ̃i )i=1,n .
Bisection or the secant method can be used for simultaneously obtaining these zeros
with acceptable accuracy. In [38], a quadratically convergent rational approximation-
based scheme is provided.
Thus, this eigensolver is organized into a tree of tasks as shown in Fig. 8.1. Using
p = 2r processors (r = 2 in Fig. 8.1), the computation is recursively organized into
an (r + 1)-level tree. Three types of tasks are considered:
1. Tearing (going down in the tree) involves O(r ) arithmetic operations.
2. Diagonalizing in parallel the p smaller tridiagonal
   matrices at the lowest level of
3
n
the tree, each of these tasks involves O p arithmetic operations.

   (going up the tree): at level i, there are q = 2 tasks,


3. Updating the spectrum i
3
n
consuming O q arithmetic operations.

For i = 0, . . . , r , a task of the third type is executed on 2r −i processors at level i .


The algorithm computes all the eigenpairs of T . Eigenproblems corresponding
to the leaves can be handled using any sequential algorithm such as the QR (or QL)
algorithm. Computing the full spectrum with a Divide-and-Conquer approach con-
sumes O(n 3 ) arithmetic operations irrespective of the number of processors used. A
more precise arithmetic operation count reveals that the divide and conquer approach
results in some savings in the number of arithmetic operations compared to straight-
forward application of the QL algorithm. The divide and conquer algorithm is imple-
mented in LAPACK [35] through the routine xSTEDC.

8.2.3 Sturm Sequences: A Multisectioning Approach

Following the early work in [39], or the TREPS procedure in [40], one can use
the Sturm sequence properties to enable the computation of all the eigenvalues of a
268 8 The Symmetric Eigenvalue and Singular-Value Problems

tridiagonal matrix T , or only those eigenvalues in a given interval of the spectrum.


Once the eigenvalues are obtained, the corresponding eigenvectors can be retrieved
via inverse iteration. Let T be a symmetric tridiagonal matrix of order n,

T = [βi , αi , βi+1 ],

and let pn (λ) be its characteristic polynomial:

pn (λ) = det(T − λI ).

Then the sequence of the principal minors of the matrix can be built using the fol-
lowing recursion:

p0 (λ) = 1,
p1 (λ) = α1 − λ, (8.35)
pi (λ) = (αi − λ) pi−1 (λ) − βi2 pi−2 (λ), i = 2, . . . , n.

Assuming that no subdiagonal element of T is zero (since if some βi is equal to 0,


the problem can be partitioned into two smaller eigenvalue problems), the sequence
{ pi (λ)} is called the Sturm sequence of T in λ. It is well known (e.g. see [41]) that the
number of eigenvalues smaller than a given λ is equal to the number of sign variations
in the Sturm sequence (8.35). Hence one can find the number of eigenvalues lying
in a given interval [δ, γ ] by computing the Sturm sequence in δ and γ . The linear
recurrence (8.35), however, suffers from the possibility of over- or underflow. This
is remedied by replacing the Sturm sequence pi (λ) by the sequence

pi (λ)
qi (λ) = , i = 1, n.
pi−1 (λ)

The second order linear recurrence (8.35) is then replaced by the nonlinear recurrence

βi2
q1 (λ) = α1 − λ, qi (λ) = αi − λ − , i = 2, . . . , n. (8.36)
qi−1 (λ)

Here, the number of eigenvalues that are smaller than λ is equal to the number of
negative terms in the sequence (qi (λ))i=1,...,n . It can be easily proved that qi (λ) is
the ith diagonal element of D in the factorization L DL  of T − λI . Therefore, for
any i = 1, . . . , n,
i
pi (λ) = q j (λ).
j=1

Given an initial interval, we can find all the eigenvalues lying in it by repeated
bisection or multisection of the interval. This partitioning process can be performed
until we obtain each eigenvalue to a given accuracy. On the other hand, we can stop
8.2 Tridiagonalization-Based Schemes 269

the process once we have isolated each eigenvalue. In the latter case, the eigenvalues
may be extracted using a faster method. Several methods are available for extracting
an isolated eigenvalue:
• Bisection (linear convergence);
• Newton’s method (quadratic convergence);
• The ZEROIN scheme [42] which is based √ on a combination of the secant and
bisection methods (convergence of order ( 5 + 1)/2).
Ostrowski [43] defines an efficiency index which links the amount of computation
to be done at each step and the order of the convergence. The respective indices of the
three methods are 1, 1.414 and 1.618. This index, however, is not the only aspect to
be considered here. Both ZEROIN and Newton methods require the use of the linear
recurrence (8.35) in order to obtain the value of det(T − λI ), and its first derivative,
for a given λ. Hence, if the possibility of over- or underflow is small, selecting the
ZEROIN method is recommended, otherwise selecting the bisection method is the
safer option. After the computation of an eigenvalue, the corresponding eigenvector
can be found by inverse iteration [3], which is normally a very fast process that often
requires no more than one iteration to achieve convergence to a low relative residual.
When some eigenvalues are computationally coincident, the isolation process
actually performs “isolation of clusters”, where a cluster is defined as a single eigen-
value or a number of computationally coincident eigenvalues. If such a cluster of
coincident eigenvalues is isolated, the extraction stage is skipped since convergence
has been achieved.
The whole computation consists of the following five steps:
1. Isolation by partitioning;
2. Extraction of a cluster by bisection or by the ZEROIN method;
3. Computation of the eigenvectors of the cluster by inverse iteration;
4. Grouping of close eigenvalues;
5. Orthogonalization of the corresponding groups of vectors by the modified Gram-
Schmidt process.
The method TREPS (standing for Tridiagonal Eigenvalue Parallel Solver by Sturm
sequences) is listed as Algorithm 8.4 and follows this strategy.
The Partitioning Process
Parallelism in this process is obviously achieved by performing simultaneously the
computation of several Sturm sequences. However, there are several ways for achiev-
ing this. Two options are:
• Performing bisection on several intervals, or
• Partitioning of one interval into several subintervals.
A multisection of order k splits the interval [δ, γ ] into k+1 subintervals [μi , μi+1 ],
where μi = δ + i((γ − δ)/(k + 1)) for i = 0, . . . , k + 1. If interval I contains only
one eigenvalue, approximating it with an absolute error ε, will require
270 8 The Symmetric Eigenvalue and Singular-Value Problems

Algorithm 8.4 TREPS: tridiagonal eigenvalue parallel solver by Sturm sequences


Input: T ∈ Rn×n symmetric tridiagonal; δ < γ (assumption: τ < γ − δ); 0 < τ ; k (multisection
degree).
Output: Λ(T ) ∩ [δ, γ ] = {λ1 , · · · , λ p } and the corresponding eigenvectors u 1 , · · · , u p .
1: Compute n(δ) and n(γ ); p = n(γ ) − n(δ); //n(α) is the number of eigenvalues of T smaller
than α which is computed by a Sturm sequence in α.
2: M = {([δ, γ ], n(δ), n(γ ))}; S = ∅; //Sets of intervals including either several eigenvalues
or one (simple or multiple) eigenvalue respectively.
3: while M = ∅,
4: Select ([α, β], n(α), n(β)) ∈ M ; μ0 = α;
5: doall i = 1 : k,
6: μi = α + i(β − α)/(k + 1); Compute n(μi );
7: end
8: do i = 0 : k,
9: ti = n(μi+1 ) − n(μi );
10: if ti > 1 and |μi+1 − μi | > τ > 0, then
11: Store ([μi , μi+1 ], n(μi ), n(μi+1 )) in M ;
12: else
13: if n(μi+1 ) − n(μi ) > 0, then
14: Store ([μi , μi+1 ], n(μi ), n(μi+1 )) in S ;
15: end if
16: end if
17: end
18: end while
19: doall ([μ, ν], n(μ), n(ν)) ∈ S ,
20: Extract λ from [μ, ν];
21: if λ is simple, then
22: Compute the corresponding eigenvector u by inverse iteration ;
23: else
24: Compute a basis V = [vi1 , · · · , vi ] of the invariant subspace by inverse iteration and
combined orthogonalization ;
25: end if
26: end
27: doall Grouping of close eigenvalues,
28: Reorthogonalize the corresponding eigenvectors;
29: end

 
γ −δ
n k = log2 / log2 (k + 1)

multisections of order k. Thus, the efficiency of the multisectioning of order k com-


pared to bisection (multisectioning of order 1) is

E f = n 1 /(k n k ) = (log2 (k + 1))/k.

Hence, for extraction of eigenvalues via parallel bisections is preferable to one mul-
tisectioning of higher order. On the other hand, during the isolation step, the effi-
ciency of multisectioning is higher because: (i) multisectioning creates more tasks
than bisection, and (ii) often, one interval contains much more than one eigenvalue.
A reasonable strategy for choosing bisection or multisectioning is outlined in [44].
8.2 Tridiagonalization-Based Schemes 271

Computation of the Sturm Sequence


Parallelism in evaluating the recurrence (8.36) is very limited. Here, the computation
of the regular Sturm sequence (8.35) is equivalent to solving a lower-bidiagonal
system L x = c of order n + 1 :
⎛ ⎞ ⎛ ⎞
1 1
⎜ (λ − α1 ) 1 ⎟ ⎜0⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
L = ⎜ β2 (λ − α2 ) 1
2
⎟, and c = ⎜ 0 ⎟.
⎜ . .. . .. . .. ⎟ ⎜ .. ⎟
⎝ ⎠ ⎝.⎠
βn2 (λ − αn ) 1 0

The algorithm of Sect. 3.2.2 may be used, however, to capitalize on the vectorization
possible in evaluating the linear recurrence (8.35). For a vector length k < n/2 the
total number of arithmetic operations in the parallel algorithm is roughly 10n +
11k, compared to only 4n for the uniprocessor algorithm (we do not consider the
operations needed to compute βi2 since these quantities can be provided by the user),
resulting in an arithmetic redundancy which varies between 2.5 and 4. This algorithm,
therefore, is efficient only when vector operations are at least 4 times faster than their
sequential counterparts.
Computation of the Eigenvectors and Orthonormalization
The computation of an eigenvector can be started as soon as the corresponding eigen-
value is computed. It is obtained by inverse iteration (see Algorithm 8.5). When the

Algorithm 8.5 Computation of an eigenvector by inverse iteration.


Input: T ∈ Rn×n , λ ∈ R eigenvalue of T , x0 ∈ Rn (x0 = 0), τ > 0.
Output: x eigenvector of T corresponding to λ.
1: x = xx00  ; y = (T − λI )−1 x; α = y ;
2: while ατ < 1,
3: x = αy ;
4: y = (T − λI )−1 x; α = y ;
5: end while

eigenvalue λ is multiple of order r , a basis of the corresponding invariant subspace


can be obtained by a simultaneous iteration procedure (i.e. the iterate x in Algo-
rithm 8.5 is replaced by a block X ∈ Rn×r in which the columns are maintained
orthogonal by means of a Gram-Schmidt procedure).
So, we consider the extraction of an eigenvalue and the computation of its eigen-
vector as two parts of the same task, with the degree of potential parallelism being
dependent on the number of desired eigenvalues.
The computed eigenvectors of distinct eigenvalues are necessarily orthogonal.
However, for close eigenvalues, this property is poorly satisfied and an orthogonaliza-
tion step must be performed. Orthonormalization is only performed on eigenvectors
272 8 The Symmetric Eigenvalue and Singular-Value Problems

whose corresponding eigenvalues meet a predefined grouping criterion (e.g.; in the


procedure TINVIT of the library EISPACK [45], two eigenvalues λi and λi+1 are in
the same group when |λi −λi+1 | ≤ 10−3 T  R , where T  R = maxi=1:n (αi + βi )).
The modified Gram-Schmidt method is used to orthonormalize each group.
If we consider processing several groups of vectors in parallel, then we have two
levels of parallelism: (i) concurrent orthogonalization of the groups; and (ii) parallel
orthogonalization of a group (as discussed in Sect. 7.4). Selection of the appropriate
algorithm will depend on: (a) the levels of parallelism in the architecture, and (b)
whether we have many groups each containing a small number of eigenvalues, or
only a few groups each containing many more eigenvalues.
Reliability of the Sturm Sequences
The reliability of Sturm sequence computation in floating point arithmetic where the
sequence is no longer monotonic has been considered in [46]. It states that in very
rare situations, it is possible to obtain incorrect eigenvalues with regular bisection.
A robust algorithm, called MRRR in [47], is implemented in LAPACK via routine
DSTEMR for the computation of high quality eigenvectors. In addition, several
approaches are considered for its parallel implementation.

8.3 Bidiagonalization via Householder Reduction

The classical algorithm for reducing a rectangular matrix which reduces A to an


upper bidiagonal matrix B, via the orthogonal transformation (8.7) (e.g. see [3])
requires 4(mn 2 − n 3 /3) + O(mn) arithmetic operations. V and U can be assembled
in 4(mn 2 − n 3 /3) + O(mn) and 4n 3 /3 + O(n 2 ) additional operations, respectively.
The corresponding routine in LAPACK is based on BLAS3 procedures and a parallel
version of this algorithm is implemented in routine PDGEBRD of ScaLAPACK.
A bidiagonalization scheme that is more suited for parallel processing has been
proposed in [48]. This one-sided reduction consists of determining, as a first step, the
orthogonal matrix W which appears in the tridiagonalization (8.5). The matrix W
can be obtained by using Householder or Givens transformations, without forming
A A explicitly. The second step performs a QR factorization of F = AW. From
(8.5) and (8.6), we can see that R is upper bidiagonal. It is therefore possible to set
V = Q and B = R. This allows a simplified QR factorization. Two adaptations of
the approach in [48] have been considered in [49]: one for obtaining a numerically
more robust algorithm, and a second parallel block version which allows the use of
BLAS3.
Once the bidiagonalization is performed, all that remains is to compute the singular
value decomposition of B. When only the singular values are sought, these can be
achieved via the uniprocessor scheme in [3] in which the number of arithmetic
operations per iteration is O(n), see routine PDGESVD in ScaLAPACK .
References 273

References

1. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
2. Stewart, G.W., Sun, J.: Matrix Perturbation Theory. Academic Press, Boston (1990)
3. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
4. Parlett, B.: The Symmetric Eigenvalue Problem. SIAM (1998)
5. Hestenes, M.: Inversion of matrices by biorthogonalization and related results. J. Soc. Ind.
Appl. Math. 6(1), 51–90 (1958). doi:10.1137/0106005, http://epubs.siam.org/doi/abs/10.1137/
0106005
6. Forsythe, G.E., Henrici, P.: The cyclic Jacobi method for computing the principal values of a
complex matrix (January 1960)
7. Demmel, J., Veselic, K., Physik, L.M., Hagen, F.: Jacobi’s method is more accurate than QR.
SIAM J. Matrix Anal. Appl 13, 1204–1245 (1992)
8. Schönhage, A.: Zur konvergenz des Jacobi-verfahrens. Numer. Math. 3, 374–380 (1961)
9. Wilkinson, J.H.: Note on the quadratic convergence of the cyclic Jacobi process. Numer. Math.
4, 296–300 (1962)
10. Sameh, A.: On Jacobi and Jacobi-like algorithms for a parallel computer. Math. Comput. 25,
579–590 (1971)
11. Luk, F., Park, H.: A proof of convergence for two parallel Jacobi SVD algorithms. IEEE Trans.
Comp. (to appear)
12. Luk, F., Park, H.: On the equivalence and convergence of parallel Jacobi SVD algorithms. IEEE
Trans. Comp. 38(6), 806–811 (1989)
13. Henrici, P.: On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing
eigenvalues of Hermitian matrices. Soc. Ind. Appl. Math. 6, 144–162 (1958)
14. Brent, R., Luk, F.: The solution of singular-value and symmetric eigenvalue problems on
multiprocessor arrays. SIAM J. Sci. Stat. Comput. 6(1), 69–84 (1985)
15. Sameh, A.: Solving the linear least-squares problem on a linear array of processors. In: L.
Snyder, D. Gannon, L.H. Jamieson, H.J. Siegel (eds.) Algorithmically Specialized Parallel
Computers, pp. 191–200. Academic Press (1985)
16. Kaiser, H.: The JK method: a procedure for finding the eigenvectors and eigenvalues of a real
symmetric matrix. Computers 15, 271–273 (1972)
17. Nash, J.: A one-sided transformation method for the singular value decomposition and algebraic
eigenproblem. Computers 18(1), 74–76 (1975)
18. Luk, F.: Computing the singular value decomposition on the Illiac IV. ACM Trans. Math. Sftw.
6(4), 524–539 (1980)
19. Brent, R., Luk, F., van Loan, C.: Computation of the singular value decomposition using mesh
connected processors. VLSI Comput. Syst. 1(3), 242–270 (1985)
20. Berry, M., Sameh, A.: An overview of parallel algorithms for the singular value and symmetric
eigenvalue problems. J. Comp. Appl. Math. 27, 191–213 (1989)
21. Charlier, J., Vanbegin, M., van Dooren, P.: On efficient implementations of Kogbetlianz’s
algorithm for computing the singular value decomposition. Numer. Math. 52, 279–300 (1988)
22. Paige, C., van Dooren, P.: On the quadratic convergence of Kogbetliantz’s algorithm for com-
puting the singular value decomposition. Numer. Linear Algebra Appl. 77, 301–313 (1986)
23. Charlier, J., van Dooren, P.: On Kogbetliantz’s SVD algorithm in the presence of clusters.
Numer. Linear Algebra Appl. 95, 135–160 (1987)
24. Wilkinson, J.: Almost diagonal matrices with multiple or close eigenvalues. Numer. Linear
Algebra Appl. 1, 1–12 (1968)
25. Luk, F.T., Park, H.: On parallel Jacobi orderings. SIAM J. Sci. Stat. Comput. 10(1), 18–26
(1989)
26. Bečka, M., Vajteršic, M.: Block-Jacobi SVD algorithms for distributed memory systems I:
hypercubes and rings. Parallel Algorithms Appl. 13, 265–287 (1999)
27. Bečka, M., Vajteršic, M.: Block-Jacobi SVD algorithms for distributed memory systems II:
meshes. Parallel Algorithms Appl. 14, 37–56 (1999)
274 8 The Symmetric Eigenvalue and Singular-Value Problems

28. Bečka, M., Okša, G., Vajteršic, M.: Dynamic ordering for a parallel block-Jacobi SVD algo-
rithm. Parallel Comput. 28(2), 243–262 (2002). doi:10.1016/S0167-8191(01)00138-7, http://
dx.doi.org/10.1016/S0167-8191(01)00138-7
29. Okša, G., Vajteršic, M.: Efficient pre-processing in the parallel block-Jacobi SVD algo-
rithm. Parallel Comput. 32(2), 166–176 (2006). doi:10.1016/j.parco.2005.06.006, http://www.
sciencedirect.com/science/article/pii/S0167819105001341
30. Bečka, M., Okša, G., Vajteršic, M.: Parallel Block-Jacobi SVD methods. In: M. Berry, K.
Gallivan, E. Gallopoulos, A. Grama, B. Philippe, Y. Saad, F. Saied (eds.) High-Performance
Scientific Computing, pp. 185–197. Springer, London (2012), http://dx.doi.org/10.1007/978-
1-4471-2437-5_1
31. Dongarra, J., Croz, J.D., Hammarling, S., Hanson, R.: An extended set of FORTRAN basic
linear algebra subprograms. ACM Trans. Math. Softw. 14(1), 1–17 (1988)
32. Dongarra, J.J., Kaufman, L., Hammarling, S.: Squeezing the most out of eigenvalue solvers on
high-performance computers. Linear Algebra Appl. 77, 113–136 (1986)
33. Gallivan, K.A., Plemmons, R.J., Sameh, A.H.: Parallel algorithms for dense linear algebra
computations. SIAM Rev. 32(1), 54–135 (1990). http://dx.doi.org/10.1137/1032002
34. Dongarra, J., Du Croz, J., Hammarling, S., Duff, I.: A set of level-3 basic linear algebra
subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)
35. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,
Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide, 3rd
edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)
36. Blackford, L., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J.,
Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.: ScaLAPACK
User’s Guide. SIAM, Philadelphia (1997). http://www.netlib.org/scalapack
37. Cuppen, J.: A divide and conquer method for the symmetric tridiagonal eigenproblem. Numer.
Math. 36, 177–195 (1981)
38. Dongarra, J.J., Sorensen, D.C.: A fully parallel algorithm for the symmetric eigenvalue problem.
SIAM J. Sci. Stat. Comput. 8(2), s139–s154 (1987)
39. Kuck, D., Sameh, A.: Parallel computation of eigenvalues of real matrices. In: Information
Processing ’71, pp. 1266–1272. North-Holland (1972)
40. Lo, S.S., Philippe, B., Sameh, A.: A multiprocessor algorithm for the symmetric tridiagonal
eigenvalue problem. SIAM J. Sci. Stat. Comput. 8, S155–S165 (1987)
41. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, New York (1965)
42. Forsythe, G., Malcom, M., Moler, C.: Computer Methods for Mathematical Computation.
Prentice-Hall, New Jersey (1977)
43. Ostrowski, A.: Solution of Equations and Systems of Equations. Academic Press, New York
(1966)
44. Bernstein, H., Goldstein, M.: Parallel implementation of bisection for the calculation of eigen-
values of tridiagonal symmetric matrices. Computing 37, 85–91 (1986)
45. Garbow, B.S., Boyle, J.M., Dongarra, J.J., Moler, C.B.: Matrix Eigensystem Routines—
EISPACK Guide Extension. Springer, Heidelberg (1977)
46. Demmel, J.W., Dhillon, I., Ren, H.: On the correctness of some bisection-like parallel eigen-
value algorithms in floating point arithmetic. Electron. Trans. Numer. Anal. pp. 116–149 (1995)
47. Dhillon, D., Parlett, B., Vömel, C.: The design and implementation of the MRRR algorithm.
ACM Trans. Math. Softw. 32, 533–560 (2006)
48. Rui, Ralha: One-sided reduction to bidiagonal form. Linear Algebra Appl. 358(1–3), 219–238
(2003)
49. Bosner, N., Barlow, J.L.: Block and parallel versions of one-sided bidiagonalization. SIAM J.
Matrix Anal. Appl. 29, 927–953 (2007)
Part III
Sparse Matrix Computations
Chapter 9
Iterative Schemes for Large Linear Systems

Sparse linear systems occur in a multitude of applications in computational science


and engineering. While the need for solving such linear systems is most prevalent in
numerical simulations using mathematical models based on differential equations, it
also arises in other areas. For example, the PageRank vector that was used by Google
to order the nodes of a network based on its link structure can be interpreted as the
solution of a linear system of order equal to the number of nodes [1, 2]. The system
can be very large and sparse so that parallel iterative methods become necessary;
see e.g. [3, 4]. We note that we will not describe any parallel asynchronous iterative
methods that are sometimes proposed for solving such systems when they are very
large, possibly distributed across several computers that are connected by means of
a relatively slow network; cf. [5–8].
As we mentioned in the preface of this book, iterative methods for the parallel
solution of the sparse linear systems that arise in partial differential equations have
been proposed since the very early days of parallel processing. In fact, some basic
ideas behind parallel architectures have been inspired by such methods. For instance,
SOLOMON, the first parallel SIMD system, was designed in order to facilitate the
communication necessary for the fast implementation of iterative schemes and for
solving partial differential equations with marching schemes. The Illiac IV followed
similar principles while with proper design of the numerical model, the SIMD Mas-
sively Parallel Processor (MPP) built for NASA’s Goddard Space Flight Center by
Goodyear Aerospace between 1980 and 1985 was also shown to lend itself to the
fast solution of partial differential equations for numerical weather prediction with
marching schemes in the spirit of “Richardson’s forecast factory” [9–12].
In this chapter, we explore some methods for the solution of nonsingular linear
systems of the form Ax = f . We explore some of the basic iterative schemes for
obtaining an approximation y of the solution vector x with a specified relative resid-
ual r / f  ≤ τ , where r = f − Ay, with τ being a given tolerance. We divide
our presentation into two parts: (i) splitting, and (ii) polynomial methods. We illus-
trate these parallel iterative schemes through the use of the linear algebraic systems
resulting from the finite-difference discretization of linear elliptic partial differential
equations.
© Springer Science+Business Media Dordrecht 2016 277
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_9
278 9 Iterative Schemes for Large Linear Systems

9.1 An Example

Let us consider the two-dimensional linear elliptic boundary value problem


   
−∂ ∂u ∂ ∂u ∂u
a(x, y) − c(x, y) + d(x, y) ,
∂x ∂x ∂y ∂y ∂x
∂u
+ e(x, y) + f (x, y)u = g(x, y), (9.1)
∂y

defined on the unit square 0 ≤ x, y ≤ 1, with the Dirichlet boundary conditions


u(x, y) = 0. Using the five-point central finite-difference discretization on a uniform
mesh of width h, we obtain the linear system of equations

Au = g, (9.2)

in which u represents an approximation of the solution at the points (i h, j h), i, j =


1, 2, . . . , n, where n is assumed to be even in what follows. Here A is a block-
tridiagonal matrix of order n 2 ,
⎛ ⎞
A1 B1
⎜ C2 A2 B2 ⎟
⎜ ⎟
⎜ · · · ⎟

A=⎜ ⎟, (9.3)
· · · ⎟
⎜ ⎟
⎝ Cn−1 An−1 Bn−1 ⎠
C n An

( j) ( j) ( j)
in which A j = [γi , αi , βi ], j = 1, 2, . . . , n, is tridiagonal of order n, B j =
( j) ( j) ( j) ( j) ( j) ( j)
diag(μ1 , μ2 , . . . , μn ), and C j = diag(ν1 , ν2 , . . . , νn ). Correspondingly,
we write

u  = (u   
1 , u 2 , . . . , u n ),

and

g  = (g1 , g2 , . . . , gn ),

If we assume that a(x, y), c(x, y) > 0, and f (x, y) ≥ 0 on the unit square, then
( j)
αi > 0, and provided that


2ai± 1 , j 2ci± 1 , j
0 < h < min 2
, 2
,
i, j |di j | |ei j |
9.1 An Example 279

where di j and ei j are the values of d(i h, j h) and e(i h, j h), respectively, while
ai± 1 , j and ci± 1 , j denote the values of a(x, y) and c(x, y) on the staggered grid.
2 2
Thus, we have
( j) ( j) (j ( j)
βi , γi , μi , and νi < 0. (9.4)

Furthermore,
( j) ( j) ( j) ( j) ( j)
αi ≥ |βi + γi + μi + νi |. (9.5)

With the above assumptions it can be shown that the linear system (9.2) has a unique
solution [13]. Before we show this, however, we would like to point out that the above
block-tridiagonal is a special one with particular properties. Nevertheless, we will
use this system to illustrate some of the basic iterative methods for solving sparse
linear systems. We will explore the properties of such a system by first presenting
the following preliminaries. These and their proofs can be found in many textbooks,
see for instance the classical treatises [13–15].
Definition 9.1 A square matrix B is irreducible if there exists no permutation matrix
Q for which
 
B11 B12
Q B Q =
0 B22

where B11 and B22 are square matrices.

Definition 9.2 A matrix B = (βi j ) ∈ Rm×m is irreducibly diagonally dominant if


it is irreducible, and

|βii | ≥ |βi j |, i = 1, 2, . . . , m,
j=1
j=1

with strict inequality holding at least for one i.


2 ×n 2
Lemma 9.1 The matrix A ∈ Rn in (9.2) is irreducibly diagonally dominant.

Theorem 9.1 (Perron-Frobenius) Let B ∈ Rm×m be nonnegative, i.e., B ≥ 0 or


βi j ≥ 0 for all i, j, and irreducible. Then
(i) B has a real, simple, positive eigenvalue equal to its spectral radius, ρ(B) =
max |λi (B)|, with a corresponding nonnegative eigenvector x ≥ 0.
1≤i≤m
(ii) ρ(B) increases when any entry in B increases, i.e., if C ≥ B with C = B,
then ρ(B) < ρ(C).
280 9 Iterative Schemes for Large Linear Systems

Theorem 9.2 The matrix A in (9.2) is nonsingular. In fact A is an M-matrix and


A−1 > 0.

Corollary 9.1 The tridiagonal matrices A j ∈ Rn×n , j = 1, 2, . . . , n in (9.3) are
nonsingular, and A−1
j > 0.

It is clear then from Theorem 9.2 that the linear system (9.2) has a unique solution
u, which can be obtained by an iterative or a direct linear system solver. We explore
first some basic iterative methods based on classical options for matrix splitting.

9.2 Classical Splitting Methods

If the matrix A in (9.2) can be written as

A = M − N, (9.6)

where M is nonsingular, then the system (9.2) can be expressed as

Mu = N u + g. (9.7)

This, in turn, suggests the iterative scheme

Mu k+1 = N u k + g, k ≥ 0, (9.8)

with u 0 chosen arbitrarily. In order to determine the condition necessary for the
convergence of the iteration (9.8), let δu k = u k − u and subtract (9.7) from (9.8) to
obtain the relation

M δu k+1 = N δu k , k ≥ 0. (9.9)

Thus,

δu k = H k δu 0 , (9.10)

where H = M −1 N . Consequently, lim δu k = 0 for any initial vector u 0 , if and


k→∞
only if the spectral radius of H is strictly less than 1, i.e. ρ(H ) < 1. For linear
systems in which the coefficient matrix have the same properties as those of (9.3) an
important splitting of the matrix A is given by the following.
Definition 9.3 A = M − N is a regular splitting of A if M is nonsingular with
M −1 ≥ 0 and N ≥ 0.
Theorem 9.3 If A = M − N is a regular splitting, then the iterative scheme (9.8)
converges for any initial iterate u 0 [13].
9.2 Classical Splitting Methods 281

Simple examples of regular splitting of A are the classical Jacobi and Gauss-Seidel
iterative methods. For example, if we express A as

A = D − L − U,
(1) (1) (n) (n)
where D = diag(α1 , . . . , αn ; . . .; α1 , . . . , αn ), and −L, −U are the strictly
lower and upper triangular parts of A, respectively, then the iteration matrices of the
Jacobi and Gauss-Seidel schemes are respectively given by,

H J = D −1 (L + U )

and

HG.S. = (D − L)−1 U. (9.11)

9.2.1 Point Jacobi

Here, the iteration is given by

u k+1 = H J u k + b, k ≥ 0, (9.12)

where u 0 is arbitrary, H J = D −1 (L +U ), or (I − D −1 A), and b = D −1 g. Assuming


that H J and b are obtained at a preprocessing stage, parallelism in each iteration is
realized by an effective sparse matrix-vector multiplication as outlined in Sect. 2.4.
If instead of using the natural ordering of the mesh points (i.e. ordering the mesh
points row by row) that yields the coefficient matrix in (9.3), we use the so called
red-black ordering [15], the linear system (9.2) is replaced by A u = g , or
    
DR E R u (R) gR
= , (9.13)
E B DB u (B) gB

where D R and D B are diagonal matrices each of order n 2 /2, and each row of E R (or
E B ) contains no more than 4 nonzero elements. The subscripts (or superscripts) B,
R denote the quantities associated with the black and red points, respectively. One
can also show that
 
DR E R
A = = P  A P,
E B DB
282 9 Iterative Schemes for Large Linear Systems

where A is given by (9.3), and P is a permutation matrix. Hence, the iterative point-
Jacobi scheme can be written as
 
(R)  
(R)  
DR 0 u k+1 0 −E R uk gR
= + , k ≥ 0.
0 DB u
(B) −E B 0 u
(B) g B
k+1 k

Observing that u (R) (B) (B) (R)


k+1 depends only on u k , and u k+1 depends only on u k , it is clear
(R)
we need to compute only half of each iterate u j . Thus, choosing u 0 arbitrarily, we
generate the sequences

(B) (R)
u 2k+1 = −E B u 2k + g B ,
(R) (B)
k = 0, 1, 2, . . . (9.14)
u 2k+2 = −E R u 2k+1 + g R

where

E B = D −1 −1
B E B , E R = DR E R
(9.15)
g B = D −1 −1
B gB , gR = DR gR .

Evaluating E R , E B , g R , and g B , in parallel, in a pre-processing stage, the degree


of parallelism in each iteration is again governed by the effectiveness of the matrix-
vector multiplication involving the matrices E B and E R . Note that each iteration in
(9.14) is equivalent to two iterations of (9.12). In other words we update roughly half
(B)
the unknowns in each half step of (9.14). Once u j (say) is accepted as a reasonable
estimate of u (B) , i.e., convergence has taken place, u (R)
j is obtained directly from
(9.13) as

u (R) = −E R u (B)
j + gR .

Clearly, the point Jacobi iteration using the red-black ordering of the mesh points
exhibits a higher degree of parallelism than using the natural ordering.

9.2.2 Point Gauss-Seidel

Employing the natural ordering of the uniform mesh, the iterative scheme is given by

u k+1 = (D − L)−1 (U u k + g), k ≥ 0. (9.16)

Here again the triangular system to be solved is of a special structure, and obtaining
an approximation of the action of (D − L)−1 on a column vector can be obtained
effectively on a parallel architecture, e.g. via use of the Neumann series. Using the
9.2 Classical Splitting Methods 283

point red-black ordering of the mesh, and applying the point Gauss-Seidel splitting
to (9.13), we obtain the iteration

   
 
(R) (R)
u k+1 D −1 0 0 −E R uk gR
= R + , k ≥ 0.
(B)
u k+1 D B E B D −1
−1
R D −1
B
0 0 (B)
uk gB

Simplifying, we get
(R) (B)
u k+1 = E R u k + g R ,
and (9.17)
(B) (R)
u k+1 = −E B u k+1 + g B

(B)
where u 0 is chosen arbitrarily, and E R , E B , g B , g R are obtained in parallel via a
preprocessing stage as given in (9.15). Taking in consideration, however, that [13]

ρ(HG.S. ) = ρ 2 (H J ), (9.18)

where,
  
D −1 0 0 −E R
H J = R ,
0 D −1
B
−E B 0
 −1  
DR 0 0 −E R
HG.S. = ,
E B DB 0 0

we see that the red-black point-Jacobi scheme (9.14) and the red-black point-Gauss-
Seidel scheme (9.17) are virtually equivalent regarding degree of parallelism and for
solving (9.13) to a given relative residual.

9.2.3 Line Jacobi

Here we consider the so called line red-black ordering [15], where one row of mesh
points is colored red while the one below and above it is colored black. In this case
the linear system (9.2) is replaced by A u = g , or
    
T R FR v(R) fR
= , (9.19)
FB T B v(B) fB

where TR = diag(A1 , A3 , A5 , . . . , An−1 ), TB = diag(A2 , A4 , A6 , . . . , An ),


284 9 Iterative Schemes for Large Linear Systems
⎛ ⎞
B1
⎜ C3 B3 ⎟
⎜ ⎟
FR = ⎜
⎜ C5 B5 ⎟,

⎝ ··· ⎠
Cn−1 Bn−1
⎛ ⎞
C2 B2
⎜ C4 B4 ⎟
⎜ ⎟

FB = ⎜ ··· ⎟

⎝ Cn−2 Bn−2 ⎠
Cn

and A j , B j , and C j are as given by (9.3). In fact, it can be shown that

A = Q  AQ,

where Q is a permutation. Now,


   
TR 0 0 −FR
A = −
0 TB −FB 0

is a regular splitting of A which gives rise to the line-Jacobi iterative scheme

(R) (B)
vk+1 = TR−1 (−FR vk + f R ),
(B) (R)
k≥0 (9.20)
vk+1 = TB−1 (−FB vk + f B ),

which can be written more economically as,

(B) (R)
v2k+1 = TB−1 (−FB v2k + f B ),
(R) (B)
k≥0 (9.21)
v2k+2 = TR−1 (−FR v2k+1 + f R ),

(R)
with v0 chosen arbitrarily. For efficient implementation of (9.21), we start the pre-
processing stage by obtaining the factorizations A j = D j L j U j , j = 1, 2, . . . , n,
where D j is diagonal, and L j , U j are unit lower and unit upper bidiagonal matrices,
respectively. Since each A j is diagonally dominant, the above factorizations are
obtained by Gaussian elimination without pivoting. Note that processor i handles
the factorization of A2i−1 and A2i . Using an obvious notation, we write

TR = D R L R U R , and TB = D B L B U B , (9.22)
9.2 Classical Splitting Methods 285

where, for example, L R = diag(L 1 , L 3 , . . . , L n−1 ), and L B = diag(L 2 , L 4 ,


. . . , L n ). Now, the iteration (9.21) is reduced to solving the following systems

(B) (R) (B) (B)


L B w2k+1 = (−G B v2k + h B ) ; U B v2k+1 = w2k+1 ,
and (9.23)
(R) (B) (R) (R)
L R w2k+2 = (−G R v2k+1 + h R ) ; U R v2k+2 = w2k+2

where

G R = D −1 −1
R F R , G B = D B FB , (9.24)

h R = D −1 −1
R f R , h B = DB f B , (9.25)

(R)
with v0 chosen arbitrarily. Thus, if we also compute G R , h R , and G B , h B in the
pre-processing stage, each iteration (9.23) has ample degree of parallelism that can
be further enhanced if solving triangular systems involving each L j is achieved via
any of the parallel schemes outlined in Chap. 3. Considering the ample parallelism
inherent in each iteration of the red-black point Jacobi scheme using n/2 multicore
nodes, it is natural to ask why should one consider the line red-black ordering. The
answer is essentially provided by the following theorem.

Theorem 9.4 ([13]) Let A be that matrix given in (9.2). Also let A = M1 − N1 =
M2 − N2 be two regular splittings of A. If N2 ≥ N1 ≥ 0, with neither N1 nor
(N2 − N1 ) being the null matrix, then

ρ(H1 ) < ρ(H2 ) < 1,

where Hi = Mi−1 Ni , i = 1,2.

Note that the regular splittings of A (red-black point-Jacobi) and A (red-black


line-Jacobi) are related. In fact, if we have the regular splittings:
   
DR 0 0 −E R
A = − ,
0 DB −E B 0

= MJ − NJ ,

and
   
TR 0 0 −FR
A = − ,
0 TB −FB 0

= MJ − NJ
286 9 Iterative Schemes for Large Linear Systems

Then A has the two regular splittings,

A = M J − N J = S  M J S − S  N J S,

where S = P  Q is a permutation matrix, and S  N J S ≥ N ≥ 0, equality excluded.


Consequently, ρ(H J ) < ρ(H J ) < 1, where

−1 −1
H J = M J N J and H J = M J N J , (9.26)

i.e., the line red-black Jacobi converges in fewer iterations than the point red-black
Jacobi scheme. Such advantage, however, quickly diminishes as n increases, i.e., as
h = 1/(n + 1) → 0, ρ(H J ) → ρ(H J ). For example, in the case of the Poisson
equation, i.e., when a(x, y) = c(x, y) = 1, d(x, y) = e(x, y) = 0, and f (x, y) =
f , we have

ρ(H J ) 1
= .
ρ(H J ) 2 − cos π h

For n as small as 100, the above ratio is roughly 0.9995.

9.2.4 Line Gauss-Seidel

Applying the Gauss-Seidel splitting to (9.19)—linered-black ordering—we obtain


the iteration
(R) (B)
vk+1 = TR−1 (−FR vk + f R ),
(B) (R) k≥0 (9.27)
vk+1 = TB−1 (−FB vk+1 + f B ),

(B)
with v0 chosen arbitrarily. If, in the pre-processing stage, TR and TB are factored as
shown in (9.22), the iteration (9.27) reduces to solving the following linear systems

(R) (B) (R) (R)


L R xk+1 = (−G R vk + h R ) ; U R vk+1 = xk+1 ,
and (9.28)
(B) (R) (B) (B)
L B xk+1 = (−G B vk+1 + h B ) ; U B vk+1 = xk+1 ,

where G and h are as given in (9.24 and 9.25). Consequently, each iteration (9.28)
can be performed with high degree of parallelism using (n/2) nodes assuming that
G R , B B , h R and h B have already been computed in a preprocessing stage. Similarly,
such parallelism can be further enhanced if we use one of the parallel schemes in
Chap. 3 for solving each triangular system involving L j , j = 1, 2, ....., n.
9.2 Classical Splitting Methods 287

We can also show that


)  2
ρ(HG.S. ρ(H J )
= .
ρ(HG.S. ) ρ(H J )

9.2.5 The Symmetric Positive Definite Case

If d(x, y) = e(x, y) = 0 in (9.1), then the linear system (9.2) is symmetric. Since A
( j)
is nonsingular with αi > 0, i, j = 1, 2, . . . , n, then by virtue of Gerschgorin’s the-
orem, e.g., see [16] and (9.5), the system (9.2) is positive-definite. Consequently, the
linear systems in (9.13) and (9.19) are positive-definite. In particular, the matrices TR
and TB in (9.19) are also positive-definite. In what follows, we consider two classical
acceleration schemes: (i) the cyclic Chebyshev semi-iterative method e.g. see [13,
17] for accelerating the convergence of line-Jacobi, and (ii) Young’s successive over-
relaxation, e.g. see [13, 15] for accelerating the convergence of line Gauss-Seidel,
without adversely affecting the parallelism inherent in these two original iterations.
In this section, we do not discuss the notion of multisplitting where the coefficient
matrix A has, say k possible splittings of the form M j − N j , j = 1, 2, ...., k. On
a parallel architecture k iterations as in Eq. (9.8) can proceed in parallel but, under
some conditions, certain combinations of such iterations can converge faster than
any based on one splitting. For example, see [5, 18–20].
The Cyclic Chebyshev Semi-Iterative Scheme
For the symmetric positive definite case, each A j in (9.2) is symmetric positive
definite tridiagonal matrix, and C j+1 = B 
j . Hence the red-black line-Jacobi iteration
(9.20) reduces to,
 
(R)
 
(R)
 
TR 0 vk+1 0 −FR vk fR
= + .
0 TB (B)
vk+1 −FR 0 (B)
vk fB

Since TR and TB are positive-definite, the above iteration may be written as



 
 
(R) (R)
xk+1 0 −K xk bR
= + , (9.29)
(B)
xk+1 −K  0 (B)
xk bB

−1/2 −1/2 −1/2 −1/2


where K = TR FR T B , b R = TR f R , b B = TB f B , and




(R) 1/2 (R)
xk TR 0 vk
(B) = 1/2 (B)
xk 0 TB vk

converges to the true solution


288 9 Iterative Schemes for Large Linear Systems

 −1  
I K bR
x= . (9.30)
K I bB

The main idea of the semi-iterative method is to generate iterates


k
yk = ψ j (K )x j , k≥0 (9.31)
j=0

with the coefficients ψ j (K ) chosen such that yk approaches the solution x faster than
(R) (B)
the iterates x 
j = (x j , x j ) produced by (9.29). Again from (9.29) it is clear
that if x0 is chosen as x, then x j = x for j > 0. Hence, if yk is to converge to x we
must require that


k
ψ j (K ) = 1, k ≥ 0.
j=0

Let δyk = yk − x and δxk = xk − x, then from (9.29) to (9.31), we have

δyk = qk (H )δy0 , k≥0

where
 
k
0 −K
H= , qk (H ) = ψ j (K )H j ,
−K  0
j=0

and δy0 = δx0 . Note that qk (I ) = I . If all the eigenvalues of H were known
beforehand, it would have been possible to construct the polynomials qk (H ), k ≥ 1,
such that the exact solution (ignoring roundoff errors) is obtained after a finite number
of iterations. Since this is rarely the case, and since

δyk  ≤ qk (H ) δy0  = qk (ρ)δy0 ,

where ρ = ρ(H ) is the spectral radius of H , we seek to minimize qk (ρ) for k ≥ 1.


Observing that the eigenvalues of H are given by ±λi , i = 1, 2, . . . , n 2 /2, with
0 < λi < 1, we construct the polynomial q̂k (ξ ) characterized by

min max |qk (ξ )| = max |q̂k (ξ )|,


qk (ξ )ε Q −ρ≤ξ ≤ρ −ρ≤ξ ≤ρ

where Q is the set of all polynomials qk (ξ ) of degree k such that q0 (ξ ) = qk (1) = 1.


The unique solution of this problem is well known [13]
9.2 Classical Splitting Methods 289

τk (ξ/ρ)
q̂k (ξ ) = , −ρ ≤ ξ ≤ ρ (9.32)
τk (1/ρ)

where τk (ξ ) is the Chebyshev polynomial of the first kind of degree k,

τk (ξ ) = cos(k cos−1 ξ ), |ξ | ≤ 1,
= cosh(k cosh−1 ξ ), |ξ | > 1

which satisfy the 3-term recurrence relation

τ j+1 (ξ ) = 2ξ τ j (ξ ) − τ j−1 (ξ ), j ≥ 1, (9.33)

with τ0 (ξ ) = 1, and τ1 (ξ ) = ξ . Hence, δyk = q̂k (H )δy0 , and from (9.29), (9.32),
and (9.33), we obtain the iteration

yk+1 = ωk+1 [b + H yk ] + (1 − ωk+1 )yk−1 , k≥1 (9.34)

where y1 = H y0 + b, i.e. ω1 = 1, with y0 chosen arbitrarily, and

2 τk (1/ρ)
ωk+1 = ρ τ
k+1 (1/ρ)  (9.35)
1/ 1 − ρ4 ωk ,
2
= k = 2, 3, . . .

with ω2 = 2/(2 − ρ 2 ). Consequently, the sequence yk converges to the solution x


with the errors δyk = yk − x satisfying the relation

δyk  ≤ δx0 /τk (ρ −1 ),


= 2δx0 e−kθ (1 + e−2kθ )−1 ,

where

θ = loge (ρ −1 + ρ −2 − 1).

Such reduction in errors is quite superior to the line-Jacobi iterative scheme (9.21)
in which

δvk  ≤ ρ k δv0 .

Note that H is similar to H J , hence ρ ≡ ρ(H ) = ρ(H J ) where H j is defined in


(9.26). Similar to the Jacobi iteration (9.21), (9.34) can be written more efficiently as
(R) (R) (R)
z 2k = z 2k−2 + ω2k Δz 2k ,
(B) (B) (B) k≥1 (9.36)
z 2k+1 = z 2k−1 + ω2k+1 Δz 2k+1 ,
290 9 Iterative Schemes for Large Linear Systems

−1/2 (R) (B) −1/2 (B)


where z 0(R) is chosen arbitrarily, z (R)
j = TR yj , z j = TB yj , with

(B) (R)
z1 = TB−1 ( f B − FR z 0 ),
(R) (B) (R) (9.37)
Δz 2k = TR−1 ( f R − FR z 2k−1 ) − z 2k−2 ,

and
(B) (R)
Δz 2k+1 = TB−1 ( f B − FR z 2k ) − z 2k−1 .

Assuming that we have a good estimate of ρ, the spectral radius of H J , and hence
have available the array ω j , in a preprocessing stage, together with a Cholesky
factorization of each tridiagonal matrix A j = L j D j L j , then each iteration (9.37)
can be performed with almost perfect parallelism. This is assured if each pair A2i−1
and A2i , as well as B2i−1 and B2i are stored in the local memory of the ith multicore
node i, i = 1, 2, . . . , n/2. Note that each matrix-vector multiplication involving FR
or FR , as well as solving tridiagonal systems involving TR or TB in each iteration
(9.37), will be performed with very low communication overhead. If we have no prior
knowledge of ρ, then we either have to deal with evaluating the largest eigenvalue
in modulus of the generalized eigenvalue problem
   
0 FR TR 0
u=λ u
FR 0 0 TB

before the beginning of the iterations, or use an adaptive procedure for estimating
ρ during the iterations (9.36). In the latter, early iterations can be performed with
nonoptimal acceleration parameters ω j in which ρ 2 is replaced by

(R) (R)
Δz 2k FR TB−1 FR Δz 2k
ρ2k
2
= ,
(R) (R)
Δz 2k TR Δz 2k

and as k increases, ρ2k approaches ρ, see [21].


Note that this method may be regarded as a polynomial method in which rk =
(I − AM −1 )k r0 , where we have the matrix splitting A = M − N .

The Line-Successive Overrelaxation Method


The line Gauss-Seidel iteration for the symmetric positive definite case is given by,
 
(R)
 
 
TR 0 vk+1 0 −FR vk(R) fR
= + .
FR TB (B)
vk+1 0 0 vk
(B) fB
9.2 Classical Splitting Methods 291

Consider, instead, the modified iteration


 
(R)
1  
(R)  
1
ω TR 0
vk+1
ω − 1 TR  −FR vk fR
= + , (9.38)
FR ω1 TB (B)
vk+1 0 1
ω − 1 T B v
(B) f B
k

in which ω is a real parameter. If ω = 1, we have the line Gauss-Seidel iteration, and


for ω > 1 (ω < 1), we have the line-successive overrelaxation (underrelaxation). It
is worthwhile mentioning that for 0 < ω ≤ 1, we have a regular splitting of
 
T R FR
C= (9.39)
FR TB

and convergence of (9.38) is assured. The following theorem shows how large can
ω be and still assuring that ρ(Hω ) < 1, where

Hω = Mω−1 Nω ,

in which
 1

Mω = ω TR 0 , (9.40)
FR ω1 TB

and
1  
ω − 1 TR  −FR
Nω = . (9.41)
ω − 1 TB
1
0

i.e.,
  
I 0 (1 − ω)I −ωTR−1 FR
Hω = . (9.42)
−ωTB−1 FR I 0 (1 − ω)I

Theorem 9.5 ρ(Hω ) < 1 if and only if 0 < ω < 2.

Proof The proof follows directly from the fact that the matrix

S=Mω + Nω 
ω − 1 TR 
2
0
=
ω − 1 TB
2
0

is symmetric positive definite, see [22], for 0 < ω < 2.



292 9 Iterative Schemes for Large Linear Systems

Theorem 9.6 Let


2
ω0 =  ,
1+ 1 − ρ 2 (H J )

where
  
TR−1 0 0 −FR
H J = .
0 TB−1 −FR 0

Then

ρ(Hω 0 ) = min ρ(Hω ).


0<ω<2

Proof Consider the algebraic eigenvalue problem Hω u = λu, or


     
(1 − ω)I −ωTR−1 FR u1 I 0 u1
=λ .
0 (1 − ω)I u2 ωTB−1 FR I u2

This can be reduced to the eigenvalue problem

(1 − ω − λ)2
TB−1 FR TR−1 FR u 2 = u2.
λω2

If we let H J w = μw, with w = (w1 , w2 ), we can similarly verify that

TB−1 FR TR−1 FR w2 = μ2 w2 .

Thus, if λ is an eigenvalue of Hω , then

(λ + ω − 1)2
μ2 = (9.43)
λω2

is the square of an eigenvalue of H J , or

λ2 + [2(ω − 1) − ω2 μ2 ]λ + (ω − 1)2 = 0.

Differentiating w.r.t. λ, we get

1 2 2
λ+ω−1= μ ω . (9.44)
2
From (9.43) and (9.44) ρ(Hω ) is a minimum when
9.2 Classical Splitting Methods 293

ρ 4 (H J )ω4
ρ 2 (H J ) = 1 ,
2 (H )ω2
4ω2 2ρ J −ω+1

i.e.,

ρ 2 (H J )ω2 − 4ω + 4 = 0.

Taking the smaller root of the above quadratic, we obtain the optimal value of ω,

2
ω0 =  ,
1+ 1 − ρ 2 (H J )

which minimizes ρ(Hω ). Hence,


ρ(Hω 0 ) = ω0 − 1
ρ 2 (H J )
=  , (9.45)
[1 + 1 − ρ 2 (H J )]2

note that ρ(H J ) < 1.




Algorithm 9.1 Line SOR iteration


(R) (B)
(i) L R xk+1 = −G R vk + hR
1: Solve (R) (R)
(ii) U R yk+1 = xk+1
(R) (R) (R)
2: vk+1 = ω0 yk+1 + (1 − ω0 )vk
3: Solve
(B) (R)
(i) L B xk+1 = −G B vk+1 + h B
(B) (B)
(ii) U B yk+1 = xk+1
(B) (B) (B)
4: vk+1 = ω0 yk+1 + (1 − ω0 )vk

Using the factorization (9.22) of TR and TB , then similar to the line Gauss-Seidel
iteration (9.23), the line-SOR iteration is given by Algorithm 9.2.5 where ω0 is the
(B)
optimal acceleration parameter, v0 is chosen arbitrarily, and G, h are as given in
(9.23). Assuming that G R , G B , and h R , h B are computed in a preprocessing stage
each iteration exhibits ample parallelism as before provided ω0 is known. If the
optimal parameter ω0 is not known beforehand, early iterations may be performed
with the non-optimal parameters

2
ωj =  ,
1 + 1 − μ2j
294 9 Iterative Schemes for Large Linear Systems

where
s  −1
j FR T R FR s j
μ2j =
s
j TB s j

(B) (B)
in which s j = v j − v j−1 , and as j increases μ j approaches ρ(H J ), see [21].

9.3 Polynomial Methods

Given an initial guess x0 ∈ Rn , the exact solution of Ax = f , can be expressed as

x = x0 + A−1 r0 , (9.46)

where r0 = f − Ax0 is the initial residual. A polynomial method is one which


approximates A−1 r0 for each iteration k = 1, 2, . . . by some polynomial expression
pk−1 (A)r0 so that the kth iterate is expressed as:

xk = x0 + pk−1 (A)r0 , (9.47)

where pk−1 ∈ Pk−1 is the set of all polynomials of degree no greater than k − 1.
In the following two sections, we consider two such methods. In Sect. 9.3.1 we
consider the use of an explicit polynomial for symmetric positive definite linear
systems. In Sect. 9.3.2 we consider the more general case where the linear system
could be nonsymmetric and the polynomial pk−1 is defined implicitly. This results
in an iterative scheme characterized by the way the vector pk−1 (A)r0 is selected.
This class of methods is referred to as Krylov subspace methods.

9.3.1 Chebyshev Acceleration

In this section we consider the classical Chebyshev method (or Stiefel iteration
[23]), and one of its generalizations—the block Stiefel algorithm [24]—for solving
symmetric positive definite linear systems of the form,

Ax = f, (9.48)

where A is of order n. Further, for the sake of illustration, let us assume that A has
a spectral radius less than 1. Also, let

0 < ν = μ1  μ2  · · · μn = μ < 1,
9.3 Polynomial Methods 295

where ν and μ are the smallest and largest eigenvalues of A, respectively. Then, the
classical Chebyshev iteration (or Stiefel iteration [23]) for solving the above linear
system is given by,

1: (a) Initial step


x0 : arbitrary
r0 = f − Ax0
x1 = x0 + γ −1 r0 ; r1 = f − Ax1 .
2: (b) For j = 0, 1, 2, 3, · · · , obtain the following:

1. Δx j = ω j r j + (γ ω j − 1)Δx j−1 ,
2. x j+1 = x j + Δx j , (9.49)
3. r j+1 = f − Ax j+1 .

Here, γ = β/α, in which

2 μ+ν
α= , β=
μ−ν μ−ν

and
 −1
1
ω j = γ − 2 ω j−1 , j ≥1

with ω0 = 2/γ .
This iterative scheme produces residuals r j that satisfy the relation

r j = P j (A)r0 (9.50)

where P j (λ) is a polynomial of degree j given by,

τ j (β − αλ)
P j (λ) = , ν  λ  μ, (9.51)
τ j (β)

where, as defined earlier, τ j (ξ ) is the Chebyshev polynomial of the first kind:



cos( j cos−1 ξ ), |ξ |  1,
τ j (ξ ) =
cos h( j cos h −1 ξ ), ξ  1.

As a result
r j 2
 [τ j (β)]−1 .
r0 2
296 9 Iterative Schemes for Large Linear Systems

Suppose now that ν is just an estimate of an interior eigenvalue μs+1 , where s is a


small integer

0 < μ1  · · ·  μs < ν  μs+1  · · ·  μn ≤ μ < 1.


n
Thus, if r0 = i=1 ηi z i , where z i is the eigenvector of A corresponding to μi , then
r j can be expressed as


s
n
rj = ηi P j (μi )z i + ηi P j (μi )z i = r j + r j . (9.52)
i=1 i=s+1

While r j is damped out quickly as j increases, i.e., for β = (μ + ν)/(μ − ν)

r j |2
 [τ j (β)]−1 , (9.53)
r0 2

the term r j is damped out at a much slower rate,

r j 2 τ j (β − αμ1 )
 .
r0 2 τ j (β)

The basic strategy of the block Stiefel algorithm is to annihilate the contributions of
the eigenvectors z 1 , z 2 , . . . , z s to the residuals r j so that eventually r j 2 approaches
zero as ζ j = 1/τ j [(μn +μs+1 )/(μn −μs+1 )] rather than ψ j = 1/τ j [(μn +μ1 )/(μn −
μ1 )] as in the classical Stiefel iteration [25]. Let Z = [z 1 , z 2 , . . . , z s ] be the ortho-
normal matrix consisting of the s-smallest eigenvectors. Then, from the fact that
r j = −A(x j − x) a projection process [26] produces the improved iterate

x̂ j = x j + Z (Z  AZ )−1 Z r j ,

for which the corresponding residual r̂ j = b − A x̂ j has zero projection onto z i ,


1  i  s, i.e., Z r̂ j = 0. Note that Z  AZ = diag(μ1 , . . . , μs ). This procedure is
essentially a deflation technique.
Assuming, for the time being, that we also have the optimal parameters μ = μn ,
ν = μs+1 where s is 2 or 3 such that μs < μs+1 , and ζk  ψk for k not too large,
together with a reasonable approximation of the eigenpairs μi and z i , 1  i  s.
In order to avoid the computation of inner products in each iteration, we adopt
an unconventional stopping criterion. Once ζi , the right-hand side of (9.53), drops
below a given tolerance, we consider r 2 to be sufficiently damped and start the
projection step. Therefore, once x+1 and r+1 are obtained, the improved iterate
x+1 is computed by
9.3 Polynomial Methods 297


s
x̂+1 = x+1 + μi (z ir+1 )z i . (9.54)
i=1

While it is reasonable to expect that the optimal parameters μ and ν(μn  μ <
1, μs  ν < μs+1 ) are known, for example, as a result of previously solving
problem (9.48) with a different right-hand side using the CG algorithm, it may not
be reasonable to assume that Z is known a priori. In this case, the projection step
(9.54) may be performed as follows.
Let k =  − s + 1, where  is determined as before, i.e., so that τk (β) is large
enough to assure that rk 2 is negligible compared to rk 2 . Now, from (9.50) and
(9.52)

rk  Pk (A)Z y,

where y  = (η1 , η2 , . . . , ηs ), or

rk  Z wk

in which

wk = (η1 Pk (μ1 ), . . . , ηs Pk (μs )).

Consequently,

R = [r−s+1 , r−s+2 , . . . , r ]  Z [w−s+1 , w−s+2 , . . . , w ] = Z W .

Let

R  = Q  U (9.55)

be the orthogonal factorization of R where Q  has orthonormal columns and U is


upper triangular of order s. As a result

Q   Z Θ

in which Θ is an orthogonal matrix or order s, and

x̂+1 = x+1 + Q  (Q  −1 
 AQ  ) Q  r+1 (9.56)

has the desired property that Z r̂+1  0. Note that the eigenvalues of Q 
 AQ  are
good approximations of μ1 , . . . , μs .
The projection stage consists of the six steps shown below.
1. The modified Gram-Schmidt factorization R = Q  U , as described in Algo-
rithm 7.3, where
298 9 Iterative Schemes for Large Linear Systems
⎡ ⎤
p11 p12 · · · p1s
(1) (1)
R = [r−s+1 , . . . , r ], Q  = [q−s+1 , . . . , q ], and U = ⎣ p22 · · · p2s ⎦ .
pss

2. Obtain the s inner-products γi = qi r+1 ,  − s + 1  i  .


3. Compute the upper triangular part of the symmetric matrix S = Q   AQ  .
4. Solve the linear system Sd = c, where c  = (γ , . . . , γ ).
 −s+1 
5. Compute d = Q  d = i=−s+1 qi δi , where δi = ei d in which ei is the ith
column of the identity.
6. Finally, obtain x̂+1 = x+1 + d .
For small s, e.g. s = 2 or 3, the cost of this projection stage is modest.

Algorithm 9.2 Block Stiefel iterations.


Input: A ∈ Rn×n (A SPD), f, x0 ∈ Rn , μ, ν ∈ R, tol > 0.
Output:  ∈ N, x ∈ Rn , r+1 
f .
β
1: α = μ−ν 2
;β = μ+ν −1
μ−ν ; γ = α ; r0 = f − A x 0 ; x 1 = x 0 + γ0 ; r1 = f − A x 1 ; Δx 0 = 0;
ω−1 = 0;
2: Determine  for which τ (β) ;
3: for j = 1 : , do
 −1
4: ω j = γ − 4α1 2 ω j−1 ;
5: θ j = γ ω j − 1;
6: Δx j = ω j r j + θ j Δx j−1 ;
7: x j+1 = x j + Δx j ;
8: r j+1 = f − A x j+1 ;
9: R j = [r j−s+1 , · · · , r j−1 , r j ];
10: end
//Projection step: Obtain the orthogonal factorization of R via MGS:
11: R = Q  U (Q  orthonormal) ;
12: Form S = Q   AQ  and c = Q  r+1 ;
13: Solve the system : Sd = c;
14: x̂+1 = x + Q  d;
15: x = x̂+1 ;
16: r = f − A x;
17: Compute rel. res. r+1 
f ;

Remark 9.1 On a parallel computing platform the block Stiefel iteration takes more
advantage of parallelism than the classical Chebyshev scheme. Further, on a platform
of many multicore nodes (peta- or exa-scale architectures), the block Stiefel itera-
tion could be quite scalable and consume far less time than the conjugate gradient
algorithm (CG) for obtaining a solution with a given level of the relative residual.
This is due to the fact that the block Stiefel scheme avoids the repeated fan-in and
fan-out operations (inner products) needed in each CG iteration.
9.3 Polynomial Methods 299

9.3.2 Krylov Methods

Modern algorithms for the iterative solution of linear systems involving large sparse
matrices are frequently based on approximations in Krylov subspaces. They allow
for implicit polynomial approximations in classical linear algebra problems, namely:
(i) computation of eigenvalues and corresponding eigenvectors, (ii) solving linear
systems, and (iii) matrix function evaluation. To make such a claim more precise,
we consider the problem of solving the linear system:

Ax = f, (9.57)

where A ∈ Rn×n is a large nonsingular matrix and f ∈ Rn .


In this section, we consider the Arnoldi process, which is at the core of many
such iterative schemes, and explore the potential of high performance on parallel
architectures. For general presentations of Krylov subspace methods, one is referred
to some of these references [27–30].
Krylov Subspaces
Definition 9.4 Given a matrix A ∈ Rn×n and a vector r0 , for k ≥ 1, the Krylov
subspace Kk (A, r0 ) is the subspace spanned by the set of vectors {r0 , Ar0 , A2 r0 , . . . ,
Ak−1r0 }.

A Krylov method is characterized by how the vector yk ∈ Kk (A, r0 ) is selected to


define the kth iterate as xk = x0 + yk . In general, the method implicitly defines the
polynomial pk−1 such that yk = pk−1 (A)r0 . For instance, the GMRES method [31]
consists of selecting yk such that the 2-norm of the residual rk = f − Axk = r0 − Ayk
is minimized over Kk (A, r0 ); this defines a unique polynomial but it is not made
explicit by the method.

Proposition 9.1 With the previous notations, we have the following assertions:
• The sequence of the Krylov subspaces is nested:

Kk (A, r0 ) ⊆ Kk+1 (A, r0 ), for k ≥ 0.

• There exists η ≥ 0 such that the previous sequence is increasing for k ≤ η (i.e.
dim (Kk (A, r0 )) = k), and is constant for k ≥ η (i.e. Kk (A, r0 ) = Kη (A, r0 )).
The subspace Kη (A, r0 ) is invariant with respect to A. When A is nonsingular,
the solution of (9.57) satisfies x ∈ x0 + Kη (A, r0 ).
• If Ak+1 r0 ∈ Kk (A, r0 ), then Kk (A, r0 ) is an invariant subspace of A.

Proof The proof is straightforward.

Working with the Krylov subspace Kk (A, r0 ) usually requires the knowledge of a
basis. As we will point out later, the canonical basis {r0 , Ar0 , A2 r0 , . . . , Ak−1 r0 } is
ill-conditioned and hence not appropriate for direct use. The most robust approach
300 9 Iterative Schemes for Large Linear Systems

consists of obtaining an orthogonal version of this canonical basis. Such a procedure


is called the Arnoldi process.
The Arnoldi Process
Let us assume that, for k ≥ 0, the columns of the matrix Vk = [v1 , v2 , . . . , vk ] form
an orthonormal basis of Kk (A, r0 ). An algorithm can be derived inductively as long
as the sequence of Krylov subspaces is not constant:
k = 1: by normalizing r0 , one gets v1 = rr00  .
induction: let Vk = [v1 , v2 , . . . , vk ] be an orthonormal basis of Kk (A, r0 ). The vector
wk = Avk obviously lies in Kk+1 (A, r0 ). If wk ∈ Kk (A, r0 ) then Kk (A, r0 ) is
invariant and the sequence is constant from that point on and the process stops.
Else, wk ∈ Kk+1 (A, r0 )\Kk (A, r0 ).1 Then the vector vk+1 can be defined by
vk+1 = PPkk w ⊥
wk  where Pk is the orthogonal projector onto Kk (A, r0 ) .
k

With this procedure, one realizes a Gram-Schmidt process as implemented in Algo-


rithm 7.2 of Chap. 7. As a result, it performs (step-by-step) a QR-factorization of
the matrix [r0 , Av1 , Av2 , . . . , Avk ] = Vk+1 Rk+1 , where the upper-triangular matrix
Rk+1 can be partitioned into
⎡ ⎤
r0 
Rk+1 = ⎣ Ĥk ⎦ ,
0

where Ĥk ∈ R(k+1)×k is an augmented upper-Hessenberg matrix. By considering


the modified version MGS of the Gram-Schmidt process, the Arnoldi process is then
expressed by Algorithm 9.3.

Algorithm 9.3 Arnoldi procedure.


Input: A ∈ Rn×n , r0 ∈ Rn and m > 0.
Output: V = [v1 , · · · , vm+1 ] ∈ Rn×(m+1) , Ĥk = (μi j ) ∈ R(m+1)×m .
1: v1 = r0 /r0  ;
2: for k = 1 : m, do
3: w = Avk ;
4: for j = 1 : k, do
5: μ j,k = vj w ;
6: w = w − μ j,k v j ;
7: end
8: μk+1,k = w; vk+1 = w/μk+1,k ;
9: end

We can now claim the main result:

1 Given two sets A and B , we define A \B = {x ∈ A | x ∈


/ B }.
9.3 Polynomial Methods 301

Theorem 9.7 (Arnoldi identities) The matrices Vm+1 and Ĥm satisfy the following
relation:

AVm = Vm+1 Ĥm , (9.58)

that can also be expressed by



AVm = Vm Hm + μm+1,m vm+1 em , (9.59)

where
• em ∈ Rm is the last canonical vector of Rm ,
• Hm ∈ Rm×m is the matrix obtained by deleting the last row of Ĥm .
The upper Hessenberg matrix Hm corresponds to the projection of the restriction of
the operator A on the Krylov subspace with basis Vm :

Hm = Vm AVm . (9.60)

The Arnoldi process involves m sparse matrix-vector multiplications and requires


O(m 2 n) arithmetic operations for the orthogonalization process. Note that we need
to store the sparse matrix A and the columns of Vm+1 .
Lack of Efficiency in Parallel Implementation
A straightforward parallel implementation consists of creating parallel procedures
for v → Av and the BLAS-1 kernels as well (combinations of _DOT and _AXPY).
As outlined in Sect. 2.1, inner products impede achieving high efficiency on parallel
computing platforms. It is not possible to replace the Modified Gram-Schmidt MGS
procedure by its classical counterpart CGS, not only due to the lack of numerical
stability of the latter, but also because of a dependency in the computation of vk+1
which requires availability of vk .
The situation would be much better if a basis of Kk (A, r0 ) were available before
the orthogonalization process. This is addressed below.
Nonorthogonal Bases for a Krylov Subspace
Consider Z m+1 = [z 1 , . . . , z m+1 ] ∈ Rn×m+1 such that {z 1 , . . . , z k } is a basis of
Kk (A, r0 ) for any k ≤ m + 1, with α0 z 1 = r0 , for α0 = 0. A natural model of
recursion which builds bases of this type is given by,

αk z k+1 = (A − βk I )z k − γk z k−1 , k ≥ 1, (9.61)

where αk = 0, βk , and γk are scalars with γ1 = 0 (and z 0 = 0). For instance,


the canonical basis of Km+1 (A, r0 ) corresponds to the special choice: αk = 1,
βk = 0, and γk = 0 for any k ≥ 1. Conversely, such a recursion insures that, for
1 ≤ k ≤ m + 1, z k = pk−1 (A)r0 , where pk−1 is a polynomial of degree k − 1.
302 9 Iterative Schemes for Large Linear Systems

By defining the augmented tridiagonal matrix


⎡ ⎤
β1 γ2
⎢ α1 β2 γ3 ⎥
⎢ ⎥
⎢ α2 β3 γ4 ⎥
⎢ ⎥
⎢ . . . . ⎥

T̂m = ⎢ α3 . . ⎥ ∈ R(m+1)×m , (9.62)

⎢ .. ⎥
⎢ . β γ ⎥
⎢ m−1 m ⎥
⎣ αm−1 βm ⎦
αm

the relations (9.61), can be expressed as

AZ m = Z m+1 T̂m . (9.63)

Given the QR factorization,

Z m+1 = Wm+1 Rm+1 , (9.64)

where the matrix Wm+1 ∈ Rn×(m+1) has orthonormal columns with Wm+1 e1 =
r0 /r0 , and where Rm+1 ∈ R(m+1)×(m+1) is upper triangular, then Z m = Wm Rm ,
where Rm is the leading m × m principal submatrix of Rm+1 , and Wm consists of the
first m columns of Wm+1 . Consequently, Z m = Wm Rm is an orthogonal factorization.
Substituting the QR factorizations of Z m+1 and Z m into (9.63) yields
−1
AWm = Wm+1 Ĝ m , Ĝ m = Rm+1 T̂m Rm , (9.65)

where we note that the (m + 1) × m matrix Ĝ m is an augmented upper Hessen-


berg matrix. We are interested in relating the matrices Wm+1 and Ĝ m above to the
corresponding matrices Vm+1 and Ĥm determined by the Arnoldi process.

Proposition 9.2 Assume that m steps of the Arnoldi process can be applied to the
matrix A ∈ Rn×n with an initial vector r0 ∈ Rn without breakdown yielding the
decomposition (9.58). Let
AWm = Wm+1 Ĝ m

be another decomposition, such that Wm+1 ∈ Rn×(m+1) has orthonormal columns,


Wm e1 = Vm e1 , and Ĝ m ∈ R(m+1)×m is an augmented upper Hessenberg matrix
with nonvanishing subdiagonal entries. Then Wm+1 = Vm+1 Dm+1 and Ĝ m =
−1
Dm+1 Ĥm Dm , where Dm+1 ∈ C(m+1)×(m+1) is diagonal with ±1 diagonal entries,
and Dm is the leading m ×m principal submatrix of Dm+1 . In particular, the matrices
Ĥm and Ĝ m are orthogonally equivalent, i.e., they have the same singular values. The
matrices Hm and G m obtained by removing the last row of Ĥm and Ĝ m , respectively,
are also orthogonally similar, i.e., they have the same eigenvalues.
9.3 Polynomial Methods 303

Proof The proposition is a consequence of the Implicit Q Theorem (e.g. see [16]).

Suitability for Parallel Implementation


In implementing the Arnoldi process through a non-orthonormal Krylov basis, the
two critical computational steps that need to be implemented as efficiently as possible
are those expressed in (9.63) and (9.64):
(a) Step 1—The Krylov basis generation in (9.63) consists of two levels of com-
putations. The first involves sparse matrix-vector multiplication while the second
involves the tridiagonalization process through a three-term recurrence. By an ade-
quate memory allocation of the sparse matrix A, the computational kernel v −→ A v
can usually be efficiently implemented on a parallel architecture (see Sect. 2.4.1). This
procedure, however, involves a global sum (see Algorithms 2.9 and 2.10) which, in
addition to the three-term recurrence limit parallel scalability. To avoid such lim-
itations, it is necessary to consider a second level of parallelism at the three-term
recursion level which must be pipelined. The pipelining pattern depends on the
matrix sparsity structure.
(b) Step 2—The QR factorization in (9.64) can be implemented on a parallel
architecture as outlined in Sect. 7.6 since the basis dimension m is much smaller than
the matrix size, i.e. m  n).
The combination of these two steps (a) and (b) leads to an implementation which
is fairly scalable unlike a straightforward parallel implementation of the Arnoldi
procedure. This is illustrated in [32, 33] where the matrix consists of overlapped
sparse diagonal blocks. In this case, it is shown that the number of interprocessor
communications is linear with respect to the number of diagonal blocks. This is not
true for the classical Arnoldi approach which involves inner products. An example
is also given in the GPREMS procedure, e.g. see [34] and Sect. 10.3. Unfortunately,
there is a limit on the size m of the basis generated this way as we explain below.
Ill-Conditioned Bases
From the relation (9.65), the regular Arnoldi relation (9.58) can be recovered by
considering the Hessenberg matrix Ĝ m . Clearly this computation can be suspect
when Rm+1 is ill-conditioned. If we define the condition number of Z m+1 as,

max Z m+1 y
y=1
cond(Z m+1 ) = , (9.66)
min Z m+1 y
y=1

then, when the basis Z m+1 is ill-conditioned the recovery fails since cond(Z m+1 ) =
cond(Rm+1 ). This is exactly the situation with the canonical basis. In the next section,
we propose techniques for choosing the scalars αk , βk , and γk in (9.62) in order to
limit the growth of the condition number of Z k as k increases.
Limiting the Growth of the Condition Number of Krylov Subspace Bases
The goal is therefore to define more appropriate recurrences of the type (9.61). We
explore two options: (i) using orthogonal polynomials, and (ii) defining a sequence
304 9 Iterative Schemes for Large Linear Systems

of shifts in the recursion (9.61). A comparison of the two approaches with respect
to controling the condition number is given in [35]. Specifically, both options are
compared by examining their effect on the convergence speed of GMRES.
The algorithms corresponding to these options require some knowledge of the
spectrum of the matrix A, i.e. Λ(A). Note that, Λ(A) need not be determined to
high accuracy. Applying, Algorithm 9.3 with a small basis dimension m 0 ≤ m, the
convex hull of Λ(A) is estimated by the eigenvalues of the Hessenberg matrix Hm 0 .
( j)
Since the Krylov bases are repeatedly built from a sequence of initial vectors r0 ,
the whole computation is not made much more expensive by such an estimation of
Λ(A). Moreover, at each restart, the convex hull may be updated by considering the
convex hull of the union of the previous estimates and the Ritz values obtained from
the last basis generation.
In the remainder of this section, we assume that A ∈ Cn×n . We shall indicate how
to maintain real arithmetic operations when the matrix A is real.
Chebyshev Polynomial Bases
The short generation recurrences of Chebyshev polynomials of the first kind are
ideal for our objective of creating Krylov subspace bases. As outlined before, such
recurrence is given by

Tk (t) := cosh(k cosh−1 (t)), k ≥ 0, |t| ≥ 1. (9.67)

Generating such Krylov bases has been discussed in several papers, e.g. see [35–37].
In what follows, we adopt the presentation given in [35]. Introducing the family of
ellipses in C,
 
E (ρ) := eiθ + ρ −2 e−iθ : −π < θ ≤ π , ρ ≥ 1,

where E (ρ) has foci at ±c with c = 2ρ −1 , with semi-major and semi-minor axes
given, respectively, by α = 1 + ρ12 and β = 1 − ρ12 , in which

α+β
ρ= . (9.68)
c

When ρ grows from 1 to ∞, the ellipse E (ρ) evolves from the interval [−2, +2] to
the unit circle.
Consider the scaled Chebyshev polynomials

(ρ) 1 ρ
Ck (z) := Tk ( z), k = 0, 1, 2, . . . .
ρk 2

then rom the fact that


   1  
1 −1 −iθ
Tk ρe + ρ e

= ρ k eikθ + ρ −k e−ikθ ,
2 2
9.3 Polynomial Methods 305

and
(ρ) 1  ikθ 
Ck (eiθ + ρ −2 e−iθ ) = e + ρ −2k e−ikθ , (9.69)
2
(ρ)
it follows that when z ∈ E (ρ) we have Ck (z) ∈ 21 E (ρ ) . Consequently, it can be
2k

(ρ)
proved that, for any ρ, the scaled Chebyshev polynomials (Ck ) are quite well con-
ditioned on E (ρ) for the uniform norm [35, Theorem 3.1]. This result is independent
of translation, rotation, and scaling of the ellipse, provided that the standard Cheby-
shev polynomials Tk are translated, rotated, and scaled accordingly. Let E (c1 , c2 , τ )
denote the ellipse with foci c1 and c2 and semi-major axis of length τ . This ellipse
can then be mapped by a similarity transformation φ(z) = μz + ν onto the ellipse
E (ρ) = E (− 2ρ1
, 2ρ
1
, 1 + ρ12 ) with a suitable value of ρ ≥ 1 by translation, rotation,
and scaling:

m = c1 +c
2 , is the center point of E (c1 , c2 , τ ),
2
⎢c |c2 −c1 |
= √ 2 , is the semi-focal distance of E (c1 , c2 , τ ),


⎢s = τ 2 − c2 , is the semi-minor axis of E (c1 , c2 , τ ),
⎢ (9.70)
⎢ρ = τ +s
c ,

⎣μ = ρ(c22−m) ,
ν = −μm.

The three last expressions are obtained from (9.68) and from the equalities φ(m) = 0
and φ(c2 ) = ρ2 .
It is now easy to define polynomials Sk (z) which are well conditioned on the
ellipse E (c1 , c2 , τ ), by defining for k ≥ 0

(ρ)
Sk (z) = Ck (φ(z)). (9.71)

Before using the polynomials Sk to generate Krylov subspace bases, however, we


outline first the recurrence which generates them, e.g. see, [38]):

⎨ T0 (z) = 1,
T1 (z) = z, (9.72)

Tk+1 (z) = 2z Tk (z) − Tk−1 (z), for k ≥ 1.

Since
(ρ)
Sk+1 (z) = Ck+1 (φ(z))
1 ρ
= k+1 Tk+1 ( φ(z))
ρ 2
1  ρ ρ 
= k+1 ρφ(z) Tk ( φ(z)) − Tk−1 ( φ(z)) ,
ρ 2 2
306 9 Iterative Schemes for Large Linear Systems

we obtain the following recurrence for S j (z),



⎨ S0 (z) = 1,
S1 (z) = 21 φ(z), (9.73)
⎩ S (z) = φ(z) Sk (z) − 1
Sk−1 (z), for k ≥ 1,
k+1 ρ2

where ρ and φ(z) = μz + ν are obtained from (9.70).


Let,

Z m+1 = [r0 , S1 (A)r0 , . . . , Sm (A)r0 ] ∈ Cn×(m+1) (9.74)

form a basis for the Krylov subspace Km+1 (A, r0 ), with m being a small integer.
Then the matrix T̂m , which satisfies (9.63), is given by,
⎛ ⎞
− 2ν
μ μ
2
⎜ 1
−μν 1 ⎟
⎜ μρ 2 μ ⎟
⎜ ⎟
⎜ 1
− μν 1 ⎟
⎜ μρ 2 μ ⎟
⎜ .. .. ⎟
⎜ . . ⎟
⎟ ∈ C(m+1)×m .
1
T̂m = ⎜ μρ 2 (9.75)
⎜ ⎟
⎜ .. ⎟
⎜ . − μν μ1 ⎟
⎜ ⎟
⎜ 1
− μν ⎟
⎝ μρ 2 ⎠
1
μρ 2

The remaining question concerns how the ellipse is selected for a given operator
A ∈ Cn×n . By extension of the special situation when A is normal, one selects the
smallest ellipse which contains the spectrum Λ(A) of A, see [35] and Sect. 9.3.2.
A straightforward application of the technique presented here is described in
Algorithm 9.4.

Algorithm 9.4 Chebyshev-Krylov procedure.


Input: A ∈ Cn×n , m > 1, r0 ∈ Cn , c1 , c2 , and τ > 0 such that the ellipse E (c1 , c2 , τ ) ⊃ Λ(A).
Output: Z m+1 ∈ Cn×(m+1) basis of Km+1 (A, r0 ), T̂m ∈ C(m+1)×m tridiagonal matrix such that
AZ m = Z m+1 T̂m .
1: Compute ρ, μ, and ν by (9.70) ;
2: Generate the Krylov subspace basis matrix Z m+1 using the recurrence (9.73).
3: Build the matrix T̂m ; see (9.75).

When A and r0 are real (A ∈ Rn×n and r0 ∈ Rn ), the computations in Algo-


rithm 9.4 are real if the ellipse is correctly chosen: since the spectrum Λ(A) is sym-
metric with respect to the real axis, so must be the chosen ellipse. By considering
only ellipses where the major axis is the real axis, the foci as well as the function φ
are real.
9.3 Polynomial Methods 307

In order to avoid the possibility of overflow or underflow, the columns (z 1 , . . . ,


−1
z m+1 ) of Z m+1 are normalized to obtain the new basis Wm+1 = Z m+1 Dm+1 where
Dm+1 = diag(z 1 , . . . , z m+1 ) is diagonal. The recurrence (9.73) must then be
modified accordingly, with the new basis Wm+1 satisfying the relation

A Wm = Wm+1 Ťm , (9.76)

−1 .
in which Ťm = Dm+1 T̂m Dm
Newton Polynomial Bases
This approach has been developed in [39] from an earlier result [40], and its imple-
mentation on parallel architectures is given in [32]. More recently, an improvement
of this approach has been proposed in [35].
Recalling the notations of Sect. 9.3.2, we consider the recurrence (9.61) with the
additional condition that γk = 0 for k ≥ 1. This condition reduces the matrix T̂m
(introduced in (9.62)) to the bidiagonal form. Denoting this bidiagonal matrix by
B̂m , the corresponding polynomials pk (z) can then be generated by the following
recurrence

pk (z) = ηk (z − βk ) pk−1 (z), k = 1, 2, . . . , m, (9.77)

where z k = pk−1 (A)r0 , with p0 = 1, ηk = α1k is a scaling factor, and βk is a zero


of the polynomial pk (z). The corresponding basis Z m+1 is called a scaled Newton
polynomial basis and is built from z 1 = α0 r0 for any nonzero α0 , by the recurrence

z k+1 = ηk (A − βk I )z k , k = 1, 2, . . . , m. (9.78)

Therefore

A Z m = Z m+1 B̂m , (9.79)

with
⎛ ⎞
β1
⎜ α1 β2 ⎟
⎜ ⎟
⎜ α2 β3 ⎟
⎜ ⎟
⎜ . . ⎟

B̂m = ⎜ α3 . ⎟ ∈ C(m+1)×m . (9.80)

⎜ .. ⎟
⎜ . β ⎟
⎜ m−1 ⎟
⎝ αm−1 βm ⎠
αm

The objective here is to mimic the characteristic polynomial of A with a special


order for enumerating the eigenvalues of A.
308 9 Iterative Schemes for Large Linear Systems

Definition 9.5 (Leja points) Let S be a compact set in C, such that (C ∪ {∞})\S
is connected and possesses a Green’s function. Let ζ1 ∈ S be arbitrary and let ζ j
for j = 2, 3, 4, . . . , satisfy

"
k "
k
|ζk+1 − ζ j | = max |z − ζ j |, ζk+1 ∈ S , k = 1, 2, 3, . . . . (9.81)
z∈S
j=1 j=1

Any sequence of points ζ1 , ζ2 , ζ3 , . . . satisfying (9.81) is said to be a sequence of


Leja points for S .

In [39], the set S is chosen to be Λ(Hm ), the spectrum of the upper Hessenberg
matrix Hm generated in Algorithm 9.3. These eigenvalues of Λ(Hm ) are the Ritz
values of A corresponding to the Krylov subspace Km (A, r0 ). They are sorted with
respect to the Leja ordering, i.e., they are ordered to satisfy (9.81) with S = Λ(Hm ),
and are used as the nodes βk in the Newton polynomials (9.78).
In [35], this idea is extended to handle more elaborate convex sets S containing
the Ritz values. This allows starting the process with a modest integer m 0 to determine
the convex hull of Λ(Hm 0 ). From this set, an infinite Leja sequence for which S =
Λ(Hm ). Note that the sequence has at most m terms. Moreover, when an algorithm
builds a sequence of Krylov subspaces, the compact set S can be updated by S :=
co (S ∪ Λ(Hm )) at every restart.
The general pattern of a Newton-Krylov procedure is given by Algorithm 9.5.

Algorithm 9.5 Newton-Krylov procedure.


Input: A ∈ Cn×n , m > 1, r0 ∈ Cn , a set S ⊃ Λ(A).
Output: Z m+1 ∈ Cn×(m+1) basis of Km+1 (A, r0 ), B̂m ∈ C(m+1)×m bidiagonal matrix such that
AZ m = Z m+1 B̂m .
1: From S , build a Leja ordered sequence of points {βk }k=1,··· ,m ;
2: for k = 1 : m, do
3: w = A z k − βk+1 z k ;
4: αk+1 = 1/w ; z k+1 = αk+1 w ;
5: end
6: Build the matrix B̂m (9.80).

When the matrix A is real, we can choose the set S to be symmetric with respect
to the real axis. In such a situation, the nonreal shifts βk are supposed to appear in
conjugate complex pairs. The Leja ordering is adapted to keep consecutive elements
of the same pair. Under this assumption, the recurrence (which appears in Algo-
rithm 9.6) involves real operations. The above bidiagonal matrix B̂m becomes the
tridiagonal matrix T̂m since at each conjugate pair of shifts an entry appears on the
superdiagonal.
References 309

Algorithm 9.6 Real Newton-Krylov procedure.


Input: A ∈ Rn×n , m > 1, r0 ∈ Rn , a set S ⊃ Λ(A).
Output: Z m+1 ∈ Rn×(m+1) basis of Km+1 (A, r0 ), T̂m ∈ R(m+1)×m tridiagonal matrix such that
AZ m = Z m+1 T̂m .
1: From S , build a Leja ordered sequence of points {βk }k=1,··· ,m ;
//conjugate values appear consecutively, the first value of the pair is the one with positive
imaginary part.
2: for k = 1 : m, do
3: if Im(βk ) == 0 then
4: z k+1 = A z k − βk z k ;
5: ηk+1 = w ; z k+1 = w/ηk+1 ;
6: else
7: if Im(βk ) > 0 then
8: w(1) = A z k − Re(βk )z k ;
9: w(2) = A w(1) − Re(βk )w(1) + Im(βk )2 z k ;
10: αk+1 = 1/w(1)  ; z k+1 = αk+1 w(1) ;
11: αk+2 = 1/w(2)  ; z k+2 = αk+1 w(2) ;
12: end if
13: end if
14: end
15: Build the matrix T̂m .

References

1. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings
of 7th International Conference World Wide Web, pp. 107–117. Elsevier Science Publishers
B.V, Brisbane (1998)
2. Langville, A., Meyer, C.: Google’s PageRank and Beyond: The Science of Search Engine
Rankings. Princeton University Press, Princeton (2006)
3. Gleich, D., Zhukov, L., Berkhin., P.: Fast parallel PageRank: a linear system approach. Technical
report, Yahoo Corporate (2004)
4. Gleich, D., Gray, A., Greif, C., Lau, T.: An inner-outer iteration for computing PageRank.
SIAM J. Sci. Comput. 32(1), 349–371 (2010)
5. Bahi, J., Contassot-Vivier, S., Couturier, R.: Parallel Iterative Algorithms. Chapman &
Hall/CRC, Boca Raton (2008)
6. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation. Prentice Hall, Engle-
wood Cliffs (1989)
7. Kollias, G., Gallopoulos, E., Szyld, D.: Asynchronous iterative computations with web infor-
mation retrieval structures: The PageRank case. In: PARCO, pp. 309–316 (2005)
8. Ishii, H., Tempo, R.: Distributed randomized algorithms for the PageRank computation. IEEE
Trans. Autom. Control 55(9), 1987–2002 (2010)
9. Kalnay, E., Takacs, L.: A simple atmospheric model on the sphere with 100% parallelism.
Advances in Computer Methods for Partial Differential Equations IV (1981). http://ntrs.nasa.
gov/archive/nasa/casi.ntrs.nasa.gov/19820017675.pdf. Also published as NASA Technical
Memorandum No. 83907, Laboratory for Atmospheric Sciences, Research Review 1980–1981,
pp. 89–95, Goddard Sapce Flight Center, Maryland, December 1981
10. Gallopoulos, E.: Fluid dynamics modeling. In: Potter, J.L. (ed.) The Massively Parallel Proces-
sor, pp. 85–101. The MIT Press, Cambridge (1985)
11. Gallopoulos, E., McEwan, S.D.: Numerical experiments with the massively parallel processor.
In: Proceedings of the 1983 International Conference on Parallel Processing, August 1983,
pp. 29–35 (1983)
310 9 Iterative Schemes for Large Linear Systems

12. Potter, J. (ed.): The Massively Parallel Processor. MIT Press, Cambridge (1985)
13. Varga, R.S.: Matrix Iterative Analysis. Prentice Hall Inc., Englewood Cliffs (1962)
14. Wachspress, E.L.: Iterative Solution of Elliptic Systems. Prentice-Hall Inc., Englewood Cliffs
(1966)
15. Young, D.: Iterative Solution of Large Linear Systems. Academic Press, New York (1971)
16. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
17. Golub, G.H., Varga, R.S.: Chebychev semi-iterative methods, successive overrelaxation itera-
tive methods, and second order Richardson iterative methods: part I. Numer. Math. 3, 147–156
(1961)
18. O’Leary, D., White, R.: Multi-splittings of matrices and parallel solution of linear systems.
SIAM J. Algebra Discret. Method 6, 630–640 (1985)
19. Neumann, M., Plemmons, R.: Convergence of parallel multisplitting iterative methods for
M-matrices. Linear Algebra Appl. 88–89, 559–573 (1987)
20. Szyld, D.B., Jones, M.T.: Two-stage and multisplitting methods for the parallel solution of
linear systems. SIAM J. Matrix Anal. Appl. 13, 671–679 (1992)
21. Hageman, L., Young, D.: Applied Iterative Methods. Academic Press, New York (1981)
22. Keller, H.: On the solution of singular and semidefinite linear systems by iteration. J. Soc.
Indus. Appl. Math. 2(2), 281–290 (1965)
23. Stiefel, E.L.: Kernel polynomials in linear algebra and their numerical approximations. U.S.
Natl. Bur. Stand. Appl. Math. Ser. 49, 1–22 (1958)
24. Saad, Y., Sameh, A., Saylor, P.: Solving elliptic difference equations on a linear array of
processors. SIAM J. Sci. Stat. Comput. 6(4), 1049–1063 (1985)
25. Rutishauser, H.: Refined iterative methods for computation of the solution and the eigenvalues
of self-adjoint boundary value problems. In: Engli, M., Ginsburg, T., Rutishauser, H., Seidel,
E. (eds.) Theory of Gradient Methods. Springer (1959)
26. Householder, A.S.: The Theory of Matrices in Numerical Analysis. Dover Publications, New
York (1964)
27. Dongarra, J., Duff, I., Sorensen, D., van der Vorst, H.: Numerical Linear Algebra for High-
Performance Computers. SIAM, Philadelphia (1998)
28. Meurant, G.: Computer Solution of Large Linear Systems. Studies in Mathematics and its
Applications. Elsevier Science (1999). http://books.google.fr/books?id=fSqfb5a3WrwC
29. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)
30. van der Vorst, H.A.: Iterative Krylov Methods for Large Linear Systems. Cambridge University
Press, Cambridge (2003). http://dx.doi.org/10.1017/CBO9780511615115
31. Saad, Y., Schultz, M.H.: GMRES: A generalized minimal residual algorithm for solving non-
symmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856–869 (1986)
32. Sidje, R.B.: Alternatives for parallel Krylov subspace basis computation. Numer. Linear Alge-
bra Appl. 305–331 (1997)
33. Nuentsa Wakam, D., Erhel, J.: Parallelism and robustness in GMRES with the Newton basis
and the deflated restarting. Electron. Trans. Linear Algebra (ETNA) 40, 381–406 (2013)
34. Nuentsa Wakam, D., Atenekeng-Kahou, G.A.: Parallel GMRES with a multiplicative Schwarz
preconditioner. J. ARIMA 14, 81–88 (2010)
35. Philippe, B., Reichel, L.: On the generation of Krylov subspace bases. Appl. Numer. Math.
(APNUM) 62(9), 1171–1186 (2012)
36. Joubert, W.D., Carey, G.F.: Parallelizable restarted iterative methods for nonsymmetric linear
systems. Part I: theory. Int. J. Comput. Math. 44, 243–267 (1992)
37. Joubert, W.D., Carey, G.F.: Parallelizable restarted iterative methods for nonsymmetric linear
systems. Part II: parallel implementation. Int. J. Comput. Math. 44, 269–290 (1992)
38. Parlett, B.: The Symmetric Eigenvalue Problem. SIAM, Philadelphia (1998)
39. Bai, Z., Hu, D., Reichel, L.: A Newton basis GMRES implementation. IMA J. Numer. Anal.
14, 563–581 (1994)
40. Reichel, L.: Newton interpolation at Leja points. BIT 30, 332–346 (1990)
Chapter 10
Preconditioners

In order to make iterative methods effective or even convergent it is frequently nec-


essary to combine them with an appropriate preconditioning scheme; cf. [1–5]. A
large number of preconditioning techniques have been proposed in the literature and
have been implemented on parallel processing systems. We described earlier in this
book methods for solving systems involving banded preconditioners, for example
the Spike algorithm discussed in Chap. 5.
In this chapter we consider some preconditioners that are particularly amenable
to parallel implementation. If the coefficient matrix A is available explicitly, in solv-
ing the linear system Ax = f , then we seek to partition it, either before or after
reordering depending on its structure, into two or more submatrices that can be
handled simultaneously in parallel. For example, A can be partitioned into several
(non-overlapped or overlapped) block rows in order to solve the linear system via an
accelerated block row projection scheme, see Sect. 10.2. If the reordered matrix A
were to result in a banded or block-tridiagonal matrix, then the block row projection
method will exhibit two or more levels of parallelism. Alternatively, let A be such
that it can be rewritten, after reordering, as the sum of a “generalized” banded matrix
M whose Frobenius norm is almost equal to that of A, and a general sparse matrix E
of a much lower rank than that of A and containing a much lower number of nonzeros
than A. Then this matrix M that encapsulates as many of the nonzeros as possible
may be used as a preconditioner. Hence, in the context of Krylov subspace methods,
in each outer iteration for solving Ax = f , there is a need to solve systems of the
for M z = r , involving the preconditioner M. Often, M can be represented as over-
lapped diagonal blocks to maximize the number of nonzero elements encapsulated
by M. Solving systems with such coefficient matrices M on parallel architectures
can be realized by: (i) Tearing—that we first encountered in Sect. 5.4, or (ii) via a
multiplicative Schwarz approach. Preconditioners of this type are discussed in the
following Sects. 10.1 and 10.3.

© Springer Science+Business Media Dordrecht 2016 311


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_10
312 10 Preconditioners

10.1 A Tearing-Based Solver for Generalized Banded


Preconditioners

The tearing-based solver described in Sect. 5.4 is applicable for handling systems
involving such sparse preconditioner M. The only exception here is that each linear
system corresponding to an overlapped diagonal block M j can be solved using a
sparse direct solver. Such a direct solver, however, should have the following capa-
bility: given a system M j Y j = G j in which G j has only nonzero elements at the top
and bottom m rows, then the solver should be capable of obtaining the top and bottom
m rows of the solution Y j much faster than computing all of Y j . The sparse direct
solver PARDISO possesses such a feature [6]. This will allow solving the balance
system, and hence the original system, much faster and with a high degree of parallel
scalability.

10.2 Row Projection Methods for Large Nonsymmetric


Linear Systems

Most sparse nonsymmetric linear system solvers either require: (i) storage and
computation that grow excessively as the number of iterations increases, (ii) spe-
cial spectral properties of the coefficient matrix A to assure convergence or, (iii) a
symmetrization process that could result in potentially disastrously ill-conditioned
problems. One group of methods which avoids these difficulties is accelerated row
projection (RP) algorithms. This class of sparse system solvers start by partitioning
the coefficient matrix A of the linear system Ax = f of order n, into m block rows:

A = (A1 , A2 , . . . , Am ), (10.1)

and partition the vector f accordingly. A row projection (RP) method is any
algorithm which requires the computation of the orthogonal projections Pi x =
Ai (Ai Ai )−1 Ai x of a vector x onto R(Ai ), i = 1, 2, . . . , m. Note that the nonsin-
gularity of A implies that Ai has full column rank and so (Ai Ai )−1 exists.
In this section we present two such methods, bearing the names of their inventors,
and describe their properties. The first (Kaczmarz) has an iteration matrix formed as
the product of orthogonal projectors, while the second RP method (Cimmino) has an
iteration matrix formed as the sum of orthogonal projectors. Conjugate gradient (CG)
acceleration is used for both. Most importantly, we show the underlying relationship
between RP methods and the CG scheme applied to the normal equations. This, in
turn, provides an explanation for the behavior of RP methods, a basis for comparing
them, and a guide for their effective use.
Possibly the most important implementation issue for RP methods is that of choos-
ing the row partitioning which defines the projectors. An approach for banded systems
yields scalable parallel algorithms that require only a few extra vectors of storage,
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 313

and allows for accurate computations involving the applications of the necessary pro-
jections. Numerous numerical experiments show that these algorithms have superior
robustness and can be quite competitive with other solvers of sparse nonsymmetric
linear systems.
RP schemes have also been attracting the attention of researchers because of their
robustness in solving overdetermined and difficult systems that appear in applications
and because of the many interesting variations in their implementation, including
parallel asynchronous versions; see e.g. [7–12].

10.2.1 The Kaczmarz Scheme

As the name suggests, a projection method can be considered as a method of solution


which involves the projection of a vector onto a subspace. This method, which was
first proposed by Kaczmarz considers each equation as a hyperplane; (i.e. partition
using m = n); thus reducing the problem of finding the solution of a set of equations
to the equivalent problem of finding the coordinates of the point of intersection of
those hyperplanes. The method of solution is to project an initial iterate onto the first
hyperplane, project the resulting point onto the second, and continue the projections
on the hyperplanes in a cyclic order thus approaching the solution more and more
closely with each projection step [13–16].

Algorithm 10.1 Kaczmarz method (classical version)


1: Choose x0 ; r0 = f − Ax0 ; set k = 0. ρik
2: do i = 1 : n,
rki
3: αki = ai 22
;
4: xk = xk + αki ai ;
5: rk = rk − αki A ai ;
6: end
7: If a convergence criterion is satisfied, terminate the iterations; else set k = k +1 and go to Step 2.

Here ρik is the ith component of the residual rk = f − Axk and ai is the ith
row of the matrix A. For each k, Step 2 consists of n projections, one for each row
of A. Kaczmarz’s method was recognized as a projection in [17, 18]. It converges
for any system of linear equations with nonzero rows, even when it is singular and
inconsistent, e.g. see [19, 20], as well as [21–23]. The method has been used under
the name (unconstrained) ART (Algebraic Reconstruction Techniques) in the area
of image reconstruction [24–28].
The idea of projection methods was further generalized by in [29, 30] to include
the method of steepest descent, Gauss-Seidel, and other relaxation schemes.
In order to illustrate this idea, let the error and the residual at the kth step be
defined, respectively, as
314 10 Preconditioners

δxk = x − xk , (10.2)

and

rk = f − Axk . (10.3)

Then a method of projection is one in which at each step k, the error δxk is resolved
into two components, one of which is required to lie in a subspace selected at that
step, and the other is δxk+1 , which is required to be less than δxk in some norm. The
subspace is selected by choosing a matrix Yk whose columns are linearly independent.
Equivalently,

δxk+1 = δxk − Yk u k (10.4)

where u k is a vector (or a scalar if Yk has one column) to be selected at the kth step
so that

δxk+1  < δxk  (10.5)

where  ·  is some vector norm. The method of projection depends on the choice of
the matrix Yk in (10.4) and the vector norm in (10.5).
If we consider ellipsoidal norms, i.e.

s2 = s  Gs, (10.6)

where G is a positive definite matrix, then u k is selected such that δxk+1  is mini-
mized yielding,

u k = (Yk GYk )−1 Yk Gδxk (10.7)

and

δxk 2 − δxk+1 2 = δxk GYk u k . (10.8)

Since we do not know δxk , the process can be made feasible if we require that,

Yk G = Vk A (10.9)

in which the matrix Vk will have to be determined at each iteration. This will allow
u k to be expressed in terms of rk ,

u k = (Yk GYk )−1 Vkrk . (10.10)


10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 315

The 2-Partitions Case


Various choices of G in (10.6) give rise to various projection methods. In this section,
we consider the case when

G = I. (10.11)

This gives from (10.7) and (10.9)

u k = (Vk A A Vk )−1 Vk Aδxk (10.12)

and

xk+1 = xk + A Vk (Vk A A Vk )−1 Vkrk . (10.13)

Now let n be even, and let A be partitioned as


 
A
A= 1 (10.14)
A
2

where Ai , i = 1, 2, is an n/2 × n matrix. Let

V1 = (In/2 , 0), V2 = (0, In/2 ), and f  = ( f 1 , f 2 ) (10.15)

where f i , i = 1, 2, is an n/2 vector.


Then, one iteration of (10.13) can be written as

z 1 = xk , z 2 = z 1 + A1 (A −1 
1 A1 ) ( f 1 − A1 z 1 ),
 −1  (10.16)
z 3 = z 2 + A2 (A2 A2 ) ( f 2 − A2 z 2 ), xk+1 = z 3 .

This is essentially the same as applying the block Gauss-Seidel method to

A A y = f, A y = x. (10.17)

Symmetrized m-Partition
Variations of (10.16) include the block-row Jacobi, the block row JOR and the block-
row SOR methods [31–34]. For each of these methods, we also have the correspond-
ing block-column method, obtained by partitioning the matrix in (10.14) by columns
instead of rows. Note that if A had been partitioned into n parts, each part being a
row of A, then (10.16) would be equivalent to the Kaczmarz method. In general, for
m partitions, the method of successive projections yields the iteration

xk+1 = Q u xk + f u = (I − Pm )(I − Pm−1 ) · · · (I − P1 )xk + f u (10.18)


316 10 Preconditioners

where

f u = fˆm +(I − Pm ) fˆm−1 +(I − Pm )(I − Pm−1 ) fˆm−2 +· · ·+(I − Pm ) · · · (I − P2 ) fˆ1 ,

and fˆi = Ai (Ai Ai )−1 f i .


While the robustness of (10.18) is remarkable and the iteration converges even
when A is singular or rectangular, as with any linear stationary process, the rate of
convergence is determined by the spectral radius of Q u and can be arbitrarily slow. For
this reason, it is proposed in [31, 35], to symmetrize Q u by following a forward sweep
through the rows with a backward sweep, and introduce an acceleration parameter
to get the iteration
xk+1 = Q(ω)xk + c (10.19)

Here, c is some vector to be defined later, and

Q(ω) = (I − ω P1 )(I − ω P2 ) · · · (I − ω Pm )2 · · · (I − ω P2 )(I − ω P1 ), (10.20)

When A is nonsingular and 0 < ω < 2, the eigenvalues of the symmetric matrix
(I − Q(ω)) lie in the interval (0,1] and so the conjugate gradient (CG) may be used
to solve

(I − Q(ω))x = c. (10.21)

Note that iteration, (10.19) is equivalent to that of using the block symmetric suc-
cessive overrelaxation (SSOR) method to solve the linear system

A A y = f
(10.22)
x = A y

in which the blocking is that induced by the row partitioning of A. This gives a simple
expression for the right-hand side c = T (ω) f where,

T (ω) = A (D + ωL)−T D(D + ωL)−1 , (10.23)

in which A A = L + D+L  is the usual splitting into strictly block-lower triangular,


block-diagonal, and striclty block-upper triangular parts.
Exploration of the effectiveness of accelerated row projection methods has been
made in [35] for the single row (m = n), and in [14, 36] for the block case (m ≥ 2).
Using sample problems drawn from nonself-adjoint elliptic partial differential equa-
tions, the numerical experiments in [14, 36] examined the issues of suitable block row
partitioning and methods for the evaluation of the actions of the induced projections.
Comparisons with preconditioned Krylov subspace methods, and preconditioned
CG applied to the normal equations show that RP algorithms are more robust but not
necessarily the fastest on uniprocessors.
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 317

The first implementation issue is the choice of ω in (10.21). Normally, the ‘opti-
mal’ ω is defined as the ωmin that minimizes the spectral radius of Q(ω). Later, we
show that ωmin = 1 for the case in which A is partitioned into two block rows, i.e.,
m = 2, see also [14]. This is no longer true for m > 3, as can be seen by considering
⎛ ⎞ ⎛ ⎞
100 A1
A = ⎝ 1 1 0 ⎠ = ⎝ A
2
⎠.
101 A
3

For Q(ω) defined in (10.20), it can be shown that the spectral radii of Q(1), and
Q(0.9) satisfy,

ρ(Q(1)) = (7 + 17)/16 = 0.69519 ± 10−5 ,
ρ(Q(0.9)) ≤ 0.68611 ± 10−5 < ρ(Q(1)).

Hence, ωmin = 1.
However, taking ω = 1 is recommended as we explain it in the following. First,
however, we state two important facts.
Proposition 10.1 At least rank(A1 ) of the eigenvalues of Q(1) are zero.

Proof (I − P1 )x = 0 for x ∈ R(P1 ) = R(A1 ). From the definition of Q(ω), N (I −


P1 ) ⊆ N (Q(1)).

Proposition 10.2 When ω = 1 and x0 = 0, A


1 x k = f 1 holds in exact arithmetic
for every iteration of (10.19).

Proof Using the definition Pi = Ai (Ai Ai )−1 Ai = Ai Ai+ , (10.23) can be expanded
to show that the ith block column of T (1) is given by
⎡ ⎤

i−1
m
i+1
(I − P j ) ⎣ I + (I − P j ) (I − P j )⎦ (Ai )+ . (10.24)
j=1 j=i j=m−1

The first product above should be interpreted as I when i = 1 and the third product
should be interpreted as I when i = m − 1 and 0 when i = m, so that the first
summand in forming T (1) f is

[I + (I − P1 ) · · · (I − Pm ) · · · (I − P2 )](A +
1 ) f1 .

Since A    +
1 (I − P1 ) = 0, then A1 x 1 = A1 (A1 ) f 1 = f 1 . The succeeding iterates
xk are obtained by adding elements in R(Q(1)) ⊆ R(I − P1 ) = N (A 1 ) to x 1 .
318 10 Preconditioners

Taking ω = 1 is recommended for the following three reasons:


• Since CG acceleration is to be applied, the entire distribution of the spectrum must
be considered, not simply the spectral radius. Thus, from Proposition 10.1, when
ω = 1, many of the eigenvalues of the coefficient matrix in (10.21) are exactly equal
to 1. Moreover, since the number of needed CG iterations (in exact arithmetic) is
equal to the number of distinct eigenvalues, this suggests that numerically fewer
iterations are needed as compared to when ω = 1.
• Numerical experience shows that ρ(Q(ω)) is not sensitive to changes in ω. This
matches classical results for the symmetric successive overrelaxation SSOR iter-
ations, which are not as sensitive to minor changes in ω compared to the method
of successive overrelaxation SOR. Hence, the small improvement that does result
from choosing ωmin is more than offset by the introduction of extra nonzero eigen-
values.
• From Proposition 10.2, A 1 x k = f 1 is satisfied for all k, and as we will show later,
remains so even after the application of CG acceleration. This feature means that,
when ω = 1, those equations deemed more important than others can be placed
into the first block and kept satisfied to machine precision throughout this iterative
procedure.

Definition 10.1 Considering the previous notations, for solving the system Ax = f ,
the system resulting from the choice Q = Q(1) and T = T (1) is :

(I − Q)x = c, (10.25)

Where Q and T are as given by (10.20) and (10.23), respectively, with ω = 1, and
c = T f . The corresponding solver given by Algorithm 10.2 is referred to as KACZ,
the symmetrized version of the Kaczmarz method.

Algorithm 10.2 KACZ: Kaczmarz method (symmetrized version)


1: c = T f ;
2: Choose x0 ; set k = 0.
3: do i = 1 : m,
4: xk = (I − Pi )xk ;
5: end
6: do i = m − 1 : 1,
7: xk = (I − Pi )xk ;
8: end
9: xk = xk + c;
10: If a convergence criterion is satisfied, terminate the iterations; else set k = k + 1 and go to
Step 3.
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 319

10.2.2 The Cimmino Scheme

This RP method can be derived as a preconditioner for the CG algorithm. Premulti-


plying the system Ax = f by

à = (A1 (A −1  −1  −1
1 A1 ) , A2 (A2 A2 ) , . . . , Am (Am Am ) ) (10.26)

we obtain

(P1 + P2 + · · · + Pm )x = Ã f. (10.27)

This system can also be derived as a block Jacobi method applied to the system
(10.22); see [37]. For nonsingular A, this system is symmetric positive definite and
can be solved via the CG algorithm. The advantage of this approach is that the
projections can be computed in parallel and then added.
In 1938, Cimmino [22, 38] first proposed an iteration related to (10.27), and since
then it has been examined by several others [21, 31, 32, 37, 39–42]. Later, we will
show how each individual projection can, for a wide class of problems, be computed
in parallel creating a solver with two levels of parallelism.

10.2.3 Connection Between RP Systems and the Normal


Equations

Although KACZ can be derived as a block SSOR, and the Cimmino scheme as a block
Jacobi method for (10.22), a more instructive comparison can be made with CGNE—
the conjugate gradient method applied to the normal equations A Ax = A f .
All three methods consist of the CG method applied to a system with coefficient
matrix W  W , where W  is shown for each of the three methods in Table 10.1.
Intuitively an ill-conditioned matrix A is one in which some linear combination of
rows yields approximately the zero vector. For a block row partitioned matrix near
linear dependence may occur within a block, that is, some linear combination of the
rows within a particular block is approximately zero, or across the blocks, that is, the

Table 10.1 Comparison of system matrices for three row projection methods
Method W
CGNE (A1 , A2 , . . . , Am )
Cimmino (Q 1 , Q 2 , . . . , Q m )

m−1
(P1 , (I − P1 )P2 , (I − P1 )(I − P2 )P3 , . . . , (I − Pi )Pm )
KACZ i=1
320 10 Preconditioners

linear combination must draw on rows from more than one block row Ai . Now let
Ai = Q i Ui be the orthogonal decomposition of Ai in which the columns of Q i are
orthonormal. Examining the matrices W shows that CGNE acts on A A, in which
near linear dependence could occur both from within and across blocks. Cimmino,
however, replaces each Ai with the orthonormal matrix Q i . In other words, Cimmino
avoids forming linear dependence within each block, but remains subject to linear
dependence formed across the blocks.
Similar to Cimmino, KACZ also replaces each Ai with the orthonormal matrix
Q i , but goes a step further since Pi (I − Pi ) = 0.
Several implications follow from this heuristic argument. We make two practical
observations. First, we note that the KACZ system matrix has a more favorable
eigenvalue distribution than that of Cimmino in the sense that KACZ has fewer small
eigenvalues and many more near the maximal one. Similarly, the Cimmino system
matrix is better conditioned than that of CGNE. Second, we note that RP methods
will require fewer iterations for matrices A where the near linear dependence arises
primarily from within a block row rather than across block rows. A third observation
is that one should keep the number of block rows small. The reason is twofold: (i)
partial orthogonalization across blocks in the matrix W of Table 10.1 becomes less
effective as more block rows appear; (ii) the preconditioner becomes less effective
(i.e. the condition number of the preconditioned system increases) when the number
of blocks increases. Further explanation of the benefit of keeping the number of block
rows small is seen from the case in which m = n, i.e. having n blocks. In such a
case, the ability to form a near linear dependence occurs only across rows where the
outer CG acceleration method has to deal with it.

10.2.4 CG Acceleration

Although the CG algorithm can be applied directly to the RP systems, special prop-
erties allow a reduction in the amount of work required by KACZ. CG acceleration
for RP methods was proposed in [35], and considered in [14, 36]. The reason that a
reduction in work is possible, and assuring that A
1 x k = f 1 is satisfied in every CG
outer iteration for accelerating KACZ follows from:
Theorem 10.1 Suppose that the CG algorithm is applied to the KACZ system
(10.25). Also, let rk = c − (I − Q)xk be the residual, and dk the search direc-
tion. If x0 = c is chosen as the starting vector, then rk , dk ∈ R(I − P1 ) for all k.
Proof

r0 = c − (I − Q)c = Qc
= (I − P1 )(I − P2 ) · · · (I − Pm ) · · · (I − P2 )(I − P1 )c ∈ R(I − P1 ).

Since d0 = r0 , the same is true for d0 . Suppose now that the theorem holds for step
(k − 1). Then dk−1 = (I − P1 )dk−1 and so
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 321

wk ≡ (I − Q)dk−1
= (I − P1 )dk−1 − (I − P1 )(I − P2 ) · · · (I − Pm ) · · · (I − P2 )(I − P1 )dk−1
∈ R(I − P1 ).

Since rk is a linear combination of rk−1 and wk , then rk ∈ R(I − P1 ). Further, since


dk is a linear combination of dk−1 and rk , then dk ∈ R(I − P1 ). The result follows
by induction.

This reduces the requisite number of projections from 2m − 1 to 2m − 2 because the


first multiplication by (I − P1 ) when forming (I − Q)dk can be omitted. Also, using
x0 = c for KACZ keeps the first block of equations satisfied in exact arithmetic.

Corollary 10.1 If x0 = c is chosen in the CG algorithm applied to the KACZ


system, then A
1 x k = f 1 for all k ≥ 0.

Proof The proof of Theorem 10.1 shows that c ∈ (A +


1 ) f 1 + R(I − P1 ). Since
  +    +
A1 (I − P1 ) = A1 (I − A1 A1 ) = 0, A1 x0 = A1 (A1 ) f 1 = f 1 because A1 has
full column rank. For k > 0, dk−1 ∈ R(I − P1 ), so

A   
1 x k = A1 (x k−1 + αk dk−1 ) = A1 x k−1 = · · · = A1 x 0 = f 1 .

In summary, CG acceleration for KACZ allows one projection per iteration to be


omitted and one block of equations to be kept exactly satisfied, provided that x0 = c
is used.

10.2.5 The 2-Partitions Case

If the matrix A is partitioned into two block rows (i.e. m = 2), a complete eigen-
analysis of RP methods is possible using the concept of the angles θk between the
two subspaces L i = R(Ai ), i = 1, 2. The definition presented here follows [43], but
for convenience L 1 and L 2 are assumed to have the same dimension. The smallest
angle θ1 ∈ [0, π/2] between L 1 and L 2 is defined by

cos θ1 = max max u  v


u∈L 1 v∈L 2
subject to u = v = 1.

Let u 1 and v1 be the attainment vectors; then for k = 2, 3, . . . , n/2 the remaining
angles between the two subspaces are defined as

cos θk = max max u  v


 1 v∈L 2
u∈L
u = v = 1
subject to
u i u = vi v = 0, i = 1, 2, ...., k − 1.
322 10 Preconditioners

Furthermore, when u i and v j are defined as above, u i v j = 0, i = j also holds.


From this, one can obtain the CS decomposition [44, 45], which is stated below in
terms of the projectors Pi .

Theorem 10.2 (CS Decomposition). Let Pi ∈ Rn×n be the orthogonal projectors


onto the subspaces L i , for i = 1, 2. Then there are orthogonal matrices U1 and U2
such that
     
I 0 I 0 C −S
P1 = U1 U1 , P2 = U2 U2 , U1 U2 = ,
00 00 S C

C = diag(c1 , c2 , . . . , cn/2 )
S = diag(s1 , s2 , . . . , sn/2 )
(10.28)
I = C 2 + S2
1 ≥ c1 ≥ c2 ≥ · · · ≥ cn/2 ≥ 0.

In the above theorem ck = cos θk and sk = sin θk , where the angles θk are as defined
above. Now consider the nonsymmetric RP iteration matrix Q u = (I − ω P1 )(I −
ω P2 ). Letting α = 1 − ω, using the above expressions of P1 and P2 , we get
 
α 2 C −αS
Q u = U1 U2 .
αS C

Hence,
 
α 2 C 2 + αS 2 (1 − α)C S
U2 Q u U2 = . (10.29)
α(1 − α)C S C 2 + αS 2

Since each of the four blocks is diagonal, U2 Q u U2 has the same eigenvalues as
the scalar 2 × 2 principal submatrices of the permuted matrix (inverse odd-even
permutation on the two sides). The eigenvalues are given by,
  
1
(1 − α) ci + 2α ± |1 − α|ci (1 − α) ci + 4α ,
2 2 2 2 (10.30)
2

for i = 1, 2, . . . , n/2. For a given α the modulus of this expression is a maximum


when ci is largest, i.e., when ci = c1 . The spectral radius of Q u can then be found by
taking ci = c1 and choosing the positive sign in (10.30). Doing so and minimizing
with respect to α yields ωmin = 1 − αmin = 2/(1 + s1 ) and a spectral radius of
(1 − s1 )/(1 + s1 ). The same result was derived in [31] using the classical SOR
theory. The benefit of the CS decomposition is that the full spectrum of Q u is given
and not simply only its spectral radius. In particular, when ω = 1, the eigenvalues
of the nonsymmetric RP iteration matrix Q u become {c12 , c22 , . . . , cn/2
2 , 0} with the

zero eigenvalue being of multiplicity n/2.


10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 323

Now, applying the CS decomposition to the symmetrized RP iteration, the matrix


Q(ω) is given by,

Q(ω) = (I −
 ω2P1 )(I2 − ω P 2 ) (I − ω P1 ) 
2

α (αC + S ) α(α − 1)C S


2
= U1 U1 (10.31)
α(α − 1)C S αS 2 + C 2

where again α = 1 − ω. When ω = 1, the eigenvalues of the symmetrized matrix


Q are identical to those of the unsymmetrized matrix Q u . One objection to the
symmetrization process is that Q(ω) requires three projections while Q u only needs
two. In the previous section, however, we showed that when ω = 1, KACZ can be
implemented with only two projections per iteration.
When m = 2, ω = 1 minimizes the spectral radius of Q(ω) as was shown in
[14]. This result can be obtained from the representation (10.31) in the same way as
(10.29) is obtained.
The CS decomposition also allows the construction of an example showing that
no direct relationship need to exist between the singular value distribution of A and
its related RP matrices. Define
   
A1 0 D
A= = (10.32)
A
2 S C

where each block is n/2 × n/2, with C and S satisfying   (10.28) and D =
I 0 C
diag(d1 , d2 , . . . , dn/2 ). Then P1 = and P2 = (C S), and the eigenval-
0 0 S

ues of [I − Q(1)] are {s12 , s22 , . . . , sn/2
2 , 1} while those of A are [c ± c2 + 4s d ]/2.
i i i i
Clearly, A is nonsingular provided that each si and di is nonzero. If the si ’s are close
to 1 while the di ’s are close to 0, then [I − Q(1)] has eigenvalues that are clustered
near 1, while A has singular values close to both 0 and 1. Hence, A is badly condi-
tioned while [I − Q(1)] is well-conditioned. Conversely if the di ’s are large while
the si ’s are close to 0 in such a way that di si is near 1, then A is well-conditioned
while [I − Q(1)] is badly conditioned. Hence, the conditioning of A and its induced
RP system matrix may be significantly different.
We have already stated that the eigenvalue distribution for the KACZ systems is
better for CG acceleration than that of Cimmino, which in turn is better than that of
CGNE (CG applied to A Ax = A f ). For m = 2, the eigenvalues of KACZ are
{s12 , s22 , . . . , sn/2
2 , 1} while those of Cimmino are easily seen to be {1 − c , . . . , 1 −
1
cn/2 , 1 + cn/2 , . . . , 1 + c1 }, verifying that KACZ has better eigenvalue distribution
than that of Cimmino. The next theorem shows that in terms of condition numbers
this heuristic argument is valid when m = 2.

Theorem 10.3 When m = 2, κ(A A) ≥ κ(P1 + P2 ) ≥ κ(I − (I − P1 )(I − P2 )(I −


P1 )). Furthermore, if c1 = cos θ1 is the canonical cosine corresponding to the
324 10 Preconditioners

smallest angle between R(A1 ) and R(A2 ), then κ(P1 + P2 ) = (1 + c1 )2 κ(I − (I −


P1 )(I − P2 )(I − P1 )).

Proof Without loss of generality, suppose that R(A1 ) and R(A2 ) have dimension
n/2. Let U1 = (G 1 , G 2 ) and U2 = (H1 , H2 ) be the matrices defined in Theorem
10.2 so that P1 = G 1 G   
1 , P2 = H1 H1 , and G 1 H1 = C. Set X = G 1 A1 and

 
Y = H1 A2 so that A1 = P1 A1 = G 1 G 1 A1 = G 1 X and A2 = H1 Y . It is easily
verified that the eigenvalues of (P1 + P2 ) are 1±ci , corresponding to the eigenvectors
gi ± h i where G 1 = (g1 , g2 , . . . , gn/2 ) and H1 = (h 1 , h 2 , . . . , h n/2 ). Furthermore,
G 1 g1 = H1 h 1 = e1 and G 1 h 1 = H1 g1 = c1 e1 , where e1 is the first unit vector. Then
A A = A1 A     
1 + A2 A2 = G 1 X X G 1 + H1 Y Y H1 , so (g1 + h 1 ) (A A)(g1 +
 
2 
h 1 ) = (1+c1 ) e1 (A A)e1 , and (g1 −h 1 ) (A A)(g1 −h 1 ) = (1−c1 ) e1 (A A)e1 .
   2 

Thus, the minimax characterization of eigenvalues yields,

λmax (A A) ≥ (1 + c1 )2 (e1 (A A)e1 )/(g1 + h 1 ) (g1 + h 1 )


= (1 + c1 )2 (e1 (A A)e1 )/2(1 + c1 ) = (1 + c1 )(e1 (A A)e1 )/2

1 + c1
and λmin (A A) ≤ (1 − c1 )(e1 (A A)e1 )/2. Hence κ(A A) ≥ = κ(P1 +
1 − c1
P2 ), proving the first inequality.
To prove the second inequality, we see that from the CS Decomposition the eigen-
values of (I − (I − P1 )(I − P2 )(I − P1 )) are given by: {s12 , s22 , . . . , sn/2
2 , 1}, where

si2 = 1 − ci2 are the square of the canonical sines with the eigenvalue 1 being of
multiplicity n/2. Consequently,

1
κ(I − (I − P1 )(I − P2 )(I − P1 )) = ,
s12

and
κ(P1 + P2 ) 1 + c1
= × s12 = (1 + c1 )2 .
κ(I − (I − P1 )(I − P2 )(I − P1 )) 1 − c1

Note that (1 + c1 )2 is a measure of the lack of orthogonality between R(A1 ) and


R(A2 ), and so measures the partial orthogonalization effect described above.

10.2.6 Row Partitioning Goals

The first criterion for a row partitioning strategy is that the projections Pi x =
Ai (Ai Ai )−1 Ai x must be efficiently computable. One way to achieve this is through
parallelism: if Ai is the direct sum of blocks C j for j ∈ Si , then Pi is block-diagonal
[14]. The computation of Pi x can then be done by assigning each block of Pi to a dif-
ferent multicore node of a multiprocessor. The second criterion is storage efficiency.
10.2 Row Projection Methods for Large Nonsymmetric Linear Systems 325

The additional storage should not exceed O(n), that is, a few extra vectors, the num-
ber of which must not grow with increasing problem size n. The third criterion is
that the condition number of the subproblems induced by the projections should be
kept under control, i.e. monitored. The need for this is made clear by considering
the case when m = 1, i.e. when A is partitioned into a single block of rows. In
this case, KACZ simply solves the normal equations A Ax = A b, an approach
that can fail if A is severely ill-conditioned. More generally when m > 1, computing
y = Pi x requires solving a system of the form Ai Ai v = w. Hence the accuracy with
which the action of Pi can be computed depends on the condition number of Ai Ai
(i.e., κ(Ai Ai )), requiring the estimation of an upper bound of κ(Ai Ai ). The fourth
criterion is that the number of partitions m, i.e. the number of projectors should be
kept as small as possible, and should not depend on n. One reason has already been
outlined in Proposition 10.1.
In summary, row partitioning should allow parallelism in the computations,
require at most O(n) storage, should yield well conditioned subproblems, with the
number of partitions m being a small constant. We should note that all four goals can
be achieved simultaneously for an important class of linear systems, namely banded
systems.

10.2.7 Row Projection Methods and Banded Systems

For general linear systems, algorithm KACZ (Algorithm 10.2) can exploit parallelism
only at one level, that within each projection: given a vector u obtain v = (I − P j )u,
where P j = A j (Aj A j )−1 Aj . This matrix-vector multiplication is essentially the
solution of the least-squares problem

min u − A j w2
v

where one seeks only the minimum residual v.


If A were block-tridiagonal of the form
⎛ ⎞
G 1 H1
⎜ J2 G 2 H2 ⎟
⎜ ⎟
⎜ J3 G 3 H3 ⎟
⎜ ⎟
⎜ J4 G 4 H4 ⎟
A=⎜



⎜ J5 G 5 H5 ⎟
⎜ J6 G 6 H6 ⎟
⎜ ⎟
⎝ J7 G 7 H7 ⎠
J8 G 8

where each G i , Hi , Ji is square of order q, for example, then using suitable row
permutations, one can enhance parallelism within each projection by extracting block
rows in which each block consists of independent submatrices.
326 10 Preconditioners

Case 1: m = 2,
⎛ ⎞
G 1 H1
⎜ J2 G 2 H2 ⎟
⎜ ⎟
⎜ J G H ⎟
⎜ 5 5 5 ⎟
⎜ J G H ⎟

π1 A = ⎜ 6 6 6 ⎟
J3 G 3 H3 ⎟
⎜ ⎟
⎜ J4 G 4 H4 ⎟
⎜ ⎟
⎝ J7 G 7 H7 ⎠
J8 G 8

Case 2: m = 3,
⎛ ⎞
G 1 H1
⎜ J4 G 4 H4 ⎟
⎜ ⎟
⎜ J7 G 7 H7 ⎟
⎜ ⎟
⎜ J2 G 2 H2 ⎟
π2 A = ⎜



⎜ J5 G 5 H5 ⎟
⎜ J8 G 8 ⎟
⎜ ⎟
⎝ J3 G 3 H3 ⎠
J6 G 6 H6

Such permutations have two benefits. First, they introduce an outer level of paral-
lelism in each single projection. Second, the size of each independent linear least-
squares problem being solved is much smaller, leading to reduction in time and
memory requirements.
The Cimmino row projection scheme (Cimmino) is capable of exploiting paral-
lelism at two levels. For m = 3 in the above example, we have
(i) An outer level of parallel projections on all block rows using 8 nodes, one node
for each independent block, and
(ii) An inner level of parallelism in which one performs each projection (i.e., solv-
ing a linear least-squares problem) as efficiently as possible on the many cores
available on each node of the parallel architecture.

10.3 Multiplicative Schwarz Preconditioner with GMRES

We describe in this section an approach for creating a parallel Krylov subspace


method, namely GMRES, for solving systems of the form Ax = f preconditioned via
a Block Multiplicative Schwarz iteration, which is related to block-row projection via
the Kaczmarz procedure—see Sect. 10.2.1. Here, A consists of overlapped diagonal
blocks, see Fig. 10.1. While multiplicative Schwarz is not amenable for parallel
computing as it introduces a recursion between the local solves, it is often more
10.3 Multiplicative Schwarz Preconditioner with GMRES 327

Fig. 10.1 A matrix domain


decomposition with block
overlaps

robust than its additive counterpart. By using Newton Krylov bases, as introduced
in Sect. 9.3.2, the resulting method does not suffer from the classical bottleneck of
excessive internode communications.

10.3.1 Algebraic Domain Decomposition of a Sparse Matrix

Let us consider a sparse matrix A ∈ Rn×n . The pattern of A is the set P =


{(k, l)|ak,l = 0} which is the set of the edges of the graph G = (W, P) where
W = {1, . . . , n} = [1 : n] is the set of vertices.

Definition 10.2 A domain decomposition of matrix A into p subdomains is defined


by a collection of sets of integers Wi ⊂ W = [1 : n], i = 1, . . . , p such that:

|i − 
j| > 1 =⇒ Wi ∩ W j = ∅
p
P ⊂ i=1 (Wi × Wi )

In order to simplify the notation, the sets Wi are assumed to be intervals of integers.
This is not a restriction, since it is always possible to define a new numbering of
the unknowns which satisfies this constraint. Following this definition, a “domain”
decomposition can be considered as resulting from a graph partitioner but with poten-
tial overlap between domains. It can be observed that such a decomposition does
not necessarily exist (e.g. when A is dense matrix). For the rest of our discussion,
we will assume that a graph partitioner has been applied resulting in p intervals
Wi = wi + [1 : m i ] whose union is W , W = [1 : n]. The submatrix of A corre-
sponding to Wi × Wi is denoted by Ai . We shall denote by Ii ∈ Rn×n the diagonal
matrix, or the sub-identity matrix, whose diagonal elements are set to one if the corre-
sponding node belongs to Wi and set to zero otherwise. In effect, Ii is the orthogonal
projector onto the subspace L i corresponding to the unknowns numbered by Wi . We
still denote by Ai the extension of block Ai to the whole space, in other words,
328 10 Preconditioners

Ai = Ii AIi , (10.33)

For example, from Fig. 10.1 , we see that unlike the tearing method, the whole overlap
block C j belongs to A j as well as A j+1 .
Also, let

Āi = Ai + I¯i , (10.34)

where I¯i = I − Ii is the complement sub-identity matrix. For the sake of simplifying
the presentation of what follows, we assume that all the matrices Āi , for i = 1, . . . , p
are nonsingular. Hence, the generalized inverse Ai+ of Ai is given by Ai+ = Ii Āi−1 =
Āi−1 Ii .

Proposition 10.3 For any domain decomposition as given in Definition 10.2 the
following property holds:

|i − j| > 2 ⇒ Ii AI j = 0, ∀ i, j ∈ {1, . . . , p}.

Proof Let (k, l) ∈ Wi × W j such that ak,l = 0. Since (k, l) ∈ P, there exists m ∈
{1 . . . n} such that k ∈ Wm and l ∈ Wm ; therefore Wi ∩ Wm = ∅ and W j ∩ Wm = ∅.
Consequently, from Definition 10.2, |i − m| ≤ 1 and | j − m| ≤ 1, which implies
|i − j| ≤ 2.

Next, we consider a special case which arises often in practice.

Definition 10.3 The domain decomposition is with a weak overlap if and only if the
following is true:

|i − j| > 1 ⇒ Ii AI j = 0, ∀ i, j ∈ {1, . . . , p}.

The set of unknowns which represents the overlap is defined by the set of integers
Ji = Wi ∩ Wi+1 , i = 1, . . . , p − 1, with the size of the overlap being si . Similar to
(10.33) and (10.34), we define

Ci = Oi AOi , (10.35)

and

C̄i = Ci + Ōi , (10.36)

where the diagonal matrix Oi ∈ Rn×n is a sub-identity matrix whose diagonal


elements are set to one if the corresponding node belongs to Ji and set to zero
otherwise, and Ōi = I − Oi .
10.3 Multiplicative Schwarz Preconditioner with GMRES 329

Example 10.1 In the following matrix, it can be seen that the decomposition is of a
weak overlap if A2b,t = 0, and At,b
2 = 0, i.e. when A is block-tridiagonal.
⎛ ⎞
A1m,m A1m,b
⎜ b,m ⎟
⎜ A1 C1 At,m At,b ⎟
⎜ 2 2 ⎟
A=⎜ m,t
A2 A2 m,m m,b
A2 ⎟ (10.37)
⎜ t,m ⎟
⎝ b,t
A2 A2 b,m
C 2 A3 ⎠
At,m
3 A3m,m

10.3.2 Block Multiplicative Schwarz

The goal of Multiplicative Schwarz methods is to iteratively solve the linear system

Ax = f (10.38)

where matrix A is decomposed into overlapping subdomains as described in the


previous section. The iteration consists of solving the original system in sequence on
each subdomain. This is a well-known method; for more details, see for instance [2,
4, 46, 47]. In this section, we present the main properties of the iteration and derive
an explicit formulation of the corresponding matrix splitting that allows a parallel
expression of the preconditionner when used with Krylov methods.
Classical Formulation
This method corresponds to a relaxation iteration defined by the splitting A = P − N .
The existence of the preconditioner P is not made explicit in the classical formulation
of the multiplicative Schwarz procedure. However, the existence of P is a direct
consequence of the coming expression (10.39). Let xk be the current iterate and
rk = f − Axk the corresponding residual. The two sequences of vectors can be
generated via xk+1 = xk + P −1rk and rk+1 = f − A xk+1 , as given by Algorithm 10.3
for building p sub-iterates together their corresponding residuals.

Algorithm 10.3 Single block multiplicative Schwarz step


Input: A ∈ Rn×n , a partition of A, xk and rk = f − Axk ∈ Rn .
Output: xk+1 = xk + P −1 rk and rk+1 = f − Axk+1 , where P corresponds to one Multiplicative
Schwarz step.
1: xk,0 = xk and rk,0 = rk ;
2: do i = 1 : p,
3: z = Ai+ rk,i−1 ;
4: rk,i := rk,i−1 − A z;
5: xk,i := xk,i−1 + z;
6: end
7: xk+1 = xk, p and rk+1 = rk, p ;
330 10 Preconditioners

Hence, it follows that,


+
rk+1 = (I − A A+
p ) · · · (I − A A1 )rk . (10.39)

Convergence of this iteration to the solution x of the system (10.38) is proven for
M-matrices and SPD matrices (eg. see [46]).
Embedding in a System of Larger Dimension
If the subdomains do not overlap, it can be shown [47] that the Multiplicative Schwarz
is equivalent to a Block Gauss-Seidel method applied on an extended system. In this
section, following [48], we present an extended system which embeds the original
system (10.38) into a larger one with no overlapping between subdomains.
For that purpose, we define the prolongation mapping and the restriction map-
ping. We assume for the whole section that the set of indices defining the domains
are intervals. As mentioned before, this does not limit the scope of the study since a
preliminary symmetric permutation of the matrix, corresponding to the same renum-
bering of the unknowns and the equations, can always end up with such a system.
For any vector x ∈ Rn , we consider the set of overlapping subvectors x (i) ∈ Rm i
for i = 1, . . . , p, where x (i) is the subvector of⎛x corresponding
⎞ to the indices Wi .
x (i,t)
This vector can also be partitioned into x (i) = ⎝ x (i,m) ⎠ accordingly to the indices
x (i,b)
of the overlapping blocks Ci−1 and Ci (with the obvious convention that x (1,t) and
x ( p,b) are zero-length vectors).
Definition 10.4 The prolongation mapping which injects Rn into a space Rm where
p  p−1
m = i=1 m i = n + i=1 si is defined as follows :

D : Rn → Rm
x → 
x,

where 
x is obtained from vector x by duplicating all the blocks of entries cor-
responding
⎛ (1) ⎞to overlapping blocks: therefore, according to the previous notation,
x
⎜ ⎟
x = ⎝ ... ⎠.

x ( p)
The restriction mapping consists of projecting a vector 
x ∈ Rm onto Rn , which
consists of deleting the subvectors corresponding to the first appearance of each
overlapping blocks
P : Rm → Rn

x → x.

Embedding the original system in a larger one is done for instance in [47, 48].
We present here a special case. In order to avoid a tedious formal presentation of
the augmented system, we present its construction on an example which is generic
10.3 Multiplicative Schwarz Preconditioner with GMRES 331

 In (10.37), is displayed an example with


enough to understand the definition of A.
three domains. Mapping D builds x̃ by duplicating some entries in vector x; mapping
x → x = D x expands vector x to include subvectors x (1,b) and x (2,b) :
⎛ ⎞
x (1,m)
⎛ (1,m) ⎞
x ⎜ x (1,b) ⎟
⎜ (2,t) ⎟
⎜x (2,t) ⎟ ⎜x ⎟
⎜ (2,m) ⎟ D ⎜ (2,m) ⎟
x =⎜
⎜ x ⎟ −→ 
⎟ x = ⎜x


⎟ (10.40)
⎝ x (3,t) ⎠ ⎜ x (2,b) ⎟
⎜ ⎟
x (3,m) ⎝ x (3,t) ⎠
x (3,m)
with x (1,b) = x (2,t) and x (2,b) = x (3,t) . (10.41)

 corresponding to the matrix given in (10.37) is


The matrix A
⎛ ⎞
A1m,m A1m,b
⎜ Ab,m C1 At,m At,b ⎟
⎜ 1 2 2 ⎟
⎜ b,m ⎟
⎜ A1 C1 At,m At,b ⎟
= ⎜ ⎟
2 2
A ⎜ A2m,t A2m,m A2m,b ⎟ (10.42)
⎜ ⎟
⎜ A2b,t A2b,m C2 At,m ⎟
⎜ 3 ⎟
⎝ A2b,t A2b,m C2 At,m ⎠
3
At,m
3 A3m,m

The equalities (10.41) define a subspace J of Rm . This subspace is the range


 show that
of mapping D. These equalities combined with the definition of matrix A
 
J is an invariant subspace of A : AJ ⊂ J . Therefore solving system Ax = f is
equivalent to solving system A x = f where 
f = D f . Operator P deletes entries
x (1,b) and x (2,b) from vector x̃.

Remark 10.1 The following properties are straightforward consequences of the pre-
vious definitions:
1. Ax = P AD x,
2. 
The subspace J = R(D) ⊂ Rm is an invariant subspace of A,
3. PD = In and DP is a projection onto J ,
4.  x).
∀x, y ∈ Rn , (y = Ax ⇔ D y = AD
This can be illustrated by diagram (10.43):

A
Rn → Rn
D ↓ ↑ P (10.43)

A
Rm → Rm .
332 10 Preconditioners

One iteration of the Block Multiplicative Schwarz method on the original system
(10.38) corresponds to one Block-Seidel iteration on the enhanced system

x = D f,
A (10.44)

where the diagonal blocks are the blocks defined by the p subdomains. More pre-
 the block lower triangular part of A,
cisely, denoting by P  the iteration defined in
Algorithm 10.3 can be expressed as follows :


⎪ 
xk = D xk ,

rk = Drk ,

−1 (10.45)

⎪ 
x =xk + P rk ,
⎩ k+1
xk+1 = P xk+1 .

= P
To prove it, let us partition A − N , where N  is the strictly upper block-
 Matrices P
triangular part of (− A).  and N
 are partitioned by blocks accordingly to
the domain definition. One iteration of the Block Gauss-Seidel method can then be
expressed by

xk+1 = 
 −1
xk + P rk .

The resulting block-triangular system is solved successively for each diagonal block.
xk and 
To derive the iteration, we partition  xk+1 accordingly. At the first step, and
assuming xk,0 = xk and 
rk,0 = 
rk , we obtain

xk,0 + A−1
xk+1,1 = 
 1 rk,0 , (10.46)

which is identical to the first step of the Multiplicative Schwarz xk,1 = xk,0 + A−1
1 +
rk,0 . The ith step (i = 2, . . . , p)

xk+1,i = 
 xk,i−1 +
 
Ai−1  i,1:i−1 
fi − P xk+1,1:i−1 − Ai  i,i+1: p 
xk,i + N xk,i+1: p , (10.47)

is equivalent to its counterpart xk,i+1 = xk,i + Ai+rk,i in the Multiplicative Schwarz


algorithm.
Therefore, we have the following diagram

P −1
Rn → Rn
D ↓ ↑ P
−1
P
Rm → Rm
10.3 Multiplicative Schwarz Preconditioner with GMRES 333

−1 D. Similarly, we claim that


and we conclude that P −1 = P P

D,
N = PN (10.48)

and J is an invariant subspace of Rm for N . We must remark that there is an abuse


in the notation, in the sense that the matrix denoted by P −1 can be singular even
when P −1 is non singular.

Explicit Formulation of the Multiplicative Schwarz Preconditioner


A lemma must be first given to then express the main result.
Notation: For 1 ≤ i ≤ j ≤ p, Ii: j is the identity on the union of the domains
Wk , (i ≤ k ≤ j) and I¯i: j = I − Ii: j .
Lemma 10.1 For any i ∈ {1, . . . , p − 1},

Āi+1 Ii + Ii+1 Āi − Ii+1 AIi = C̄i Ii:i+1 , (10.49)

and for any i ∈ {1, . . . , p},

Ai+ = Āi−1 Ii = Ii Āi−1 .

Straightforward (see Fig. 10.2).

Theorem 10.4 ([49]) Let A ∈ Rn×n be algebraically decomposed into p subdo-


mains as described in Sect. 10.3.1 such that all the matrices Āi , and the matrix
Ci for i = 1, . . . , p are non singular. The inverse of the Multiplicative Schwarz
preconditioner, i.e. the matrix P −1 , can be explicitly expressed by:

−1 −1 ¯ −1
P −1 = A¯p C̄ p−1 Ā−1
p−1 C̄ p−2 · · · Ā2 C 1 Ā1 (10.50)

Fig. 10.2 Illustration of


(10.49). Legend I = Ii and
I + 1 = Ii+1
334 10 Preconditioners

where for i = 1, . . . , p, matrix Āi corresponds to the block Ai and for i = 1, . . . , p −


1, matrix C̄i to the overlap Ci as given by (10.34), and (10.36), respectively.

Proof In [49], two proofs are proposed. We present here the proof by induction.
Since xk+1 = xk + P −1 rk and since xk+1 is obtained as result of successive steps
as defined in Algorithm 10.3, for i = 1, . . . , p: xk,i+1 = xk,i + Ai+rk,i , we shall
prove by reverse induction on i = p − 1, . . . , 0 that:

−1
xk, p = xk,i + A¯p C̄ p−1 · · · C̄i+1 Āi+1
−1
Ii+1: p rk,i (10.51)

For i = p − 1, the relation xk, p = xk, p−1 + Ā−1


p I p rk, p−1 is obviously true. Let us
assume that (10.51) is valid for i and let us prove it for i − 1.

xk, p = xk,i−1 + Ai+rk,i−1 + A−1 −1 +


p C̄ p−1 · · · C̄i+1 Āi+1 Ii+1: p (I − A Ai )rk,i−1
+ + −1 +
= xk,i−1 + Ā−1
p · · · C̄i+1 (Ai + Ai+1 Ii+1: p − Āi+1 Ii+1: p A Ai )rk,i−1 .

The last transformation was possible since the supports of A p , and of C p−1 , C p−1 ,
. . . , Ci+1 are disjoint from domain i. Let us transform the matrix expression:

B = Ai+ + Ai+1
+ −1
Ii+1: p − Āi+1 Ii+1: p A Ai+
−1 +
= Āi+1 (Ai+1 Ii + Ii+1: p Āi − Ii+1: p AIi ) Āi−1

Lemma 10.1 and elementary calculations imply that:


−1
B = Āi+1 (C̄i Ii:i+1 + Ii+2: p − Oi+1 ) Āi−1
−1
= Āi+1 C̄i Āi−1 Ii: p

which proves that relation (10.51) is valid for i − 1. This ends the proof.

Once the inverse of P is explicitly known, the matrix N = P − A can be expressed


explicitly as well. For this purpose, we assume that every block Ai , for i = 1, . . . , p −
1 is partitioned as follows:
 
Bi Fi
Ai = ,
E i Ci

and A p = B p .
Proposition 10.4 The matrix N , is defined by the multiplicative Schwarz splitting
A = P − N . By embedding it in N  as expressed in (10.48), block Ni, j is the upper
i, j as referred by the block structure of N
part of block N . The blocks can be expressed
as follows:
10.3 Multiplicative Schwarz Preconditioner with GMRES 335

⎨ Ni, j = G i · · · G j−1 B j , when j > i + 1
Ni,i+1 = G i Bi+1 − [Fi , 0], (10.52)

Ni, j = 0 other wise,

for i = 1, . . . , p − 1 and j = 2, . . . , p and where G i = [Fi Ci−1 , 0]. When the


algebraic domain decomposition is with a weak overlap, expression (10.52) becomes:

Ni,i+1 = G i Bi+1 − [Fi , 0], f or i = 1, . . . , p − 1,
(10.53)
Ni, j = 0 other wise.
 p−1
The matrix N is of rank r ≤ i=1 si , where si is the size of Ci .

Proof The proof of the expression (10.52) is based on a constructive proof of


Theorem 10.4 as explicited in [49].
The structure of row block Ni = [Ni,1 , . . . , Ni, p ], for i = 1, . . . , p − 1, of matrix
N is:
   
0 Ci−1 Bi+1 (1, 2)
Ni = 0 , · · · 0 , [Fi 0] , G i G i+1 Bi+2 , . . . , G i · · · G p−1 B p .
0 −I

Therefore, the rank of row block of Ni is limited by the rank of factor [Fi 0] which
 p−1
cannot exceed si . This implies r ≤ i=1 si .

Example 10.2 In Fig. 10.3, a block-triagonal matrix A is considered. Three overlap-


ping blocks are defined on it. There is a weak overlap. The pattern of the matrix N
as defined in Proposition 10.4 has only two nonzero blocks as shown in Fig. 10.3.

0 0
A1
Nonzero blocks

5 5
A2
C1

10 A3 10
C2

15 15
0 5 10 15 0 5 10 15
A N

Fig. 10.3 Expression of the block multiplicative Schwarz splitting for a block-tridiagonal matrix.
Left Pattern of A; Right Pattern of N where A = P − N is the corresponding splitting
336 10 Preconditioners

10.3.3 Block Multiplicative Schwarz as a Preconditioner


for Krylov Methods

Left and Right Preconditioner of a Krylov Method


The goal is to solve the system (10.38) by using a preconditioned Krylov method
from an initial guess x0 and therefore with r0 = f − Ax0 as initial residual. Let
P ∈ Rn×n be a preconditioner of the matrix A. Two options are possible:

Left preconditioning: the system to be solved is P −1 Ax = P −1 f which defines the


Krylov subspaces Kk (P −1 A, P −1 r0 ).
Right preconditioning: the system to be solved is A P −1 y = f and the solution is
recovered from x = P −1 y. The corresponding sequence of Krylov subspaces is
therefore Kk (A P −1 , r0 ).

Because Kk (P −1 A, P −1 r0 ) = P −1 Kk (A P −1 , r0 ), the two preconditioners


exhibit similar behavior since they involve two related sequences of Krylov sub-
spaces.
Early Termination
We consider the advantage of preconditioning a Krylov method, via the splitting
A = P − N in which N is rank deficient—the case for the Multiplicative Schwarz
preconditioning. For solving the original system Ax = f , we define a Krylov method,
as being an iterative method which builds, from an initial guess x0 , a sequence of
iterates xk = x0 + yk such that yk ∈ Kk (A, r0 ) where Kk (A, r0 ) is the Krylov
subspace of degree k, built from the residual r0 of the initial guess: Kk (A, r0 ) =
Pk−1 (A)r0 , where Pk−1 (R) is the set of polynomials of degree k − 1 or less. The
vector yk is obtained by minimizing a given norm of the error xk − x, or by projecting
the initial error onto the subspace Kk (A, r0 ) in a given direction. Here, we consider
that, for a given k, when the property x ∈ x0 + Kk (A, r0 ) holds, it implies that
xk = x. This condition is satisfied when the Krylov subspace sequence becomes
stationary as given in Proposition 9.1.

Proposition 10.5 With the previous notations, if rank(N ) = r < n, then any Krylov
method, as it has just been defined, reaches the exact solution in at most r + 1
iterations.

Proof In P −1 A = I − P −1 N the matrix P −1 N is of rank r . For any degree k, the fol-


lowing inclusion Kk (P −1 A, P −1 r0 ) ⊂ Span(r0 ) + R(P −1 N ) guarantees that the
dimension of Kk (P −1 A, P −1 r0 ) is at most r +1. Therefore, the method is stationary
from k = r + 1 at the latest. The proof is identical for the right preconditioning.

For a general nonsingular matrix, this result is applicable to the methods BiCG and
QMR, preconditioned by the Multiplicative Schwarz method. In exact arithmetic, the
number of iterations cannot exceed the total dimension s of the overlap by more than
1. The same result applies to GMRES(m) when m is greater than s.
10.3 Multiplicative Schwarz Preconditioner with GMRES 337

Explicit Formulation and Nonorthogonal Krylov Bases


In the classical formulation of the Multiplicative Schwarz iteration (Algorithm 10.3)
the computation of the two steps

xk+1 = xk + P −1 rk and rk+1 = f − Axk+1 ,

is carried out recursively through the domains, whereas our explicit formulation (see
Theorem 10.4) decouples the two computations. The computation of the residual is
therefore more easily implemented in parallel since it is included in the recursion.
Another advantage of the explicit formulation arises when it is used as a precondi-
tioner of a Krylov method. In such a case, the user is supposed to provide a code
for the procedure x → P −1 x. Since the method computes the residual, the classical
algorithm implies calculation of the residual twice.
The advantage is even higher when considering the a priori construction of a
nonorthogonal basis of the Krylov subspace. For this purpose, the basis is built by
a recursion of the type of (9.61) but where the operator A is now replaced by either
P −1 A or A P −1 . The corresponding algorithm is one of the following: 9.4, 9.5,
or 9.6. Here, we assume that we are using the Newton-Arnoldi iteration (i.e. the
Algorithm 9.5) with a left preconditionner.
Next, we consider the parallel implementation of the above scheme on a linear
array of processors P(q), q = 1, . . . , p. We assume that the matrix A and the
vectors involved are distributed as indicated in Sect. 2.4.1. In order to get rid of any
global communication, the normalization of w, must be postponed. For a clearer
presentation, we skip for now the normalizing factors but we will show later how to
incorporate these normalization factors without harming parallel scalability.
Given a vector z 1 , building a Newton-Krylov basis can be expressed by the
loop:
do k = 1 : m,
z k+1 = P −1 A z k − λk+1 z k ;
end
At each iteration, this loop involves multiplying A by a vector (MV as defined in
Sect. 2.4.1), followed by solving a linear system involving the preconditioner P.
As shown earlier, Sect. 2.4.1, the MV kernel is implemented via Algorithm 2.11.
Solving systems involving the preconditioner P corresponds to the application of
one Block Multiplicative Schwarz iteration, which can be expressed from the explicit
formulation of P −1 as given by Theorem 10.4, and implemented by Algorithm 10.4.
Algorithms 2.11 and 10.4 can be concatenated into one. Running the resulting
program on each processor defines the flow illustrated in Fig. 10.4. The efficiency
of this approach is analyzed in [50]. It shows that if there is a good load balance
across all the subdomains and if τ denotes the maximum number of steps necessary
to compute a subvector vq , then the number of steps to compute one entire vector is
given by T = pτ , and consequently, to compute m vectors of the basis the number
of parallel steps is
T p = ( p − 3 + 3m)τ. (10.54)
338 10 Preconditioners


z 1 z 2 · · · zq · · ·

↓ A1 . . . ∗
. . . . k=1:m
. . . . → zk+1 = αk (P−1 A − λk I)zk ;
A4 . . ∗
.
.
. . .
.
.
. . . →
.
. Recursion across the domains.
. . ∗
. .
. . →
A3q−2 ∗
. The stars illustrate the wavefront of the
. flow of the computation.
.
. 1/3 for 1 domain / processor.
.
.
Ap

Fig. 10.4 Pipelined costruction of the Newton-Krylov basis corresponding to the block multiplica-
tive Schwarz preconditioning

Algorithm 10.4 Pipelined multiplicative Schwarz iteration w := P −1 v: program for


processor p(q)
Input: In the local memory: Aq , vq = [(vq1 ) , (vq2 ) (vq3 ) ] and wqT = [(wq1 ) , (wq2 ) , (wq3 ) ].
Output: w := P −1 v.
1: if q > 1, then
2: receive z q1 from processor pq−1 ;
3: wq1 = vq1 ;
4: end if
5: solve Aq wq = vq ;
6: if q < p, then
7: wq3 := Cq wq3 ;
8: send wq3 to processor pq+1 ;
9: end if
10: if q > 1, then
11: send wq1 to processor pq−1 ;
12: end if
13: if q < p, then
14: receive t from processor pq+1 ;
15: wq3 := t;
16: end if

Thus, the speedup of the basis construction is given by,


p
Sp = . (10.55)
3 + ( p − 3)/m
10.3 Multiplicative Schwarz Preconditioner with GMRES 339

While efficiency E p = 1
3+( p−3)/m grows with m, it is asymptotically limited to 13 .
Assuming that the block-diagonal decomposition is with weak overlap as given by
Definition 10.3, then by using (10.53) we can show that the asymptotic efficiency
can reach 21 .
If we do not skip the computation of the normalizing coefficients αk , the actual
computation should be as follows:
do k = 1 : m,
z k+1 = αk (P −1 A z k − λk+1 z k );
end
where αk = P −1 A z 1−λ z  . When a slice of the vector z k+1 has been computed on
k k+1 k
processor Pq , the corresponding norm can be computed and added to the norms of
the previous slices which have been received from the previous processor Pq−1 . Then
the result is sent to the next processor p(q + 1). When reaching the last processor
Pp , the norm of the entire vector is obtained, and can be sent back to all processors in
order to update the subvectors. This procedure avoids global communication which
could harm parallel scalability. The entries of the tridiagonal matrix Tˆm , introduced
in (9.62), must be updated accordingly.
A full implementation of the GMRES method preconditioned with the block mul-
tiplicative Schwarz iteration is coded in GPREMS following the PETSc formulations.
A description of the method is given in [50] where the pipeline for building the New-
ton basis is described. The flow of the computation is illustrated by Fig. 10.5. A new
version is available in a more complete set where deflation is also incorporated [51].
Deflation is of importance here as it limits the increase of the number of GMRES
iterations for large number of subdomains.

(1) (1) (1)


y(1) = B1 v0 v1 = C̄1 Ā−1
1 y
(1) − v
0
(1) (1) (2)
σ3 = ||v3 ||2
P1

(2) (2) (2) (1)


σ3 = ||v3 ||2 + σ3
P2

(3) (3) (2) (2)


σ3 = ||v3 ||2 + σ3

P3

(4) (4) (2) (3)


σ3 =||v3 ||2 + σ3
P4 σ3 = σ3
(4)

v1 = (P−1 A − λ1 )v0 v2 v3
Communications for Ax Communications for P−1 y
Communications for the consistency of the computed vector in the overlapped region

Fig. 10.5 Flow of the computation vk+1 = σk P −1 (A − λk I )vk (courtesy of the authors of [50])
340 10 Preconditioners

References

1. Axelsson, O., Barker, V.A.: Finite Element Solution of Boundary Value Problems. Academic
Press Inc., Orlando (1984)
2. Meurant, G.: Computer Solution of Large Linear Systems. Studies in Mathematics and
its Applications. Elsevier Science, North-Holland (1999). http://books.google.fr/books?id=
fSqfb5a3WrwC
3. Chen, K.: Matrix Preconditioning Techniques and Applications. Cambridge University Press,
Cambridge (2005)
4. Saad, Y.: Iterative Methods for Sparse Linear Systems. SIAM, Philadelphia (2003)
5. van der Vorst, H.A.: Iterative Krylov Methods for Large Linear Systems. Cambridge University
Press, Cambridge (2003). http://dx.doi.org/10.1017/CBO9780511615115
6. Schenk, O., Gärtner, K.: Solving unsymmetric sparse systems of linear equations with pardiso.
Future Gener. Comput. Syst. 20(3), 475–487 (2004)
7. Censor, Y., Gordon, D., Gordon, R.: Component averaging: an efficient iterative parallel algo-
rithm for large and sparse unstructured problems. Parallel Comput. 27(6), 777–808 (2001)
8. Gordon, D., Gordon, R.: Component-averaged row projections: a robust, block-parallel scheme
for sparse linear systems. SIAM J. Sci. Stat. Comput. 27(3), 1092–1117 (2005)
9. Zouzias, A., Freris, N.: Randomized extended Kaczmarz for solving least squares. SIAM J.
Matrix Anal. Appl. 34(2), 773–793 (2013)
10. Liu, J., Wright, S., Sridhar, S.: An Asynchronous Parallel Randomized Kaczmarz Algorithm.
CoRR (2014). arXiv:abs/1201.3120 math.NA
11. Popa, C.: Least-squares solution of overdetermined inconsistent linear systems using Kacz-
marz’s relaxation. Int. J. Comput. Math. 55(1–2), 79–89 (1995)
12. Popa, C.: Extensions of block-projections methods with relaxation parameters to inconsistent
and rank-deficient least-squares problems. BIT Numer. Math. 38(1), 151–176 (1998)
13. Bodewig, E.: Matrix Calculus. North-Holland, Amsterdam (1959)
14. Kamath, C., Sameh, A.: A projection method for solving nonsymmetric linear systems on
multiprocessors. Parallel Comput. 9, 291–312 (1988/1989)
15. Tewarson, R.: Projection methods for solving space linear systems. Comput. J. 12, 77–80
(1969)
16. Tompkins, R.: Methods of steep descent. In: E. Beckenbach (ed.) Modern Mathematics for the
Engineer. McGraw-Hill, New York (1956). Chapter 18
17. Gastinel, N.: Procédé itératif pour la résolution numérique d’un système d’équations linéaires.
Comptes Rendus Hebd. Séances Acad. Sci. (CRAS) 246, 2571–2574 (1958)
18. Gastinel, N.: Linear Numerical Analysis. Academic Press, Paris (1966). Translated from the
original French text Analyse Numerique Lineaire
19. Tanabe, K.: A projection method for solving a singular system of linear equations. Numer.
Math. 17, 203–214 (1971)
20. Tanabe, K.: Characterization of linear stationary iterative processes for solving a singular
system of linear equations. Numer. Math. 22, 349–359 (1974)
21. Ansorge, R.: Connections between the Cimmino-methods and the Kaczmarz-methods for the
solution of singular and regular systems of equations. Computing 33, 367–375 (1984)
22. Cimmino, G.: Calcolo approssimato per le soluzioni dei sistemi di equazioni lineari. Ric. Sci.
Progr. Tech. Econ. Naz. 9, 326–333 (1938)
23. Dyer, J.: Acceleration of the convergence of the Kaczmarz method and iterated homogeneous
transformations. Ph.D. thesis, University of California, Los Angeles (1965)
24. Gordon, R., Bender, R., Herman, G.: Algebraic reconstruction photography. J. Theor. Biol. 29,
471–481 (1970)
25. Gordon, R., Herman, G.: Three-dimensional reconstruction from projections, a review of algo-
rithms. Int. Rev. Cytol. 38, 111–115 (1974)
26. Trummer, M.: A note on the ART of relaxation. Computing 33, 349–352 (1984)
27. Natterer, F.: Numerical methods in tomography. Acta Numerica 8, 107–141 (1999)
References 341

28. Byrne, C.: Applied Iterative Methods. A.K. Peters (2008)


29. Householder, A., Bauer, F.: On certain iterative methods for solving linear systems. Numer.
Math. 2, 55–59 (1960)
30. Householder, A.S.: The Theory of Matrices in Numerical Analysis. Dover Publications, New
York (1964)
31. Elfving, T.: Block iterative methods for consistent and inconsistent linear equations. Numer.
Math. 35, 1–12 (1980)
32. Kydes, A., Tewarson, R.: An iterative method for solving partitioned linear equations. Com-
puting 15, 357–363 (1975)
33. Peters, W.: Lösung linear Gleichungeneichungssysteme durch Projektion auf Schnitträume
von Hyperebenen and Berechnung einer verallgemeinerten Inversen. Beit. Numer. Math. 5,
129–146 (1976)
34. Wainwright, R., Keller, R.: Algorithms for projection methods for solving linear systems of
equations. Comput. Math. Appl. 3, 235–245 (1977)
35. Björck, Å., Elfving, T.: Accelerated projection methods for computing pseudoinverse solutions
of systems for linear equations. BIT 19, 145–163 (1979)
36. Bramley, R., Sameh, A.: Row projection methods for large nonsymmetric linear systems. SIAM
J. Sci. Stat. Comput. 13, 168–193 (1992)
37. Elfving, T.: Group iterative methods for consistent and inconsistent linear equations. Technical
Report LITH-MAT-R-1977-11, Linkoping University (1977)
38. Benzi, M.: Gianfranco cimmino’s contribution to numerical mathematics. In: Atti del Seminario
di Analisi Matematica dell’Università di Bologna, pp. 87–109. Technoprint (2005)
39. Gilbert, P.: Iterative methods for the three-dimensional reconstruction of an object from pro-
jections. J. Theor. Biol. 36, 105–117 (1972)
40. Lakshminarayanan, A., Lent, A.: Methods of least squares and SIRT in reconstruction. J. Theor.
Biol. 76, 267–295 (1979)
41. Whitney, T., Meany, R.: Two algorithms related to the method of steepest descent. SIAM J.
Numer. Anal. 4, 109–118 (1967)
42. Arioli, M., Duff, I., Noailles, J., Ruiz, D.: A block projection method for general sparse matrices.
SIAM J. Sci. Stat. Comput. 13, 47–70 (1990)
43. Bjorck, A., Golub, G.: Numerical methods for computing angles between linear subspaces.
Math. Comput. 27, 579–594 (1973)
44. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins (2013)
45. Stewart, G.: On the perturbation of pseudo-inverse, projections and linear least squares prob-
lems. SIAM Rev. 19, 634–662 (1977)
46. Benzi, M., Frommer, A., Nabben, R., Szyld, D.B.: Algebraic theory of multiplicative Schwarz
methods. Numerische Mathematik 89, 605–639 (2001)
47. Hackbusch, W.: Iterative Solution of Large Sparse Systems of Equations. Springer, New York
(1999)
48. Tang, W.: Generalized Schwarz splittings. SIAM J. Sci. Stat. Comput. 13, 573–595 (1992)
49. Atenekeng-Kahou, G.A., Kamgnia, E., Philippe, B.: An explicit formulation of the multiplica-
tive Schwarz preconditionner. Appl. Numer. Math. 57(11–12), 1197–1213 (2007)
50. Nuentsa Wakam, D., Atenekeng-Kahou, G.A.: Parallel GMRES with a multiplicative Schwarz
preconditioner. J. ARIMA 14, 81–88 (2010)
51. Nuentsa Wakam, D., Erhel, J.: Parallelism and robustness in GMRES with the newton basis
and the deflated restarting. Electron. Trans. Linear Algebra (ETNA) 40, 381–406 (2013)
Chapter 11
Large Symmetric Eigenvalue Problems

In this chapter we consider the following problems:


Eigenvalue problem: Given symmetric A ∈ Rn×n compute eigenpairs {λ, x} that
satisfy Ax = λx.
Generalized eigenvalue problem: Given symmetric A ∈ Rn×n and B ∈ Rn×n that
is s.p.d., compute generalized eigenpairs {λ, x} that satisfy Ax = λBx.
Singular value problem: Given A ∈ Rm×n , compute normalized vectors u ∈
Rm , v ∈ Rn and scalars σ ≥ 0 such that Av = σ u and A u = σ v.
We assume here that the underlying matrix is large and sparse so that it is impractical
to form a full spectral or singular-value decomposition as was the strategy in Chap. 8.
Instead, we present methods that seek only a few eigenpairs or singular triplets. The
methods can easily be adapted for complex Hermitian matrices.

11.1 Computing Dominant Eigenpairs and Spectral


Transformations

The simplest method for computing the dominant eigenvalue (i.e. the eigenvalue of
largest modulus) and its corresponding eigenvector is the Power Method, listed in
Algorithm 11.1. The power method converges to the leading eigenpair for almost all
initial iterates x0 if the matrix has a unique eigenvalue of maximum modulus. For
symmetric matrices, this implies that the dominant eigenvalue λ is simple and −λ
is not an eigenvalue. This method can be easily implemented on parallel architec-
tures given an efficient parallel sparse matrix-vector multiplication kernel MV (see
Sect. 2.4.1).

Theorem 11.1 Let λ1 > λ2 ≥ · · · ≥ λn be the eigenvalues of the symmetric matrix


A ∈ Rn×n and u 1 , . . . u n the corresponding normalized vectors. If λ1 > 0 is the
unique dominant eigenvalue and if x0 u 1 = 0, then the sequence of vectors xk

© Springer Science+Business Media Dordrecht 2016 343


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_11
344 11 Large Symmetric Eigenvalue Problems

Algorithm 11.1 Power method.


Input: A ∈ Rn×n , x0 ∈ Rn and τ > 0.
Output: x ∈ Rn and λ ∈ R such that Ax − λx ≤ τ .
1: x0 = x0 /x0  ; k = 0 ;
2: y = A x0 ;
3: λ = x0 y ; r = y − λ x0 ;
4: while r  > τ ,
5: k := k + 1 ; xk = y/y ;
6: y = A xk ;
7: λ = xk y ; r = y − λ xk ;
8: end while
9: x = xk ;

created by Algorithm 11.1 converges to ±u 1 . The method converges linearly with the
convergence factor given by: max{ |λλ21 | , |λλn1 | }.

Proof See e.g. [1].

Note, however, that the power method could exhibit very slow convergence. As
an illustration, we consider the Poisson matrix introduced in Eq. (6.64) of Sect. 6.64.
For order n = 900 (5-point discretization of the Laplace operator on the unit square
with a 30 × 30 uniform grid), the power method has a convergence factor of 0.996.
The most straightforward way for simultaneously improving the convergence of
the power method and enhancing parallel scalability is to generalize it by iterating
on a block of vectors instead of a single vector. A direct implementation of the power
method on each vector of the block will not work since every column of the iterated
block will converge to the same dominant eigenvector. It is therefore necessary to
maintain strong linear independence across the columns. For that purpose, the block
will need to be orthonormalized in every iteration. The simplest implementation of
this method, called Simultaneous Iteration, for the standard symmetric eigenvalue
problem is given by Algorithm 11.2. Simultaneous Iteration was originally introduced
as Treppen-iteration (cf. [2–4]) and later extended for non-Hermitian matrices (cf.
[5, 6]).
For s = 1, this algorithm mimics Algorithm 11.1 exactly.

Theorem 11.2 Let λi (i = 1, . . . , n) be the eigenvalues of the symmetric matrix


A ∈ Rn×n and u 1 , . . . u n the corresponding normalized vectors, with the eigenval-
ues ordered such that : |λ1 | ≥ · · · ≥ |λs | > |λs+1 | ≥ · · · ≥ |λn |. Let P be the
spectral projector associated with the invariant subspace belonging to the eigenval-
ues λ1 , . . . , λs . Also, let X 0 = [x1 , . . . , xs ] ∈ Rn×s be a basis of this subspace such
that P X 0 is of full column rank. Then,
  
 λs+1 k
(I − Pk )u i  = O   , for i = 1, . . . , s, (11.1)
λi 
11.1 Computing Dominant Eigenpairs and Spectral Transformations 345

Algorithm 11.2 Simultaneous iteration method.


Input: A ∈ Rn×n , X 0 ∈ Rn×s and τ > 0.
Output: X ∈ Rn×s orthonormal and Λ ∈ Rs×s diagonal such that AX − X Λ ≤ τ .
1: X 0 = MGS(X 0 ) ;
2: Y = AX 0 ;
3: H = X 0 Y ;
4: H = QΛQ  ; //Diagonalization of H
5: R = Y − X 0 H ;
6: while R > τ ,
7: X k = MGS(Y Q) ;
8: Y = A X k ;
9: H = X k Y ;
10: H = QΛQ  ; //Diagonalization of H
11: R = Y − X k H ; ;
12: end while
13: X = X k ;

where Pk is the orthogonal projector onto the subspace spanned by the columns of
X k which is created by Algorithm 11.2.

Proof See e.g. [7] for the special case of a diagonalizable matrix.

Algorithms which implement simultaneous iteration schemes are inherently parallel.


For example, in Algorithm 11.2, steps 7–11, are performed using dense matrix kernels
whose parallel implementations that guarantee high performance have been discussed
in earlier chapters. Most of the time consumed in each iteration of Algorithm 11.2
is due to the sparse matrix-dense matrix multiplication, e.g. step 8. We should point
out that the eigendecomposition of H = X k Y may be performed on one multicore
node when s is small relative to the size of the original matrix.

11.1.1 Spectral Transformations

In the previous section, we explored methods for computing a few dominant eigen-
pairs. In this section, we outline techniques that allow the computation of those
eigenpairs belonging to the interior of the spectrum of a symmetric matrix (or op-
erator) A ∈ Rn×n . These techniques consist of considering a transformed operator
which has the same invariant subspaces as the original operator but with a rearranged
spectrum.
Deflation
Let us order the eigenvalues of A as: |λ1 | ≥ · · · ≥ |λ p | > |λ p+1 | ≥ · · · ≥ |λn |.
We assume that an orthonormal basis V = (v1 , . . . , v p ) ∈ Rn× p of the invariant
subspace V corresponding to the eigenvalues {λ1 , · · · , λ p } is already known. Let
P = I − V V  be the orthogonal projector onto V ⊥ (the orthogonal complement
of V ).
346 11 Large Symmetric Eigenvalue Problems

Proposition 11.1 The invariant subspaces of the operator B = PAP are the same
as those of A, and the spectrum of B is given by,

Λ(B) = {0} ∪ {λ p+1 , . . . , λn }. (11.2)

Proof Let u i be the eigenvector of A corresponding to λi . If i ≤ p, then Pu i = 0


and Bu i = 0, else Pu i = u i and Bu i = λi u i .

Using the power method or simultaneous iteration on the operator B provides the
dominant eigenvalues of B which are now λ p+1 , λ p+2 , . . .. To implement the matrix-
vector multiplication involving the matrix B, it is necessary to perform the operation
Pv for any vector v. We need to state here that it is not recommended to use the
expression P x = x − V (V  x) since this corresponds to the Classical Gram-Schmidt
method (CGS) which is not numerically reliable unless applied twice (see Sect. 7.4). It
is betterto implement the Modified Gram-Schmidt version (MGS) via the operator:
Pv = i=1 (I − u i u i ) v. Since B = PAP, a multiplication of B by a vector
p

implies two applications of the projector P to this vector. However, we can ignore
the pre-multiplication of A by P since the iterates xk in Algorithm 11.1, and X k in
Algorithm 11.2 satisfy, respectively, the relationships P xk = xk and P X k = X k .
Therefore we can use only the operator B̃ = PA in both the power and simultaneous
iteration methods. Hence, this procedure—known as the deflation process—increases
the number of arithmetic operations in each iteration by 4pn or 4spn, respectively,
on a uniprocessor.
In principle, by applying the above technique, it should be possible to determine
any part of the spectrum. There are two obstacles, however, which make this approach
not viable for determining eigenvalues that are very small in magnitude: (i) the num-
ber of arithmetic operations increases significantly; and (ii) the storage required for
the eigenvectors u 1 , . . . , u p could become too large. Next, we consider two alternate
strategies for a spectral transformation that could avoid these two disadvantages.
Shift-and-Invert Transformation
This shift-and-invert strategy transforms the eigenvalues of A which lie near zero
into the largest eigenvalues of A−1 and therefore allow efficient use of the power or
the simultaneous iteration methods for determining eigenvalues of A lying close to
the origin. This can be easily extended for the computation of the eigenvalues in the
neighborhood of any given value μ.

Lemma 11.1 (Shift-and-Invert) Let A ∈ Rn×n and μ ∈ R. For any eigenpair


(ν, u) of (A − μI )−1 , (λ = μ + ν1 , u) is an eigenpair of A. Using the power
method (Algorithm 11.1) for obtaining the dominant eigenpair of the matrix B =
(A − μI )−1 , one builds a sequence of vectors which converges to the eigenvector u
of the eigenvalue λ closest to μ, as long as the initial vector used to start the power
method is not chosen orthogonal to u. The convergence factor of the method is given
 
by, τ =  λ−μ , where λ̃ is the second closest eigenvalue to μ.
λ̃−μ
11.1 Computing Dominant Eigenpairs and Spectral Transformations 347

Algorithm 11.3 Shift-and-invert method .


Input: A ∈ Rn×n , μ ∈ R, x0 ∈ Rn and τ > 0.
Output: x ∈ Rn and λ ∈ R such that Ax − λx ≤ τ .
1: Perform an LU-factorization of (A − μI ).
2: x0 := x0 /x0  ; k = 0 ;
3: Solve (A − μI )y = x0 ; ρ = y;
4: x1 = y/ρ ; t = x0 /ρ ;
5: λ1 = x1 t ; r1 = t − λ1 x1 ;
6: while rk+1  > τ ,
7: k := k + 1 ;
8: Solve (A − μI )y = xk ; ρ = y ;
9: xk+1 = y/ρ ; t = xk /ρ ;
10: λk+1 = xk+1  t ;r
k+1 = t − λk+1 x k ;
11: end while
12: λ := λk+1 + μ ; x = xk+1 ;

The resulting method is illustrated in Algorithm 11.3.


Solving the linear systems (step 8) in each iteration k is not as costly since the
LU-factorization of (A −μI ) is performed only once (step 1). In order to increase the
order of convergence to an asymptotic cubic rate, it is possible to update the shift μ at
each iteration (see the Rayleigh Quotient Iteration [1]) requiring an LU-factorization
of (A − μk I ) at each iteration k. As a compromise, one can re-evaluate the shift μ
only periodically, e.g. every q > 1 iterations.
The Shift-and-Invert transformation may also be combined with the Simultaneous
Iteration method (Algorithm 11.2), resulting in algorithms which are better suited for
parallel architectures than the single vector iteration which could have better rates
of convergence.
Polynomial Transformations for Intermediate Eigenvalues
The Shift-and-Invert transformation is efficient for targetting any region of the spec-
trum of a matrix but it involves solving linear systems. A more easily computable
transformation is via the use of polynomials: for any polynomial p and any eigenpair
(λ, u) of A, the pair ( p(λ), u) is an eigenpair of p(A).
Although polynomial transformations usually give rise to a considerably higher
number of iterations than Shift-and-Invert transformations, they are still useful be-
cause of their ability to cope with extremely large matrices and realizing high effi-
ciency on parallel architectures.
As a simple illustration, we consider the quadratic transformation p(A) =
I − c(A − a I )(A − bI ) combined with the method of simultaneous iterations for
computing all the eigenvalues of A lying in a given interval [a, b] as introduced in
[8]. The scaling factor c is chosen so that min( p(c1 ), p(c2 )) = −1 where the interval
[c1 , c2 ] includes, as closely as possible, the spectrum of A. This yields

2
c = min , (11.3)
k=1,2 (ck − a)(ck − b)
348 11 Large Symmetric Eigenvalue Problems

which implies that for any eigenvalue λ of A,

c(b − a)2
if λ ∈ [a, b], then 1 ≤ p(λ) ≤ 1 + g, where g = ,
4
else − 1 ≤ p(λ) ≤ 1.

Therefore the polynomial transformation makes dominant the eigenvalues lying in


[a, b]. The eigenvalues of p(A), however, can be densely concentrated around 1. For
that reason, the authors of [8] apply the simultaneous iteration not to B = p(A) but
to Tm (B) where Tm is the Chebyshev polynomial of the first kind of order m. The
resulting procedure is given in Algorithm 11.4.

Algorithm 11.4 Computing intermediate eigenvalues.


Input: A ∈ Rn×n , [a, b] : interval of the eigenvalues to compute, [c1 , c2 ] : interval including all
the eigenvalues, X 0 ∈ Rn×s , m, and τ > 0.
Output: X = [x1 , · · · , xs ] ∈ Rn×s orthonormal and Λ = (λ1 , · · · , λs ) such that Axi −λi xi  ≤ τ
for i = 1, · · · , s.
1: c = mink=1,2 2/[(ck − a)(ck − b)] ; B ≡ I − c(A − a I )(A − bI ) ;
2: X 0 = MGS(X 0 ) ;
3: while True ,
4: Yk0 = X k ;
5: Yk1 = multiply(B, X k ) ;
6: do t = 1 : m − 1,
7: Ykt+1 = 2 × multiply(B, Ykt ) − Ykt−1 ;
8: end
9: Yk = Ykm ;
10: do i = 1 : s,
11: λi = X k (:, i) Yk (:, i) ;
12: ρi = Yk (:, i) − λi Yk (:, i) ;
13: end
14: if maxi=1:s ρi ≤ τ , then
15: EXIT ;
16: end if
17: [Uk , Rk ] = qr (Yk ) ;
18: H = Rk Rk ;
19: Hk = Q k Λk Q  k ; //Diagonalization of H
20: X k+1 = Uk Q k ;
21: end while
22: X = X k ;

Theorem 11.3 The ith column xi of X k as generated by Algorithm 11.4 converges


to an eigenvector u corresponding to an eigenvalue λ ∈ [a, b] of A such that

(k)
u − xi  = O(q k ), where (11.4)
 
|Tm ( p(μ)|
q = max . (11.5)
μ∈Λ(A) |Tm ( p(λ)|
μ∈[a,b]
/
11.1 Computing Dominant Eigenpairs and Spectral Transformations 349

Proof This result is a straightforward application of Theorem 11.2 to the matrix


Tm ( p(A)).

Note that the intervals [a, b] and [c1 , c2 ] should be chosen so as to avoid obtaining
a very small g which could lead to very slow convergence.

11.1.2 Use of Sturm Sequences

The scheme for obtaining the eigenpairs of a tridiagonal matrix that lie in a given
interval, TREPS (Algorithm 8.4), introduced in Sect. 8.2.3, can be applied to any
sparse symmetric matrix A with minor modifications as described below.
Let μ ∈ R be located somewhere inside the spectrum of the sparse symmetric
matrix A, but not one of its eigenvalues. Hence, A − μI is indefinite . Computing
the symmetric factorization P  (A − μI )P = L DL  , via the stable algorithm
introduced in [9], where D is a symmetric block-diagonal matrix with blocks of
order 1 or 2, P is a permutation matrix, and L is lower triangular matrix with a unit
diagonal, then according to the Sylvester Law of Inertia, e.g. see [10], the number
of negative eigenvalues of D is equal to the number of eigenvalues of A which are
smaller than μ. This allows one to iteratively partition a given interval for extracting
the eigenvalues belonging to this interval. On a parallel computer, the strategy of
TREPS can be followed as described in Sect. 8.2.3. The eigenvector corresponding
to a computed eigenvalue λ can be obtained by inverse iteration using the last LDL
factorization utilized in determining λ.
An efficient sparse LDL factorization scheme by Iain Duff et al. [11], is im-
plemented as routine MA57 in Harwell Subroutine Library [12] (it also exists in
MATLAB as procedure ldl).
If the matrix A ∈ Rn×n is banded with a narrow semi-bandwidth d, two options
are viable for applying Sturm sequences:
1. The matrix A is first tridiagonalized via the orthogonal similarity transformation
T = Q  AQ using parallel schemes such as those outlined in either [13] or [14],
which consume O(n 2 d) arithmetic operations on a uniprocessor.
2. The TREPS strategy is used through the band-LU factorization of A −μI with no
pivoting to determine the Sturm sequences, e.g. see [15]. Note that the calculation
of one sequence on a uniprocessor consumes O(nd 2 ) arithmetic operations.
The first option is usually more efficient on parallel architectures since the reduction
to the tridiagonal form is performed only once, while in the second option we will
need to obtain the LU-factorization of A − μI for different values of μ.
In handling very large symmetric eigenvalue problems on uniprocessors, Sturm
sequences are often abandoned as inefficient compared to the Lanczos method. How-
ever, on parallel architectures the Sturm sequence approach offers great advantages.
This is due mainly to the three factors: (i) recent improvements in parallel factoriza-
tion schemes of large sparse matrices on a multicore node, and (ii) the ability of the
Sturm sequence approach in determining exactly the number of eigenvalues lying
350 11 Large Symmetric Eigenvalue Problems

in a given subinterval, and (iii) the use of many multicore nodes, one node per each
subinterval.

11.2 The Lanczos Method

The Lanczos process is equivalent to the Arnoldi process which is described in


Sect. 9.3.2 but adapted to symmetric matrices.
When A is symmetric, Eq. (9.60) implies that Hm is symmetric and upper Hes-
senberg, i.e. Hm = Tm is a tridiagonal matrix. Later, we show that for sufficiently
large values of m, the eigenvalues of Tm could be used as good approximations of
the extreme eigenvalues of A. Denoting the matrix T̂m by
⎛ ⎞
α1 β2
⎜ β2 · · ⎟
⎜ ⎟
⎜ · · · ⎟
⎜ ⎟
⎜ βk αk βk+1 ⎟
T̂m = ⎜

⎟ ∈ R(m+1)×m ,
⎟ (11.6)
⎜ · · · ⎟
⎜ · · βm ⎟
⎜ ⎟
⎝ βm αm ⎠
βm+1

we present the following theorem.

Theorem 11.4 (Lanczos identities) The matrices Vm+1 and T̂m satisfy the following
relation:

Vm+1 Vm+1 = Im+1 , (11.7)
and AVm = Vm+1 T̂m , (11.8)

The equality (11.8) can also be written as



AVm = Vm Tm + βm+1 vm+1 em , (11.9)

where em ∈ Rm is the last canonical vector of Rm , and Tm ∈ Rm×m is the symmetric


tridiagonal matrix obtained by deleting the last row of T̂m .

11.2.1 The Lanczos Tridiagonalization

Many arithmetic operations can be skipped in the Lanczos process when compared
to that of Arnoldi: since the entries of the matrix located above the first superdiagonal
are zeros, orthogonality between most of the pairs of vectors is mathematically guar-
anteed. The procedure is implemented as shown in Algorithm 11.5. The matrix T̂m
11.2 The Lanczos Method 351

Algorithm 11.5 Lanczos procedure (no reorthogonalization).


Input: A ∈ Rn×n symmetric, v ∈ Rn and m > 0.
Output: V = [v1 , · · · , vm+1 ] ∈ Rn×(m+1) , T̂m ∈ R(m+1)×m as given in (11.6).
1: v1 = v/v ;
2: do k = 1 : m,
3: w := Avk ;
4: if k > 1, then
5: w := w − βk vk−1 ;
6: end if
7: αk = vk w;
8: w := w − αk vk ;
9: βk+1 = w;
10: if βk+1 = 0, then
11: Break ;
12: end if
13: vk+1 = w/βk+1 ;
14: end

is usually stored in the form of the two vectors (α1 , . . . , αm ) and (0, β2 , . . . , βm+1 ).
In this algorithm, it is easy to see that it is possible to proceed without storing the
basis Vm since only vk−1 and vk are needed to perform iteration k. This interesting
feature allows us to use large values for m. Theoretically (i.e. in exact arithmetic), at
some m ≤ n the entry βm+1 becomes zero. This implies that Vm is an orthonormal
basis of an invariant subspace and that all the eigenvalues of Tm are eigenvalues of
A. Thus, the Lanczos process terminates with Tm being irreducible (since (βk )k=2:m
are nonzero) with all its eigenvalues simple.
The Effect of Roundoff Errors
Unfortunately, in floating-point arithmetic the picture is not so as straightforward: the
orthonormality expressed in (11.7) is no longer assured. Instead, the relation (11.8)
is now replaced by

AVm = Vm+1 T̂m + E m , (11.10)

where E m ∈ Rn×m , with E m  = O(uA).


Enforcing Orthogonality
An alternate version of the Lanczos method that maintains orthogonality during
the tridiagonalization process can be realized as done in the Arnoldi process (see
algorithm 9.3). The resulting procedure is given by Algorithm 11.6. Since in this
version the basis Vm+1 must be stored, the parameter m cannot be chosen as large as
in Algorithm 11.5. Hence, a restarting strategy must be considered (see Sect. 11.2.2).
A Block Version of the Lanczos Method with Reorthogonalization
There exists a straightforward generalization of the Lanczos method which operates
with blocks of r vectors instead of a single vector (see [16–18]). Starting with an
352 11 Large Symmetric Eigenvalue Problems

Algorithm 11.6 Lanczos procedure (with reorthogonalization).


Input: A ∈ Rn×n symmetric, v ∈ Rn and m > 0.
Output: V = [v1 , . . . , vm+1 ] ∈ Rn×(m+1) , T̂m ∈ R(m+1)×m as given in (11.6).
1: v1 = v/v ;
2: do k = 1 : m,
3: w := Avk ;
4: do j = 1 : k,
5: μ = vj w;
6: w := w − μ v j ;
7: end
8: αk = vk w;
9: w := w − αk vk ;
10: βk+1 = w;
11: if βk+1 = 0, then
12: Break ;
13: end if
14: vk+1 = w/βk+1 ;
15: end

initial matrix V1 ∈ Rn× p with orthonormal columns, the block Lanczos algorithm
for reducing A to the block-tridiagonal matrix
⎡ ⎤
G 1 R1
⎢ R1 G 2 R  ⎥
⎢ 2 ⎥
⎢ .. .. .. ⎥
T =⎢ . . . ⎥ (11.11)
⎢ ⎥
⎣  ⎦
Rk−2 G k−1 Rk−1
Rk−1 G k

is given by Algorithm 11.7. In other words,

AV = V T + Z (0, . . . , 0, I p ) (11.12)

where
V = (V1 , V2 , . . . , Vk ). (11.13)

Since the blocks R j are upper-triangular, the matrix T is a symmetric block-


tridiagonal matrix of half-bandwidth p.

11.2.2 The Lanczos Eigensolver

Let us assume that the tridiagonalization is given by (11.9). For any eigenpair (μ, y)
of Tm , where y = (γ1 , . . . , γm ) is of unit 2-norm, the vector x = Vm y is called the
Ritz vector which satisfies
11.2 The Lanczos Method 353

Algorithm 11.7 Block-Lanczos procedure (with full reorth.)


Input: A ∈ Rn×n symmetric, Y ∈ Rn× p and k > 0, and p > 0.
Output: V = [v1 , · · · , vm ] ∈ Rn×m , T ∈ Rm×m as given in (11.11) and where m ≤ kp, and
Z ∈ Rn× p which satisfies (11.12).
1: Orthogonalize Y into V1 ; V = [];
2: do j = 1 : k,
3: V := [V, V j ];
4: W := AV j ;
5: do i = 1 : j,
6: F = Vi W ;
7: W := W − Vi F ;
8: end
9: G j = V j W ;
10: W := W − V j G j ;
11: [V j+1 , R j ]=qr(W );
12: if R j singular, then
13: Break ;
14: end if
15: end
16: Z = W ;

Ax − μx = βm+1 γm vm+1 , (11.14)


and Ax − μx = βm+1 |γm |. (11.15)

Clearly, there exists an eigenvalue λ ∈ Λ(A) such that

Ax − μx
|μ − λ| ≤ , (11.16)
x

(e.g. see [7, p. 79]). Therefore, in exact arithmetic, we have

|μ − λ| ≤ βm+1 |γm |. (11.17)

The convergence of the Ritz values to the eigenvalues of A has been investigated
extensively (e.g. see the early references [19, 20]), and can be characterized by the
following theorem.
Theorem 11.5 (Eigenvalue convergence) Let Tm be the symmetric tridiagonal ma-
trix built by Algorithms 11.5 from the inital vector v. Let the eigenvalues of A and
Tm be respectively denoted by (λi )i=1:n and (μi(m) )i=1:m and labeled in decreasing
order. The difference between the ith exact and approximate eigenvalues, λi and
(m)
μi , respectively, satisfies the following inequality,
 (m)
2
(m) κ tan ∠(v, u i )
0≤ λi − μi ≤ (λ1 − λn ) i (11.18)
Tm−i (1 + 2γi )
354 11 Large Symmetric Eigenvalue Problems

λi −λi+1 (m)
where u i is the eigenvector corresponding to λi , γi = λi+1 −λn , and κi is given by

i−1 μ(m) − λ

(m) (m) j n
κ1 = 1, and κi = (m)
, for i > 1, (11.19)
j=1 μj − λi

with Tk being the Chebyshev polynomial of first kind of order k.

Proof See [7] or [21].

The Lanczos Eigensolver without Reorthogonalization


Let us consider the effect of the loss of orthogonality on any Ritz pair (μ, x = Vm y).
Recalling that the orthonormality of Vm could be completely lost, then from (11.10)
(similar to (11.15) and (11.16)), there exists an eigenvalue λ ∈ Λ(A) such that

βm+1 |γm | + Fm 


|μ − λ| ≤ , (11.20)
x

where x ≥ σmin (Vm ) can be very small. Several authors discussed how accurately
can the eigenvalues of Tm approximate the eigenvalues of A in finite precision, for
example, see the historical overview in [21]. The framework of the theory proving
that loss of orthogonality appears when Vm includes a good approximation of an
eigenvector of A and, by continuing the process, new copies of the same eigenvalue
could be regenerated was established in [20, 22].
A scheme for discarding “spurious” eigenvalues by computing the eigenvalues of
the tridiagonal matrix T̃m obtained from Tm by deleting its first row and first column
was developed in [23]. As a result, any common eigenvalue of Tm and T̃m is deemed
spurious.
It is therefore possible to compute a given part of the spectrum Λ(A). For in-
stance, if the sought after eigenvalues are those that belong to an interval [a, b],
Algorithm 11.8 provides those estimates with their corresponding eigenvectors with-
out storing the basis Vm . Once an eigenvector y of Tm is computed, the Ritz vector
x = Vm y needs to be computed. To overcome the lack of availability of Vm , however,
a second pass of the Lanczos process is restarted with the same initial vector v and
accumulating products in Step 6 of Algorithm 11.8 on the fly.
Algorithm 11.8 exhibits a high level of parallelism as outlined below:
• Steps 2 and 3: Computing the eigenvalues of a tridiagonal matrix via multisec-
tioning and bisections using Sturm sequences (TREPS method),
• Step 5: Inverse iterations generate q independent tasks, and
• Steps 6 and 7: Using parallel sparse matrix—vector (and multivector) multipli-
cation kernels, as well as accumulating dense matrix-vector products on the fly,
and
• Step 7: Computing simultaneously the norms of the q columns of W .
11.2 The Lanczos Method 355

Algorithm 11.8 LANCZOS1: Two-pass Lanczos eigensolver.


Input: A ∈ Rn×n symmetric, [a, b] ⊂ R, and v ∈ Rn , and m ≥ n.
Output: {λ1 , · · · , λ p } ⊂ [a, b] ∩ Λ(A) and their corresponding eigenvectors U = [u 1 , . . . , u p ] ∈
Rn× p .
1: Perform the Lanczos procedure from v to generate Tm without storing the basis Vm ;
2: Compute the eigenvalues of Tm in [a, b] by TREPS (Algorithm 8.4);
3: Compute the eigenvalues of T̃m in [a, b] by TREPS;
4: Discard the spurious eigenvalues of Tm to get a list of q eigenvalues: {λ1 , · · · , λq } ⊂ [a, b];
5: Compute the corresponding eigenvectors of Tm by inverse iteration : Y = [y1 , . . . , yq ] ;
6: Redo the Lanczos procedure from v and perform W = Vm Y without storing the basis Vm ;
7: Using (11.16), check for convergence of the columns of W as eigenvectors of A;
8: List the converged eigenvalues {λ1 , · · · , λ p } ⊂ [a, b] and corresponding normalized eigenvec-
tors U = [u 1 , . . . , u p ] ∈ Rn× p .

The Lanczos Eigensolver with Reorthogonalization


When Algorithm 11.6 is used, the dimension m is usually too small to obtain good
(m)
approximations of the extremal eigenvalues of A from the Ritz values (μi )1:m , i.e.
the eigenvalues of Tm . Therefore, a restarting strategy becomes mandatory.
Assuming that we are seeking the p largest eigenvalues of A, the simplest tech-
nique consists of computing the p largest eigenpairs (μi , yi ) of Tm and restart-
ing the Lanczos process with the vector v which is the linear combination v =
p
Vm ( i=1 μi yi ). This approach of creating the restarting vector is not as effec-
tive as the Implicitly Restarted Arnoldi Method (IRAM) given in [24] and imple-
mented in ARPACK [25]. Earlier considerations of implicit restarting appeared in
[26, 27]. The technique can be seen as a special polynomial filter applied to the
initial starting vector v1 , i.e. the first vector used in the next restart is given by,
ṽ1 = pm−1 (A)v1 ∈ Km (A, v1 ) where the polynomial pm−1 is implicitly built.

Algorithm 11.9 LANCZOS2: Iterative Lanczos eigensolver.


Input: A ∈ Rn×n symmetric, and v ∈ Rn , and m ≥ n and p < m.
Output: The p largest eigenvalues {λ1 , · · · , λ p } ⊂ Λ(A) and their corresponding eigenvectors
U = [u 1 , · · · , u p ] ∈ Rn× p .
1: k = 0;
2: Perform the Lanczos procedure from v to generate Tm1 and the basis Vm1 (Algorithm 11.6);
3: repeat
4: k = k + 1 ;
5: if k > 1, then
6: Apply the restarting procedure IRAM to generate V pk and T pk from Vmk−1 and Tmk−1 ;
7: Perform the Lanczos procedure to generate Vmk and the basis Tmk from V pk and T pk ;
8: end if
9: Compute the p largest Ritz pairs and the corresponding residuals;
10: until Convergence

Since all the desired eigenpairs do not converge at the same iteration, Algo-
rithm 11.9 can be accelerated by incorporating a deflating procedure that allows the
storing of the converged vectors and applying the rest of the algorithm on PA, instead
356 11 Large Symmetric Eigenvalue Problems

of A, where P is the orthogonal projector that maintains the orthogonality of the next
basis with respect to the converged eigenvectors.
Parallel scalability of Algorithm 11.9 can be further enhanced by using a block ver-
sion of the Lanczos eigensolver based on Algorithm 11.7. Once the block-tridiagonal
matrix Tm is obtained, any standard algorithm can be employed to find all its eigen-
pairs. For convergence results for the block scheme analogous Theorem 11.5 see
[17, 18]. An alternative which involves fewer arithmetic operations consists of or-
thogonalizing the newly computed Lanczos block against the (typically few) con-
verged Ritz vectors. This scheme is known as the block Lanczos scheme with selec-
tive orthogonalization. The practical aspects of enforcing orthogonality in either of
these ways are discussed in [28–30]. The IRAM strategy was adapted for the block
Lanczos scheme in [31].

11.3 A Block Lanczos Approach for Solving Symmetric


Perturbed Standard Eigenvalue Problems

In this chapter, we have presented several algorithms for obtaining approximations


of a few of the extreme eigenpairs of the n × n sparse symmetric eigenvalue problem,

Ax = λx, (11.21)

which are suitable for implementation on parallel architectures with high efficiency.
In several computational science and engineering applications there is the need for
developing efficient parallel algorithms for approximating the extreme eigenpairs of
the series of slightly perturbed eigenvalue problems,

A(Si )y = λ(Si )y, i = 1, 2, .., m (11.22)

where
A(Si ) = A + B Si B  (11.23)

in which B ∈ Rn× p , Si = Si ∈ R p× p with B being of full-column rank p n,


and m is a small positive integer.
While any of the above symmetric eigenvalue problem solvers can be adapted to
handle the series of problems in (11.22), we consider here an approach based on the
block Lanczos algorithm as defined in Sect. 11.2.1, [16].

11.3.1 Starting Vectors for A(Si )x = λX

We are now in a position to discuss how to choose the starting block V1 for the block
Lanczos reduction of A(Si ) in order to take advantage of the fact that A has already
11.3 A Block Lanczos Approach for Solving Symmetric Perturbed … 357

been reduced to block tridiagonal form via the algorithm described above. Recalling
that

A(Si ) = A + B Si B  ,

where 1 ≤ i ≤ m, p n, and the matrix B has full column rank, the idea of
the approach is rather simple. Taking as a starting block the matrix V1 given by the
orthogonal factorization of B, B = V1 R0 where R0 ∈ R p× p is upper triangular.
Then, the original matrix A is reduced to the block tridiagonal matrix T via the
block Lanczos algorithm.
In the following, we show that this choice of the starting Lanczos block leads to a
reduced block tridiagonal form of the perturbed matrices A(Si ) which is only a rank
p perturbation of the original block tridiagonal matrix T . Let V , contain the Lanczos
blocks V1 , V2 , . . . , Vk generated by the algorithm. From the orthogonality property
of these matrices, we have

I p for i = j
Vi V j = (11.24)
0 for i = j

where I p is the identity matrix of order p. Now let E i = B Si B  , hence

V  E i V = V  B Si B  V = V  V1 R0 Si R0 V1 V, i = 1, 2, . . . , m. (11.25)

Since
V1 = V (I p , 0, . . . , 0), (11.26)

then,
⎛ ⎞
R0 Si R0
⎜ 0 ⎟
⎜ ⎟
V  Ei V = ⎜ .. ⎟, (11.27)
⎝ . ⎠
0

and,

V  A(Si )V = V  (A + B Si B  )V = V  AV + V  E i V = T (Si ) (11.28)

where
⎛ ⎞
G̃ 1 R1
⎜ R1 G 2 R  ⎟
⎜ 2 ⎟
⎜ .. .. .. ⎟
T (Si ) = ⎜ . . . ⎟ (11.29)
⎜ ⎟
⎝  ⎠
Rk−2 G k−1 Rk−1
Rk−1 G k
358 11 Large Symmetric Eigenvalue Problems

with

G̃ 1 = G 1 + R0 Si R0 ,

i.e. the matrix V  A(Si )V is of the same structure as V  AV . It is then clear that
the advantage of choosing such a set of starting vectors is that the block Lanczos
algorithm needs to be applied only to the matrix A to yield the block tridiagonal matrix
T . Once T is formed, all block tridiagonal matrices T (Si ) can be easily obtained
by the addition of terms R0 Si R0 to the first diagonal block G 1 of T . The matrix V
is independent of Si and, hence, remains the same for all A(Si ). Consequently, the
computational savings will be significant for large-scale engineering computations
which require many small modifications (or reanalyses) of the structure.

11.3.2 Starting Vectors for A(Si )−1 x = μx

The starting set of vectors discussed in the previous section are useful for handling
A(Si )x = λx when the largest eigenvalues of A(Si ) are required. For many engi-
neering or scientific applications, however, only a few of the smallest (in magnitude)
eigenvalues and their corresponding eigenvectors are desired. In such a case, one usu-
ally considers the shift and invert procedure instead of working with A(Si )x = λx
directly. In other words, to seek the eigenvalue near zero, one considers instead the
1
problem A(Si )−1 x = x. Accordingly, to be able to take advantage of the nature of
λ
the perturbations, we must choose an appropriate set of starting vectors.

It is not clear at this point how the starting vectors for (A + B Si B  )−1 x = μx,
1
where μ = , can be properly chosen so as to yield a block tridiagonal structure
λ
analogous to that shown in (11.29). However, if we assume that both A and Si are
nonsingular, the Woodbury formula yields,

(A + B Si B  )−1 = A−1 − A−1 B(Si−1 + B  A−1 B)−1 B  A−1 . (11.30)

Thus, if in reducing A−1 to the block tridiagonal form

Ṽ  A−1 Ṽ = T̃ (11.31)

via the block Lanczos scheme, we choose the first block, Ṽ1 of Ṽ as that orthonormal
matrix resulting from the orthogonal factorization

A−1 B = Ṽ1 R̃0 (11.32)


11.3 A Block Lanczos Approach for Solving Symmetric Perturbed … 359

where R̃0 is upper triangular of order p, then


 
  −1 Ip
Ṽ (A + B Si B ) Ṽ = T̃ + Ẽ(Si )(I p , 0) = T̃ (Si ) (11.33)
0

in which Ẽ(Si ) is a p × p matrix given by

Ẽ(Si ) = R̃0 (Si−1 + B  A−1 B)−1 R̃0 . (11.34)

Note that T̃ (Si ) is a block tridiagonal matrix identical to T̃ in (11.31), except for the
first diagonal block. Furthermore, note that as before, Ṽ is independent of Si and
hence, remains constant for 1 ≤ i ≤ m.

11.3.3 Extension to the Perturbed Symmetric Generalized


Eigenvalue Problems

In this section, we address the extension of our approach to the perturbed generalized
eigenvalue problems of type

K (Si )x = λM x, K (Si ) = (K + B Si B  ) (11.35)

and

K x = λM(Si )x, M(Si ) = (M + B Si B  ) (11.36)

where K (Si ) and M(Si ) are assumed to be symmetric positive definite for all i,
1 ≤ i ≤ m. In structural mechanics, K (Si ) and M(Si ) are referred to as stiffness
and mass matrices, respectively.
Let K = L K L K  and M = L M L M  be the Cholesky factorization of K and
M, respectively. Then the generalized eigenvalue problems (11.35) and (11.36) are
reduced to the standard form

( K̃ + B̃ Si B̃  )y = λy (11.37)

and

( M̂ + B̂ Si B̂  )z = λ−1 z (11.38)

where

K̃ = L −1 − −1 
M K L M , B̃ = L M B, and y = L M x (11.39)
360 11 Large Symmetric Eigenvalue Problems

and

M̂ = L −1 − −1 
K M L K , B̂ = L K B, and z = L K x. (11.40)

Now, both problems can be treated as discussed above. If one seeks those eigenpairs
closest to zero in (11.35), then we need only obtain the block tridiagonal form asso-
ciated with K̃ −1 once the relevant information about the starting orthonormal block
is obtained from the orthogonal factorization for B̃. Similarly, in (11.36), one needs
the block tridiagonal form associated with M̂ based on the orthogonal factorization
of B̂.

11.3.4 Remarks

Typical examples of the class of problems described in the preceding sections arise
in the dynamic analysis of modified structures. A frequently encountered problem is
how to take into account, in analysis and design, changes introduced after the initial
structural dynamic analysis has been completed. Typically, the solution process is
of an iterative nature and consists of repeated modifications to either the stiffness
or the mass of the structure in order to fine-tune the constraint conditions. Clearly,
the number of iterations depends on the complexity of the problem, together with
the nature and number of constraints. Even though these modifications may be only
slight, a complete reanalysis of the new eigenvalues and eigenvectors of the modified
eigensystem is often necessary. This can drive the computational cost of the entire
process up dramatically especially for large scale structures.
The question then is how information obtained from the initial/previous analysis
can be readily exploited to derive the response of the new modified structure without
extensive additional computations. To illustrate the usefulness of our approach in this
respect, we present some numerical applications from the free vibrations analysis of
an undamped cantilever beam using finite elements. Without loss of generality, we
consider only modifications to the system stiffness matrix.
We assume that the beam is uniform along its span and that it is composed of
a linear, homogeneous, isotropic, elastic material. Further, the beam is assumed
to be slender, i.e. deformation perpendicular to the beam axis is due primarily to
bending (flexing), and shear deformation perpendicular to the beam axis can be
neglected; shear deformation and rotational inertia effects only become important
when analyzing deep beams at low frequencies or slender beams at high frequencies.
Subsequently, we consider only deformations normal to the undeformed beam axis
and we are only interested in the fewest lowest natural frequencies.
The beam possesses an additional support at its free end by a spring (assumed
to be massless) with various stiffness coefficients αi , i = 1, 2, . . . , m, as shown in
Fig. 11.1. The beam is assumed to have length L = 3.0, a distributed mass m̄ per
unit length, and a flexural rigidity E I .
11.3 A Block Lanczos Approach for Solving Symmetric Perturbed … 361

7
EI = 10 m = 390

αi

L = 3.0

Fig. 11.1 A cantilever beam with additional spring support

First, we discretize the beam, without spring support using one-dimensional solid
beam finite elements, each of length = 0.3. Using a lumped mass approach, we
get a diagonal mass matrix M, while the stiffness matrix K is block-tridiagonal of
the form
⎛ ⎞
As C 
⎜ C As C  ⎟
⎜ ⎟
EI ⎜ .. .. .. ⎟
K = 3 =⎜ . . . ⎟
⎜ ⎟
⎝ C As C ⎠ 

C Ar

in which each As , Ar , and C is of order 2. Note that both M and K are symmetric
positive definite and each is of order 20. Including the spring with stiffness αi , we
obtain the perturbed stiffness matrix

K (αi ) = K + αi bb (11.41)

where b = e19 (the 19th column of I20 ). In other words, K (αi ) is a rank-1 perturbation
of K . Hence, the generalized eigenvalue problems for this discretization are given
by

K (αi )x = (K + αi bb )x = λM x, i = 1, 2, . . . , m (11.42)

which can be reduced to the standard form by symmetrically scaling K (αi ) using
the diagonal mass matrix M, i.e.

M −1/2 K (αi )M −1/2 y = λy

in which y = M 1/2 x.
Second, we consider modeling the beam as a three-dimensional object using
regular hexahedral elements, see Fig. 11.2. This element possesses eight nodes, one
at each corner, with each having three degrees of freedom, namely, the components
of displacement u, v, and w along the directions of the x, y, and z axes, respectively.
362 11 Large Symmetric Eigenvalue Problems

Fig. 11.2 The regular z, w


hexahedral finite element

o
m

i
k

n p y, v

j l

x, u

In the case of the hexahedral element, [32, 33], the element stiffness and mass
matrices possess the SAS property (see Chap. 6) with respect to some reflection
matrix and, hence, they can each be recursively decomposed into eight submatrices.
Because of these properties of the hexahedral element (i.e. three levels of symmet-
ric and antisymmetric decomposability), if SAS ordering of the nodes is employed,
then the system stiffness matrix K , of order n, satisfies the relation

PKP = K (11.43)

where P = P  is a permutation matrix (P 2 = I ). Therefore, depending on the


number of planes of symmetry, the problem can be decomposed into up to eight
subproblems. In the case of the three-dimensional cantilever beam studied here, see
Fig. 11.3, having only two axes of symmetry with the springs removed, then the
problem can be decomposed into only 4 independent subproblems,

Q  K Q = diag(K 1 , . . . , K 4 ) (11.44)

z z

x y

k1 k2
L

Fig. 11.3 A 3-D cantilever beam with additional spring supports


11.3 A Block Lanczos Approach for Solving Symmetric Perturbed … 363

in which Q is an orthogonal matrix that can be easily constructed from the permu-
tation submatrices that constitute P. The matrices K i are each of order n/4 and of
a much smaller bandwidth. Recalling that in obtaining the smallest eigenpairs, we
need to solve systems of the form

K z = g. (11.45)

It is evident that in each step of the block Lanczos algorithm, we have four inde-
pendent systems that can be solved in parallel, with each system of a much smaller
bandwidth than that of the stiffness matrix K .

11.4 The Davidson Methods

In 1975, an algorithm for solving the symmetric eigenvalue problem, called David-
son’s method [34], emerged from the computational chemistry community. This
successful eigensolver was later generalized and its convergence proved in [35, 36].
The Davidson method can be viewed as a modification of Newton’s method ap-
plied to the system that arises from treating the symmetric eigenvalue problem as a
constrained optimization problem involving the Rayleigh quotient. This method is
a precursor to other methods that appeared later such as Trace Minimization [37,
38], and Jacobi-Davidson [39] which are discussed in Sect. 11.5. These approaches
essentially take the viewpoint that the eigenvalue problem is a nonlinear system of
equations and attempt to find a good way to correct a given approximate eigenpair.
In practice, this requires solving a correction equation which updates the current
approximate eigenvector, in a subspace that is orthogonal to it.

11.4.1 General Framework

All versions of the Davidson method may be regarded as various forms of precondi-
tioning the basic Lanczos method. In order to illustrate this point, let us consider a
symmetric matrix A ∈ Rn×n . Both the Lanczos and Davidson algorithms generate,
at some iteration k, an orthonormal basis Vk = (v1 , . . . , vk ) of a k-dimensional sub-
space Vk of Rn×k and the symmetric interaction matrix Hk = Vk  AVk ∈ Rk×k . In
the Lanczos algorithm, Vk is the Krylov subspace Vk = Kk (A, v), but for Davidson
methods, this is not the case. In the Lanczos method, the goal is to obtain Vk such
that some eigenvalues of Hk are good approximations of some eigenvalues of A.
In other words, if (λ, y) is an eigenpair of Hk , then it is expected that the Ritz pair
(λ, x), where x = Vk y, is a good approximation of an eigenpair of A. Note that this
occurs only for some eigenpairs of Hk and only at convergence.
364 11 Large Symmetric Eigenvalue Problems

Davidson methods differ from the Lanczos scheme in the definition of the new
direction which will be added to the subspace Vk to obtain Vk+1 . In Lanczos schemes,
Vk+1 is the basis of the Krylov subspace Kk+1 (A, v). In other words, if reorthog-
onalization is not considered (see Sect. 11.2), the vector vk+1 is computed by a
three term recurrence. In Davidson methods, however, the local improvement of the
direction of the Ritz vector towards the sought after eigenvector is obtained by a
quasi-Newton step in which the next vector is obtained by reorthogonalization with
respect to Vk . Moreover, in this case, the matrix Hk is no longer tridiagonal. There-
fore, one iteration of Davidson methods involves more arithmetic operations than the
basic Lanczos scheme, and at least as expensive as the Lanczos scheme with full re-
orthogonalization. Note also, that Davidson methods require the storage of the basis
Vk thus limiting the maximum value kmax in order to control storage requirements.
Consequently, Davidson methods are implemented with periodic restarts.
To compute the smallest eigenvalue of a symmetric matrix A, the template of the
basic Davidson method is given by Algorithm 11.10. In this template, the operator
Ck of Step 12 represents any type of correction, to be specified later in Sect. 11.4.3.
Each Davidson method is characterized by this correction step.

Algorithm 11.10 Generic Davidson method.


Input: A ∈ Rn×n symmetric, and v ∈ Rn , and m < n.
Output: The smallest eigenvalue λ ∈ Λ(A) and its corresponding eigenvector x ∈ Rn .
1: v1 = v/v ;
2: repeat
3: do k = 1 : m,
4: compute Wk = AVk ;
5: compute the interaction matrix Hk = Vk  AVk ;
6: compute the smallest eigenpair (λk , yk ) of Hk ;
7: compute the Ritz vector xk = Vk yk ;
8: compute the residual rk = Wk yk − λk xk ;
9: if convergence then
10: Exit from the loop k;
11: end if
12: compute the new direction tk = Ck rk ; //Davidson correction
13: Vk+1 = MGS([Vk , tk ]) ;
14: end
15: V1 =MGS([xk , tk ]);
16: until Convergence
17: x = xk ;

11.4.2 Convergence

In this section, we present a general convergence result for the Davidson methods
as well as the Lanczos eigensolver with reorthogonalization (Algorithm 11.9). For a
more general context, we consider the block version of these algorithms. Creating
Algorithm 11.11 as the block version of Algorithm11.10, we generalize Step 11 so
11.4 The Davidson Methods 365

that the corrections Ck,i can be independently chosen. Further, when some, but not
all, eigenpairs have converged, a deflation process is considered by appending the
converged eigenvectors as the leading columns of Vk .

Algorithm 11.11 Generic block Davidson method.


Input: A ∈ Rn×n symmetric, p ≥ 1, V ∈ Rn× p , and p ≤ m ≤ n .
Output: The p smallest eigenvalue {λ1 , · · · , λ p } ⊂ Λ(A) and their corresponding eigenvector
X = [x1 , · · · , x p ] ∈ Rn× p .
1: V1 = MGS(V ) ; k = 1;
2: repeat
3: compute Wk = AVk ;
4: compute the interaction matrix Hk = Vk  AVk ;
5: compute the p smallest eigenpairs (λk,i , yk,i ) of Hk , for i = 1, · · · , p ;
6: compute the p corresponding Ritz vectors X k = Vk Yk where Yk = [yk,1 , · · · yk, p ] ;
7: compute the residuals Rk = [rk,1 , · · · , rk, p ] = Wk Yk − X k diag(λk,1 , · · · , λk, p );
8: if convergence then
9: Exit from this loop k;
10: end if
11: compute the new block Tk = [tk,1 , . . . , tk, p ] where tk,i = Ck,i rk,i ; //Davidson correction
12: Vk+1 = MGS([Vk , Tk ]) ;
13: if dim(Vk ) ≤ m − p, then
14: Vk+1 = MGS([Vk , Tk ]) ;
15: else
16: Vk+1 =MGS([X k , Tk ]);
17: end if
18: k = k + 1;
19: until Convergence
20: X = X k ; λi = λk,i for i = 1, · · · , p.

Theorem 11.6 (Convergence of the Block Davidson methods) In Algorithm 11.11,


let Vk be the subspace spanned by the columns of Vk . Under the assumption that

xk,i ∈ Vk+1 , for i = 1, . . . , p, and k ∈ N, (11.46)

the sequence {λk,i }k≥1 is nondecreasing and convergent for i = 1, . . . , p.


Moreover, if in addition,
1. there exist K 1 , K 2 > 0 such that for any k ≥ 1 and for any vector v ∈ Vk⊥ , the
set of matrices {Ck,i }k≥1 satisfies,

K 1 v2 ≤ v Ck,i v ≤ K 2 v2 , and (11.47)

2. for any i = 1, . . . , p and k ≥ 1,

(I − Vk Vk )Ck,i (A − λk,i I )xk,i ∈ Vk+1 , (11.48)

then limk→∞ λk,i is an eigenvalue of A and the elements in {xk,i }k≥1 yield the
corresponding eigenvectors.
366 11 Large Symmetric Eigenvalue Problems

Proof See [36].

This theorem provides a convergence proof of the single vector version of the David-
son method expressed in Algorithm 11.10, as well as the LANCZOS2 method since
in this case Ck,i = I , even for the block version.

11.4.3 Types of Correction Steps

To motivate the distinct approaches for the correction Step 12 in Algorithm 11.10,
let us assume that a vector x of unit norm approximates an unknown eigenvector
(x + y) of A, where y is chosen orthogonal to x. The quantity λ = ρ(x) (where
ρ(x) = x  Ax denotes the Rayleigh quotient of x) approximates the eigenvalue λ+δ
 A(x+y)
which corresponds to the eigenvector x + y : λ + δ = ρ(x + y) = (x+y) x+y2
.
The quality of the initial approximation is measured by the norm of the residual
r = Ax − λx. Since the residual r = (I − x x  )Ax is orthogonal to the vector x,
then if we denote by θ the angle ∠(x, x + y), and t is the orthogonal projection of
x onto x + y and z = x − t, then we have:

where x is of unit 2-norm.

Proposition 11.2 Using the previous notations, the correction δ to the approxi-
mation λ of an eigenvalue, and the orthogonal correction y for the corresponding
approximate eigenvector x, satisfy:

(A − λI )y = −r + δ(x + y),
(11.49)
y ⊥ x,

for which the following inequalities hold :

|δ| ≤ 2A tan2 θ, (11.50)


 
sin θ
r  ≤ 2A tan θ +1 . (11.51)
cos2 θ

Proof Equation (11.49) is directly obtained from

A(x + y) = (λ + δ)(x + y).


11.4 The Davidson Methods 367

Moreover

λ + δ = ρ(t),
= (1 + tan2 θ ) t  At,
= (1 + tan2 θ )(x − z) A(x − z),
= (1 + tan2 θ )(λ − 2x  Az + z  Az).

Since z is orthogonal to the eigenvector t, we obtain x  Az = (t + z) Az = z  Az


and therefore

δ = −z  Az + (tan2 θ )(λ − z  Az),


= (tan2 θ )(λ − ρ(z)), or
|δ| ≤ 2(tan2 θ ) A,

which proves (11.50). Inequality (11.51) is a straightforward consequence of (11.49)


and (11.50).
Since the system (11.49) is not easy to solve, it is replaced by an approximate
one obtained by deleting the nonlinear term δ(x + y) whose norm is of order O(θ 2 )
whereas r  = O(θ ). Thus, the problem reduces to solving a linear system under
constraints:

(A − λI )y = −r,
(11.52)
y ⊥ x,

Unfortunately, this system has no solution except the trivial solution when λ is an
eigenvalue. Three options are now considered to overcome this difficulty.
First Option: Using a Preconditioner:
The first approach consists of replacing Problem (11.52) by

(M − λI )y = −r, (11.53)

where M is chosen as a preconditioner of A. Then, the solution y is projected onto


Vk⊥ . Originally in [34], Davidson considered M = diag(A) to solve (11.53). This
method is called the Davidson method.
Second Approach: Using an Iterative Solver
When λ is close to an eigenvalue, solve approximately

(A − λI )y = −r, (11.54)

by an iterative method with a fixed or variable number of iterations. In such a situation,


the method is similar to inverse iteration in which the ill-conditioning of the system
368 11 Large Symmetric Eigenvalue Problems

provokes an error which is in the direction of the desired eigenvector. The iterative
solver, however, must be capable of handling symmetric indefinite systems.
Third Approach: Using Projections
By considering the orthogonal projector P = I − Vk Vk with respect to Vk ⊥ , the
system to be solved becomes

P(A − λI )y = −r,
(11.55)
P y = y,

Note that the system matrix can be also expressed as P(A − λI )P since P y = y.
This system is then solved via an iterative scheme that accommodates symmetric
indefiniteness. The Jacobi-Davidson method [39] follows this approach.
The study [40] compares the second approach which can be seen as the Rayleigh
Quotient method, and the Newton Grassmann method which corresponds to this
approach. The study concludes that the two correction schemes have comparable
behavior. In [40], the authors also provide a stopping criterion for controlling the
inner iterations of an iterative solver for the correction vectors.
This approach is studied in the section devoted to the Trace Minimization method
as an eigensolver of the generalized symmetric eigenvalue problem (see Sect. 11.5).
Also, in this section we describe the similarity between the Jacobi-Davidson method
and the method that preceded it by almost two decades: the Trace Minimization
method.

11.5 The Trace Minimization Method for the Symmetric


Generalized Eigenvalue Problem

The generalized eigenvalue problem

Ax = λBx, (11.56)

where A and B are n × n real symmetric matrices with B being positive definite,
arises in many applications, most notably in structural mechanics [41, 42] and plasma
physics [43, 44]. Usually, A and B are large, sparse, and only a few of the eigenvalues
and the associated eigenvectors are desired. Because of the size of the problem, meth-
ods that rely only on operations like matrix-vector multiplications, inner products,
and vector updates are usually considered.
Many methods fall into this category (see, for example [45, 46]). The basic idea in
all of these methods is building a sequence of subspaces that, in the limit, contain the
desired eigenvectors. Most of the early methods iterate on a single vector, i.e. using
one-dimensional subspaces, to compute one eigenpair at a time. If, however, several
eigenpairs are needed, a deflation technique is frequently used. Another alternative
is to use block analogs of the single vector methods to obtain several eigenpairs
11.5 The Trace Minimization Method for the Symmetric … 369

simultaneously. The well-known simultaneous iteration [47], or subspace iteration


[1] (originally described in [2]) has been extensively studied in the 1960 s and the
1970s [45, 47–50].
In this chapter we do not include practical eigensolvers based on contour integra-
tion. The first of such methods was proposed in [51], and later it was enhanced as a
subspace iteration similar to the trace minimization scheme, or Davidson-type trace
minimization, resulting in the Hermitian eigensolver FEAST, e.g. see [52].
Let A be symmetric positive definite and assume that we seek the smallest p
eigenpairs, where p n. In simultaneous iteration, the sequence of subspaces of
dimension p is generated by the following recurrence:

X k+1 = A−1 BXk , k = 0, 1, . . . , (11.57)

where X 0 is an n × p matrix of full rank. The eigenvectors of interest are magnified at


each iteration step, and will eventually dominate X k . The downside of simultaneous
iteration is that linear systems of the form Ax = b have to be solved repeatedly
which is a significant challenge for large problems. Solving these linear systems
inexactly often compromises global convergence. A variant of simultaneous iteration
that avoids this difficulty is called the trace minimization method. This method was
proposed in 1982 [37]; see also [38]. Let X k be the current approximation to the
eigenvectors corresponding to the p smallest eigenvalues where X k BXk = I p . The
idea of the trace minimization scheme is to find a correction term Δk that is B-
orthogonal to X k such that

tr(X k − Δk ) A(X k − Δk ) < tr(X k AX k ).

It follows that, for any B-orthonormal basis X k+1 of the new subspace span{X k −Δk },
we have

tr(X k+1 AX k+1 ) < tr(X k AX k ),

i.e. span{X k − Δk } gives rise to a better approximation of the desired eigenspace


than span{X k }. This trace reduction property can be maintained without solving any
linear system exactly.
The trace minimization method can be accelerated via shifting strategies. The
introduction of shifts, however, may compromise the robustness of the trace mini-
mization scheme. Various techniques have been developed to prevent unstable con-
vergence (see the section on “Randomization” for details). A simple way to get
around this difficulty is to utilize expanding subspaces. This, in turn, places the trace
minimization method into a class of methods that includes the Lanczos method [53],
Davidson’s method [34], and the Jacobi-Davidson method [39, 54, 55].
The Lanczos method has become popular after the ground-breaking analysis by
Paige [20], resulting in many practical algorithms [17, 56–58] (see [10, 59] for an
overview). The original Lanczos algorithm was developed for handling the standard
eigenvalue problem only, i.e. B = I . Extensions to the generalized eigenvalue prob-
370 11 Large Symmetric Eigenvalue Problems

lem [60–62] require solving a linear system of the form Bx = b at each iteration
step, or factorizing matrices of the form (A − σ B) during each iteration. Davidson’s
method, which can be regarded as a preconditioned Lanczos method, was intended to
be a practical method for standard eigenvalue problems in quantum chemistry where
the matrices involved are diagonally dominant. In the past two decades, Davidson’s
method has gone through a series of significant improvements [35, 36, 63–65]. A
development is the Jacobi-Davidson method [39], published in 1996, which is a
variant of Davidson’s original scheme and the well-known Newton’s method. The
Jacobi-Davidson algorithm for the symmetric eigenvalue problem may be regarded
as a generalization of the trace minimization scheme (which was published 15 years
earlier) that uses expanding subspaces. Both utilize an idea that dates back to Ja-
cobi [66]. As we will see later, the current Jacobi-Davidson scheme can be further
improved by the techniques developed in the trace minimization method.
Throughout this section, the eigenpairs of (11.56) are denoted by (xi , λi ), 1 
i  n, with the eigenvalues arranged in ascending order.

11.5.1 Derivation of the Trace Minimization Algorithm

Here, we derive the trace minimization method originally presented in [37]. We


assume that A is positive definite, otherwise problem (11.56) can be replaced by

(A − μB)x = (λ − μ)Bx

with μ < λ1 < 0, that ensures a positive definite (A − μB).


The trace minimization method is motivated by the following theorem.

Theorem 11.7 (Beckenbach and Bellman [67], Sameh and Wisniewski [37]). Let A
and B be as given in Problem (11.56), and let X ∗ be the set of all n × p matrices X
for which X  BX = I p , 1  p  n. Then


p

min tr(X AX) = λi . (11.58)
X ∈X ∗
i=1

where λ1  λ2  · · ·  λn are the eigenvalues of Problem (11.56). The equality


holds if and only if the columns of the matrix X, which achieves the minimum, span
the eigenspace corresponding to the smallest p eigenvalues.

If we denote by E/F the matrix E F −1 and X the set of all n × p matrices of


full rank, then (11.58) is equivalent to
  
p
X  AX
min tr = λi .
X ∈X X  BX
k=1
11.5 The Trace Minimization Method for the Symmetric … 371

X  AX/ X  BX is called the generalized Rayleigh quotient. Most of the early methods
that compute a few of the smallest eigenvalues are devised explicitly or implicitly
by reducing the generalized Rayleigh quotient step by step. A simple example is
the simultaneous iteration scheme for a positive definite matrix A where the current
approximation X k is updated by (11.58). It can be shown by the Courant-Fischer
theorem [1] and the Kantorovic̆ inequality [68, 69] that
   
 AX
X k+1 k+1 X k AX k
λi  BX
 λi , 1  i  p. (11.59)
X k+1 k+1 X k BXk

The equality holds only when X k is already an eigenspace of Problem (11.56).

Algorithm 11.12 Simultaneous iteration for the generalized eigenvalue problem.


1: Choose a block size s  p and an n × s matrix V1 of full rank such that V1 BV1 = Is .
2: do k = 1, 2, . . . until convergence,
3: Compute Wk = AVk and the interaction matrix Hk = Vk Wk .
4: Compute the eigenpairs (Yk , Θk ) of Hk . The eigenvalues are arranged in ascending order and
the eigenvectors are chosen to be orthogonal.
5: Compute the corresponding Ritz vectors X k = Vk Yk .
6: Compute the residuals Rk = Wk Yk − B X k Θk .
7: Test for convergence.
8: Solve the linear system

AZk+1 = BXk , (11.60)


by an iterative method.
9: B-orthonormalize Z k+1 into Vk+1 .
10: end

In [37], simultaneous iteration is derived in a way that allows exploration of the


trace minimization property explicitly. At each iteration step, the previous approxi-
mation X k , which satisfies X k B X k = Is and X k AX k = Θk , where Θk is diagonal
(with the ith element denoted by θk,i ), is corrected with Δk that is obtained by

minimizing tr(X k − Δk ) A(X k − Δk ),


(11.61)
subject to X k BΔk = 0.

As a result, the matrix Z k+1 = X k − Δk always satisfies



tr(Z k+1 AZ k+1 )  tr(X k AXk ), (11.62)

and

Z k+1 BZk+1 = Is + Δ
k BΔk , (11.63)
372 11 Large Symmetric Eigenvalue Problems

which guarantee that



tr(X k+1 AXk+1 )  tr(X k AXk ) (11.64)

for any B-orthonormal basis X k+1 of the subspace span{Z k+1 }. The equality in
(11.64) holds only when Δk = 0, i.e. X k spans an eigenspace of (11.56) (see
Theorem 11.10 for details).
Using Lagrange multipliers, the solution of the minimization problem (11.61) can
be obtained by solving the saddle-point problem
    
A BXk Δk AXk
= , (11.65)
X k B 0 Lk 0

where L k represents the Lagrange multipliers. Several methods may be used to solve
(11.65) using either direct methods via the Schur complement or via preconditioned
iterative schemes, e.g. see the detailed survey [70]. In [37], (11.65) is further reduced
to solving the following positive-semidefinite system

(PAP)Δk = PAXk , (11.66)

where P is the orthogonal projector P = I − BXk (X k B 2 X k )−1 X k B. This system is


solved by the conjugate gradient method (CG) in which zero is chosen as the initial
iterate so that the linear constraint X k BΔ( )
k = 0 is automatically satisfied for any
( )
intermediate Δk . This results in the following basic trace minimization algorithm.

Algorithm 11.13 The basic trace minimization algorithm.


1: Choose a block size s  p and an n × s matrix V1 of full rank such that V1 BV1 = Is .
2: do k = 1, 2, . . . until convergence,
3: Compute Wk = AVk and the interaction matrix Hk = Vk Wk .
4: Compute the eigenpairs (Yk , Θk ) of Hk . The eigenvalues are arranged in ascending order and
the eigenvectors are chosen to be orthogonal.
5: Compute the corresponding Ritz vectors X k = Vk Yk .
6: Compute the residuals Rk = Axk − B X k Θk = Wk Yk − BXk Θk .
7: Test for convergence.
8: Solve the positive-semidefinite linear system (11.66) approximately via the CG scheme.
9: B-orthonormalize X k − Δk into Vk+1 .
10: end

From now on, we will refer to the linear system (11.66) in step 8 as the inner
system(s). It is easy to see that the exact solution of the inner system is

Δk = X k − A−1 B X k (X k B A−1 BXk )−1 , (11.67)

thus the subspace spanned by X k − Δk is the same subspace spanned by A−1 BXk . In
other words, if the inner system (11.66) is solved exactly at each iteration step, the
trace minimization algorithm above is mathematically equivalent to simultaneous
11.5 The Trace Minimization Method for the Symmetric … 373

iteration. As a consequence, global convergence of the basic trace minimization


algorithm follows exactly from that of simultaneous iteration.

Theorem 11.8 ([1, 4, 37]) Let A and B be positive definite and let s  p be the block
size such that the eigenvalues of Problem (11.56) satisfy 0 < λ1  λ2  · · ·  λs <
λs+1  · · ·  λn . Let also the initial iterate X 0 be chosen such that it has linearly
independent columns and is not deficient in any eigen-component associated with
the p smallest eigenvalues. Then the ith column of X k , denoted by xk,i , converges to
the eigenvector xi corresponding to γi for i = 1, 2, . . . , p with an asymptotic rate
of convergence bounded by λi /λs+1 . More specifically, at each step, the error

φi = (xk,i − xi ) A(xk,i − xi ) (11.68)

is reduced asymptotically be a factor of (λi /λs+1 )2 .

The main difference between the trace minimization algorithm and simultaneous
iteration is in step 8. If both (11.60) and (11.66) are solved via the CG scheme exactly,
the performance of either algorithm is comparable in terms of time consumed, as
observed in practice. The additional cost in performing the projection P at each CG
step (once rather than twice) is not high because the block size s is usually small,
i.e. s n. This additional cost is sometimes compensated for by the fact that PAP,
when it is restricted to the subspace {v ∈ R n |Pv = v}, is better conditioned than A
as will be seen in the following theorem.

Theorem 11.9 Let A and B be as given in Theorem 11.8 and P be given as in (11.66),
and let νi , μi , 1  i  n be the eigenvalues of A and PAP arranged in ascending
order, respectively. Then, we have

0 = μ1 = μ2 = · · · = μs < ν1  μs+1  μs+2  · · ·  μn  νn .

Proof The proof is a straightforward consequence of the Courant-Fischer theorem


[1], and hence omitted.

11.5.2 Practical Considerations

In practice, however, the inner systems (11.66) are always solved approximately,
particularly for large problems. Note that the error (11.68) in the ith column of X k
is reduced asymptotically by a factor of (λi /λs+1 )2 at each iteration step. Thus, we
should not expect high accuracy in the early Ritz vectors even if the inner systems are
solved to machine precision. Further, convergence of the trace minimization scheme
is guaranteed if a constant relative residual tolerance is used for the inner system
(11.66) in each outer iteration.
374 11 Large Symmetric Eigenvalue Problems

A Convergence Result
We prove convergence of the trace minimization algorithm under the assumption
that the inner systems in (11.66) are solved inexactly. We assume that, for each i,
1  i  s, the ith inner system in (11.66) is solved approximately by the CG scheme
with zero as the initial iterate such that the 2-norm of the residual is reduced by a factor
γ < 1. The computed correction matrix will be denoted by Δck = (dk,1 c dc , . . . , dc )
k,2 k,s
to distinguish it from the exact solution Δk = (dk,1 , dk,2 , . . . , dk,s ) of (11.66).
We begin the convergence proof with two lemmas. We first show that, in each
iteration, the columns of X k − Δck are linearly independent, and the sequence {X k }∞ 0
in the trace minimization algorithm is well-defined. In the second, we show that the
computed correction matrix Δck satisfies

tr(X k − Δck ) A(X k − Δck )  tr(X k AXk ).

This assures that, no matter how prematurely the CG process is terminated, tr(X k
AXk ) always forms a decreasing sequence bounded from below by Σi=1s λ .
i

Lemma 11.2 For each k = 0, 1, 2, . . . , Z k+1 = X k − Δck is of full column rank.

Proof Since dk,ic is an intermediate approximation obtained from the CG process,

there exists a polynomial p(t) such that


c
dk,i = p(PAP)(PAxk,i ),

where xk,i is the ith column of X k and P is the projector in (11.66). As a consequence,
c is B-orthogonal to X , i.e. X  Bd c = 0. Thus the matrix
for each i, dk,i k k k,i


Z k+1 BZk+1 = Is + (Δck ) BΔck

is nonsingular, and Z k+1 is of full column rank. 




Lemma 11.3 Suppose that the inner systems in (11.66) are solved by the CG scheme
( ) ( )
with zero as the initial iterate. Then, for each i, (xk,i −dk,i ) A(xk,i −dk,i ) decreases
monotonically with respect to step of the CG scheme.

Proof The exact solution of the inner system (11.66) is given by

Δk = X k − A−1 BXk (X k BA−1 BXk )−1

( )
for which PΔk = Δk . For each i, 1  i  s, the intermediate dk,i in the CG process
( ) ( )
also satisfies Pdk,i = dk,i . Thus, it follows that

( ) ( ) ( ) ( )
(dk,i − dk,i ) PAP(dk,i − dk,i ) = (dk,i − dk,i ) A(dk,i − dk,i )
( )  ( )
= (xk,i − dk,i ) A(xk,i − dk,i ) − ei (X k B A−1 B X k )−1 ei .
11.5 The Trace Minimization Method for the Symmetric … 375

( ) ( )
Since the CG process minimizes the PAP-norm of the error dk,i = dk,i − dk,i
( ) ( )
on the expanding Krylov subspace [37], both (dk,i − dk,i ) PAP(dk,i − dk,i ) and
( ) ( )
(xk,i − dk,i ) A(xk,i − dk,i ) decrease monotonically. 


Theorem 11.10 Let X k , Δck , and Z k+1 be as given in Lemma 11.2. Then limk→∞
Δck = 0.

Proof By the definition of Δck , we have


Z k+1 BZk+1 = Is + (Δck ) BΔck  Is + Tk .

From the spectral decomposition,


 
Z k+1 BZk+1 = Uk+1 Dk+1
2
Uk+1 ,

2 =diag(δ (k+1) , δ (k+1) , . . . , δ (k+1) ),


where Uk+1 is an s×s orthogonal matrix and Dk+1 1 2 s
(k+1)
we see that δi = 1 + λi (Tk )  1. Further, from the definition of X k+1 , there
exists an orthogonal matrix Vk+1 for which
−1
X k+1 = Z k+1 · Uk+1 Dk+1 Vk+1 .

(k+1)  Z  AZ
Denoting by z i the diagonal elements of the matrix Uk+1 k+1 k+1 Uk+1 , it
follows that

     −1 
−1    AZ
tr X k+1 AXk+1 = tr Dk+1 Uk+1 Z k+1 k+1 Uk+1 Dk+1 ,
(k+1) (k+1) (k+1)
z1 z2 zs
= + + ··· + ,
δ1(k+1) δ2(k+1) δs(k+1)
 z 1(k+1) + z 2(k+1) + · · · + z s(k+1) ,
  
= tr Z k+1 AZk+1 ,
 
 tr X k AXk ,

which implies that


     
· · ·  tr X k AXk  tr Z k+1
 
AZk+1  tr X k+1 AXk+1  · · · .

Since the sequence is bounded from below by Σi=1 s λ , it converges to a positive


i
number t  Σi=1 λi . Moreover, the two sequences
s
376 11 Large Symmetric Eigenvalue Problems

(k+1) (k+1) (k+1)


z1 z2 zs
+ + ··· + , k = 1, 2, . . .
δ1(k+1) δ2(k+1) δs(k+1)

and
(k+1) (k+1)
z1 + z2 + · · · + z s(k+1) , k = 1, 2 . . .

also converge to t. Therefore,


 (k+1)   (k+1)   
(k+1)
z1 λ1 (Tk ) z2 λ2 (Tk ) zs λs (Tk )
+ + ··· + → 0.
1 + λ1 (Tk ) 1 + λ2 (Tk ) 1 + λs (Tk )

Observing that for any i, 1  i  s,

(k+1)    
zi  λ1 Uk+1 Z k+1 AZk+1 Uk+1 ,
  
= λ1 Z k+1 AZk+1 ,
 AZ
y  Z k+1 k+1 y
= min ,
y=0  y y   
 AZ
y  Z k+1  BZ
y  Z k+1
k+1 y k+1 y
= min · ,
y=0 y  Z  BZk+1 y y y
 k+1 
 AZ
y  Z k+1 k+1 y
 
,
y  Z k+1 BZk+1 y

 λ1 (A, B),

> 0.

Hence, we have

λ1 (Tk ) → 0, i = 1, 2, . . . , s,

i.e. lim Δck = 0. 



k→∞

Theorem 11.11 If for each 1  i  s, the CG process for the ith inner system
(11.66)

(PAP)dk,i = PAxk,i , dk,i BXk = 0,

is terminated such that the 2-norm of the residual is reduced by a factor γ < 1, i.e.

PAxk,i − (PAP)dk,i
c
2  γ PAxk,i 2 , (11.69)

then the columns of X k converge to the s eigenvectors of Problem 11.56.


11.5 The Trace Minimization Method for the Symmetric … 377

Proof Condition (11.69) implies that

PAxk,i 2 − PAdk,i
c
2  γ PAxk,i 2 ,

and consequently

1
PAxk,i 2  PAdk,i
c
2 .
1−γ

It follows from Theorem 11.10 that lim PAX k = 0, i.e.


k→∞
  −1 
lim AXk − BXk X k B 2 X k X k BAX k = 0.
k→∞

In other words, span{X k } converges to an eigenspace of Problem 11.56. 




Randomization
Condition (11.69) in Theorem 11.11 is not essential because the constant γ can
be arbitrarily close to 1. The only deficiency in Theorem 11.11) is that it does not
establish ordered convergence in the sense that the ith column of X k converges to the
ith eigenvector of the problem. This is called unstable convergence in [4]. In practice,
roundoff errors turn unstable convergence into delayed stable convergence. In [4], a
randomization technique to prevent unstable convergence in simultaneous iteration
was introduced. Such an approach can be incorporated into the trace minimization
algorithm as well: After step 8 of Algorithm 11.13, we append a random vector to
X k and perform the Ritz processes 3 and 4 on the augmented subspace of dimension
s + 1. The extra Ritz pair is discarded after step 4.
Randomization slightly improves the convergence of the first s Ritz pairs [47].
Since it incurs additional cost, it should be used only in the first few steps when a
Ritz pair is about to converge.
Terminating the CG Process
Theorem 11.11 gives a sufficient condition for the convergence of the trace minimiza-
tion algorithm. However, the asymptotic rate of convergence of the trace minimiza-
tion algorithm will be affected by the premature termination of the CG processes.
The algorithm behaves differently when the inner systems are solved inexactly. It
is not clear how the parameter γ should be chosen to avoid performing excessive
CG iterations while maintaining the asymptotic rate of convergence. In [37], the CG
process is terminated by a heuristic stopping strategy.
( )
Let dk,i be the approximate solution at the ith step of the CG process for the ith
column of X k and dk,i the exact solution, then the heuristic stopping strategy in [37]
can be outlined as follows:
378 11 Large Symmetric Eigenvalue Problems

1. From Theorem 11.8, it is reasonable to terminate the CG process for the ith column
of Δk when the error
   1/2
( ) ( ) ( )
εk,i = dk,i − dk,i A dk,i − dk,i ,

is reduced by a factor of τi = λi /λs+1 , called the error reduction factor.


( )
2. The quantity εk,i can be estimated by
   1/2
( ) (l+1)  ( ) (l+1)
dk,i − dk,i A dk,i − dk,i ,

which is readily available from the CG process.


3. The error reduction factor τi = λi /λs+1 , 1  i  s, can be estimated by the
ratio of the Ritz values τ̃k,i = θk,i /θk,s+1 . Since θk,s+1 is not available, θk−1,s is
used instead and is fixed after a few steps because it will eventually converge to
λs rather than λs+1 .

11.5.3 Acceleration Techniques

For problems in which the desired eigenvalues are poorly separated from the remain-
ing part of the spectrum, the algorithm converges slowly. Like other inverse iteration
schemes, the trace minimization algorithm can be accelerated by shifting. Actually,
the formulation of the trace minimization algorithm makes it easier to incorporate
shifts. For example, if the Ritz pairs (xi , θi ), 1  i  i 0 , have been accepted as
eigenpairs and θi0 < θi0 +1 , then θi0 can be used as a shift parameter for computing
subsequent eigenpairs. Due to the deflation effect, the linear systems

[P(A − θi0 B)P]dk,i = PAxk,i , X k Bdk,i = 0, i 0 + 1  i  s,

are consistent and can still be solved by the CG scheme. Moreover, the trace reduc-
tion property still holds. In the following, we introduce two more efficient shifting
techniques which improve further the performance of the trace minimization algo-
rithm.
Single Shift
We know from Sect. 11.5.1 that global convergence of the trace minimization al-
gorithm follows from the monotonic reduction of the trace, which in turn depends
on the positive definiteness of A. A simple and robust shifting strategy would be
finding a scalar σ close to λ1 from below and replace A by (A − σ B) in step 8 of
Algorithm 11.13. After the first eigenvector has converged, find another σ close to
λ2 from below and continue until all the desired eigenvectors are obtained. If both
A and B are explicitly available, it is not hard to find a σ satisfying σ  λ1 .
11.5 The Trace Minimization Method for the Symmetric … 379

In the trace minimization algorithm, the subspace spanned by X k converges to the


invariant subspace Vs corresponding to the s smallest eigenvalues. If the subspace
spanned by X k is close enough to Vs , a reasonable bound for the smallest eigenvalue
can be obtained. More specifically, let Q be a B-orthonormal matrix obtained by
appending n − s columns to X k , i.e. Q = (X k , Yk ) and Q  B Q = In . Then Problem
(11.56) is reduced to the standard eigenvalue problem

(Q  AQ)u = λu. (11.70)

Since
   
Θk X k AYk Θk Ck
Q  AQ = = , (11.71)
Yk AXk Yk AYk Ck Yk AYk

by the Courant-Fischer theorem, we have


   
Θk 0 0 Ck
λ1  λmin + λmin
0 Yk AYk Ck 0


 min θ1 , λ1 (Yk AYk ) − Ck 2 .

Similarly [1], it is easy to show that Ck 2 = Rk  B −1 , in which Rk = AXk − BXΘk
is the residual matrix. If
 
θk,1  λ1 Yk AYk , (11.72)

we get

λ1  θk,1 − Rk  B −1 . (11.73)

In particular, if (11.71) holds for the orthonormal complement of xk+1 , we have

λ1  θk,1 − rk,1  B −1 . (11.74)

This heuristic bound for the smallest eigenvalue suggests the following shifting
strategy (we denote −∞ by λ0 ). If the first i 0  0 eigenvalues have converged,
use σ = max{λi0 , θk,i0 +1 − rk,i0 +1  B −1 } as the shift parameter. If θk,i0 +1 lies in
a cluster, replace the B −1 -norm of rk,i0 +1 by the B −1 -norm of the residual matrix
corresponding to the cluster containing θk,i0 +1 .
Multiple Dynamic Shifts
In [37], the trace minimization algorithm is accelerated with a more aggressive shift-
ing strategy. At the beginning of the algorithm, a single shift is used for all the
380 11 Large Symmetric Eigenvalue Problems

columns of X k . As the algorithm proceeds, multiple shifts are introduced dynami-


cally and the CG process is modified to handle possible breakdown. This shifting
strategy is motivated by the following theorem.
Theorem 11.12 [1] For an arbitrary nonzero vector u and scalar σ , there is an
eigenvalue λ of (11.56) such that

[λ − σ ]  (A − σ B)u B −1 /Bu B −1 .

We know from the Courant-Fischer theorem that the targeted eigenvalue λi is always
below the Ritz value θk,i . Further, from Theorem 11.12, if θk,i is already very close
to the targeted eigenvalue λi , then λi must lie in the interval [θk,i − rk,i  B −1 , θk,i ].
This observation leads to the following shifting strategy for the trace minimization
algorithm. At step k of the outer iteration, the shift parameters σk,i , 1  i  s, are
determined by the following rules (here, λ0 = −∞ and the subscript k is dropped
for the sake of simplicity):
1. If the first i 0 , i 0  0, eigenvalues have converged, choose
!
θi0 +1 if θi0 +1 + ri0 +1  B −1  θi0 +2 − ri0 +2  B −1 ,
σk,i0 +1 =
max{θi0 +1 − ri0 +1  B −1 , λi0 } otherwise.

2. For any other column j, i 0 + 1 < j  p, choose the largest θ such that

θl < θ j − r j  B −1

as the shift parameter σ j . If no such θl exists, use θi0 +1 instead.


3. Choose σi = θi if θi−1 has been used as the shift parameter for column (i − 1)
and

θi < θi+1 − ri+1  B −1 .

4. Use σi0 +1 as the shift parameter for other columns if any.


Solving the Inner Systems
With the multiple shifts, the inner systems in (11.66) become

[P(A − σk,i B)P]dk,i = PAxk,i , X k Bdk,i = 0, 1i s (11.75)

with P = I −BXk (X k B 2 X k )−1 X k B. Clearly, the linear systems could be indefinite,


and the CG process for such systems may break down. A simple way to get around
this problem is either by using another solver such as MINRES (e.g. see [71] or [72]),
or terminating the CG process when a near breakdown is detected. In [37], the CG
( ) ( ) ( ) ( )
process is also terminated when the error (xk,i − dk,i ) A(xk,i − dk,i ), increases by
a small factor. This helps maintain global convergence which is not guaranteed in
the presence of shifting.
11.5 The Trace Minimization Method for the Symmetric … 381

11.5.4 A Davidson-Type Extension

The shifting strategies described in the previous section, improve the performance
of the trace minimization algorithm considerably. Although the randomization tech-
nique, the shifting strategy, and the roundoff error actually make the algorithm sur-
prisingly robust for a variety of problems, further measures to guard against unstable
convergence are necessary for problems in which the desired eigenvalues are clus-
tered. A natural way to maintain stable convergence is by using expanding subspaces
with which the trace reduction property is automatically maintained.
The best-known method that utilizes expanding subspaces is that of Lanczos. It
uses the Krylov subspaces to compute an approximation of the desired eigenpairs,
usually the largest. This idea was adopted by Davidson, in combination with the si-
multaneous coordinate relaxation method, to obtain what he called the “compromise
method” [34]) known as Davidson’s method today. In this section, we generalize the
trace minimization algorithm described in the previous sections by casting it into the
framework of the Davidson method. We start by the Jacobi-Davidson method, ex-
plore its connections to the trace minimization method, and develop a Davidson-type
trace minimization algorithm.
The Jacobi-Davidson Method
As mentioned in Sect. 11.5, the Jacobi-Davidson scheme is a modification of the
Davidson method. It uses the same ideas presented in the trace minimization method
to compute a correction term to a previously computed Ritz pair, but with a different
objective. In the Jacobi-Davidson method, for a given Ritz pair (xi , θi ) with xi Bxi =
1, a correction vector di is sought such that

A(xi + di ) = λi B(xi + di ), xi Bdi = 0, (11.76)

where λi is the eigenvalue targeted by θi . Since the targeted eigenvalue λi is not


available during the iteration, it is replaced by an approximation σi . Ignoring high-
order terms in (11.76), we get
    
A − σi B Bxi di −ri
= , (11.77)
xi B 0 li 0

where ri = Axi − θi Bxi is the residual vector associated with the Ritz pair (xi , θi ).
Note that replacing ri with Axi does not affect di . In [39, 55], the Ritz value θi is
used in place of σi at each step.
The block Jacobi-Davidson algorithm, described in [55], may be outlined as shown
in Algorithm 11.14 which can be regarded as a trace minimization scheme with
expanding subspaces.
The performance of the block Jacobi-Davidson algorithm depends on how good
the initial guess is and how efficiently and accurately the inner system (11.78) is
solved. If the right-hand side of (11.78) is taken as the approximate solution to the
inner system (11.78), the algorithm is reduced to the Lanczos method. If the inner
382 11 Large Symmetric Eigenvalue Problems

Algorithm 11.14 The block Jacobi-Davidson algorithm.


Choose a block size s  p and an n × s matrix V1 such that V1 BV1 = Is .
For k = 1, 2, . . . until convergence, do
1: Compute Wk = AVk and the interaction matrix Hk = Vk Wk .
2: Compute the s smallest eigenpairs (Yk , Θk ) of Hk . The eigenvalues are arranged in ascending
order and the eigenvectors are chosen to be orthogonal.
3: Compute the corresponding Ritz vectors X k = Vk Yk .
4: Compute the residuals Rk = Wk Yk − B X k Θk .
5: Test for convergence.
6: for 1  i  s, solve the indefinite system
    
A − θi B Bxk,i dk,i rk,i
 = , (11.78)
xk,i B 0 lk,i 0

or preferably its projected form



[Pi (A − θk,i B)Pi ]dk,i = Pi rk,i , xk,i Bdk,i = 0, (11.79)

approximately, where Pi = I − Bxk,i (xk,i  B 2 x )−1 x  B is an orthogonal projector, and r


k,i k,i k,i =
Axk,i − θk,i Bxk,i is the residual corresponding to the Ritz pair (xk,i , θk,i ).
7: If dim(Vk )  m − s, then

Vk+1 = ModG S B (Vk , Δk ),

else

Vk+1 = ModG S B (X k , Δk ).

Here, ModG S B stands for the Gram-Schmidt process with reorthogonalization [73] with respect
to B-inner products, i.e. (x, y) = x  By.
End for

system (11.78) is solved to a low relative residual, it is reduced to the simultaneous


Rayleigh quotient iteration (RQI, see [1]) with expanding subspaces, which con-
verges cubically. If the inner system (11.78) is solved to a modest relative residual,
the performance of the algorithm is in-between. In practice, however, the stage of
cubic convergence is often reached after many iterations. The algorithm almost al-
ways “stagnates” at the beginning. Increasing the number of iterations for the inner
systems makes little difference or, in some cases, even derails convergence to the
desired eigenpairs. Note that the Ritz shifting strategy in the block Jacobi-Davidson
algorithm forces convergence to eigenpairs far away from the desired ones. However,
since the subspace is expanding, the Ritz values are decreasing and the algorithm is
forced to converge to the smallest eigenpairs.
Another issue with the block Jacobi-Davidson algorithm is ill-conditioning. At the
end of each outer Jacobi-Davidson iteration, as the Ritz values approach a multiple
eigenvalue or a cluster of eigenvalues, the inner system (11.79) becomes poorly
conditioned. This makes it difficult for an iterative solver to compute even a crude
approximation of the solution of the inner system.
11.5 The Trace Minimization Method for the Symmetric … 383

All these problems can be partially solved by the techniques developed in the
trace minimization method, i.e. the multiple dynamic shifting strategy, the implicit
deflation technique (dk,i is required to be B-orthogonal to all the Ritz vectors obtained
in the previous iteration step), and the dynamic stopping strategy. We call the modified
algorithm a Davidson-type trace minimization algorithm [38].
The Davidson-Type Trace Minimization Algorithm
Let s  p be the block size, m  s be a given integer that limits the dimension of
the subspaces. The Davidson-type trace minimization algorithm is given by Algo-
rithm 11.15. The orthogonality requirement di(k) ⊥ B X k is essential in the original

Algorithm 11.15 The Davidson-type trace minimization algorithm.


Choose a block size s  p and an n × s matrix V1 such that V1 BV1 = Is .
For k = 1, 2, . . . until convergence, do
1: Compute Wk = AVk and the interaction matrix Hk = Vk Wk .
2: Compute the s eigenpairs (Yk , Θk ) of Hk . The eigenvalues are arranged in ascending order and
the eigenvectors are chosen to be orthogonal.
3: Compute the corresponding Ritz vectors X k = Vk Yk .
4: Compute the residuals Rk = Wk Yk − B X k Θk .
5: Test for convergence.
6: for 1  i  s, solve the indefinite system

[P(A − σk,i B)P]dk,i = Prk,i , X k Bdk,i = 0, (11.80)


to a certain accuracy determined by the stopping criterion described in Sect. 11.5.2. The shift
parameters σk,i , 1  i  s, are determined according to the dynamic shifting strategy described
in Sect. 11.5.3.
7: If dim(Vk )  m − s, then

Vk+1 = ModG S B (Vk , Δk ),

else

Vk+1 = ModG S B (X k , Δk ).

End for

trace minimization algorithm for maintaining the trace reduction property (11.64). In
the current algorithm, it appears primarily as an implicit deflation technique. A more
(k)
efficient approach is to require di to be B-orthogonal only to “good” Ritz vectors.
The number of outer iterations realized by this scheme is decreased compared to
the trace minimization algorithm in Sect. 11.5.3, and compared to the block Jacobi-
Davidson algorithm. In the block Jacobi-Davidson algorithm, the number of outer
iterations cannot be reduced further when the number of iterations for the inner sys-
tems is increased. However, in the Davidson-type trace minimization algorithm, the
number of outer iterations decreases steadily with increasing the number of iterations
for the inner systems. Note that reducing the number of outer iterations enhances the
efficiency of implementation on parallel architectures.
384 11 Large Symmetric Eigenvalue Problems

11.5.5 Implementations of TraceMIN

In this section, we outline two possible implementation schemes for algorithm


TraceMIN on a cluster of multicore nodes interconnected via a high performance
network.
One version is aimed at seeking a few of the smallest eigenpairs, while the other
is more suitable if we seek a larger number of eigenvalues in the interior of the
spectrum, together with the corresponding eigenvectors. We refer to these versions
as TraceMIN_1 and TraceMIN_2, respectively.
TraceMIN_1
In TraceMIN_1, i.e. Algorithm 11.13, we assume that the matrices A and B of the
generalized eigenvalue problem are distributed across multiple nodes. We need the
following kernels and algorithms on a distributed memory architecture, in which
multithreading is used within each node.

• Multiplication of a sparse matrix by a tall narrow dense matrix (multivectors)


• Global reduction
• B-orthonormalization
• Solving sparse saddle-point problems via direct or iterative schemes

All other operations in TraceMIN_1 do not require internode communications,


and can be expressed as simple multithreaded BLAS or LAPACK calls. More specif-
ically, the computation of all the eigenpairs of Hk can be realized on a single node
using a multithreaded LAPACK eigensolver since Hk is of a relatively small size
s n.
In order to minimize the number of costly global communications, namely the all-
reduce operations, one should group communications in order to avoid low parallel
efficiency.
Sparse Matrix-multivector Multiplication
A sparse matrix multiplication Y = AX requires communication of some elements of
the tall-dense matrix X (or a single vector x). The amount of communication required
depends on the sparsity pattern of the matrix. For example, a narrow banded matrix
requires only nearest neighbor communication, but a general sparse matrix requires
more internode communications. Once all the relevant elements of X are obtained
on a given node, one may use a multithreaded sparse matrix multiplication scheme.
Global Reduction
There are two types of global reduction in our algorithm: multiplying two dense
matrices, and concurrently computing the 2-norms of many vectors. In the first case,
one may call a multithreaded dense matrix multiplication routine on each node, and
perform all-reduce to obtain the result. In the second, each node repeatedly calls a
dot product routine, followed by one all-reduce to obtain all the inner products.
11.5 The Trace Minimization Method for the Symmetric … 385

B-orthonormalization
This may be achieved via the eigendecomposition of matrices of the form W  BW
or W  AW to obtain a section of the generalized eigenvalue problem, i.e. to obtain a
matrix V (with s columns) for which V  AV is diagonal, and V  BV is the identity
matrix of order s. This algorithm requires one call to a sparse matrix-multivector
multiplication kernel, one global reduction operation to compute H , one call to a
multithreaded dense eigensolver, one call to a multithreaded dense matrix-matrix
multiplication routine, and s calls to a multithreaded vector scaling routine. Note
that only two internode communication operations are necessary. In addition, one
can take full advantage of the multicore architecture of each node.
Linear System Solver
TraceMINdoes not require a very small relative residual in solving the saddle point
problem, hence we may use a modest stopping criterion when solving the linear
system (11.64). Further, one can simultaneously solve the s independent linear sys-
tems via a single call to the CG scheme. The main advantage of such a procedure is
that one can group the internode communications, thereby reducing the associated
cost of the CG algorithm as compared with solving the systems one at a time. For
instance, in the sparse matrix-vector multiplication and the global reduction schemes
outlined above, grouping the communications results in fewer MPI communication
operations.
We now turn our attention to TraceMIN_2, which efficiently computes a larger
number of eigenpairs corresponding to any interval in the spectrum.
TraceMIN_2: Trace Minimization with Multisectioning
In this implementation of TraceMIN, similar to our implementation of TREPS (see
Sect. 11.1.2), we use multisectioning to divide the interval under consideration into
a number of smaller subintervals. Our hybrid MPI+OpenMP approach assigns dis-
tinct subsets of the subintervals to different nodes. Unlike the implementation of
TraceMIN_1, we assume that each node has"a local memory # capable of storing all
elements of A and B. For each subinterval ai , ai+1 , we compute the L DL  —
factorization of A − ai B and A − ai+1 B to determine the inertia at each end-
# the number of eigenvalues of Ax = λBx that
point of the interval. "This yields
lie in the subinterval ai , ai+1 . TraceMIN_1 is then used on each node, taking
full advantage of as many cores as possible, to compute the lowest eigenpairs of
(A − σ B)x = (λ − σ )Bx, in which σ is the mid-point of each corresponding subin-
terval. Thus, TraceMIN_2 has two levels of parallelism: multiple nodes working
concurrently on different sets of subintervals, with each node using multiple cores
for implementing the TraceMIN_1 iterations on a shared memory architecture.
Note that this algorithm requires no internode communication after the intervals are
selected, which means it scales extremely well. We employ a recursive scheme to
divide the interval into subintervals. We select a variable n e , denoting the maximum
number of eigenvalues allowed to belong to any subinterval. Consequently, any in-
terval containing more than n e eigenvalues is divided in half. This process is repeated
386 11 Large Symmetric Eigenvalue Problems

until we have many subintervals, each of which containing less than or equal to n e
eigenvalues. We then assign the subintervals amongst the nodes so that each node
is in charge of roughly the same number of subintervals. Note that TraceMIN_2
requires only one internode communication since the intervals can be subdivided in
parallel.

11.6 The Sparse Singular-Value Problem

11.6.1 Basics

Algorithms for computing few of the singular triplets of large sparse matrices consti-
tute a powerful computational tool that has a significant impact on numerous science
and engineering applications. The singular-value decomposition (partial or complete)
is used in data analysis, model reduction, matrix rank estimation, canonical corre-
lation analysis, information retrieval, seismic reflection tomography, and real-time
signal processing.
In what follows, similar to the basic material we introduced in Chap. 8 about the
singular-value decomposition, we introduce the notations we use in this chapter,
together with additional basic facts related to the singular-value problem.
The numerical accuracy of the ith approximate singular triplet (ũ i , σ̃i , ṽi ) of the
real matrix A can be determined via the eigensystem of the 2 − cyclic matrix
 
0 A
Aaug = , (11.81)
A 0

see Theorem 8.1, and is measured by the norm of the residual vector ri given by
      $ %1
ũ i ũ i 2
 ri 2 =  Aaug − σ̃i 2 /  ũ i 22 +  ṽi 22 ,
ṽi ṽi

which can also be written as


$ % $ %
 ri 22 =  Aṽi − σ̃i ũ i 22 +  A ũ i − σ̃i ṽi 22 /  ũ i 22 +  ṽi 22 ,(11.82)

with the backward error [74], given by,



ηi = max  Aṽi − σ̃i ũ i 2 ,  A ũ i − σ̃i ṽi 22

could be used as a measure of absolute accuracy. Introducing a normalizing factor,


the above expression could be used for assessing the relative error.
Alternatively, as outlined in Chap. 8, we may compute the SVD of A indirectly
via the eigenpairs of either the n × n matrix A A or the m × m matrix A A . Thus, if
11.6 The Sparse Singular-Value Problem 387

V  (A A)V = diag(σ12 , σ22 , . . . , σr2 , 0, . . . , 0),


& '( )
n−r

where r = min(m, n), and σi is the ith nonzero singular value of A corresponding to
the right singular vector vi , then the corresponding left singular vector, u i , is obtained
as u i = σ1i Avi . Similarly, if

U  (A A )U = diag(σ12 , σ22 , . . . , σr2 , 0, . . . 0 ),


& '( )
m−r

where σi is the ith nonzero singular value of A corresponding to the left singular
vector u i , then the corresponding right singular vector, vi , is obtained as vi = σ1i A u i .
As stated in Chap. 8, computing the SVD of A via the eigensystems of either
A A or A A may be adequate for determining the largest singular triplets of A, but
some loss of accuracy is observed for the smallest singular triplets,
√ e.g. see also [23].
In fact, if the smallest singular value of A is smaller than ( uA), then it will be
computed as the zero eigenvalue of A A (or A A ). Thus, it is preferable to compute
the smallest
 singular  values of A via the eigendecomposition of the augmented matrix
0 A
Aaug = .
A 0
Note that, whereas the square of the smallest and largest singular values of A are
the lower and upper bounds of the spectrum of A A or A A , the smallest singular
values of A lie in the middle of the spectrum of Aaug in (11.81). Further, similar to
(11.82), the norms of the residuals corresponding to the i th eigenpairs of A A and
A A , are given by

 ri 2 = A Aṽi − σ̃i2 ṽi 2 /  ṽi 2 and  ri 2 = A A ũ i − σ̃i2 ũ i 2 /  ũ i 2 ,

respectively.
When A is a square nonsingular matrix, it may be advantageous (in certain cases)
to compute the needed few largest singular triplets of A−1 which are σ1n ≥ · · · ≥ σ11 .
This approach has the drawback of needing to solve linear systems involving the
matrix A. If a suitable parallel direct sparse solver is available, this strategy provides a
robust algorithm. The resulting computational scheme becomes of value in enhancing
parallelism for the subspace method described below. This approach can also be
extended to rectangular matrices of full column rank if the upper triangular matrix
R in the orthogonal factorization A = Q R is available:

Proposition 11.3 Let A ∈ Rm×n (m ≥ n) be of rank n. Let

A = Q R, where Q ∈ Rm×n and R ∈ Rn×n ,


such that Q  Q = In and R upper triangular.

The singular values of R are the same as those of A.


388 11 Large Symmetric Eigenvalue Problems

Consequently, the smallest singular


 valuesof A can be computed from the largest
0 R −1
eigenvalues of (R −1 R − ) or of − .
R 0
Sensitivity of the Smallest Singular Value
In order to compute the smallest singular value in a reliable way, one must investigate
the sensitivity of the singular values with respect to perturbations of the matrix under
consideration.

Theorem 11.13 Let A ∈ Rm×n (m ≥ n) be of rank n and Δ ∈ Rm×n . If the singular


values of A and A + Δ are respectively denoted by,

σ1 ≥ σ2 ≥ · · · ≥ σn , and
σ̃1 ≥ σ̃2 ≥ · · · ≥ σ̃n .

then,

|σi − σ̃i | ≤ Δ2 , for i = 1, . . . , n.

Proof see [75].

When applied to the smallest singular value, this result yields the following.
Proposition 11.4 The relative condition number of the smallest singular value of a
nonsingular matrix A is equal to κ2 (A) = σσn1 .
This means that the smallest singular value of an ill-conditioned matrix cannot be
computed with high accuracy even with a backward-stable algorithm.
In [76] it is shown that for some special class of matrices, e.g. tall and narrow sparse
matrices, an accurate computation of the smallest singular value may be obtained
via a combination of a parallel orthogonal factorization scheme of A, with column
pivoting, and a one-sided Jacobi algorithm (see Chap. 8).
Since the nonzero singular values are roots of a polynomial (e.g. roots of the
characteristic polynomial of the augmented matrix), then when simple, they are
differentiable with respect to the entries of the matrix. More precisely, one can state
that:

Theorem 11.14 Let σ be a nonzero simple singular value of the matrix A = (αi j )
with u = (μi ) and v = (νi ) being the corresponding normalized left and right
singular vectors. Then, the singular value is differentiable with respect to the matrix
A, or

∂σ
= μi ν j , ∀i, j = 1, . . . , n.
∂αi j

Proof See [77].


11.6 The Sparse Singular-Value Problem 389

The effect of a perturbation of the matrix on the singular vectors can be more
significant than that on the singular values. The sensitivity of the singular vectors
depends upon the distribution of the singular values. When a simple singular value
is not well separated from the rest, the corresponding left and right singular vec-
tors are poorly determined. This is made precise by the theorem below, see [75],
which we state here without proof. Let A ∈ Rn×m (n ≥ m) have the singular value
decomposition  
 Σ
U AV = .
0

Let U and V be partitioned as U = (u 1 U2 U3 ) and V = (v1 V2 ) where u 1 ∈ Rn ,


U2 ∈ Rn×(m−1) , U3 ∈ Rn×(n−m) , v1 ∈ Rm and U2 ∈ Rm×(m−1) . Consequently,
⎛ ⎞
σ1 0
U  AV = ⎝ 0 Σ2 ⎠ .
0 0

Let A be perturbed by the matrix E, where


⎛ ⎞
γ11 g12 
U  E V = ⎝ g21 G 22 ⎠ .
g31 G 32

Theorem 11.15 Let h = σ1 g12 + Σ2 g21 , and à = A + E. If (σ1 I − Σ2 ) is nonsin-


gular (i.e. if σ1 is a simple singular value of A), then the matrix
⎛ ⎞
σ1 + γ11 g12 
U  ÃV = ⎝ g21 Σ2 + G 22 ⎠
g31 G 32

has a right singular vector of the form


 
1
+ O(E2 ).
(σ1 I − Σ2 2 )−1 h
2

Next, we present a selection of parallel algorithms for computing the extreme sin-
gular values and the corresponding vectors of a large sparse matrix A ∈ Rm×n . In
particular, we present the simultaneous iteration method and two Lanczos schemes
for computing few of the largest singular eigenvalues and the corresponding eigen-
vectors of A. Our parallel algorithms of choice for computing a few of the smallest
singular triplets, however, are the trace minimization and the Davidson schemes.
390 11 Large Symmetric Eigenvalue Problems

11.6.2 Subspace Iteration for Computing the Largest


Singular Triplets

Subspace iteration, presented in Sect. 11.1, can be used to obtain the largest singular
triplets via obtaining the dominant eigenpairs of the symmetric matrix G = Ãaug ,
where  
γ Im A
Ãaug = , (11.83)
A  γ In

in which the shift parameter γ is chosen to assure that G is positive definite. This
method generates the sequence

Zk = Gk Z0,

where the initial iterate is the n×s matrix Z 0 = (z 1 , z 2 , . . . , z s ), in which s = 2 p with


p being the number of desired largest singular values. Earlier, we presented the basic
form of simultaneous iteration without Rutishauser’s classic Chebyshev acceleration
in RITZIT (see [4]). This particular algorithm incorporates both a Rayleigh-Ritz
procedure and an acceleration scheme via Chebyshev polynomials. The iteration
which embodies the RITZIT procedure is given in Algorithm 11.16. The Rayleigh
Quotient matrix, Hk , in step 3 is essentially the projection of G 2 onto the subspace
spanned by the columns of Z k−1 . The three-term recurrence in step 7 follows from
the mapping of the Chebyshev polynomial of the first kind, of degree q, Tq (x),
onto the interval [−e, e], where e is chosen to be slightly smaller than the smallest
eigenvalue of the SPD matrix Hk . This use of Chebyshev polynomials has the desired
effect of damping out the unwanted eigenvalues of G and producing the improved
rate of convergence: O(Tk (θs )/Tk (θ1 )) which is considerably higher than the rate
of convergence without the Chebyshev acceleration: O(θs /θ1 ), where (θ1 ≥ θ2 ≥
· · · ≥ θs ) are the eigenvalues of Hk , or the square root of the diagonal matrix Δ2k in
step 4 of Algorithm 11.16.

Algorithm 11.16 SISVD: inner iteration of the subspace iteration as implemented


in Rutishauser’s RITZIT.
1: Compute Ck = G Z k−1 ; //without assembling G.
2: Factor Ck = Q k Rk ;
3: Form Hk = Rk Rk ;
4: Factor Hk = Pk Δ2k Pk ;
5: Form Z k = Q k Pk ;
6: do j=2:q,
7: Z k+ j = 2e G Z k+ j−1 − Z k+ j−2 ;
8: end

The orthogonal factorization in step 2 of Algorithm 11.16 may be computed by


a modified Gram-Schmidt procedure or by Householder transformations provided
11.6 The Sparse Singular-Value Problem 391

that the orthogonal matrix Q k is explicitly available for the computation of Z k in


step 5. On parallel architectures, especially those having hierarchical memories, one
may achieve higher performance by using either a block Gram-Schmidt (Algorithm
B2GS) or the block Householder orthogonalization method in step 2. Further, we
could use the parallel two- or one-sided Jacobi method, described earlier, to obtain
the spectral decomposition in step 4 of RITZIT (as originally suggested in [47]).
In fact, when appropriately adapted for symmetric positive definite matrices, the
one-sided Jacobi scheme can realize high performance on parallel architectures (e.g.
see [78]), provided that the dimension of the current subspace, s, is not too large.
For larger subspace size s, an optimized implementation of the parallel Cuppen’s
algorithm [79] may be used instead in step 4.
The success of Rutishauser’s subspace iteration method using Chebyshev accel-
eration relies upon
 the following strategy for limiting the degree of the Chebyshev
polynomial, Tq xe , on the interval [−e, e], where e = θs (assuming we use a block
of s vectors, and q = 1 initially):

qnew = min{2qold , q̂}, where






1, ⎛ ⎞ if θs < ξ1 θ1
q̂ = ξ2 (11.84)

⎪ 2 × max ⎝   , 1⎠ otherwise.
⎩ arccosh θ1 θs

Here, ξ1 = 0.04 and ξ2 = 4. The polynomial degree of the current iteration is then
taken to be q = qnew . It can easily be shown that the strategy in (11.84) insures that
.  .   
. .
.Tq θ1 . = cosh q arccosh θ1 ≤ cosh(8) < 1500.
. θs .2 θs

Although this has been successful for RITZIT, we can generate several variations
of polynomial-accelerated subspace iteration schemes SISVD using a more flexible
bound. Specifically, we can consider an adaptive strategy for selecting the degree q
in which ξ1 and ξ2 are treated as control parameters for determining the frequency
and the degree of polynomial acceleration, respectively. In other words, large (small)
values of ξ1 , inhibit (invoke) polynomial acceleration, and large (small) values of
ξ2 yield larger (smaller) polynomial degrees when acceleration is selected. Corre-
spondingly, the number of matrix-vector multiplications will increase with ξ2 and
the total number of iterations may well increase with ξ1 . Controlling the parameters,
ξ1 and ξ2 , allows us to monitor the method’s complexity so as to maintain an optimal
balance between the dominating kernels (specifically, sparse matrix—vector multi-
plication, orthogonalization, and spectral decomposition). We demonstrate the use
of such controls in the polynomial acceleration-based trace minimization method for
computing a few of the smallest singular triplets in Sect. 11.6.4.
392 11 Large Symmetric Eigenvalue Problems

11.6.3 The Lanczos Method for Computing a Few


of the Largest Singular Triplets

The single-vector Lanczos tridiagonalization procedures (with and without reorthog-


onalization) and the corresponding Lanczos eigensolver as well as its block version,
discussed above in Sect. 11.2, may be used to obtain an approximation of the extreme
singular values and the associated singular vectors by considering the standard eigen-
value problem for the (m + n) × (m + n) symmetric indefinite matrix Aaug :
 
0 A
Aaug = . (11.85)
A 0

Note that, the largest singular triplets of A will be obtained with much higher accuracy
than their smallest counterparts using the Lanczos method.
The Single-Vector Lanczos Method
Using Algorithms 11.5 or 11.6 for the tridiagonalization of Aaug , which is refer-
enced only through matrix-vector multiplications, we generate elements of the cor-
responding tridiagonal matrices to be used by the associated Lanczos eigensolvers:
Algorithms 11.8 or 11.9, respectively. We denote this method by LASVD.
The Block Lanczos Method
As mentioned earlier, one could use the block version of the Lanczos method as an
alternate to the single-vector scheme LASVD. The resulting block version BLSVD
uses block three-term recurrence relations which require sparse matrix-tall dense
matrix multiplications, dense matrix multiplications, and dense matrix orthogonal
factorizations. These are primitives that achieve higher performance on parallel ar-
chitectures. In addition, this block version of the Lanczos algorithm is more robust
for eigenvalue problems with multiple or clustered eigenvalues. Again, we consider
the standard eigenvalue problem involving the 2-cyclic matrix Aaug . Exploiting the
structure of the matrix Aaug , we can obtain an alternative form for the Lanczos re-
cursion (11.9). If we apply the Lanczos recursion given by (11.9) to Aaug with a
starting vector ũ = (u, 0) such that ũ2 = 1, then all the diagonal entries of the
real symmetric tridiagonal Lanzcos matrices generated are identically zero. In fact,
this Lanczos recursion, for i = 1, 2, . . . , k, reduces to

β2i vi = A u i − β2i−1 vi−1 ,


(11.86)
β2i+1 u i+1 = Avi − β2i u i .

where u 1 ≡ u, v0 ≡ 0, and β1 ≡ 0. Unfortunately, (11.86) can only compute the


distinct singular values of A but not their multiplicities. Following the block Lanczos
recursion as done in Algorithm 11.7, (11.86) can be represented in matrix form as

A Ûk = V̂k Jk + Z k ,


(11.87)
A V̂k = Ûk Jk + Ẑ k ,
11.6 The Sparse Singular-Value Problem 393

where Ûk = (u 1 , . . . , u k ), V̂k = (v1 , . . . , vk ), with Jk being a k × k bidiagonal


matrix in which (Jk ) j j = β2 j and (Jk ) j j+1 = β2 j+1 . In addition, Z k , Z̃ k contain
remainder terms. It is easy to show that the nonzero singular values of Jk are the
same as the positive eigenvalues of
 
O Jk
Kk ≡ . (11.88)
Jk O

For the block analogue of (11.87), we make the simple substitutions

u i ↔ Ui , vi ↔ Vi ,

where Ui and Vi are matrices of order m × b and n × b, respectively, with b being the
current block size. The matrix Jk is now a block-upper bidiagonal matrix of order
bk
⎛ ⎞
S1 R1
⎜ S2 R2 ⎟
⎜ ⎟
⎜ · · ⎟

Jk ≡ ⎜ ⎟, (11.89)
· · ⎟
⎜ ⎟
⎝ · Rk−1 ⎠


Sk

in which Si and Ri are b × b upper-triangular matrices. If Ui and Vi form mutually


orthogonal sets of bk vectors so that Ûk and V̂k are orthonormal matrices, then the
singular values of the matrix Jk will be identical to those of the original matrix A.
Given the upper block-bidiagonal matrix Jk , we approximate the singular triplets
of A by first computing the singular triplets of Jk . To determine the left and right
singular vectors of A from those of Jk , we must retain the Lanczos vectors of Ûk
(k) (k) (k)
and V̂k . Specifically, if {σi , yi , z i } is the ith singular triplet of Jk , then the
(k) (k) (k)
approximation to the ith singular triplet of A is given by {σi , Ûk yi , V̂k z i }, where
(k) (k)
Ûk yi , V̂k z i are the left and right approximate singular vectors, respectively. The
computation of singular triplets for Jk requires two phases. The first phase reduces
Jk to an upper-bidiagonal matrix Ck having diagonal elements {α1 , α2 , . . . , αbk }
and superdiagonal elements {β1 , β2 , . . . , βbk−1 } via a finite sequence of orthogonal
transformations (thus preserving the singular values of Jk ). The second phase reduces
Ck to the diagonal form by a modified Q R algorithm. This diagonalization procedure
is discussed in detail in [80]. The resulting diagonalized Ck will yield the approximate
singular values of A, while the corresponding left and right singular vectors are
determined through multiplications by all the left and right transformations used in
both phases of the SVD of Jk .
There are a few options for the reduction of Jk to the bidiagonal matrix, Ck . The
use of either Householder or Givens reductions that zero out or chase off elements
generated above the first super-diagonal of Jk was advocated in [81]. This bidiago-
394 11 Large Symmetric Eigenvalue Problems

nalization, and the subsequent diagonalization processes offer only poor data locality
and limited parallelism. For this reason, one should adopt instead the single vector
Lanczos bidiagonalization recursion given by (11.86) and (11.87) as the strategy
of choice for reducing the upper block-bidiagonal matrix Jk to the bidiagonal form
(Ck ), i.e.

Jk Q̂ = P̂Ck ,
(11.90)
Jk P̂ = Q̂Ck ,

or

Jk p j = α j q j + β j−1 q j−1 ,
(11.91)
Jk q j = α j p j + β j p j+1 ,

where P̂ = ( p1 , p2 , . . . , pbk ) and Q̂ = (q1 , q2 , . . . , qbk ) are orthonormal matri-


ces of order bk × bk. Note that the recursions in (11.91) which require banded
matrix-vector multiplications can be realized via optimized level-2 BLAS routines.
For orthogonalization of the outermost Lanczos vectors, {Ui } and {Vi }, as well as
the innermost Lanczos vectors, { pi } and {qi }, one can apply a complete reorthog-
onalization [10] strategy to insure robust approximation of the singular triplets of
the matrix A. Such a hybrid Lanczos approach which incorporates in each outer
iteration of the block Lanczos SVD recursion, inner iterations of the single-vector
Lanczos bidiagonalization procedure has been introduced in [82]. As an alternative
to the outer recursion given by (11.87), which is derived from the 2-cyclic matrix
Aaug , Algorithm 11.17 depicts the simplified outer block Lanczos recursion for ap-
proximating primarily the the largest eigenpairs of A A. Combining the equations
in (11.87), we obtain
A A V̂k = V̂k Hk ,

where Hk = Jk Jk is the symmetric block-tridiagonal matrix


⎛ ⎞
S1 R1
⎜ R1 S2 R  ⎟
⎜ 2 ⎟
⎜ R2 · · ⎟
Hk ≡ ⎜

⎟,
⎟ (11.92)
⎜ · · · ⎟
⎝ · · 
Rk−1 ⎠
Rk−1 Sk

Applying the block Lanczos recursion [10] in Algorithm 11.17 for computing the
eigenpairs of the n×n symmetric positive definite matrix A A, the tridiagonalization
of Hk via an inner Lanczos recursion follows from a simple application of (11.9).
Analogous to the reduction of Jk in (11.89), computation of the eigenpairs of the
resulting tridiagonal matrix can be performed via a Jacobi or a QR-based symmetric
eigensolver.
11.6 The Sparse Singular-Value Problem 395

Algorithm 11.17 BLSVD: hybrid Lanczos outer iteration (Formation of symmetric


block-tridiagonal matrix Hk ).
1: Choose V1 ∈ Rn×b orthonormal and c = max(b, k).
2: Compute S1 = V1 A AV1 , (V0 , R0 = 0 initially).
3: do i = 1 : k, (where k = c/b)
4: Compute Yi−1 = A AVi−1 − Vi−1 Si−1 − Vi−1 Ri−2  ;
i−1
5: Orthogonalize Yi−1 against {V } =0 ;
6: Factor Yi−1 = Vi Ri−1 ;
7: Compute Si = Vi A AVi ;
8: end

Clearly, A A should never be formed explicitly. The above algorithms require


only sparse matrix-vector multiplications involving A and A . Higher parallel scal-
ability is realized in the outer (block) Lanczos iterations by the multiplication of
A and A by b vectors rather than multiplication by a single vector. A stable vari-
ant of the block Gram-Schmidt orthogonalization [4], which requires efficient dense
matrix-vector multiplication (level-2 BLAS) routines or efficient dense matrix-matrix
multiplication (level-3 BLAS) routines as Algorithm B2GS, is used to produce the
orthogonal projections of Yi (i.e. Ri−1 ) and Wi (i.e. Si ) onto Ṽ ⊥ and Ũ ⊥ , respectively,
where U0 and V0 contain the converged left and right singular vectors, respectively,
and

Ṽ = (V0 , V1 , . . . , Vi−1 ) and Ũ = (U0 , U1 , . . . , Ui−1 ).

11.6.4 The Trace Minimization Method for Computing


the Smallest Singular Triplets

Another candidate subspace method for the SVD of sparse matrices is based upon the
trace minimization algorithm, discussed earlier in this chapter, for the generalized
eigenvalue problem

H x = λGx, (11.93)

where H and G are symmetric, with G being positive definite. In order to compute
the t smallest singular triplets of an m × n matrix A, we could set H = A A, and
G = In . If Y is defined as the set of all n × p matrices Y for which Y  Y = I p ,
where p = 2t, then as illustrated before, we have


p
min trace(Y  H Y ) = σ̃n−i+1
2
, (11.94)
Y ∈Y
i=1
396 11 Large Symmetric Eigenvalue Problems

where σ̃i is a singular value of A, λi = σ̃i2 is an eigenvalue of H , and σ̃1 ≥ σ̃2 ≥


· · · ≥ σ̃n . In other words, given an n × p matrix Y which forms a section of the
eigenvalue problem

H z = λz, (11.95)

i.e.

Y  H Y = Σ̃, Y  Y = I p , (11.96)

Σ̃ = diag(σ̃n2 , σ̃n−1
2
, . . . , σ̃n−
2
p+1 ),

our trace minimization scheme TRSVD generates the sequence Yk , whose first t
columns converge to the t left singular vectors corresponding to the t-smallest sin-
gular values of A.
Polynomial Acceleration Techniques for TRSVD
Convergence of the trace minimization algorithm TRSVD can be accelerated via
either a shifting strategy as discussed earlier in Sect. 11.5.3, via a Chebyshev accel-
eration strategy as illustrated for subspace iterations SISVD, or via a combination
of both strategies. For the time being we focus first on Chebyshev acceleration.
Hence, in order to dampen the unwanted (i.e. largest) singular values of A, we need
to consider applying the trace minimization scheme to the generalized eigenvalue
problem

1
x= Pq (H )x, (11.97)
Pq (λ)

where Pq (x) = Tq (x) + ε, and Tq (x) is the Chebyshev polynomial of degree q with
ε chosen so that Pq (H ) is (symmetric) positive definite. The appropriate quadratic
minimization problem here can be expressed as

(k) (k) (k) (k)


minimize ((y j − d j ) (y j − d j )) (11.98)

subject to the constraints

(k)
Y  Pq (H )d j = 0, j = 1, 2, . . . , p.

In effect, we approximate the smallest eigenvalues of H as the largest eigenvalues of


the matrix Pq (H ) in which the gaps between its eigenvalues are considerably larger
than those between the eigenvalues of H .
Although the additional number of sparse matrix-vector multiplications involving
Pq (H ) could be significantly higher than before, especially for large degrees q,
the saddle-point problems that need to be solved in each outer trace minimization
iteration become of the form,
11.6 The Sparse Singular-Value Problem 397

    
I Pq (H )Yk d (k) y (k)

j = j , j = 1, 2, . . . , p. (11.99)
Yk Pq (H ) 0 l 0

which are much easier to solve. It can be shown that the updated eigenvector approx-
(k+1)
imation, y j , is determined by

$ %−1
(k+1) (k) (k) (k)
yj = yj − dj = Pq (H )Yk Yk Pq2 (H )Yk Yk Pq (H )y j .

Thus, we may not need to use an iterative solver for determining Yk+1 since the matrix
[Yk Pq2 (H )Yk ]−1 is of relatively small order p. Using the orthogonal factorization

Pq (H )Yk = Q̂ R̂,

we have

[Yk Pq2 (H )Yk ]−1 = R̂ − R̂ −1 ,

where the polynomial degree, q, is determined by the strategy adopted in SISVD.


Adding a Shifting Strategy for TRSVD
As discussed in [37], we can also accelerate the convergence of the Yk ’s to eigenvec-
tors of H by incorporating Ritz shifts (see [1]) into TRSVD. Specifically, we modify
the symmetric eigenvalue problem as follows,

(H − ν (k) (k)
j I )z j = (λ j − ν j )z j , j = 1, 2, . . . , s, (11.100)

(k) (k)
where ν j = (σ̃n− j+1 )2 is the j-th approximate eigenvalue at the k-th iteration of
TRSVD, with λ j , z j being an exact eigenpair of H . In other words, we simply use
our most recent approximations to the eigenvalues of H from our k-th section within
TRSVD as Ritz shifts. As was shown by Wilkinson in [83], the Rayleigh quotient
iteration associated with (11.100) will ultimately achieve cubic convergence to the
(k)
square of an exact singular value of A, σn− 2
j+1 , provided ν j is sufficiently close
(k+1) (k)
to σn−
2
j+1 . However, since we have ν j < νj for all k, i.e. we approximate
(k)
the eigenvalues of H from above resulting in H − νj I
not being positive definite
and hence requiring adopting an appropriate linear system solver. Algorithm 11.18
outlines the basic steps of TRSVD that appropriately utilize polynomial (Chebyshev)
acceleration prior to using Ritz shifts. It is important to note that once a shifting has
been invoked (Steps 8–15) we abandon the use of Chebyshev polynomials Pq (H )
and solve the resulting saddle-point problems using appropriate solvers that take
into account that the (1, 1) block could be indefinite. The context switch from either
non-accelerated (or polynomial-accelerated) trace minimization iterations to trace
minimization iterations with Ritz shifts, is accomplished by monitoring the reduction
398 11 Large Symmetric Eigenvalue Problems

(k)
of the residuals in (11.82) for isolated eigenvalues (r j ) or clusters of eigenvalues
(k)
(R j ).

Algorithm 11.18 TRSVD: trace minimization with Chebyshev acceleration and Ritz
shifts.
(0) (0) (0)
1: Choose an initial n × p subspace iterate Y0 = [y1 , y2 , . . . , y p ];
2: Form a section, i.e. determine Y0 such that Y0 Pq (H )Y0 = I p and Y0 Y0 = Γ0 where Γ0 is
diagonal;
3: do k = 0, 1, 2, . . . until convergence,
(k) (k)
4: Determine the approximate singular values Σk = diag(σ̃n , · · · , σ̃n− p+1 ) from the Ritz
values of H corresponding to the columns of Yk ;
5: Rk = H Yk − Yk Σk2 ; //Compute residuals
6: Analyze the current approximate spectrum (Gerschgorin disks determine n c groups G j of
eigenvalues)
7: //Invoke Ritz shifting strategy ([37])
8: do = 1 : n c ,
(0)
9: if G = {σ̃n− j+1 } includes a unique eigenvalue, then
(k) (k )
10: Shift is selected if r j 2 ≤ ηr j 0 2 , where η ∈ [10−3 , 100 ] and k0 < k;
11: else
12: //G is a cluster of c eigenvalues
Shift is selected if R (k)  F ≤ ηR (k0 )  F , where R (k) ≡ {r (k) (k)
j , . . . , r j+c−1 } and k0 < k;
13: end if
14: Disable polynomial acceleration if shifting is selected;
15: end
16: //Deflation Reduce subspace dimension, p, by number of the H -eigenpairs accepted;
17: Adjust the polynomial degree q for Pq (H ) in iteration k + 1 (if needed).
18: Update subspace iterate Yk+1 = Yk − Δk as in (11.99) or for the shifted problem;
19: end

11.6.5 Davidson Methods for the Computation of the Smallest


Singular Values

The smallest singular value of a matrix A may be obtained by applying one of the
various versions of the Davidson methods to obtain the smallest eigenvalue of the
matrix C = A A, or to obtain the innermost positive eigenvalue of the 2-cyclic
augmented matrix in (11.81). We assume that one has the basic kernels for matrix-
vector multiplications using either A or A . Multiplying A by a vector is often
considered as a drawback. Thus, whenever possible, the so-called “transpose-free”
methods should be used. Even though one can avoid such a drawback when dealing
with the interaction matrix Hk = Vk  C Vk = (AVk ) AVk , we still have to compute
the residuals corresponding to the Ritz pairs which do involve multiplication of the
transpose of a matrix by a vector.
11.6 The Sparse Singular-Value Problem 399

For the regular single-vector Davidson method, the correction vector is obtained
by approximately solving the system

A Atk = rk . (11.101)

Obtaining an exact solution of (11.101) would yield the Lanczos algorithm applied to
C −1 . Once the Ritz value approaches the square of the sought after smallest singular
value, it is recommended that we solve (11.101) without any shifts; the benefit is that
we deal with a fixed symmetric positive definite system matrix.
The approximate solution of (11.101) can be obtained by performing a fixed
number of iterations of the Conjugate Gradient scheme, or by solving an approxi-
mate linear system Mtk = rk with a direct method, where M is obtained from an
approximate factorization of A or A A [84]:
Incomplete LU factorization of A: Here, M = U −1 L −1 L − U − , where L and U
are the products of an incomplete LU factorization of A where one drops entries
of the reduced matrix A which are below a given threshold. This version of the
Davidson method is called DAVIDLU.
Incomplete QR factorization of A: Here, M = R −1 R − , where R is the upper
triangular factor of an incomplete QR factorization of A. This version of the
Davidson method is called DAVIDQR.
Incomplete Cholesky of A T A: Here, M = L − L −1 where L is the lower triangular
factor of an incomplete Cholesky factorization of the normal equations. This
version of the Davidson method is called DAVIDIC.
Even though the construction of any of the above approximate factorizations may
fail, experiments presented in [84] show the effectiveness of the above three precon-
ditioners whenever they exist.
The corresponding method is expressed by algorithm 11.19. At step 11, the pre-
conditioner C is defined by one of the methods DAVIDLU, DAVIDIC, or DAVIDQR.
The steps 3, 4, and 13 must be implemented in such a way such that redundant com-
putations are skipped.
Similar to trace minimization, the Jacobi-Davidson method can be used directly on
the matrix A A to compute the smallest eigenvalue and the corresponding eigenvec-
tor. In [85] the Jacobi-Davidson method has been adapted for obtaining the singular
values of A by considering the eigenvalue problem corresponding to the 2-cyclic
augmented matrix.

11.6.6 Refinement of Left Singular Vectors

Having determined approximate singular values, σ̃i and their corresponding right
singular vectors, ṽi , to a user-specified tolerance for the residual

r̂i = A Aṽi − σ̃i2 ṽi , (11.102)


400 11 Large Symmetric Eigenvalue Problems

Algorithm 11.19 Computing smallest singular values by the block Davidson method.

Input: A ∈ Rm×n , p ≥ 1, V ∈ Rn× p , and 2 p ≤ ≤ n ≤ m.


Output: The p smallest singular values {σ1 , · · · , σ p } ⊂ Λ(A) and their corresponding right sin-
gular vectors X = [x1 , · · · , x p ] ∈ Rm× p .
1: V1 = MGS(V ) ; k = 1;
2: repeat
3: compute Uk = AVk and Wk = A Uk ;
4: compute the interaction matrix Hk = Vk  Wk ;
5: compute the p smallest eigenpairs (σk,i 2 , y ) of H , for i = 1, · · · , p ;
k,i k
6: compute the p corresponding Ritz vectors X k = Vk Yk where Yk = [yk,1 , · · · yk, p ] ;
7: compute the residuals Rk = [rk,1 , · · · , rk, p ] = Wk Yk − X k diag(σk,1 2 , · · · , σ 2 );
k, p
8: if convergence then
9: Exit;
10: end if
11: compute the new block Tk = [tk,1 , . . . , tk, p ] where tk,i = Crk,i ;
12: if dim(Vk ) ≤ − p, then
13: Vk+1 = MGS([Vk , Tk ]) ;
14: else
15: Vk+1 =MGS([X k , Tk ]);
16: end if
17: k = k + 1;
18: until Convergence
19: X = X k ; σi = σk,i for i = 1, · · · , p.

we must then obtain an approximation to the corresponding left singular vector, u i ,


via
1
ui = Avi , (11.103)
σ̃i

As mentioned in Sect. 11.6.1, however, it is quite possible that square roots of


the approximate eigenvalues of A A will be poor approximations to those singular
values of A which are extremely small. This phenomenon, of course, will lead to
poor approximations to the left singular vectors. Even if σ̃i is an acceptable singular
value approximation, the residual corresponding to the singular triplet {σ̃i , ũ i , ṽi },
defined by (11.82), has a 2-norm bounded from above by
 $ % 1 
2
ri 2 ≤ r̂i 2 / σ̃i ũ i 22 + ṽi 22 , (11.104)

where r̂i is the residual given in (11.102) for the symmetric eigenvalue problem for
A A or (γ 2 In − A A). Scaling by σ̃i can easily lead to significant loss of accuracy
in estimating the singular triplet residual norm, ri 2 , especially when σ̃i approaches
the machine unit roundoff.
One remedy is to refine the initial approximation of the left singular vectors,
corresponding to the few computed singular values and right singular vectors, via
11.6 The Sparse Singular-Value Problem 401

inverse iteration. To achieve this, consider the following equivalent eigensystem,


    
γ In A  vi v
= (γ + σi ) i , (11.105)
A γ Im ui ui

where {σi , u i , vi } is the ith singular triplet of A, and

γ = min(1, max{σi }). (11.106)

One possible refinement recursion (via inverse iteration) is thus given by


  (k+1)
 
(k)

γ In A  ṽi ṽi
(k+1) = (γ + σ̃i ) (k) , (11.107)
A γ Im ũ i ũ i

where {σ̃i , ũ i , ṽi } is the ith computed smallest singular triplet. By applying block
Gaussian elimination to (11.107) we obtain a more optimal form (reduced system)
of the recursion
   (k+1)  
(k)

γ In A ṽi ṽi
= (γ + σ̃i ) (k) (k) . (11.108)
0 γ Im − γ1 A A (k+1)
ũ i ũ i − γ1 Aṽi

Our iterative refinement strategy for an approximate singular triplet of A, {σ̃i , ũ i , ṽi },
is then defined by the last m equations of (11.108), i.e.
   
1 (k+1) (k) 1
γ Im − A A ũ i = (γ + σ̃i ) ũ i − Aṽi , (11.109)
γ γ

where the superscript k is dropped from ṽi since we refine only our left singular
(0)
vector approximation, ũ i . If ũ i ≡ u i from (11.105), then (11.109) can be rewritten
as
 
1  (k+1) (k)
γ Im − A A ũ i = (γ − σ̃i2 /γ )ũ i , (11.110)
γ

followed by the normalization

(k+1) (k+1) (k+1)


ũ i = ũ i /ũ i 2 .

It is easy to show that the left-hand-side matrix in (11.109) is symmetric positive


definite provided (11.106) holds. Accordingly, we may use parallel conjugate gra-
dient iterations to refine each singular triplet approximation. Hence, the refinement
procedure outlined in may be considered as a black box procedure to follow the
eigensolution of A A or γ 2 In − A A by any one of the above candidate methods
for the sparse singular-value problem. The iterations in steps 2–9 of the refinement
402 11 Large Symmetric Eigenvalue Problems

scheme in Algorithm 11.20 terminate once the norms of the residuals of all p ap-
proximate singular triplets (ri 2 ) fall below a user-specified tolerance or after kmax
iterations.

Algorithm 11.20 Refinement procedure for the left singular vector approximations
obtained via scaling.
Input: A ∈ Rm×n , p approximate singular values Σ = diag(σ̃1 , · · · , σ̃ p ) and their corresponding
approximate right singular vectors V = [ṽ1 , · · · , ṽ p ].
(k) (k)
1: U0 = AV Σ −1 ; //By definition: Uk = [ũ 1 , · · · , ũ p ]
2: do j = 1 : p,
3: k = 0;
4: while Aṽ j − σ̃ j ũ (k)
j  > τ,
5: k := k+ 1 ; 
(k+1) (k)
6: Solve γ Im − γ1 A A ũ i = (γ − σ̃i2 /γ )ũ i ; //See (11.110)
(k+1) (k+1) (k+1)
7: Set ũ i = ũ i /ũ i 2 ;
8: end while
9: end

References

1. Parlett, B.N.: The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs (1980)
2. Bauer, F.: Das verfahren der treppeniteration und verwandte verfahren zur losung algebraischer
eigenwertprobleme. ZAMP 8, 214–235 (1957)
3. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, New York (1965)
4. Rutishauser, H.: Simultaneous iteration method for symmetric matrices. Numer. Math. 16,
205–223 (1970)
5. Stewart, G.W.: Simultaneous iterations for computing invariant subspaces of non-Hermitian
matrices. Numer. Math. 25, 123–136 (1976)
6. Stewart, W.J., Jennings, A.: Algorithm 570: LOPSI: a simultaneous iteration method for real
matrices [F2]. ACM Trans. Math. Softw. 7(2), 230–232 (1981). doi:10.1145/355945.355952
7. Saad, Y.: Numerical Methods for Large Eigenvalue Problems. Halstead Press, New York (1992)
8. Sameh, H., Lermit, J., Noh, K.: On the intermediate eigenvalues of symmetric sparse matrices.
BIT 185–191 (1975)
9. Bunch, J., Kaufman, K.: Some stable methods for calculating inertia and solving symmetric
linear systems. Math. Comput. 31, 162–179 (1977)
10. Golub, G., Van Loan, C.: Matrix Computations, 4th edn. Johns Hopkins University Press,
Baltimore (2013)
11. Duff, I., Gould, N.I.M., Reid, J.K., Scott, J.A., Turner, K.: The factorization of sparse symmetric
indefinite matrices. IMA J. Numer. Anal. 11, 181–204 (1991)
12. Duff, I.: MA57-a code for the solution of sparse symmetric definite and indefinite systems.
ACM TOMS 118–144 (2004)
13. Kalamboukis, T.: Tridiagonalization of band symmetric matrices for vector computers. Com-
put. Math. Appl. 19, 29–34 (1990)
14. Lang, B.: A parallel algorithm for reducing symmetric banded matrices to tridiagonal form.
SIAM J. Sci. Comput. 14(6), 1320–1338 (1993). doi:10.1137/0914078
References 403

15. Philippe, B., Vital, B.: Parallel implementations for solving generalized eigenvalue problems
with symmetric sparse matrices. Appl. Numer. Math. 12, 391–402 (1993)
16. Carey, C., Chen, H.C., Golub, G., Sameh, A.: A new approach for solving symmetric eigenvalue
problems. Comput. Sys. Eng. 3(6), 671–679 (1992)
17. Golub, G., Underwood, R.: The block Lanczos method for computing eigenvalues. In: Rice, J.
(ed.) Mathematical Software III, pp. 364–377. Academic Press, New York (1977)
18. Underwood, R.: An iterative block Lanczos method for the solution of large sparse symmetric
eigenproblems. Technical Report STAN-CS-75-496, Computer Science, Stanford University,
Stanford (1975)
19. Kaniel, S.: Estimates for some computational techniques in linear algebra. Math. Comput. 20,
369–378 (1966)
20. Paige, C.: The computation of eigenvalues and eigenvectors of very large sparse matrices. Ph.D.
thesis, London University, London (1971)
21. Meurant, G.: The Lanczos and Conjugate Gradient Algorithms: from Theory to Finite Precision
Computations (Software, Environments, and Tools). SIAM, Philadelphia (2006)
22. Paige, C.C.: Accuracy and effectiveness of the Lanczos algorithm for the symmetric eigen-
problem. Linear Algebra Appl. 34, 235–258 (1980)
23. Cullum, J.K., Willoughby, R.A.: Lanczos Algorithms for Large Symmetric Eigenvalue Com-
putations. SIAM, Philadelphia (2002)
24. Lehoucq, R., Sorensen, D.: Deflation techniques for an implicitly restarted Arnoldi iteration.
SIAM J. Matrix Anal. Appl. 17, 789–821 (1996)
25. Lehoucq, R., Sorensen, D., Yang, C.: ARPACK User’s Guide: Solution of Large-Scale Eigen-
value Problems With Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia (1998)
26. Calvetti, D., Reichel, L., Sorensen, D.C.: An implicitly restarted Lanczos method for large
symmetric eigenvalue problems. Electron. Trans. Numer. Anal. 2, 1–21 (1994)
27. Sorensen, D.: Implicit application of polynomial filters in a k-step Arnoldi method. SIAM J.
Matrix Anal. Appl. 13, 357–385 (1992)
28. Lewis, J.G.: Algorithms for sparse matrix eigenvalue problems. Technical Report STAN-CS-
77-595, Department of Computer Science, Stanford University, Palo Alto (1977)
29. Ruhe, A.: Implementation aspects of band Lanczos algorithms for computation of eigenvalues
of large sparse symmetric matrices. Math. Comput. 33, 680–687 (1979)
30. Scott, D.: Block lanczos software for symmetric eigenvalue problems. Technical Report
ORNL/CSD-48, Oak Ridge National Laboratory, Oak Ridge (1979)
31. Baglama, J., Calvetti, D., Reichel, L.: IRBL: an implicitly restarted block Lanczos method for
large-scale Hermitian eigenproblems. SIAM J. Sci. Comput. 24(5), 1650–1677 (2003)
32. Chen, H.C., Sameh, A.: Numerical linear algebra algorithms on the cedar system. In: Noor,
A. (ed.) Parallel Computations and Their Impact on Mechanics, Applied Mechanics Division,
vol. 86, pp. 101–125. American Society of Mechanical Engineers (1987)
33. Chen, H.C.: The sas domain decomposition method. Ph.D. thesis, University of Illinois at
Urbana-Champaign (1988)
34. Davidson, E.: The iterative calculation of a few of the lowest eigenvalues and corresponding
eigenvectors of large real-symmetric matrices. J. Comput. Phys. 17, 817–825 (1975)
35. Morgan, R., Scott, D.: Generalizations of Davidson’s method for computing eigenvalues of
sparse symmetric matrices. SIAM J. Sci. Stat. Comput. 7, 817–825 (1986)
36. Crouzeix, M., Philippe, B., Sadkane, M.: The Davidson method. SIAM J. Sci. Comput. 15,
62–76 (1994)
37. Sameh, A.H., Wisniewski, J.A.: A trace minimization algorithm for the generalized eigenvalue
problem. SIAM J. Numer. Anal. 19(6), 1243–1259 (1982)
38. Sameh, A., Tong, Z.: The trace minimization method for the symmetric generalized eigenvalue
problem. J. Comput. Appl. Math. 123, 155–170 (2000)
39. Sleijpen, G., van der Vorst, H.: A Jacobi-Davidson iteration method for linear eigenvalue
problems. SIAM J. Matrix Anal. Appl. 17, 401–425 (1996)
40. Simoncini, V., Eldén, L.: Inexact Rayleigh quotient-type methods for eigenvalue computations.
BIT Numer. Math. 42(1), 159–182 (2002). doi:10.1023/A:1021930421106
404 11 Large Symmetric Eigenvalue Problems

41. Bathe, K., Wilson, E.: Large eigenvalue problems in dynamic analysis. ASCE J. Eng. Mech.
Div. 98, 1471–1485 (1972)
42. Bathe, K., Wilson, E.: Solution methods for eigenvalue problems in structural mechanics. Int.
J. Numer. Methods Eng. 6, 213–226 (1973)
43. Grimm, R., Greene, J., Johnson, J.: Computation of the magnetohydrodynamic spectrum in
axissymmetric toroidal confinement systems. Method Comput. Phys. 16 (1976)
44. Gruber, R.: Finite hybrid elements to compute the ideal magnetohydrodynamic spectrum of an
axisymmetric plasma. J. Comput. Phys. 26, 379–389 (1978)
45. Stewart, G.: A bibliographical tour of the large, sparse generalized eigenvalue problems. In:
Banch, J., Rose, D. (eds.) Sparse Matrix Computations, pp. 113–130. Academic Press, New
York (1976)
46. van der Vorst, H., Golub, G.: One hundred and fifty years old and still alive: eigenproblems. In:
Duff, I., Watson, G. (eds.) The State of the Art in Numerical Analysis, pp. 93–119. Clarendon
Press, Oxford (1997)
47. Rutishauser, H.: Computational aspects of f. l. bauer’s simultaneous iteration method. Nu-
merische Mathematik 13(1), 4–13 (1969). doi:10.1007/BF02165269
48. Clint, M., Jennings, A.: The evaluation of eigenvalues and eigenvectors of real symmetric
matrices by simultaneous iteration. Computers 13, 76–80 (1970)
49. Levin, A.: On a method for the solution of a partial eigenvalue problem. J. Comput. Math.
Math. Phys. 5, 206–212 (1965)
50. Stewart, G.: Accelerating the orthogonal iteration for the eigenvalues of a Hermitian matrix.
Numer. Math. 13, 362–376 (1969)
51. Sakurai, T., Sugiura, H.: A projection method for generalized eigenvalue problems using nu-
merical integration. J. Comput. Appl. Math. 159, 119–128 (2003)
52. Tang, P., Polizzi, E.: FEAST as a subspace iteration eigensolver accelerated by approximate
spectral projection. SIAM J. Matrix Anal. Appl. 35(2), 354–390 (2014)
53. Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential
and integral operators. J. Res. Natl. Bur. Stand. 45, 225–280 (1950)
54. Fokkema, D.R., Sleijpen, G.A.G., van der Vorst, H.A.: Jacobi-Davidson style QR and QZ
algorithms for the reduction of matrix pencils. SIAM J. Sci. Comput. 20(1), 94–125 (1998)
55. Sleijpen, G., Booten, A., Fokkema, D., van der Vorst, H.: Jacobi-Davidson type methods for
generalized eigenproblems and polynomial eigenproblems. BIT 36, 595–633 (1996)
56. Cullum, J., Willoughby, R.: Lanczos and the computation in specified intervals of the spectrum
of large, sparse, real symmetric matrices. In: Duff, I., Stewart, G. (eds.) Proceedings of the
Sparse Matrix 1978. SIAM (1979)
57. Parlett, B., Scott, D.: The Lanczos algorithm with selective orthogonalization. Math. Comput.
33, 217–238 (1979)
58. Simon, H.: The Lanczos algorithm with partial reorthogonalization. Math. Comput. 42, 115–
142 (1984)
59. Cullum, J., Willoughby, R.: Computing eigenvalues of very large symmetric matrices—an
implementation of a Lanczos algorithm with no reorthogonalization. J. Comput. Phys. 44,
329–358 (1984)
60. Ericsson, T., Ruhe, A.: The spectral transformation Lanczos method for the solution of large
sparse generalized symmetric eigenvalue problems. Math. Comput. 35, 1251–1268 (1980)
61. Grimes, R., Lewis, J., Simon, H.: A shifted block Lanczos algorithm for solving sparse sym-
metric generalized Eigenproblems. SIAM J. Matrix Anal. Appl. 15, 228–272 (1994)
62. Kalamboukis, T.: A Lanczos-type algorithm for the generalized eigenvalue problem ax = λbx.
J. Comput. Phys. 53, 82–89 (1984)
63. Liu, B.: The simultaneous expansion for the solution of several of the lowest eigenvalues and
corresponding eigenvectors of large real-symmetric matrices. In: Moler, C., Shavitt, I. (eds.)
Numerical Algorithms in Chemistry: Algebraic Method, pp. 49–53. University of California,
Lawrence Berkeley Laboratory (1978)
64. Stathopoulos, A., Saad, Y., Fischer, C.: Robust preconditioning of large, sparse, symmetric
eigenvalue problems. J. Comput. Appl. Math. 197–215 (1995)
References 405

65. Wu, K.: Preconditioning techniques for large eigenvalue problems. Ph.D. thesis, University of
Minnesota (1997)
66. Jacobi, C.: Über ein leichtes verfahren die in der theorie der säculärstörungen vorkom menden
gleichungen numerisch aufzulösen. Crelle’s J. für reine und angewandte Mathematik 30, 51–94
(1846)
67. Beckenbach, E., Bellman, R.: Inequalities. Springer, New York (1965)
68. Kantorovic̆, L.: Functional analysis and applied mathematics (Russian). Uspekhi Mat. Nauk.
3, 9–185 (1948)
69. Newman, M.: Kantorovich’s inequality. J. Res. Natl. Bur. Stand. B. Math. Math. Phys. 64B(1),
33–34 (1959). http://nvlpubs.nist.gov/nistpubs/jres/64B/jresv64Bn1p33_A1b.pdf
70. Benzi, M., Golub, G., Liesen, J.: Numerical solution of Saddle-point problems. Acta Numerica
pp. 1–137 (2005)
71. Elman, H., Silvester, D., Wathen, A.: Performance and analysis of Saddle-Point preconditioners
for the discrete steady-state Navier-Stokes equations. Numer. Math. 90, 641–664 (2002)
72. Paige, C.C., Saunders, M.A.: Solution of sparse indefinite systems of linear equations. SIAM
J. Numer. Anal. 12(4), 617–629 (1975)
73. Daniel, J., Gragg, W., Kaufman, L., Stewart, G.: Reorthogonalization and stable algorithms for
updating the Gram-Schmidt QR factorization. Math. Comput. 136, 772–795 (1976)
74. Sun, J.G.: Condition number and backward error for the generalized singular value decompo-
sition. SIAM J. Matrix Anal. Appl. 22(2), 323–341 (2000)
75. Stewart, G.G., Sun, J.: Matrix Perturbation Theory. Academic Press, Boston (1990)
76. Demmel, J., Gu, M., Eisenstat, S.I., Slapničar, K., Veselić, Z.D.: Computing the singular value
decomposition with high accuracy. Linear Algebra Appl. 299(1–3), 21–80 (1999)
77. Sun, J.: A note on simple non-zero singular values. J. Comput. Math. 6(3), 258–266 (1988)
78. Berry, M., Sameh, A.: An overview of parallel algorithms for the singular value and symmetric
eigenvalue problems. J. Comput. Appl. Math. 27, 191–213 (1989)
79. Dongarra, J., Sorensen, D.C.: A fully parallel algorithm for the symmetric eigenvalue problem.
SIAM J. Sci. Stat. Comput. 8(2), s139–s154 (1987)
80. Golub, G., Reinsch, C.: Singular Value Decomposition and Least Squares Solutions. Springer
(1971)
81. Golub, G., Luk, F., Overton, M.: A block Lanczos method for computing the singular values
and corresponding singular vectors of a matrix. ACM Trans. Math. Softw. 7, 149–169 (1981)
82. Berry, M.: Large scale singular value decomposition. Int. J. Supercomput. Appl. 6, 13–49
(1992)
83. Wilkinson, J.: Inverse Iteration in Theory and in Practice. Academic Press (1972)
84. Philippe, B., Sadkane, M.: Computation of the fundamental singular subspace of a large matrix.
Linear Algebra Appl. 257, 77–104 (1997)
85. Hochstenbach, M.: A Jacobi-Davidson type SVD method. SIAM J. Sci. Comput. 23(2), 606–
628 (2001)
Part IV
Matrix Functions and Characteristics
Chapter 12
Matrix Functions and the Determinant

Many applications require the numerical evaluation of matrix functions, such as


polynomials and rational or transcendental functions with matrix arguments (e.g.
the exponential, the logarithm or trigonometric), or scalar functions of the matrix
elements, such as the determinant. When the underlying matrices are large, it is
important to have methods that lend themselves to parallel implementation.
As with iterative methods, two important applications are the numerical solution of
ODEs and the computation of network metrics. Regarding the former, it has long been
known that many classical schemes for solving ODEs amount to approximating the
action of the matrix exponential on a vector [1]. Exponential integrators have become
a powerful tool for solving large scale ODEs (e.g. see [2] and references therein) and
software has been developed to help in this task on uniprocessors and (to a much
lesser extent) parallel architectures (e.g. see [3–6]). Recall also that in our discussion
of rapid elliptic solvers in Sect. 6.4 we have already encountered computations with
matrix rational functions.
Matrix function-based metrics for networks have a long history. It was shown as
early as 1949 in [7] that if A is an adjacency matrix for a digraph, then the elements
of Ak show the number of paths of length k that connect any two nodes. A few years
later, a status score was proposed for each node of a social network based on the
underlying link structure [8]. Specifically, the score used the values of the product
of the resolvent (ζ I − A)−1 with a suitable vector. These ideas have resurfaced with
ranking algorithms such as Google’s PageRank and HITS [9–11] and interest in this
topic has peaked in the last decade due in part to the enormous expansion of the Web
and social networks. For example, researchers are studying how to compute such
values fast for very large networks; cf. [12].
Some network metrics use the matrix exponential [13, 14], the matrix hyperbolic
sine and cosine functions [15–17] or special series and polynomials [18–20]. In other
areas, such as multivariate statistics, economics, physics, and uncertainty quantifi-
cation, one is interested in the determinant, the diagonal of the matrix inverse and its
trace; cf. [21–23].
In this chapter we consider a few parallel algorithms for evaluating matrix rational
functions, the matrix exponential and the determinant. The monographs [24, 25] and
© Springer Science+Business Media Dordrecht 2016 409
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_12
410 12 Matrix Functions and the Determinant

[26], are important references regarding the theoretical background and the design of
high quality algorithms for problems of this type on uniprocessors. Reference [27]
is a useful survey of software for matrix functions. Finally, it is worth noting that the
numerical evaluation of matrix functions is an extremely active area of research, and
many interesting developments are currently under consideration or yet to come.

12.1 Matrix Functions

It is well known from theory that if A is diagonalizable with eigenvalues {λi }i=1 n and if
−1
the scalar function f (ζ ) is such that f (λi ) exists for i = 1, . . . , n and Q AQ = Λ is
the matrix of eigenvalues, then the matrix function is defined as f (A) = Q f (Λ)Q −1 .
If A is non-diagonalizable, and Q −1 AQ = diag(J1 , . . . , J p ) is its Jordan canonical
form, then f (A) is defined as f (A) = Qdiag( f (J1 ), . . . , f (J p ))Q −1 , assuming that
(n i −1)
(λi )
for each eigenvalue λi , the derivatives { f (λi ), f (1) (λi ), . . . , f (n i −1)! } exist where
n i is the size of the largest Jordan block containing λi and f (Ji ) is the Toeplitz upper
triangular matrix with first row the vector

f (m i −1) (λi )
( f (λi ), f (1) (λi ), . . . , ).
(m i − 1)!

A matrix function can also be defined using the Cauchy integral



1
f (A) = f (ζ )(ζ I − A)−1 dζ, (12.1)
2π ι Γ

where Γ is a closed contour of winding number one enclosing the eigenvalues of A.


A general template for computations involving matrix functions is

D + C f (A)B (12.2)

where f is the function under consideration defined on the spectrum of the square
matrix A and B, C, D are of compatible shapes. A large variety of matrix computa-
tions, including most of the BLAS, matrix powers and inversion, linear systems with
multiple right-hand sides and bilinear forms can be cast as in (12.2). We focus on
rational functions (that is ratios of polynomials) and the matrix exponential. Observe
that the case of functions with a linear denominator amounts to matrix inversion or
solving linear systems, which have been addressed earlier in this book.
Polynomials and rational functions are primary tools in the practical approxima-
tion of scalar functions. Moreover, as is well known from theory, matrix functions
can also be defined by means of a unique polynomial (the Hermite interpolating poly-
nomial) that depends on the underlying matrix and its degree is at most that of the
minimal polynomial. Even though, in general, it is impractical to use it, its existence
12.1 Matrix Functions 411

provides a strong motivation for using polynomials. Finally, rational functions are
sometimes even more effective than polynomials in approximating scalar functions.
Since their manipulation in a matrix setting involves the solution of linear systems,
an interesting question, that we also discuss, is how to manipulate matrix rational
functions efficiently in a parallel environment.
We have already encountered matrix rational functions in this book. Recall our
discussion of algorithms BCR and EES in Sect. 6.4, where we used ratios of Cheby-
shev polynomials, and described the advantage of using partial fractions in a parallel
setting. In this section we extend this approach to more general rational functions.
Specifically, we consider the case where f in (12.2) is a rational function denoted by

q(ζ )
r (ζ ) = (12.3)
p(ζ )

where the polynomials q and p are such that r (A) exists.


Remark 12.1 Unless noted otherwise, we make the following assumptions for the
rational function in (12.3): (i) the degree of q (denoted by deg q) is no larger than
the degree of p. (ii) The polynomials p and q have no common roots. (iii) The roots
of p (the poles of r ) are mutually distinct. (iv) The degrees of p and q are much
smaller than the matrix size. (v) We also assume that the coefficients have been
normalized so that p is monic. Because these assumptions hold for a large number
of computations of interest, we will be referring to them as standard assumptions
for the rational function.
Since r (A) exists, the set of poles of r and the set of eigenvalues of A have empty
intersection. Assumption (iv) is directly related to our focus on large matrices and the
fact that it is impractical and in the context of most problems, not necessary anyway,
to compute with matrix polynomials and rational functions of very high degree.
In the sequel, we assume that the roots of scalar polynomials as well as partial frac-
tion coefficients are readily available, as the cost of their computation is insignificant
relative to the matrix operations. The product form representation of p is


d
p(ζ ) = (ζ − τ j ), where d = deg p.
j=1

Two cases of the template (12.2) amount to computing one of the following:

either r (A) = ( p(A))−1 q(A), or x = ( p(A))−1 q(A)b. (12.4)

The vector x can be computed by first evaluating r (A) and then multiplying by b. As
with matrix inversion, we can approximate x without first computing r (A) explicitly.
412 12 Matrix Functions and the Determinant

12.1.1 Methods Based on the Product Form


of the Denominator

Algorithm 12.1 is one way of evaluating x = ( p(A))−1 b when q ≡ 1. A parallel


implementation can only exploit the parallelism available in the solution of each
linear system. Algorithm 12.2, on the other hand, contains a parallel step at each one
of the d outer iterations. In both cases, because the matrices are shifts of the same
A, it might be preferable to first reduce A to an orthogonally similar Hessenberg
form. We comment more extensively on this reduction in the next subsection and in
Sect. 13.1.2.

d
Algorithm 12.1 Computing x = ( p(A))−1 b when p(ζ ) = j=1 (ζ − τ j ) with τ j
mutually distinct.
Input: A ∈ Rn×n , b ∈ Rn and values τ j distinct from the eigenvalues of A.
Output: Solution ( p(A))−1 b.
1: x0 = b
2: do j = 1 : d
3: solve (A − τ j I )x j = x j−1
4: end
5: return x = xd

d
Algorithm 12.2 Computing X = ( p(A))−1 when p(ζ ) = j=1 (ζ − τ j ) with τ j
mutually distinct
Input: A ∈ Rn×n , b ∈ Rn and values τ j distinct from the eigenvalues of A.
Output: ( p(A))−1 .
1: xi(0) = ei for i = 1, . . . , n and X = (x1(0) , . . . , xn(0) ).
2: do j = 1 : d
3: doall i = 1 : n
( j) ( j−1)
4: solve (A − τ j I )xi = xi
5: end
6: end
(d) (d)
7: return X = (x1 , . . . , xn )

We next consider evaluating r in the case that the numerator, q, is non-trivial


and/or its degree is larger than the denominator (violating for the purposes of this
discussion standard assumption (i)). If q is in power form, then Horner’s rule can
be used. The parallel evaluation of scalar polynomials has been discussed in the
literature and in this book (cf. the parallel Horner scheme in Sect. 3.3). See also [28–
34]. The presence of matrix arguments introduces some challenges as the relative
costs of the operations are very different. In particular, the costs of multiplying by
a scalar coefficient (in the power form), subtracting a scalar from the diagonal (in
the product form) or adding two matrices are much smaller than the cost of matrix
multiplication.
12.1 Matrix Functions 413

A method that takes into account these differences and uses approximately 2 d
matrix multiplications was presented in [24] based on an original algorithm (some-
times called Paterson-Stockmeyer) from Ref. [35]. The basic idea is to express p(ζ )
as a polynomial in the variable ζ s for suitable s, with coefficients that are also poly-
nomials, of degree d/s in ζ . The parallelism in the above approach stems first from
the matrix multiplications; cf. Sect. 2.2 for related algorithms for dense matrices.
Further opportunities are also available. For example, when d = 6 and s = 3, the
expression used in [24] is

p(A) = π6 I (A3 )2 + (π5 A2 + π4 A + π3 I )A3 + (π2 A2 + π1 A + π0 I ),

Evidently, the matrix terms in parentheses can be evaluated in parallel. A parallel


implementation on GPUs was considered in [36].
We next sketch a slightly different approach. Consider, for example, a polynomial
of degree 7,


7
p(ζ ) = πjζ j.
j=0

This can be written as a cubic polynomial in ζ 2 , with coefficients that are linear in ζ ,


3
(π2 j + π2 j+1 ζ )ζ 2 j
j=0

or as a linear polynomial in ζ 4 with coefficients that are cubic polynomials in ζ ,


1
p(ζ ) = (π4 j + π4 j+1 ζ + π4 j+2 ζ 2 + π4 j+3 ζ 3 )ζ 4 j .
j=0

These can be viewed as modified instances of a recursive approach described in [28]


that consists of expressing p with “polynomial multiply-and-add” operations, where
the one of the two summands consists of the terms of even degree and the other is the
product of ζ with another even polynomial. Assuming for simplicity that d = 2k − 1,
we can write p as a polynomial multiply-and-add

(1) k−1 (1)


p2k −1 (ζ ) = p2k−1 −1 (ζ ) + ζ 2 p̂2k−1 −1 (ζ ) (12.5)

where we use the subscript to explicitly denote the maximum degree of the respective
polynomials. Note that the process can be applied recursively on each term p (1)
j and
(1)
p̂ j . We can express this entirely in matrix form as follows. Define the matrices of
order 2n
414 12 Matrix Functions and the Determinant
 
A 0
Mj = , j = 0, 1, . . . , 2k − 1. (12.6)
πj I I

Then it follows that


 k 
A2 0
M2k −1 M2k −2 · · · M0 = , (12.7)
p2k −1 (A) I

so the sought polynomial is in block position (2, 1) of the product. Moreover,


 k−1

A2 0
M2k −1 M2k −2 · · · M2k−1 = (1)
p̂2k−1 −1 (A) I

and
 k−1

A2 0
M2k−1 −1 M2k−1 −2 · · · M0 = (1)
p2k−1 −1 (A) I

(1) (1)
where the terms p2k−1 −1 , p̂2k−1 −1 are as in decomposition (12.5). The connection
of the polynomial multiply-and-add approach (12.5) with the product form (12.7)
using the terms M j defined in (12.6), motivates the design of algorithms for the
parallel computation of the matrix polynomial based on a parallel fan-in approach,
e.g. as that proposed to solve triangular systems in Sect. 3.2.1 (Algorithm 3.2). With
no limit on the number of processors, such a scheme would take log d stages each
consisting of a matrix multiply (squaring) and a matrix multiply-and-add. See also
[37] for interesting connections between these approaches. For a limited number of
processors, the Paterson-Stockmeyer algorithm can be applied in each processor to
evaluate a low degree polynomial corresponding to the multiplication of some of
the terms M j . Subsequently, the fan-in approach can be applied to combine these
intermediate results.
We also need to note that the aforementioned methods require more storage com-
pared to the standard Horner scheme.

12.1.2 Methods Based on Partial Fractions

An alternative approach to the methods of the previous subsection is based on the


partial fraction representation. This not only highlights the linear aspects of rational
functions, as noted in [26], but it simplifies their computation in a manner that
engenders parallel processing. The partial fraction representation also emerges in the
approximation of matrix functions via numerical quadrature based on the Cauchy
integral form (12.1); cf. [5, 38].
12.1 Matrix Functions 415

In the simplest case of linear p and q, that is solving

(A − τ I )x = (A − σ I )b, (12.8)

the Laurent expansion for the rational function around the pole τ can be written as

ζ −σ 1
= (τ − σ ) + 1,
ζ −τ ζ −τ

and so the matrix-vector multiplications can be completely avoided by replacing


(12.8) with

x = b + (τ − σ )(A − τ I )−1 b;

cf. [39]. Recall that we have already made use of partial fractions to enable the
parallel evaluation of some special rational functions in the rapid elliptic solvers of
Sect. 6.4. The following result is well known; see e.g. [40].
Theorem 12.1 Let p, q be polynomials such that deg q ≤ deg p, and let τ1 , . . . , τd
be the distinct zeros of p, with multiplicities of μ1 , . . . , μd , respectively, and distinct
)
from the roots of q. Then at each τ j , the rational function r (ζ ) = q(ζ p(ζ ) has a pole of
order at most μ j , and the partial fraction representation of r is given by,

μj

d 
r (ζ ) = ρ0 + πi (ζ ), with πi (ζ ) = γi, j (ζ − τ j )− j
i=1 j=1

where πi (ζ ) denotes the principal part of r at the pole τ j , ρ0 = lim|z|→∞ r (ζ ) and


the constants γi, j are the partial fraction expansion coefficients of r .
When the standard assumptions hold for the rational function, the partial fraction
expansion is


d
q(τ j )
r (A) = ρ0 I + ρ j (A − τ j I )−1 with ρ j =  , (12.9)
j=1
p (τ j )

which exhibits ample opportunities for parallel evaluation. One way to compute r (A)
is to first evaluate the terms (A−τ j I )−1 for j = 1, . . . , d simultaneously, followed by
the weighted summation. To obtain the vector r (A)b, we can first solve
the d systems
(A − τ j I )x j = b simultaneously, followed by computing x = b + dj=1 ρ j x j . The
latter can also be written as an MV operation,

x = b + Xr, where X = (x1 , . . . , xd ), and r = (ρ1 , . . . , ρd ) . (12.10)

Algorithm 12.3 uses partial fractions to compute r (A)b.


416 12 Matrix Functions and the Determinant

d
Algorithm 12.3 Computing x = ( p(A))−1 q(A)b when p(ζ ) = j=1 (ζ − τ j ) and
the roots τ j are mutually distinct
Input: A ∈ Rn×n , b ∈ Rn , values {τ1 , . . . , τd } (roots of p).
Output: Solution x = ( p(A))−1 q(A)b.
1: doall j = 1 : d
q(τ )
2: compute coefficient ρ j =  j
p (τ j )
3: solve (A − τ j I )x j = b
4: end
)
5: set ρ0 = lim|z|→∞ q(ζ
p(ζ ) , r = (ρ1 , . . . , ρd ) , X = (x 1 , . . . , x d )
6: compute x = ρ0 b + Xr //it is important to compute this step in parallel

The partial fraction coefficients are computed only once for any given function,
thus under the last of the standard assumptions listed in Remark 12.1, the cost of line
2 is not taken into account; cf. [40] for a variety of methods. On d processors, each
can undertake the solution of one linear system (line 3 of Algorithm 12.3). Except
for line 6, where one needs to compute the linear combination of the d solution
vectors, the computations are totally independent. Thus, the cost of this algorithm
on d processors is equal to a (possibly complex) linear system solve per processor
and a dense MV. For general dense matrices of size n, the cost of solving the linear
systems dominates. For structured matrices, the cost of each system solve could be
comparable or even less than that of a single dense sequential MV. For tridiagonals
of order n, for example, the cost of a single sequential MV, in line 6, dominates
the O(n) cost of the linear system solve in line 3. Thus, it is important to consider
using a parallel algorithm for the dense MV. If the number of processors is smaller
than d then more systems can be assigned to each processor, whereas if it is larger,
more processors can participate in the solution of each system using some parallel
algorithm.
It is straightforward to modify Algorithm 12.3 to compute r (A) directly. Line 3
becomes an inversion while line 6 a sum of d inverses, costing O(n 2 d). For structured
matrices for which the inversion can be accomplished at a cost smaller than O(n 3 ),
the cost of the final step can become comparable or even larger, assuming that the
inverses have no special structure.
Next, we consider in greater detail the solution of the linear systems in Algo-
rithm 12.3. Notice that all matrices involved are simple diagonal shifts of the form
(A − τ j I ). One possible preprocessing step is to transform A to an orthogonally
similar upper Hessenberg or (if symmetric) tridiagonal form; cf. [41] for an early
use of this observation, [42] for its application in computing the matrix exponential
and [43] for the case of reduction to tridiagonal form when the matrix is symmetric.
This preprocessing, incorporated in Algorithm 12.4, leads to considerable savings
since the reduction step is performed only once while the Hessenberg linear systems
in line 5 require only O(n 2 ) operations. The cost drops to O(n) when the matrix is
symmetric since the reduced system becomes tridiagonal. In both cases, the dominant
cost is the reduction in line 1; e.g. see [44, 45] for parallel solvers of such linear
12.1 Matrix Functions 417

systems. In Chap. 13 we will make use of the same preprocessing step for computing
the matrix pseudospectrum.

d
Algorithm 12.4 Compute x = ( p(A))−1 q(A)b when p(ζ ) = j=1 (ζ − τ j ) and
the roots τ j are mutually distinct
Input: A ∈ Rn×n , b ∈ Rn , values {τ1 , . . . , τd } (roots of p).
Output: Solution x = ( p(A))−1 q(A)b.
1: [Q, H ] = hess(A) //Q obtained as product of Householder transformations so that Q AQ =
H is upper Hessenberg
2: compute b̂ = Q b //exploit the product form
3: doall j = 1 : d
q(τ )
4: compute coefficient ρ j =  j
p (τ j )
5: solve (H − τ j I )x̂ j = b̂
6: compute x j = Q x̂ j //exploit the product form
7: end
)
8: set ρ0 = lim|z|→∞ q(ζ
p(ζ ) , r = (ρ1 , . . . , ρd ) , X = Q( x̂ 1 , . . . , x̂ d )
9: compute and return x = ρ0 b + Xr //it is important to compute this step in parallel

Note that when the rational function is real, any complex poles appear in conjugate
pairs. Then the partial fraction coefficients also appear in conjugate pairs which makes
possible significant savings. For example, if (ρ, τ ) and (ρ̄, τ̄ ) are two such conjugate
pairs, then

ρ(A − τ I )−1 + ρ̄(A − τ̄ I )−1 = 2
ρ(A − τ I )−1 . (12.11)

Therefore, pairs of linear system solves can be combined into one.


Historical Remark
The use of partial fraction decomposition to enable parallel processing can be
traced back to an ingenious idea proposed in [46] for computing ζ d , where ζ is
a scalar and d a positive integer, using O(log d) parallel additions and a few par-
allel operations. Using  this as
a basis, methods for computing expressions such as
{ζ 2 , ζ 3 , . . . , ζ d }, d1 (ζ +αi ), d0 αi ζ i have also been developed. We briefly outline
the idea since it is at the core of the technique we described above. Let p(ω) = ωd −1
 k−1
which in factored form is p(ω) = dk=1 (ω−ωk ), where ωk = e2π ι d (k = 1, . . . , d)
are the dth roots of unity. Then, we can compute the logarithmic derivative p  / p
explicitly as well as from the product form,


d 
d
( p(ω)) = (ω − ωk ).
j=1 k=1
k = j
418 12 Matrix Functions and the Determinant

p ζ d−1 d  1 d
= d =
p ζ −1 ζ − ωj
j=1

and therefore

1 
d
1
ζd = where y = .
1− d
ζy
ζ − ωj
j=1

Therefore, ζ d can be computed by means of Algorithm 12.5.

Algorithm 12.5 Computing ζ d using partial fractions.


j−1
Input: ζ ∈ C, positive integer d, and values ω j = e2π ι d ( j = 1, ..., d).
Output: τ = ζ d .
1: doall j = 1 : d
2: compute ψ j = (ζ − ω j )−1
3: end

4: compute the sum y = dj=1 ψ j


5: compute τ = 1 d
1− ζ y

On d processors, the total cost of the algorithm is log d parallel additions for the
reduction, one parallel division and a small number of additional operations. At first
sight, the result is somewhat surprising; why should one go through this process to
compute powers in O(log d) parallel additions rather than apply repeated squaring
to achieve the same in O(log d) (serial) multiplications? The reason is that the com-
putational model in [46] was of a time when the cost of (scalar) multiplication was
higher than addition. Today, this assumption holds if we consider matrix operations.
In fact, [46] briefly mentions the possibility of using partial fraction expansions with
matrix arguments. Note that similar tradeoffs drive other fast algorithms, such as the
3M method for complex matrix multiplication, and Strassen’s method; cf. Sect. 2.2.2.

12.1.3 Partial Fractions in Finite Precision

Partial fraction expansions are very convenient for introducing parallelism but one
must be alert to roundoff effects in floating-point arithmetic. In particular, there are
cases where the partial fraction expansion contains terms that are large and of mixed
sign. If the computed result is small relative to the summands, it could be polluted
by the effect of catastrophic cancellation. For instance, building on an example from
[47], the expansion of p(ζ ) = (ζ − α)(ζ − δ)(ζ + δ) is

1 (α 2 − δ 2 )−1 (2δ(δ − α))−1 (2δ(δ + α))−1


= + + . (12.12)
p(ζ ) ζ −α ζ −δ ζ +δ
12.1 Matrix Functions 419

When |δ| is very small and |α| |δ| the O( 1δ ) factors in the last two terms lead to
catastrophic cancellation.
In this example, the danger of catastrophic cancellation is evident from the pres-
ence of two nearby poles that trigger the generation of large partial fraction coeffi-
cients of different sign. As shown in [47], however, cancellation
 can also be caused
by the distribution of the poles. For instance, if p(ζ ) = dj=1 (ζ − dj ) then

1  ρj
d
d d−1
= , ρ j = d . (12.13)
p(ζ ) ζ− j
k=1 ( j − k)
j=1 d
k = j

When d = 20, then ρ20 = −ρ1 = 2019 /19! ≈ 4.3 × 107 . Finally note that even if
no catastrophic cancellation takes place, multiplications with large partial fraction
coefficients will magnify any errors that are already present in each partial fraction
term.
In order to prevent large coefficients that can lead to cancellation effects when
evaluating partial fraction expansions, we follow [47] and consider hybrid represen-
tations for the rational polynomial, standing between partial fractions and the product
form. Specifically, we use a modification of the incomplete partial fraction decompo-
sition (IPF for short) that was devised in [48] to facilitate computing partial fraction
coefficients corresponding to denominators with quadratic factors which also helps
one avoid complex arithmetic; see also [40]. For example, if p(ζ ) = (ζ −α)(ζ 2 +δ 2 )
is real with one real root α and two purely imaginary roots, ±ιδ, then an IPF that
avoids complex coefficients is
 
1 1 1 ζ +α
= 2 − 2 .
p(ζ ) (α + δ 2 ) ζ −α ζ + δ2

The coefficients appearing in this IPF are ±(α 2 + δ 2 )−1 and −α(α 2 + δ 2 )−1 . They
are all real and, fortuitously, bounded for very small values of δ.
In general, assuming that the rational function r = q/ p satisfies the μstandard
assumptions and that we have a factorization of the denominator p(ζ ) = l=1 pl (ζ )
into non-trivial polynomial factors p1 , . . . , pμ , then an IPF based on these factors is

μ
 h l (ζ )
r (ζ ) = . (12.14)
pl (ζ )
l=1

where the polynomials h l are such that deg h l ≤ deg pl and their coefficients can
be computed using algorithms from [40]. It must be noted that the characterization
“incomplete” does not mean that it is approximate but rather that it is not fully
developed as a sum of d terms with linear denominators or powers thereof and thus
does not fully reveal some of the distinct poles.
420 12 Matrix Functions and the Determinant

Instead of this IPF, that we refer to as additive because it is written as a sum of


terms, we can use a multiplicative IPF of the form
μ

q(ζ ) gl (ζ )
r (ζ ) = = , (12.15)
p(ζ ) gl (ζ )
l=1

where q1 , . . . , qμ are suitable polynomials with deg gl ≤ deg pl and each term
ql / pl is expanded into its partial fractions.
The goal is to construct this decomposition and in turn the partial fraction expan-
sions so that none of the partial fraction coefficients exceeds in absolute value some
selected threshold, τ , and the number of terms, μ, is as small as possible. As in [47]
we expand each term above and obtain the partial fraction representation
⎛ ⎞
μ
 
kl
ρ (l) μ

r (ζ ) = ⎝ρ (l) + j ⎠, kl = deg p,
0
ζ − τ j,l
l=1 j=1 l=1

where the poles have been re-indexed from {τ1 , . . . , τd } to {τ1,1 , . . . , τk1 ,1 , τ1,2 , . . . ,
τkμ ,μ }. One idea is to perform a brute force search for an IPF that returns coefficients
that are not too large. For instance, for the rational function in Eq. (12.13) above, the
partial fraction coefficients when d = 6 are

ρ1 = −ρ6 = 64.8, ρ2 = −ρ5 = −324, ρ3 = −ρ4 = 648 (12.16)

If we set μ = 2 with k1 = 5 and k2 = 1 then we obtain the possibilities shown in


Table 12.1. Because of symmetries, some values are repeated. Based on these values,
we conclude that the smallest of the maximal coefficients are obtained when we
select p1 to be one of

6 
  6 
 
j j
ζ− or ζ−
6 6
j=1 j=1
j =3 j =4

6
Table 12.1 Partial fraction coefficients for 1
p1,k (ζ ) when p1,k (ζ ) = j=1 (ζ − 6j ) for the incomplete
j =k
1 1
partial fraction expansion of p1,k (ζ ) (ζ − k )
n

k=1 2 3 4 5 6
54.0 43.2 32.4 21.6 10.8 54.0
−216.0 −162.0 −108.0 −54.0 −108.0 −216.0
324.0 216.0 108.0 108.0 216.0 324.0
−216.0 −108.0 −54.0 −108.0 −162.0 −216.0
54.0 10.8 21.6 32.4 43.2 54.0
12.1 Matrix Functions 421

If we select μ = 2, k1 = 4 and k2 = 2, the coefficients are as tabulated in


Table 12.2 (omitting repetitions).
Using μ = 2 as above but k1 = k2 = 3, the coefficients obtained are much
smaller as the following expansion shows:
 
1 1 1 1
=
p1 (ζ ) p2 (ζ ) (ζ − 16 )(ζ − 36 )(ζ − 56 ) (ζ − 26 )(ζ − 46 )(ζ − 1)
 
9 9 9 9 9 9
= − + − + .
2(ζ − 16 ) ζ− 3
6 2(ζ − 56 ) 2(ζ − 26 ) ζ− 4
6
2(ζ − 1)

From the above we conclude that when μ = 2, there exists an IPF with maximum
coefficient equal to 9, which is a significant reduction from the maximum value of
648, found in Eq. (12.16) for the usual partial fractions (μ = 1).
If the rational function is of the form 1/ p, an exhaustive search like the above
can identify the IPF decomposition that will minimize the coefficients. For the par-
allel setting, one has to consider the tradefoff between the size of the coefficients,
determined by τ , and the level of parallelism, determined by μ. For a rational func-
tion with numerator and denominator degrees (0, d) the total number of cases that
must be examined is equal to the total number of k-partitions of d elements, for
k = 1, . . . , d. For a given k, the number of partitions is equal to the Stirling number
of the 2nd kind, S(d, k), while their sum is Bell number, B(s). These grow rapidly,
for instance the B(6) = 203 whereas B(10) = 115,975 [49]. Therefore, as d grows
larger the procedure can become very costly if not prohibitive. Moreover, we also
have to address the computation of appropriate factorizations that are practical when
the numerator is not trivial. One option is to start the search and stop as soon as an IPF
with coefficients smaller than some selected threshold, say τ , has been computed.
We next consider an effective approach for computing IPF, proposed in [47] and
denoted by IPF(τ ), where the value τ is an upper bound for the coefficients that
is set by the user. When τ = 0 no decomposition is applied and when τ = ∞
(or sufficiently large) it is the usual decomposition with all d terms. We outline the
application of the method for rational functions with numerator 1 and denominator
degree n. The partial fraction coefficients in this case are
 −1 
d
dp
(τi ) = (τi − τ j ), i = 1, . . . , d
dz
j=1
j =i

so their magnitude is determined by the magnitude of these products. To minimize


their value, the Leja ordering of the poles is used. Recalling Definition 9.5, if the set
of poles is T = {τ j }dj=1 we let τ1(1) satisfy τ1,1 = arg minθ∈T |θ | and let τk,1 ∈ T
be such that


k−1 
k−1
|τk,1 − τ j,1 | = max |θ − τ j,1 |, τk,1 ∈ T , k = 1, 2, . . . .
θ∈T
j=1 j=1
422

1 6 1 1
Table 12.2 Partial fraction coefficients for p1,(k,i) (ζ ) when p1,(k,i) (ζ ) = j=1 (ζ − 6j ) for the incomplete partial fraction expansion of p1,(k,i) (ζ ) (ζ − k )(ζ − i )
j =k,i n n

(1, 2) (1, 3) (1, 4) (1, 5) (1, 6) (2, 3) (2, 4) (2, 5) (2, 6) (3, 4) (3, 5) (3, 6) (4, 5)
36.0 27.0 18.0 9.0 36.0 21.6 14.4 7.2 27.0 10.8 5.4 18.0 3.6
−108.0 −72.0 −36.0 −54.0 −108.0 −54.0 −27.0 −36.0 −72.0 −18.0 −18.0 −36.0 −36.0
108.0 54.0 36.0 72.0 108.0 36.0 18.0 36.0 54.0 18.0 27.0 36.0 54.0
−36.0 −9.0 −18.0 −27.0 −36.0 −3.6 −5.4 −7.2 −9.0 −10.8 −14.4 −18.0 −21.6
6.0 3.0 2.0 1.5 1.2 6.0 3.0 2.0 1.5 6.0 3.0 2.0 6.0
−6.0 −3.0 −2.0 −1.5 −1.2 −6.0 −3.0 −2.0 −1.5 −6.0 −3.0 −2.0 −6.0
The pairs (k, i) in the header row indicate the poles left out to form the first factor. The partial fraction coefficients for the second factor ((ζ − nk )(ζ − ni ))−1
are listed in the bottom two rows. Since these are smaller than the coefficients for the first factor, we need not be concerned about their values
12 Matrix Functions and the Determinant
12.1 Matrix Functions 423

The idea is then to compute the coefficients of the partial fraction decomposition of
 −1
k
j=1 (ζ − τ j,1 ) for increasing values of k until one or more coefficients exceed
the threshold τ . Assume this happens for some value of k = k1 + 1. Then set as first
factor of the sought IPF(τ ) the term ( p1 (ζ ))−1 where


k1
p1 (ζ ) = (ζ − τ j,1 ),
j=1

(l)
and denote the coefficients ρ j , j = 1, . . . , k. Next, remove the poles {τ j,1 }kj=1
1
from
T , perform Leja ordering to the remaining ones and repeat the procedure to form
the next factor. On termination, the IPF(τ ) representation is
⎛ ⎞

d μ
 
kl
ρ (l)
(ζ − τ j )−1 = ⎝ j ⎠
z − τ j,l
j=1 l=1 j=1

This approach is implemented in Algorithm 12.6. The algorithm can be generalized to


provide the IPF(τ ) representation of rational functions r = q/ p with deg q ≤ deg p;
more details are provided in [47].
Generally, the smaller the value of τ , the larger the number of factors μ in this
representation. If τ = 0, then Algorithm 12.6 yields the product form representation
of 1/ p with the poles ordered so that their magnitude increases with their index. This
ordering is appropriate unless it causes overflow; cf. [50] regarding the Leja ordering
in product form representation.
There are no available bounds for the magnitude of the incomplete partial fraction
coefficients when the Leja ordering is applied on general sets T . We can, however,
(d)
bound the product of the coefficients, ρ j , of the simple partial fraction decompo-
d (d)
sition (IPF(∞)) j=1 |ρ j | with d, which motivates the use of the Leja ordering.
As was shown in [47, Theorem 4.3], if

1 
d
(d)
rd (ζ ) = , where pd (ζ ) = (ζ − τ j ),
pd (ζ )
j=1

(d)
and for each d the poles τ j are distinct and are the first d Leja points from a set T
that is compact in C and its complement is connected and regular for the Dirichlet
(d)
problem, then if we denote by ρ̂ j the partial fraction coefficients for any other
arbitrary set of pairwise distinct points from T , then


d
(d)

d
(d) 1
(d) 1
|ρ j | ≤ χ −d(d−1) , lim |ρ j | (d(d−1) = χ −1 ≤ lim |ρ̂ j | (d(d−1) ,
d→∞ d→∞
j=1 j=1

where χ is the transfinite diameter of T .


424 12 Matrix Functions and the Determinant

Algorithm
d 12.6 Computing the IPF(τ ) representation of ( p(ζ ))−1 when p(ζ ) =
j=1 (ζ − τ j ) and the roots τ j are mutually distinct
Input: T = {τ j }dj=1 , τ ≥ 0;
μ μ
μ
Output: l=1 {τ j,l }kj=1
l
, l=1 {ρ (l)
j }, where l=1 kl = d;
1: j = 1, μ = 1, k = 0;
2: while j ≤ d
3: k = k + 1; // j − 1 poles already selected and k − 1 poles in present factor of the IPF(τ )
representation
4: if k = 1 then
5: choose τ1,μ ∈ T such that |τ1,μ | = mint∈T |t|;
(μ)
6: ρ1 = 1; j = j + 1;
7: else k−1 k−1
8: choose τk,μ ∈ T such that l=1 |τk,μ − τl,μ | = maxt∈T l=1 |t − τl,m |;
9: do l = 1 : k − 1
(μ) (μ)
10: ρ̃l = ρl (τl,μ − τk,μ )−1 ;
11: end k−1
(μ)
12: ρ̃l = l=1 (τk,μ − τl,μ )−1 ;
(μ)
13: if max1≤l≤k |ρ̃l | ≤ τ then
14: do l = 1 : k
(μ) (μ)
15: ρl = ρ̃l ;
16: end
17: j = j +1
18: else
k−1
19: T = T \ {τl,μ }l=1 ; kμ = k − 1; k = 0; μ = μ + 1; //begin new factor
20: end if
21: end if
22: end while

12.1.4 Iterative Methods and the Matrix Exponential

We next extend our discussion to include cases that the underlying matrix is so large
that iterative methods become necessary. It is then only feasible to compute f (A)B,
where B consists of one or few vectors, rather than f (A). A typical case is in solving
large systems of differential equations, e.g. those that occur after spatial discretization
of initial value problems of parabolic type. Then, the function in the template (12.2)
is related to the exponential; a large number of exponential integrators, methods
suitable for these problems, have been developed; cf. [2] for a survey. See also [51–
55]. The discussion that follows focuses on the computation of exp(A)b.
The two primary tools used for the effective approximation of exp(A)b when
A is large, are the partial fraction representation of rational approximations to the
exponential and Krylov projection methods. The early presentations [56–60] on the
combination of these principles for computing the exponential also contained pro-
posals for the parallel computation of the exponential. This was a direct consequence
of the fact that partial fractions and the Arnoldi procedure provide opportunities for
large and medium grain parallelism. These tools are also applicable for more general
matrix functions. Moreover, they are of interest independently of parallel processing.
12.1 Matrix Functions 425

For this reason, the theoretical underpinnings of such methods for the exponential and
other matrix functions have been the subject of extensive research; see e.g. [61–65].
One class of rational functions are the Padé rational approximants. Their numer-
ator and denominator polynomials are known analytically; cf. [66, 67]. Padé, like
Taylor series, are designed to provide good local approximations, e.g. near 0. More-
over, the roots of the numerator and denominator are simple and contain no common
values. For better approximation, the identity exp(A) = (exp(Ah))1/ h is used to
cluster the eigenvalues close to the origin. The other possibility is to construct a
rational Chebyshev approximation on some compact set in the complex plane. The
Chebyshev rational approximation is primarily applicable to matrices that are sym-
metric negative definite and has to be computed numerically; a standard reference
for the power form coefficients of the numerator and denominator polynomials with
deg p = degq (“diagonal approximations”) ranging from 1 to 30 is [68]. Refer-
ence [58] describes parallel algorithms based on the Padé and Chebyshev rational
approximations, illustrating their effectiveness and the advantages of the Chebyshev
approximation [68–70] for negative definite matrices. An alternative to the deli-
cate computations involved in the Chebyshev rational approximation is to use the
Caratheodory-Fejér (CF) method to obtain good rational approximations followed
by their partial fraction representation; cf. [5, 26].
Recall that any complex poles appear in conjugate pairs. These can be grouped
as suggested in Eq. (12.11) to halve the number of solves with (A − τ I ), where
τ is the one of the conjugate pairs. In the common case that A is hermitian and τ
is complex, (A − τ I ) is complex symmetric and there exist Lanczos and CG type
iterative methods that take advantage of this special structure; cf. [71–73]. Another
possibility is to combine terms corresponding to conjugate shifts and take advantage
of the fact that

(A − τ I )(A − τ̄ I ) = A2 − 2
τ A + |τ |2 I = A(A − 2
τ I ) + |τ |2 I.

is a real matrix. Then, MV operations in a Krylov method would only involve real
arithmetic.
Consider now the solution of the linear systems corresponding to the partial frac-
tion terms. As noted in the previous subsection, if one were able to use a direct
method, it would be possible to save computations to lower redundancy by first
reducing the matrix to Hessenberg form. When full reduction is prohibitive because
of the size of A, one can apply a “partial” approach using the Arnoldi process to
build an orthonormal basis, Vν , for the Krylov subspace Kν (A, b), where ν  n.
The parallelism in Krylov methods was discussed in Sect. 9.3.2 of Chap. 9. Here,
however, it also holds that

Vν (A − τ I )Vν = Hν − τ Iν

for any τ and so the same basis reduces not only A but all shifted matrices to Hes-
senberg form [74, 75]. This is the well known shift invariance property of Krylov
426 12 Matrix Functions and the Determinant

subspaces that can be expressed by Kν (A, b) = Kν (A − τ I, b); cf. [76]. This prop-
erty implies that methods such as FOM and GMRES, can proceed by first computing
the basis via Arnoldi followed by the solution of (different) small systems, followed
by computing the solution to each partial fraction term using the same basis and then
combining using the partial fraction coefficients. For example, if FOM can be used,
then setting β = b, an approximation to r (A)b from Kν (A, b) is


d
x̃ = βVν ρ j (Hν − τ j I )−1 e1 = βVν r (Hν )e1 , (12.17)
j=1

on the condition that all matrices (Hν − τ j I ) are invertible. The above approach can
also be extended to handle restarting; cf. [77] when solving the shifted systems. It
is also possible to solve the multiply shifted systems by Lanczos based approaches
including BiCGSTAB, QMR and transpose free QMR; cf. [78, 79].
Consider now (12.17). By construction, if r (Hν ) exists, it must also provide a
rational approximation to the exponential. In other words, we can write

exp(A)b ≈ βVν exp(Hν )e1 . (12.18)

This is the Krylov approach for approximating the exponential, cf. [57, 58, 80,
81]. Note that the only difference between (12.17) and (12.18) is that r (Hν ) was
replaced by exp(Hν ). It can be shown that the term βVν exp(Hν )e1 is a polynomial
approximation of exp(A)b with a polynomial of degree ν − 1 which interpolates
the exponential, in the Hermite sense, on the set of eigenvalues of Hν ; [82]. In fact,
for certain classes of matrices, this type of approximation is extremely accurate;
cf. [61] and references therein. Moreover, Eq. (12.18) provides a a framework for
approximating general matrix functions using Krylov subspaces. We refer to [83]
for generalizations and their parallel implementations.
Consider, for example, using transpose free QMR. Then there is a multiply shifted
version of TFQMR [78] where each iteration consists of computations with the
common Krylov information to be shared between all systems and computations
that are specific to each term, building or updating data special to each term. As
described in [56], at each iteration of multiply shifted TFQMR, the dimension of the
underlying Krylov subspace increases by 2. The necessary computations are of two
types, one set that is used to advance the dimension of the Krylov subspace that are
independent of the total number of systems and consist of 2 MVs, 4 dot products
and 6 vector updates. The other set consists of computations specific to each term,
namely 9 vector updates and a few scalar ones, that can be conducted in parallel.
One possibility is to stop the iterations when an accurate approximation has been
obtained for all systems. Finally, there needs to be an MV operation to combine the
partial solutions as in (12.10). Because the roots τ j are likely to be complex, we
expect that this BLAS2 operation will contain complex elements.
An important characteristic of this approach is that it has both large grain paral-
lelism because of the partial fraction decomposition as well as medium grain par-
12.1 Matrix Functions 427

allelism because of the shared computations with A and vectors of that size. Recall
that the shared computations occur because of our desire to reduce redundancy by
exploiting the shift invariance of Krylov subspaces.
The exploitation of the multiple shifts reduces redundancy but also curtails par-
allelism. Under some conditions (e.g. high communication costs) this might not be
desirable. A more general approach that provides more flexibility over the amount
of large and medium grain parallelism was proposed in [56]. The ideas is to organize
the partial fraction terms into groups, and to express the rational function as a double
sum, say


k 
(l)
r (A) = ρ0 I + ρ j (A − ζ j I )−1
l=1 j∈I j

where the index sets Il , l = 1, . . . , k are a partition of {1, 2, . . . , deg p} and the
(1) (l)
sets of coefficients {ρ j , . . . , ρ j } are a partition of the set {ρ1 , . . . , ρd }. Then we
can build a hybrid scheme in which the inner sum is constructed using the multi-
ply shifted approach, but the k components of the outer sum are treated completely
independently. The extreme cases are k = deg p, in which all systems are treated
independently, and k = 1, in which all systems are solved with a single instance of
the multiply shifted approach. This flexible hybrid approach was used on the Cedar
vector multiprocessor but can be useful in the context of architectures with hierar-
chical parallelism, that are emerging as a dominant paradigm in high performance
computing.
In the more general case where we need to compute exp(A)B, where B is a block
of vectors that replaces b in (12.9), there is a need to solve a set of shifted systems
for each column of B. This involves ample parallelism, but also replication of work,
given what we know about the Krylov invariance to shifting. As before, it is natural
to seek techniques (e.g. [84–86]) that will reduce the redundancy. The final choice,
of course, will depend on the problem and on the characteristics of the underlying
computational platform.
We conclude by recalling from Sect. 12.1.3 that the use of partial fractions requires
care to avoid catastrophic cancellations and loss of accuracy. Table 12.3 indicates
how large the partial fraction coefficients of some Padé approximants become as
the degrees of the numerator and denominator increase. To avoid problems, we
can apply incomplete partial fractions as suggested in Sect. 12.1.3. For example, in
Table 12.4 we present results with the IPF(τ ) algorithm that extends (12.6) to the
case of rational functions that are the quotient of polynomials of equal degree. The
algorithm was applied to evaluate the partial fraction representations of diagonal Padé
approximations for the matrix exponential applied on a vector, that is rd,d (−Aδ)b
as approximations for exp(−Aδ)b with A = h12 [−1, 2, −1] N using h = 1/(N + 1)

with N = 998. The right-hand side b = Nj=1 1j v j , where v j are the eigenvectors
of A, ordered so that v j corresponds to the eigenvalue λ j = h42 sin2 2(Njπ+1) . Under
“components” we show the groups that formed from the application of the IPF
428 12 Matrix Functions and the Determinant

Table 12.3 Base 10 logarithm of partial fraction coefficient of largest


deg q
2 4 8 14 20
2 1.1
4 1.3 2.3
8 1.8 2.7 4.6
14 2.5 3.5 5.3 8.1
20 3.3 4.2 6.1 8.8 11.5
Magnitude of the Padé approximation of ez with numerator and denominator degrees degq, deg p.
Data are from Ref. [47]

Table 12.4 Components and base 10 logarithm of maximum relative


d τ Components Errors d τ Components Errors
14 0 {1, . . . , 1} −5.5 24 0 {1, . . . , 1} −13.1
104 {8, 4, 2} −5.5 104 {9, 5, 5, 3, 2} −12.7
108 {13, 1} −5.5 108 {16, 8} −9.1
∞ {14} −5.5 ∞ {24} −2.6
18 0 {1, . . . , 1} −8.2 28 0 {1, . . . , 1} −13.7
104 {8, 5, 4, 1} −8.2 104 {8, 6, 5, 5, 3, 1} −12.6
108 {15, 3} −8.2 108 {18, 9, 1} −8.6
∞ {18} −6.1 ∞ {28} −1.7
Errors when exp(−Aδ)b is evaluated based on the diagonal Padé approximant of exp(−Aδ) and
using IPF(τ ) for δ = 1 × 10−5 . Data extracted from [47]

algorithm. For example, when d = 24, the component set for τ = 108 indicates that
to keep the partial fraction coefficients below that value, the rational function was
written as a product of two terms, one consisting of a sum of 16 elements, the other
of a sum of 8. The resulting relative error in the final solution is approximately 10−9 .
If instead one were to use the full partial fraction as a sum of 28 terms, the error
would be 10−2.6 .

12.2 Determinants

The determinant of a matrix A ∈ Cn×n is easily computed from its LU fac-


torization: if P AQ = LU , where P, and Q are permutation matrices, L is a
n and U = (νi j ) an upper-triangular matrix, then
unit lower triangular matrix,
det A = det P × det Q × i=1 νii . Using Gaussian Elimination with partial pivot-
ing, we obtain the factorization of the dense matrix A, P A = LU . This guarantees
that the computed determinant of U is the exact determinant of a slightly perturbed
A (see [87]). The main problem we face here is how to guarantee that no over- or
underflow occurs when accumulating the multiplication of the diagonal entries of
12.2 Determinants 429

U . The classical technique consists of normalizing this quantity by representing the


determinant by the triplet (ρ, K , n) where

det A = ρ K n , (12.19)

n
in which |ρ| = 1 and ln K = n1 i=1 ln |νii |.
For a general dense matrix, the complexity of the computation is O(n 3 ), and
O(n 2 ) for a Hessenberg matrix. For the latter, a variation of the above procedure is
the use of Hyman’s method (see [87] and references herein). In the present section,
we consider techniques that are more suitable for large structured sparse matrices.

12.2.1 Determinant of a Block-Tridiagonal Matrix

The computation of the determinant of a nonsingular large sparse matrix can be


obtained by using any of the parallel sparse LU factorization schemes used in
MUMPS [88], or SuperLU [89], for example. On distributed memory parallel archi-
tectures, such schemes realize very modest parallel efficiencies at best. If the sparse
matrices can be reordered into the block-tridiagonal form, however, we can apply
the Spike factorization scheme (see Sect. 5.2) to improve the parallel scalability of
computing matrix determinants. This approach is described in [90].
Let the reordered sparse matrix A be of the form,
⎛ ⎞
A1 A1,2 0 . . . 0
⎜ .. ⎟
⎜ A2,1 A2 . . . . . . . ⎟
⎜ ⎟
⎜ ⎟
⎜ 0 ... ... ... 0 ⎟, (12.20)
⎜ ⎟
⎜ . . . . ⎟
⎝ .. .. .. .. A
q−1,q

0 . . . 0 Aq,q−1 Aq

where for i = 1, . . . , q − 1 the blocks Ai+1,i and Ai,i+1 are coupling matrices
defined by:
 
0 0
Ai,i+1 = ,
Bi 0
 
0 Ci+1
Ai+1,i = ;
0 0

with Ai ∈ Cn i ×n i , Bi ∈ Cbi ×li , Ci+1 ∈ Cci+1 ×ri+1 . We assume throughout that


ci ≤ n i , bi ≤ n i and li−1 + ri+1 ≤ n i .
430 12 Matrix Functions and the Determinant

Consider the Spike factorization scheme in Sect. 5.2.1, where, without loss of
generality, we assume that D = diag(A1 , A2 , . . . , Aq ) is nonsingular, and for i =
1, . . . , q, we have the factorizations Pi Ai Q i = L i Ui of the sparse diagonal blocks
Ai . Let P = diag(P1 , P2 , . . . , Pq ), Q = diag(Q 1 , Q 2 , . . . , Q q ) and

S = D −1 A = In + T,

where

In = diag(Is1 , Ir2 , Il1 , Is2 , Ir3 , Il2 · · · , Irq , Ilq−1 , Isq ),

and
⎛ ⎞
0s1 0 V1 0 0
⎜ 0 0r2 V1b 0 0 ⎟
⎜ ⎟
⎜ 0 W2t 0l1 0 0 V2t ⎟
⎜ ⎟
⎜ 0 W2 0 0s2 0 V2 ⎟
⎜ ⎟
⎜ 0 W2b 0 0 0r3 V2b ⎟
⎜ ⎟
⎜ W3t 0l20 0 V3t ⎟
⎜ ⎟
⎜ W3 0 0s3 0 V3 ⎟
T =⎜

⎟,

⎜ W3b 0 0 0r4 V3b ⎟
⎜ .. ⎟
⎜ . Vq−1 ⎟
t
⎜ 0 ⎟
⎜  ⎟
⎜ 0 Vq−1 ⎟
⎜ ⎟
⎜ 0rq 0 Vq−1 ⎟
b
⎜ ⎟
⎝ 0 Wqt 0lq−1 0 ⎠
0 Wq 0 0sq

and where the right and left spikes, respectively, are given by [91]:

V1 s1
V1 ∈ Cn 1 ×l1 ≡ (12.21)
V1b r2 ,
   
−1 0 −1 −1 0
= (A1 ) = Q 1 U1 L 1 P1 , (12.22)
B1 B1

and for i = 2, . . . , q − 1,

Vit li−1
n i ×li
Vi ∈ C ≡ Vi si (12.23)
Vib ri+1 ,
   
0 0
= (Ai )−1 = Q i Ui−1 L i−1 Pi , (12.24)
Bi Bi
12.2 Determinants 431

Wit li−1
n i ×ri
Wi ∈ C ≡ Wi si (12.25)
Wib ri+1
   
Ci Ci
= (Ai )−1 = Q i Ui−1 L i−1 Pi , (12.26)
0 0

and

Wqt lq−1
Wq ∈ Cn q ×rq ≡ (12.27)
Wq sq
   
−1 C q −1 −1 Cq
= (Aq ) = Q q Uq L q Pq . (12.28)
0 0

With this partition, it is clear that:




q
det A = sign Pi × sign Q i × det Ui det S, (12.29)
i=1

where sign P stands for the signature of permutation P.


However, as we have seen in Sect. 5.2.1, the Spike matrix S can be reordered
to result in a 2 × 2 block-upper triangular matrix with one of the diagonal blocks
the identity matrix and the other, Ŝ, being the coefficient matrix

q of the nonsingular
reduced system, see (5.10) and (5.11), which is of order n − k=1 sk . In other words,

det S = det Ŝ.

Thus, the computation of the determinant of S̃ requires an LU-factorization of Ŝ,


which can be realized via one of the parallel sparse direct solvers MUMPS, SuperLU,
or IBM’s WSMP.

12.2.2 Counting Eigenvalues with Determinants

The localization of eigenvalues of a given matrix A is of interest in many scientific


applications. When the matrix is real symmetric or complex Hermitian, a procedure
based on the computation of Sturm sequences allows the safe application of bisections
on real intervals to localize the eigenvalues as shown in Sects. 8.2.3 and 11.1.2. The
problem is much harder for real nonsymmetric or complex non-Hermitian matrices
and especially for non-normal matrices.
Let us assume that some Jordan curve Γ is given in the complex plane, and that
one wishes to count the number of eigenvalues of the matrix A that are surrounded by
Γ . Several procedures have been proposed for such a task in [92] and, more recently,
a complete algorithm has been proposed in [93] which we present below.
432 12 Matrix Functions and the Determinant

The number of surrounded eigenvalues is determined by evaluating the integral


from the Cauchy formula (see e.g. [94, 95]):

1 f  (z)
NΓ = dz, (12.30)
2iπ Γ f (z)

where f (z) = det(z I − A) is the characteristic polynomial of A. This integral is also


considered in work concerning nonlinear eigenvalue problems [96, 97].
 N −1
Let us assume that Γ = i=0 [z i , z i+1 ] is a polygonal curve where [z i , z i+1 ] ⊂
C denotes the line segment with end points z i and z i+1 . This is a user-defined curve
which approximates the initial Jordan curve. Hence,

f (z + h) = f (z) det(I + h R(z)). (12.31)

where R(z) = (z I − A)−1 is the resolvent. Also, let Φz (h) = det(I + h R(z)), then
 z+h f  (z)
dz = ln(Φz (h))
z f (z)
= ln |Φz (h)| + i arg(Φz (h)).

The following lemma determines the stepsize control which guarantees proper
integration. The branch (i.e. a determination arg0 of the argument), which is to be
followed along the integration process, is fixed by selecting an origin z 0 ∈ Γ and by
insuring that

arg0 ( f (z 0 )) = Arg( f (z 0 )). (12.32)

Lemma 12.1 (Condition (A)) Let z and h be such that [z, z + h] ⊂ Γ. If

|Arg(Φz (s))| < π, ∀s ∈ [0, h], (12.33)

then,

arg0 ( f (z + h)) = arg0 ( f (z)) + Arg(Φz (h)), (12.34)

where arg0 is determined as in (12.32) by a given z 0 ∈ Γ .

Proof 1 See [93].

Condition (A) is equivalent to

Φz (s) ∈
/ (−∞, 0], ∀s ∈ [0, h].
12.2 Determinants 433

It can be replaced by a more strict condition. Since

Φz (s) = 1 + δ, with δ = ρeiθ ,

a sufficient condition for (12.33) to be satisfied, which we denote as Condition (B),


is ρ < 1, i.e.

|Φz (s) − 1| < 1, ∀s ∈ [0, h]. (12.35)

This condition can be approximated by considering the tangent of Ψz (s) = 1 +


sΦz (0), and, substituting it in (12.35) to obtain Condition (C):

1
|h| < . (12.36)
|Φz (0)|

The derivative Φz (0) is given by :

Φz (0) = trace(R(z)). (12.37)

e.g. see [21, 97–100]. The most straightforward procedure, but not the most efficient,
consists of approximating the derivative with the ratio

Φz (s) − 1
Φz (0) ≈ ,
s
where s = αh with an appropriately small α. Therefore, the computation imposes an
additional LU factorization for evaluating the quantity Φz (s). This approach doubles
the computational effort when compared to computing only one determinant per
vertex as needed for the integration.

Fig. 12.1 Application of 1.5


EIGENCNTon a random
matrix of order 5. The
1
eigenvalues are indicated by
the stars. The polygonal line
is defined by the 10 points 0.5
with circles; the other points
of the line are automatically
0
introduced to insure the
conditions as specified in
[93] −0.5

−1

−1.5
−1.5 −1 −0.5 0 0.5 1 1.5
434 12 Matrix Functions and the Determinant

A short description of the procedure EIGENCNT is given in Algorithm 12.7. For


a detailed description, see [93]. The algorithm exhibits a parallel loop in which the
only critical section is related to the management of the ordered list of vertices Z .
For a real matrix A and when the line Γ is symmetric with respect to the real axis,
the computational cost is halved by only integrating in the half complex plane of
nonnegative imaginary values.

Algorithm 12.7 EIGENCNT: counting eigenvalues surrounded by a curve.


Input: A ∈ Rn×n and a polygonal line Γ .
Output: The number neg of algebraic eigenvalues of A which are surrounded by Γ .
1: Z = ordered list of the vertices of Γ ; Status(Z ) = “no information"; //The last element
of the list is equal to the first one.
2: while Z = “ready",
3: doall z ∈ Z ,
4: if Status(z) = “ready", then
5: if Status(z) = “no information", then
6: Compute Φz ; Store the result in I ntegral(I (z)); //I (z) is the rank of z in the list Z .
7: end if
8: Compute Φz (0);
9: Control the stepsize and either insert points in Z with status “noinformation", or
set Status(z) = “ready".
10: end if
11: end
12: end while
13: I ntegral(1 : N
) = I ntegral(2 : N + 1)/I ntegral(1 : N ); //N : number of vertices in Z ;
N
arg I ntegral(k)
14: neg = round( k=1
2π );

We illustrate the consequence of the stepsize control on a random matrix of order


5. The polygonal line Γ is determined by 10 points regularly spaced on a circle with
center at 0 and radius 1.3. Figure 12.1 depicts the eigenvalues of A, the line Γ , and
the points that are automatically inserted by the procedure. The figure illustrates that,
when the line gets closer to an eigenvalue, the segment length becomes smaller.

References

1. Varga, R.: Matrix Iterative Analysis. Springer Series in Computational Mathematics, 2nd edn.
Springer, Berlin (2000)
2. Hochbruck, M., Ostermann, A.: Exponential integrators. Acta Numer. 19, 28–209 (2012).
doi:10.1017/S0962492910000048
3. Sidje, R.: EXPOKIT: software package for computing matrix exponentials. ACM Trans. Math.
Softw. 24(1), 130–156 (1998)
4. Berland, H., Skaflestad, B., Wright, W.M.: EXPINT—A MATLAB package for exponential
integrators. ACM Trans. Math. Softw. 33(1) (2007). doi:10.1145/1206040.1206044. http://
doi.acm.org/10.1145/1206040.1206044
5. Schmelzer, T., Trefethen, L.N.: Evaluating matrix functions for exponential integrators via
Carathéodory-Fejér approximation and contour integrals. ETNA 29, 1–18 (2007)
References 435

6. Skaflestad, B., Wright, W.: The scaling and modified squaring method for matrix functions
related to the exponential. Appl. Numer. Math. 59, 783–799 (2009)
7. Festinger, L.: The analysis of sociograms using matrix algebra. Hum. Relat. 2, 153–158 (1949)
8. Katz, L.: A new status index derived from sociometric index. Psychometrika 18(1), 39–43
(1953)
9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Pro-
ceedings of 7th International Conference World Wide Web, pp. 107–117. Elsevier Science
Publishers B.V., Brisbane (1998)
10. Kleinberg, J.: Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999)
11. Langville, A., Meyer, C.: Google’s PageRank and Beyond: The Science of Search Engine
Rankings. Princeton University Press, Princeton (2006)
12. Bonchi, F., Esfandiar, P., Gleich, D., Greif, C., Lakshmanan, L.: Fast matrix computations for
pairwise and columnwise commute times and Katz scores. Internet Math. 8(2011), 73–112
(2011). http://projecteuclid.org/euclid.im/1338512314
13. Estrada, E., Higham, D.: Network properties revealed through matrix functions. SIAM Rev.
52, 696–714. Published online November 08, 2010
14. Fenu, C., Martin, D., Reichel, L., Rodriguez, G.: Network analysis via partial spectral factor-
ization and Gauss quadrature. SIAM J. Sci. Comput. 35, A2046–A2068 (2013)
15. Estrada, E., Rodríguez-Velázquez, J.: Subgraph centrality and clustering in complex hyper-
networks. Phys. A: Stat. Mech. Appl. 364, 581–594 (2006). http://www.sciencedirect.com/
science/article/B6TVG-4HYD5P6-3/1/69cba5c107b2310f15a391fe982df305
16. Benzi, D., Ernesto, E., Klymko, C.: Ranking hubs and authorities using matrix functions.
Linear Algebra Appl. 438, 2447–2474 (2013)
17. Estrada, E., Hatano, N.: Communicability Graph and Community Structures in Complex
Networks. CoRR arXiv:abs/0905.4103 (2009)
18. Baeza-Yates, R., Boldi, P., Castillo, C.: Generic damping functions for propagating importance
in link-based ranking. J. Internet Math. 3(4), 445–478 (2006)
19. Kollias, G., Gallopoulos, E., Grama, A.: Surfing the network for ranking by multidamping.
IEEE TKDE (2013). http://www.computer.org/csdl/trans/tk/preprint/06412669-abs.html
20. Kollias, G., Gallopoulos, E.: Multidamping simulation framework for link-based ranking. In:
A. Frommer, M. Mahoney, D. Szyld (eds.) Web Information Retrieval and Linear Algebra
Algorithms, no. 07071 in Dagstuhl Seminar Proceedings. Internationales Begegnungs- und
Forschungszentrum für Informatik (IBFI), Schloss Dagstuhl, Germany (2007). http://drops.
dagstuhl.de/opus/volltexte/2007/1060
21. Bai, Z., Fahey, G., Golub, G.: Some large-scale matrix computation problems. J. Com-
put. Appl. Math. 74(1–2), 71–89 (1996). doi:10.1016/0377-0427(96)00018-0. http://www.
sciencedirect.com/science/article/pii/0377042796000180
22. Bekas, C., Curioni, A., Fedulova, I.: Low-cost data uncertainty quantification. Concurr. Com-
put.: Pract. Exp. (2011). doi:10.1002/cpe.1770. http://dx.doi.org/10.1002/cpe.1770
23. Stathopoulos, A., Laeuchli, J., Orginos, K.: Hierarchical probing for estimating the trace of
the matrix inverse on toroidal lattices. SIAM J. Sci. Comput. 35, S299–S322 (2013). http://
epubs.siam.org/doi/abs/10.1137/S089547980036869X
24. Higham, N.: Functions of Matrices: Theory and Computation. SIAM, Philadelphia (2008)
25. Golub, G., Meurant, G.: Matrices. Moments and Quadrature with Applications. Princeton
University Press, Princeton (2010)
26. Trefethen, L.: Approximation Theory and Approximation Practice. SIAM, Philadelphia
(2013)
27. Higham, N., Deadman, E.: A catalogue of software for matrix functions. version 1.0. Technical
Report 2014.8, Manchester Institute for Mathematical Sciences, School of Mathematics, The
University of Manchester (2014)
28. Estrin, G.: Organization of computer systems: the fixed plus variable structure computer. In:
Proceedings Western Joint IRE-AIEE-ACM Computer Conference, pp. 33–40. ACM, New
York (1960)
436 12 Matrix Functions and the Determinant

29. Lakshmivarahan, S., Dhall, S.K.: Analysis and Design of Parallel Algorithms: Arithmetic and
Matrix Problems. McGraw-Hill Publishing, New York (1990)
30. Maruyama, K.: On the parallel evaluation of polynomials. IEEE Trans. Comput. C-22(1)
(1973)
31. Munro, I., Paterson, M.: Optimal algorithms for parallel polynomial evaluation. J. Comput.
Syst. Sci. 7, 189–198 (1973)
32. Pan, V.: Complexity of computations with matrices and polynomials. SIAM Rev. 34(2), 255–
262 (1992)
33. Reynolds, G.: Investigation of different methods of fast polynomial evaluation. Master’s thesis,
EPCC, The University of Edinburgh (2010)
34. Eberly, W.: Very fast parallel polynomial arithmetic. SIAM J. Comput. 18(5), 955–976 (1989)
35. Paterson, M., Stockmeyer, L.: On the number of nonscalar multiplications necessary to eval-
uate polynomials. SIAM J. Comput. 2, 60–66 (1973)
36. Alonso, P., Boratto, M., Peinado, J., Ibáñez, J., Sastre, J.: On the evaluation of matrix poly-
nomials using several GPGPUs. Technical Report, Department of Information Systems and
Computation, Universitat Politécnica de Valéncia (2014)
37. Bernstein, D.: Fast multiplication and its applications. In: Buhler, J., Stevenhagen, P. (eds.)
Algorithmic number theory: lattices, number fields, curves and cryptography, Mathematical
Sciences Research Institute Publications (Book 44), pp. 325–384. Cambridge University Press
(2008)
38. Trefethen, L., Weideman, J., Schmelzer, T.: Talbot quadratures and rational approximations.
BIT Numer. Math. 46, 653–670 (2006)
39. Swarztrauber, P.N.: A direct method for the discrete solution of separable elliptic equations.
SIAM J. Numer. Anal. 11(6), 1136–1150 (1974)
40. Henrici, P.: Applied and Computational Complex Analysis. Wiley, New York (1974)
41. Enright, W.H.: Improving the efficiency of matrix operations in the numerical solution of stiff
differential equations. ACM TOMS 4(2), 127–136 (1978)
42. Choi, C.H., Laub, A.J.: Improving the efficiency of matrix operations in the numerical solution
of large implicit systems of linear differential equations. Int. J. Control 46(3), 991–1008 (1987)
43. Lu, Y.: Computing a matrix function for exponential integrators. J. Comput. Appl. Math.
161(1), 203–216 (2003)
44. Ltaief, H., Kurzak, J., Dongarra., J.: Parallel block Hessenberg reduction using algorithms-
by-tiles for multicore architectures revisited. Technical Report. Innovative Computing Labo-
ratory, University of Tennessee (2008)
45. Quintana-Ortí, G., van de Geijn, R.: Improving the performance of reduction to Hessenberg
form. ACM Trans. Math. Softw. 32, 180–194 (2006)
46. Kung, H.: New algorithms and lower bounds for the parallel evaluation of certain rational
expressions and recurrences. J. Assoc. Comput. Mach. 23(2), 252–261 (1976)
47. Calvetti, D., Gallopoulos, E., Reichel, L.: Incomplete partial fractions for parallel evaluation
of rational matrix functions. J. Comput. Appl. Math. 59, 349–380 (1995)
48. Henrici, P.: An algorithm for the incomplete partial fraction decomposition of a rational
function into partial fractions. Z. Angew. Math. Phys. 22, 751–755 (1971)
49. Graham, R., Knuth, D., Patashnik, O.: Concrete Mathematics. Addison-Wesley, Reading
(1989)
50. Reichel, L.: The ordering of tridiagonal matrices in the cyclic reduction method for Poisson’s
equation. Numer. Math. 56(2/3), 215–228 (1989)
51. Butcher, J., Chartier, P.: Parallel general linear methods for stiff ordinary dif-
ferential and differential algebraic equations. Appl. Numer. Math. 17(3), 213–222
(1995). doi:10.1016/0168-9274(95)00029-T. http://www.sciencedirect.com/science/article/
pii/016892749500029T. Special Issue on Numerical Methods for Ordinary Differential Equa-
tions
52. Chartier, P., Philippe, B.: A parallel shooting technique for solving dissipative ODE’s. Com-
puting 51(3–4), 209–236 (1993). doi:10.1007/BF02238534
References 437

53. Chartier, P.: L-stable parallel one-block methods for ordinary differential equations. SIAM J.
Numer. Anal. 31(2), 552–571 (1994). doi:10.1137/0731030
54. Gander, M., Vandewalle, S.: Analysis of the parareal time-parallel time-integration method.
SIAM J. Sci. Comput. 29(2), 556–578 (2007)
55. Maday, Y., Turinici, G.: A parareal in time procedure for the control of partial differential
equation. C. R. Math. Acad. Sci. Paris 335(4), 387–392 (2002)
56. Baldwin, C., Freund, R., Gallopoulos, E.: A parallel iterative method for exponential propa-
gation. In: D. Bailey, R. Schreiber, J. Gilbert, M. Mascagni, H. Simon, V. Torczon, L. Watson
(eds.) Proceedings of Seventh SIAM Conference on Parallel Processing for Scientific Com-
puting, pp. 534–539. SIAM, Philadelphia (1995). Also CSRD Report No. 1380
57. Gallopoulos, E., Saad, Y.: Efficient solution of parabolic equations by Krylov approximation
methods. SIAM J. Sci. Stat. Comput. 1236–1264 (1992)
58. Gallopoulos, E., Saad, Y.: On the parallel solution of parabolic equations. In: Proceedings of
the 1989 International Conference on Supercomputing, pp. 17–28. Herakleion, Greece (1989)
59. Gallopoulos, E., Saad, Y.: Efficient parallel solution of parabolic equations: implicit methods
on the Cedar multicluster. In: J. Dongarra, P. Messina, D.C. Sorensen, R.G. Voigt (eds.)
Proceedings of Fourth SIAM Conference Parallel Processing for Scientific Computing, pp.
251–256. SIAM, (1990) Chicago, December 1989
60. Sidje, R.: Algorithmes parallèles pour le calcul des exponentielles de matrices de grandes
tailles. Ph.D. thesis, Université de Rennes I (1994)
61. Lopez, L., Simoncini, V.: Analysis of projection methods for rational function approximation
to the matrix exponential. SIAM J. Numer. Anal. 44(2), 613–635 (2006)
62. Popolizio, M., Simoncini, V.: Acceleration techniques for approximating the matrix exponen-
tial operator. SIAM J. Matrix Anal. Appl. 30, 657–683 (2008)
63. Frommer, A., Simoncini, V.: Matrix functions. In: Schilders, W., van der Vorst, H.A., Rommes,
J. (eds.) Model Order Reduction: Theory, Research Aspects and Applications, pp. 275–303.
Springer, Berlin (2008)
64. van den Eshof, J., Hochbruck, M.: Preconditioning Lanczos approximations to the matrix
exponential. SIAM J. Sci. Comput. 27, 1438–1457 (2006)
65. Gu, C., Zheng, L.: Computation of matrix functions with deflated restarting. J. Comput. Appl.
Math. 237(1), 223–233 (2013). doi:10.1016/j.cam.2012.07.020. http://www.sciencedirect.
com/science/article/pii/S037704271200310X
66. Varga, R.S.: On higher order stable implicit methods for solving parabolic partial differential
equations. J. Math. Phys. 40, 220–231 (1961)
67. Baker Jr, G., Graves-Morris, P.: Padé Approximants. Part I: Basic Theory. Addison Wesley,
Reading (1991)
68. Carpenter, A.J., Ruttan, A., Varga, R.S.: Extended numerical computations on the 1/9 conjec-
ture in rational approximation theory. In: Graves-Morris, P.R., Saff, E.B., Varga, R.S. (eds.)
Rational Approximation and Interpolation. Lecture Notes in Mathematics, vol. 1105, pp.
383–411. Springer, Berlin (1984)
69. Cody, W.J., Meinardus, G., Varga, R.S.: Chebyshev rational approximations to e−x in [0, +∞)
and applications to heat-conduction problems. J. Approx. Theory 2(1), 50–65 (1969)
70. Cavendish, J.C., Culham, W.E., Varga, R.S.: A comparison of Crank-Nicolson and Chebyshev
rational methods for numerically solving linear parabolic equations. J. Comput. Phys. 10,
354–368 (1972)
71. Freund, R.W.: Conjugate gradient-type methods for linear systems with complex symmetric
coefficient matrices. SIAM J. Sci. Stat. Comput. 13(1), 425–448 (1992)
72. Axelsson, O., Kucherov, A.: Real valued iterative methods for solving complex symmetric
linear systems. Numer. Linear Algebra Appl. 7(4), 197–218 (2000).doi:10.1002/1099-
1506(200005)http://dx.doi.org/10.1002/1099-1506(200005)7:4<197::AID-NLA194>3.0.
CO;2-S
73. Howle, V., Vavasis, S.: An iterative method for solving complex-symmetric systems arising
in electrical power modeling. SIAM J. Matrix Anal. Appl. 26, 1150–1178 (2005)
438 12 Matrix Functions and the Determinant

74. Datta, B.N., Saad, Y.: Arnoldi methods for large Sylvester-like observer matrix equations,
and an associated algorithm for partial spectrum assignment. Linear Algebra Appl. 154–156,
225–244 (1991)
75. Gear, C.W., Saad, Y.: Iterative solution of linear equations in ODE codes. SIAM J. Sci. Stat.
Comput. 4, 583–601 (1983)
76. Parlett, B.N.: The Symmetric Eigenvalue Problem. Prentice Hall, Englewood Cliffs (1980)
77. Simoncini, V.: Restarted full orthogonalization method for shifted linear systems. BIT Numer.
Math. 43(2), 459–466 (2003)
78. Freund, R.: Solution of shifted linear systems by quasi-minimal residual iterations. In: Reichel,
L., Ruttan, A., Varga, R. (eds.) Numerical Linear Algebra, pp. 101–121. W. de Gruyter, Berlin
(1993)
79. Frommer, A.: BiCGStab() for families of shifted linear systems. Computing 7(2), 87–109
(2003)
80. Druskin, V., Knizhnerman, L.: Two polynomial methods of calculating matrix functions of
symmetric matrices. U.S.S.R. Comput. Math. Math. Phys. 29, 112–121 (1989)
81. Friesner, R.A., Tuckerman, L.S., Dornblaser, B.C., Russo, T.V.: A method for exponential
propagation of large systems of stiff nonlinear differential equations. J. Sci. Comput. 4(4),
327–354 (1989)
82. Saad, Y.: Analysis of some Krylov subspace approximations to the matrix exponential oper-
ator. SIAM J. Numer. Anal. 29, 209–228 (1992)
83. Güttel, S.: Rational Krylov methods for operator functions. Ph.D. thesis, Technischen Uni-
versity Bergakademie Freiberg (2010)
84. Simoncini, V., Gallopoulos, E.: A hybrid block GMRES method for nonsymmetric systems
with multiple right-hand sides. J. Comput. Appl. Math. 66, 457–469 (1996)
85. Darnell, D., Morgan, R.B., Wilcox, W.: Deflated GMRES for systems with multiple shifts
and multiple right-hand sides. Linear Algebra Appl. 429, 2415–2434 (2008)
86. Soodhalter, K., Szyld, D., Xue, F.: Krylov subspace recycling for sequences of shifted linear
systems. Appl. Numer. Math. 81, 105–118 (2014)
87. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM, Philadelphia
(2002)
88. MUMPS: A parallel sparse direct solver. http://graal.ens-lyon.fr/MUMPS/
89. SuperLU (Supernodal LU). http://crd-legacy.lbl.gov/~xiaoye/SuperLU/
90. Kamgnia, E., Nguenang, L.B.: Some efficient methods for computing the determinant of large
sparse matrices. ARIMA J. 17, 73–92 (2014). http://www.inria.fr/arima/
91. Polizzi, E., Sameh, A.: A parallel hybrid banded system solver: the SPIKE algorithm. Parallel
Comput. 32, 177–194 (2006)
92. Bertrand, O., Philippe, B.: Counting the eigenvalues surrounded by a closed curve. Sib. J. Ind.
Math. 4, 73–94 (2001)
93. Kamgnia, E., Philippe, B.: Counting eigenvalues in domains of the complex field. Electron.
Trans. Numer. Anal. 40, 1–16 (2013)
94. Rudin, W.: Real and Complex Analysis. McGraw Hill, New York (1970)
95. Silverman, R.A.: Introductory Complex Analysis. Dover Publications, Inc., New York (1972)
96. Bindel, D.: Bounds and error estimates for nonlinear eigenvalue problems. Berkeley Applied
Math Seminar (2008). http://www.cims.nyu.edu/~dbindel/present/berkeley-oct08.pdf
97. Maeda, Y., Futamura, Y., Sakurai, T.: Stochastic estimation method of eigenvalue density for
nonlinear eigenvalue problem on the complex plane. J. SIAM Lett. 3, 61–64 (2011)
98. Bai, Z., Golub, G.H.: Bounds for the trace of the inverse and the determinant of symmetric
positive definite matrices. Ann. Numer. Math. 4, 29–38 (1997)
99. Duff, I., Erisman, A., Reid, J.: Direct Methods for Sparse Matrices. Oxford University Press
Inc., New York (1989)
100. Golub, G.H., Meurant, G.: Matrices, Moments and Quadrature with Applications. Princeton
University Press, Princeton (2009)
Chapter 13
Computing the Matrix Pseudospectrum

The ε-pseudospectrum, Λε (A), of a square matrix (pseudospectrum for short) is the


locus of eigenvalues of A + E for all possible E such that E ≤ ε for some matrix
norm and given ε > 0; that is

Λε (A) = {z ∈ C : z ∈ Λ(A + E), E ≤ ε}, (13.1)

where Λ(A) denotes the set of eigenvalues of A. Unless mentioned otherwise we


assume that the 2-norm is used. According to relation (13.1), the pseudospec-
trum is characterized by means of the eigenvalues of bounded perturbations of A.
Pseudospectra have many interesting properties and provide information regarding
the matrix that is sometimes more useful than the eigenvalues; the seminal mono-
graph on the subject is [1].
The pseudospectral regions of a matrix are nested as ε varies, i.e. if ε1 > ε2
then necessarily Λε1 (A) ⊃ Λε2 (A). Trivially then, Λ0 (A) = Λ(A). Therefore, to
describe the pseudospectrum for some ε it is sufficient to plot the boundary curve(s)
∂Λε (A) (there are more than one boundaries when Λε (A) is not simply connected).
To make the pseudospectrum a practical gauge we need fast methods for comput-
ing it. As we will see shortly, however, the computation of pseudospectra for large
matrices is expensive, more so than other matrix characteristics (eigenvalues, sin-
gular values, condition number). On the other hand, as the success of the EigTool
package indicates (cf. Sect. 13.4), there is interest in using pseudospectra. Parallel
processing becomes essential in this task. In this chapter we consider parallelism in
algorithms for computing pseudospectra. We will see that there exist several oppor-
tunities for parallelization that go beyond the use of parallel kernels for BLAS type
matrix operations. On the other hand, one must be careful not to create algorithms
with abundantly redundant parallelism.
There exist two more characterizations of the pseudospectrum, that turn out to be
more useful in practice than perturbation based relation (13.1). The second charac-
terization is based on the smallest singular value of the shifted matrix A − z I :

Λε (A) = {z ∈ C : σmin (A − z I ) ≤ ε}, (13.2)


© Springer Science+Business Media Dordrecht 2016 439
E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7_13
440 13 Computing the Matrix Pseudospectrum

Fig. 13.1 Illustrations of pseudospectra for matrix grcar of order n = 50. The left frame was
computed using function ps from the Matrix Computation Toolbox that is based on relation 13.1 and
shows the eigenvalues of matrices A + E j for random perturbations E j ∈ C50×50 , j = 1, . . . , 10
where E j  ≤ 10−3 . The frame on the right was computed using the EigTool package and is
based on relation 13.2; it shows the level curves defined by {z : s(z) ≤ ε} for ε = 10−1 down to
10−10

For brevity, we also write s(z) to denote σmin (A−z I ). Based on this characterization,
for any ε > 0, we can classify any point z to be interior to Λε (A) if s(z) < ε or to
be exterior when s(z) > ε. By convention, a point of ∂Λε (A) is assimilated to an
interior point.
The third characterization is based on the resolvent of A:

Λε (A) = {z ∈ C : (A − z I )−1  ≥ ε−1 }. (13.3)

We denote the resolvent matrix by R(z) = (A − z I )−1 . A basic fact is the


following:
Proposition 13.1 The sets defined by (13.1)–(13.3) are identical.
Figure 13.1 illustrates the pseudospectrum of matrix grcar of order 50 computed
using algorithms based on the first two definitions. Matrix grcar is pentadiagonal
Toeplitz corresponding to the symbol t (z) = z −2 + z −1 + 1 − z and is frequently
used to benchmark pseudospectrum codes.

13.1 Grid Based Methods

13.1.1 Limitations of the Basic Approach

Characterization (13.1) appears to necessitate the computation of all eigenvalues of


matrices of the form A + E where E is a bounded perturbation. This is an expensive
task that becomes impractical for large matrices. Not only that, but it is not clear how
13.1 Grid Based Methods 441

to select the perturbations so as to achieve a satisfactory approximation. In particular,


there is no guarantee that any of the randomly selected perturbations will cause the
extreme dislocations that are possible in the eigenvalues of A if any ε-bounded
perturbation is allowed.
A straightforward algorithm for computing pseudospectra is based on character-
ization (13.2). We list it as Algorithm 13.1, and call it GRID.

Algorithm 13.1 GRID: Computing Λε (A) based on Def. (13.2).


Input: A ∈ Rn×n , l ≥ 1 positive values in decreasing order E = {ε1 , ε2 , ..., εl }
Output: plot of ∂Λε (A) for all ε ∈ E
1: construct Ω ⊇ Λε (A) for the largest ε of interest and a grid Ωh discretizing Ω with gridpoints
zk .
2: doall z k ∈ Ωh
3: compute s(z k ) = σmin (z k I − A)
4: end
5: Plot the l contours of s(z k ) for all values of E .

GRID is simple to implement and offers parallelism at several levels: large grain,
since the singular value computations at each point are independent, as well as
medium and fine grain grain parallelism when computing each s(z k ). The sequential
cost is typically modeled by

T1 = |Ωh | × Cσmin (13.4)

where |Ωh | denotes the number of nodes of Ωh and Cσmin is the average cost for for
computing s(z). The total sequential cost becomes rapidly prohibitive as the size of
A and/or the number of nodes increase. Given that the cost of computing s(z) is at
least O(n 2 ) and even O(n 3 ) for dense matrix methods, and that a typical mesh could
easily contain O(104 ) points, the cost can be very high even for matrices of moderate
size. For large enough |Ωh | relative to the number of processors p, GRID can be
implemented with almost perfect speedup by simple static assignment of |Ωh |/ p
computations of s(z) per processor.
Load balancing also becomes an issue when the cost of computing s(z) varies
a lot across the domain. This could happen when iterative methods, such as those
described in Chap. 11, Sects. 11.6.4 and 11.6.5, are used to compute s(z).
Another obstacle is that the smallest singular value of a given matrix is usually
the hardest to compute (much more so than computing the largest); moreover, this
computation has to be repeated for as many points as necessary, depending on the
resolution of Ω. Not surprisingly, therefore, as the size of the matrix and/or the
number of mesh points increase the cost of the straightforward algorithm above also
becomes prohibitive. Advances over GRID for computing pseudospectra are based
on some type of dimensionality reduction on the domain or on the matrix, that lower
442 13 Computing the Matrix Pseudospectrum

the cost of the factors in the cost model (13.4). In order to handle large matrices, it
is necessary to construct algorithms that combine or blend the two approaches.

13.1.2 Dense Matrix Reduction

Despite the potential for near perfect speedup on a parallel system, Algorithm 13.1
(GRID) entails much redundant computation. Specifically, σmin is computed over
different matrices, but all of them are shifts of one and the same matrix. In a sequen-
tial algorithm, it is preferable to first apply an unitary similarity transformation, say
Q ∗ AQ = T , that reduces the matrix to complex upper triangular (via Schur factor-
ization) or upper Hessenberg form. In both cases, the singular values remain intact,
thus σmin (A − z I ) = σmin (T − z I ). The gains from this preprocessing are substan-
tial in the sequential case: the total cost lowers from |Ωh |O(n 3 ) to that of the initial
reduction, which is O(n 3 ), plus |Ωh |O(n 2 ). The cost of the reduction is amortized
as the number of gridpoints increases.
We illustrate these ideas in Algorithm 13.2 (GRID_fact ) which reduces A to
Hessenberg or triangular form and then approximates each s(z k ) from the (square
root of the) smallest eigenvalue of (z k I − T )∗ (z k I − T ). This is done using inverse
Lanczos iteration, exploiting the fact that the cost at each iteration is quadratic since
the systems to be solved at each step are Hessenberg or triangular.
For a parallel implementation, we need to decide upon the following:
1. How to evaluate the preprocessing steps (lines 2–5 of Algorithm 13.2). On a par-
allel system, the reduction step can either be replicated on one or more processors
and the reduced matrix distributed to all processors participating in computing
the values σmin (T − z j I ). The overall cost will be that for the preprocessing plus
|Ωh |
p O(n ) when p ≤ |Ωh |. The cost of the reduction is amortized as the number
2

of gridpoints allocated per processor increases.


2. How to partition the work between processors so as to achieve load balance: A
system-level approach is to use queues. Specifically, one or more points in Ωh
can be assigned to a task that is added to a central queue. The first task in the
queue is dispatched to the next available processor, until no more tasks remain.
Every processor computes its corresponding value(s) of s(z) and once finished, is
assigned another task until none remains. A distributed queue can also be used for
processors with multithreading capabilities, in which case the groups of points
assigned to each task are added to a local queue and a thread is assigned to each
point in turn. We also note that static load balancing strategies can also be used
if the workload differential between gridpoints can be estimated beforehand.
13.1 Grid Based Methods 443

Algorithm 13.2 GRID_fact: computes Λε (A) using Def. 13.2 and factorization.
Input: A ∈ Rn×n , l ≥ 1 positive values in decreasing order E = {ε1 , ε2 , ..., εl }, logical variable
schur set to 1 to apply Schur factorization.
Output: plot of ∂Λε (A) for all ε ∈ E
1: Ωh as in line 1 of Algorithm 13.1.
2: T = hess(A) //reduce A to Hessenberg form
3: if schur == 1 then
4: T = schur(T, complex ) //compute complex Schur factor T
5: end if
6: doall z k ∈ Ωh √
7: compute s(z k ) = λmin (z k I − T )∗ (z k I − T ) //using inverse Lanczos iteration and
exploiting the structure of T
8: end
9: Plot the l contours of s(z k ) for all values of E .

13.1.3 Thinning the Grid: The Modified GRID Method

We next consider an approach whose goal is to reduce the number of gridpoints where
s(z) needs to be evaluated. It turns out that it is possible to rapidly “thin the grid” by
stripping disk shaped regions from Ω. Recall that for any given ε and gridpoint z,
algorithm GRID uses the value of s(z) to classify z as lying outside or inside Λε (A).
In that sense, GRID makes pointwise use of the information it computes at each z. A
much more effective idea is based on the fact that at any point where the minimum
singular value s(z) is evaluated, given any value of ε, either the point lies in the
pseudospectrum or its boundary or a disk is constructed whose points are guaranteed
to be exterior to Λε (A). The following result is key.
Proposition 13.2 ([2]) If s(z) = r > ε then

D ◦ (z, r − ε) ∩ Λε (A) = ∅,

where D ◦ (z, r − ε) is the open disk centered at z with radius r − ε.


This provides a mechanism for computing pseudospectra based on “inclusion-
exclusion”. At any point s(z) is computed (as in GRID_Fact this is better done
after initially reducing A to Hessenberg or Schur form), it is readily confirmed that
the point is included in the pseudospectrum (as in GRID), otherwise it defines an
“exclusion disk” centered at that point. Algorithm 13.3 (MOG) deploys this strategy
and approximates the pseudospectrum by repeatedly pruning the initial domain that
is assumed to completely enclose the pseudospectrum until no more exclusions can
be applied. This is as robust as GRID but can be much faster. It can also be shown
that as we move away from the sought pseudospectrum boundary, the exclusion
disks become larger which makes the method rather insensitive to the choice of Ω
compared to GRID. We mention in passing that a suitable enclosing region for the
pseudospectrum that can be used as the initial Ω for both GRID and MOG is based
on the field-of-values [3].
444 13 Computing the Matrix Pseudospectrum

Algorithm 13.3 MoG: Computing Λε (A) based on inclusion-exclusion [2]


Input: A ∈ Rn×n , l ≥ 1 and ε
Output: plot of Λε (A)
1: construct Ω ⊇ Λε (A) and a grid Ωh discretizing Ω with gridpoints z k
2: reduce A to its Hessenberg or Schur form, if feasible
3: doall z k ∈ Ωh
4: compute exclusion disk Δ.
5: Set Ωh = Ωh \ Δ //Remove from Ωh all gridpoints in Δ.
6: end

Figure 13.2 (from [4]) illustrates the application of MOG for matrix triangle
(a Toeplitz matrix with symbol z −1 + 41 z 2 from [5]) of order 32. MOG required only
676 computations of s(z) to approximate Λε (A) for ε = 10−1 compared to the 2500
needed for GRID.
Unlike GRID, the preferred sequence of exclusions in MOG cannot be determined
beforehand but depends on the order in which the gridpoints are swept since Ωh is
modified in the course of the computation. This renders a parallel implementation
more challenging. Consider for instance a static allocation policy of gripoints, e.g.
block partitioning. Then a single computation within one processor might exclude
many points allocated to this and possibly other processors. In other words, if s(z) is
computed, then it is likely that values s(z + Δz) at nearby points will not be needed.
This is an action reverse from the spatial locality of reference principle in computer
science and can cause load imbalance.
To handle this problem, it was proposed in [4] to create a structure accessible to
all processors that contains information about the status of each gridpoint. Points are
classified in three categories: Active, if s(z) still needs to be computed, inactive, if
they have been excluded, and fixed, if s(z) ≤ ε. This structure must be updated as
the computation proceeds. The pseudospectrum is available when there are no more
active points.

Fig. 13.2 Using MOG to 2


compute the pseudospectrum
of triangle(32) for 1.5
ε = 1e − 1 (from [4])
1

0.5

−0.5

−1

−1.5

−2
−1 −0.5 0 0.5 1 1.5 2
13.1 Grid Based Methods 445

Based on this organization, we construct a parallel MOG algorithm as follows.


We first assume that a discretization Ωh of a rectangular region Ω enclosing the
pseudospectrum is available. Initially, the set of active nodes N is defined to be
the set of all gridpoints. As the algorithm proceeds, this set is updated to contain
only those gridpoints that have not been excluded in previous steps. The algorithm
consists of three stages. The first stage aims to find the smallest bounding box B of
∂Λε (A), the perimeter of which does not contain any active points. The second stage
performs a collapse of the perimeter of B towards ∂Λε (A). At the end of this stage
no further exclusion are performed. The set of active nodes N contains only mesh
nodes z in ∈ Λε (A). In the third stage all remaining s(z) for z ∈ N are computed.
The motivation behind the first stage is to attempt large exclusions that have
small overlap. Since the exclusion disks are not known a priori, the heuristic is that
gridpoints lying on the outer convex hull enclosing the set of active points offer a
better chance for larger exclusions. The second stage is designed to separate those
points that should remain active from those that should be marked inactive but have
not been rendered so in stage 1 because of their special location; e.g. stage 1 does
not exclude points that lie deep inside some concave region of the pseudospectrum.
To see this, consider in Fig. 13.3 the case of matrix grcar (1000), where dots (‘·’)
label gridpoints that remain active after the end of stage 1 and circles (‘o’) indicate
gridpoints that remain active after stage 2 is completed.
A master node could take the responsibility of gathering the singular values and
distributing the points, on the bounding box B (stage 1) and on the hull H (stage 2).
The administrative task assigned to the master node (exclusions and construction of
the hull) is light; it is thus feasible to also use the master node to compute singular
triplets. In conclusion, stages 1 and 2 adopt a synchronized queue model: The master
waits for all processes, including itself, to finish computing s(z) for their currently
allocated point and only then proceeds to dispatch the next set. Stage 3 is similar to
the parallel implementation of GRID, the only difference being that the points where
we need to compute s(z) are known to satisfy s(z) ≤ ε.

Fig. 13.3 Method MOG: 4


outcome from stages 1 (dots
‘·’) and 2 (circle ‘o’) for 3
matrix grcar(1000) on
50 × 50 grid (from [4]) 2

−1

−2

−3

−4
−1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5
446 13 Computing the Matrix Pseudospectrum

If the objective is to compute the Λε (A) for ε1 > · · · > εs , we can apply MOG
to estimate the enclosing Λε1 (A) followed by GRID for all remaining points. An
alternative is to take advantage of the fact that

σmin (A) − ε1 < σmin (A) − ε2 < · · · < σmin (A) − εs .

Discounting negative and zero values, these define a sequence of concentric disks
that do not contain any point of the pseudospectra for the corresponding values of ε.
For example, the disk centered at z and radius σmin (A) − ε j has no intersection with
Λε j (A). This requires careful bookkeeping but its parallel implementation, would
greatly reduce the need to use GRID.

13.2 Dimensionality Reduction on the Domain: Methods


Based on Path Following

One approach that has the potential for dramatic dimensionality reduction on the
domain is to trace the individual pseudospectral boundaries using some type of path
following. We first review the original path following idea for pseudospectra and
then examine three methods that enable parallelization. Interestingly, parallelization
also helps to improve the numerical robustness of path following.

13.2.1 Path Following by Tangents

To trace the curve, we can use predictor-corrector techniques that require differen-
tial information from the curve, specifically tangents and normals. The following
important result shows that such differential information at any point z is available
if together with σmin one computes the corresponding singular vectors (that is the
minimum singular triplet {σmin , u min , vmin }. It is useful here to note that Λε (A) can
be defined implicitly by the equation

g(x, y) = ε, where g(x, y) = σmin ((x + i y)I − A), (13.5)

(here we identify the complex plane C with R2 ). The key result is the following that
is a generalization of Theorem 11.14 to the complex plane [6]:
Theorem 13.1 Let z = x + i y ∈ C \ Λ(A). Then g(x, y) is real analytic in a
neighborhood of (x, y), if σmin ((x + i y)I − A) is a simple singular value. The
gradient of g(x, y) is equal to
∗ ∗ ∗
∇g(x, y) = ((vmin u min ), (vmin u min )) = vmin u min ,
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 447

where u min and vmin denote the left and right singular vectors corresponding to σmin .
From the value of the gradient at any point of the curve, one can make a small predic-
tion step along the tangent followed by a correction step to return to a subsequent point
on the curve. Initially, the method needs to locate one point of each non-intersecting
boundary ∂Λε (A). The path following approach was proposed in [7]. We denote it
by PF and list it as Algorithm 13.4, incorporating the initial reduction to Hessenberg
form.

Algorithm 13.4 PF: computing ∂Λε (A) by predictor-corrector path following [7].
Input: A ∈ Rn×n , value ε > 0.
Output: plot of the ∂Λε (A)
1: Transform A to upper Hessenberg form and set k = 0.
2: Find initial point z 0 ∈ ∂Λε (A)
3: repeat
4: Determine rk ∈ C, |rk | = 1, steplength τk and set z̃ k = z k−1 + τk rk .
5: Correct along dk ∈ C, |dk | = 1 by setting z k = z̃ k + θk dk where θk is some steplength.
6: until termination

Unlike typical prediction-correction, the prediction direction rk in PF is taken


orthogonal to the previous correction direction dk−1 ; this is in order to avoid an
additional (expensive) triplet evaluation. Each Newton iteration at a point z requires
the computation of the singular triplet (σmin , u min , vmin ) of z I − A. This is the dom-
inant cost per step and determines the total cost of the algorithm. Computational
experience indicates that a single Newton step is sufficient to obtain an adequate
correction. What makes PF so appealing in terms of cost and justifies our previ-
ous characterization of the dimensionality reduction as dramatic, is that by tracing
∂Λε (A), a computation over a predefined two dimensional grid Ωh is replaced with
a computation over points on the pseudospectrum boundary. On the other hand, a
difficulty of PF is that when the pseudospectrum boundary contains discontinuities
of direction or folds that bring sections of the curve close by, path following might
get disoriented unless the step size is very small. Moreover, unlike GRID, the only
opportunities for parallelism are in the computation of the singular triplets and in the
computation of multiple boundaries.
It turns out that it is actually possible to improve performance and address the
aforementioned numerical weaknesses by constructing a parallel path following algo-
rithm. We note in passing that this is another one of these cases that we like to under-
line in this book, where parallelism actually improves the numerical properties of
an algorithm. The following two observations are key. First, that when the boundary
is smooth, the prediction-correction can be applied to discover several new points
further along the curve. This entails the computation of several independent singular
triplets. Second, that by attempting to advance using several points at a time, there
is more information available for the curve that can be used to prevent the algorithm
from straying off the correct path. Algorithm 13.5 is constructed based on these ideas.
448 13 Computing the Matrix Pseudospectrum

piv sup
Fig. 13.4 Cobra: Position of pivot (z k−1 ), initial prediction (z̃ k ), support (z k ), first order predic-
j
tors (ζ j,0 ) and corrected points (z k ). (A proper scale would show that h  H)

It lends itself for parallel processing while being more robust than the original path
following. The name used is Cobra in order to evoke that snake’s spread neck.
Iteration k is illustrated in Fig. 13.4. Upon entering the repeat loop in line 4, it is
piv
assumed that there is a point, z k−1 , available that is positioned close to the curve being
traced. The loop consists of three steps. In the first prediction-correction step (line
sup
4–5), just like Algorithm PF, a support point, z k , is computed first using a small
piv sup
stepsize h. In the second prediction-correction step (lines 7–8) z k−1 , z k determine a
prediction direction dk and m equidistant points, ζi,0 ∈ dk , i = 1, . . . m, are selected
piv
in the direction of dk . Then h = |z̃ k − z k−1 | is the stepsize of the method; note that
piv
we can interpret H = |z k−1 − ζm,0 | as the length of the “neck” of cobra. Then each
ζi,0 ∈ dk , i = 1, . . . m is corrected to obtain z ki ∈ ∂Λε (A), i = 1, . . . , m. This is
implemented using only one step of Newton iteration on σmin (A − z I ) − ε = 0. All
corrections are independent and can be performed in parallel. In the third step (line
piv
10), the next pivot, z k , is selected using some suitable criterion.
The correction phase is implemented with Newton iteration along the direction of
steepest ascent dk = ∇g(x̃k , ỹk ), where x̃k , ỹk are the real and imaginary parts of z̃ k .
Theorem 13.1 provides the formula for the computation of the gradient of s(z) − ε.
The various parameters needed to implement the main steps of the procedure can be
found in [7].
In the correction phase of and other path following schemes, Newton’s method
is applied to solve the nonlinear equation g(x, y) = ε for some ε > 0 (cf. (13.5)).
Therefore, we need to compute ∇g(x, y). From Theorem 13.1, this costs only 1 DOT
operation.
Cobra offers large-grain parallelism in its second step. The number of gridpoints
on the cobra “neck” determines the amount of available large-grain parallelism avail-
able that is entirely due to path following. If the number of points m on the “neck”
is equal to or a multiple of the number of processors, then the maximum speedup
due to path following of Cobra over PF is expected to be m/2. A large number
of processors would favor making m large and thus it is well suited for obtaining
∂Λε (A) at very fine resolutions. The “neck” length H , of course, cannot be arbitrarily
large and once there are enough points to adequately represent ∂Λε (A), any further
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 449

Algorithm 13.5 Cobra: algorithm to compute ∂Λε (A) using parallel path following
[48]. After initialization (line 1), it consists of three stages: i) prediction-correction
(lines 4-5), ii) prediction-correction (lines 7-8), and iii) pivot selection (line 10).
Input: A ∈ Rn×n , value ε > 0, number m of gridpoints for parallel path following.
Output: pseudospectrum boundary ∂Λε (A)
1: Transform A to upper Hessenberg form and set k = 0.
2: Find initial point z 0 ∈ ∂Λε (A)
3: repeat
4: Set k = k + 1 and predict z̃ k .
sup
5: Correct using Newton and compute z k .
6: doall j = 1, ..., m
7: Predict ζ j,0 .
j
8: Correct using Newton and compute z k .
9: end
piv
10: Determine next pivot z k .
11: until termination

subdivisions of H would be useless. This limits the amount of large-grain parallelism


that can be obtained in the second phase of Cobra. It is then reasonable to exploit
the parallelism available in the calculation of each singular triplet, assuming that
the objective is to compute the pseudospectrum of matrices that are large. Another
source of parallelism would be to compute several starting points on the curve and
initiate Cobra on each of them in parallel.
We note some additional advantages of the three-step approach defined above.
The first step is used to obtain the chordal direction linking the pivot and support
points. For this we use a stepsize ĥ which can be very small in order to capture the
necessary detail of the curve and also be close to the tangential direction. Neverthe-
less, the chordal direction is often better for our purpose: For instance, if for points in
the neighborhood of the current pivot ∂Λε (A) is downwards convex and lies below
the tangent then the predicted points {ζ j,0 }mj=1 lie closer to ∂Λε (A) than if we had
chosen them along the tangent. We also note that unlike PF, the runtime of Cobra is
not adversely affected if the size of ĥ is small. Instead, it is the “neck” length H that
is the effective stepsize since it determines how fast we are marching along the curve.
Furthermore, the Newton correction steps that we need to apply on the m grid points
ζ j,0 are done in parallel. Finally, the selection procedure gives us the opportunity to
reject one or more points. In the rare cases that all parallel corrections fail, only the
support or pivot point can be returned, which would be equivalent to performing one
step of PF with (small) stepsize ĥ. These characteristics make Cobra more robust
and more effective than PF.
450 13 Computing the Matrix Pseudospectrum

13.2.2 Path Following by Triangles

Another parallel method for circumventing the difficulties encountered by PF at seg-


ments where ∂Λε (A) has steep turns and angular shape is by employing overlapping
triangles rather than the “cobra” structure. We describe this idea (originally from [9])
and its incorporation into an algorithm called PAT.
Definition 13.1 Two complex points a and b are said to be ε-separated when one is
interior and the other exterior to Λε (A).
Given two distinct complex numbers z 0 and z 1 , we consider the regular lattice
π
S(z 0 , z 1 ) = {Skl = z 0 + k (z 1 − z 0 ) + l (z 1 − z 0 )ei 3 , (k, l) ∈ Z2 }.

The nodes are regularly spaced:

|Sk,l+1 − Sk,l | = |Sk+1,l − Sk,l | = τ,

where τ = |z 0 − z 1 |. This mesh defines the set Ω(z 0 , z 1 ) of equilateral triangles of


mesh size τ (see Fig. 13.5).
Denote by O L the subset of Ω(z 0 , z 1 ) where T ∈ O L if, and only if, at least two
vertices of T are ε-separated. It is easy to prove that, for all z 0 = z 1 , the set O L is a
finite set.
Definition 13.2 When T ∈ O L , two vertices are interior and one vertex is exterior
or the reverse. The vertex alone in its class is called the pivot of T and is denoted
p(T ).
Let us define a transformation F which maps any triangle of O L into another
triangle of the same set: it is defined by

Fig. 13.5 PAT: The lattice


and the equilateral triangles
S
3.5 l+1 k,l+1

3 T
kl

l Skl S
k+1,l
2.5 ~
T
kl

2
l−1 S
k+1,l−1
1.5
k k+1 k+2
1 1.5 2 2.5 3 3.5 4
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 451

F(T ) = R( p(T ), θ ),
(13.6)
with θ = π3 if p(T ) is interior, else θ = − π3 ,

where R(z, θ ) denotes the rotation centered at z ∈ C with angle θ .


Proposition 13.3 F is a bijection from O L onto O L .

Proof See [9].

In Fig. 13.6, four situations are listed, depending on the type of the two points p(T )
and p(F(T )).

Definition 13.3 For any given T ∈ O L we define the F−orbit of T to be the set
O(T ) = {Tn , n ∈ Z} ⊂ O L , where Tn ≡ F n (T ).

Proposition 13.4 Let O(T ) be the F−orbit of a given triangle T ;


1. If N = card(O(T )), is the cardinality of O(T ), then N is even and is the smallest
positive integer such that F N (T ) = T .
 N −1
2. i=0 θi = 0 (2π ) where θi is the angle of the rotation F(.) when applied to Ti .

3
Γ σ(A)
4 Γ (A) Υσ(A)
σ
2.5
Υσ(A)
3.5 Ti
2
Ti p(Ti+1) p(Ti)
3
Ti+1=F(Ti) 1.5 Ti+1=F(Ti)

2.5 p(Ti+1)=p(Ti)
1

2 Λ (A) Λσ(A)
σ 0.5

0.5 1 1.5 2 2.5 3 3.5 0 0.5 1 1.5 2 2.5 3

3
Γ σ(A) p(Ti)
Γ (A)
σ
−2 2.5
Ti Υσ(A)
Υσ(A)
p(T )=p(T )
i+1 i
−2.5 2
Ti+1=F(Ti)
Ti
p(Ti+1)
−3 1.5
Λ (A)
σ
Ti+1=F(Ti)
−3.5 1

−4 Λσ(A)
0.5

−4 −3.5 −3 −2.5 −2 −1.5 −1 0 0.5 1 1.5 2 2.5 3

Fig. 13.6 PAT: Illustrations of four situations for transformation F


452 13 Computing the Matrix Pseudospectrum

Algorithm 13.6 PAT: path-following method to determine Λε (A) with triangles.


Input: A ∈ Rn×n , ε > 0, τ > 0 mesh size , θ lattice inclination, z 0 interior point (σmin (A − z 0 I ) ≤
ε).
Output: An initial triangle T0 = (z i , z e , z ) ∈ O L , the orbit O(T0 ), a polygonal line of N vertices
(N = card(O(T0 ))) that belong to ∂Λε (A).
1: h = τ eiθ ; Z = z 0 + h ; It = 0 ; //Looking for an exterior point
2: while σmin (A − z 0 I ) ≤ ε,
3: It = It + 1 ; h = 2h; Z = z 0 + h ;
4: end while
5: z i = z 0 ; z e = Z ; //STEP I: Defining the initial triangle
6: do k = 1 : It ,
7: z = (z e + z i )/2 ;
8: if σmin (A − z I ) ≤ ε, then
9: zi = z ;
10: else
11: ze = z ;
12: end if
13: end
14: T1 = F(T0 ) ; i = 1 ; I = {[z i , z e ], [z , p(T1 )} ; //STEP II: Building the F−orbit of T0
15: while Ti = T0 ,
16: Ti+1 = F(Ti ) ; i = i + 1 ;
17: From Ti insert in I a segment which intersets ∂Λε (A) ;
18: end while
//STEP III: Extraction of points of ∂Λε (A)
19: doall I ∈ I ,
20: Extract by bisection z I ∈ ∂Λε (A) ∩ I ;
21: end

Proof See [9].


The method is called PAT and is listed as Algorithm 13.6. It includes three steps:
1. STEP I: Construction of the initial triangle. The algorithm assumes that an interior
point z 0 is already known. Typically, such a point is provided as approximation
of an eigenvalue of A lying in the neighborhood of a given complex number.
The mesh size τ and the inclination θ of the lattice must also be provided by the
user. The first task consists in finding a point Z of the lattice that is exterior to
Λε (A). The technique of iteratively doubling the size of the interval is guaranteed
to stop since lim|z|→∞ σmin (A − z I ) = ∞. Then, an interval of prescribed size
τ and intersecting ∂Λε (A) is obtained by bisection. Observe that all the interval
endpoints that appear in lines 1–13 are nodes of the lattice.
2. STEP II: Construction of the chain of triangles. Once the initial triangle is known,
the transformation F is iteratively applied. At each new triangle a new interval
intersecting ∂Λε (A) is defined: one endpoint is the pivot of the new triangle and
the other endpoint is the new point which has been determined with the new
triangle. Since every point Tkl of the lattice is exactly characterized by its integer
coordinates k and l, the test of equality between triangles is exact.
3. STEP III: Definition of the polygonal line supported by ∂Λε (A). This step is
performed by a classical bisection process.
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 453

The minimum number of triangles in an orbit is 6. This corresponds to the situation


where the pivot is the same for all the triangles of the orbit and therefore the angles of
all the rotations involved are identical. Such an orbit corresponds to a large τ when
compared to the length of ∂Λε (A).
Proposition 13.5 Assume that Algorithm 13.6 determines an orbit O(T0 ) that
includes N triangles and that N > 6. Let be the length of the curve ∂Λε (A)
and η be the required absolute accuracy for the vertices of the polygonal line. The
complexity of the computation is given by the number Nε of points where s(z) is
computed. This number is such that
 
τ
Nε ≤ (N + 1)log  and N = O .
η τ

Proof See [9].

Remark 13.1 When the matrix A and the interior point z 0 are real, and when the
inclination θ is null, the orbit is symmetric with respect to the real axis. The compu-
tation of the orbit can be stopped as soon as a new interval is real. The entire orbit is
obtained by symmetry. This halves the effort of computation.

Remark 13.2 In Algorithm 13.6, one may stop at Step 19, when the orbit is deter-
mined. From the chain of triangles, a polygonal line of exterior points is therefore
obtained which follows closely the polygonal line which would have been built with
the full computation. This reduces the number of computations significantly.

To illustrate the result obtained by PAT, consider matrix grcar(100) and two exper-
iments. In the two cases, the initial interior point is automatically determined in the
neighborhood of the point z = 0.8 − 1.5i. The results are expressed in Table 13.1
and in Fig. 13.7.

Parallel Tasks in PAT

Parallelism in PAT, as in the previous algorithms, arises at two levels. (i) within
the kernel computing s(z); (ii) across computations of s(z) at different values z =
z 1 , z 2 , . . .. The main part of the computation is done in the last loop (line 19) of
Algorithm 13.6 that is devoted to the computation of points of the curve ∂Λε (A) from
a list of intervals that intersect the curve. As indicated by the doall instruction, the

Table 13.1 PAT: number of triangles in two orbits for matrix grcar(100)
ε τ N
10−12 0.05 146
10−6 0.2 174
454 13 Computing the Matrix Pseudospectrum

Fig. 13.7 PAT: Two orbits


2.5
for the matrix grcar(100);
eigenvalues are plotted by 2
(red) ‘*’, the two chains of 1.5
triangles are drawn in black
(ε = 10−12 ) and blue 1
(ε = 10−6 ) 0.5
0
−0.5
−1
−1.5
−2
−2.5
−2 −1 0 1 2 3 4

iterations are independent. For a parallel version of the method, two processes can
be created: P1 dedicated to STEP I and STEP II, and P2 to STEP III. Process P1
produces intervals which are sent to P2 . Process P2 consumes the received intervals,
each interval corresponding to an independent task. This approach allows to start
STEP III before STEP II terminates.
The parallel version of PAT, termed PPAT, applies this approach (see [10]
for details). To increase the initial production of intervals intersecting the curve,
a technique for starting the orbit simultaneously in several places is implemented
(as noted earlier, Cobra could also benefit from a similar technique). When A is
large
  PPAT computes s(z) by means of a Lanczos procedure applied to
and sparse,
O R −1
where R is the upper triangular factor of the sparse QR factorization
R −∗ 0
of the matrix (A − z I ). The two procedures, namely the QR factorization and the
triangular system solves at each iteration, are also parallelized.

13.2.3 Descending the Pseudospectrum

PF, Cobra, PAT and PPAT construct ∂Λε (A) for any given ε. It turns out that once
an initial boundary has been constructed, a path following approach can be applied,
yielding an effective way for computing additional boundaries for smaller values of
ε that is rich in large grain parallelism.
The idea is instead of tracing the curve based on tangent directions, to follow the
normals along directions that s(z) decreases. This was implemented as an algorithm
called Pseudospectrum Descent Method (PsDM for short). Essentially, given enough
gridpoints on the outermost pseudospectrum curve, the algorithm creates a front along
the normals and towards the direction of decrease of s(z) at each point steps until
the next sought inner boundary is reached. The process can be repeated to compute
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 455

Fig. 13.8 Pseudospectrum 1.5


contours and trajectories of
points computed by PsDM for 1
pseps, ε = 10−1 , ..., 10−3
for matrix kahan(100).
Arrows show the directions 0.5
used in preparing the
outermost curve with path

Y
0
following and the directions
used in marching from the
−0.5
outer to the inner curves with
PsDM
−1

−1.5
−2 −1.5 −1 −0.5 0 0.5 1 1.5
X

pseudospectral contours. Figure 13.8 shows the results from the application of PsDM
to matrix kahan(100) (available via the MATLAB gallery; see also [11]). The
plot shows (i) the trajectories of the points undergoing the steepest descent and (ii)
the corresponding level curves. The intersections are the actual points computed
byPsDM.
To be more specific, assume that an initial contour ∂Λε (A) is available in the
form of some approximation (e.g. piecewise linear) based on N points z k previously
computed using Cobra (with only slight modifications, the same idea applies if PAT
or PPAT are used to approximate the first contour). In order to obtain a new set of
points that define an inner level curve we proceed in 2 steps:
Step 1: Start from z k and compute an intermediate point w̃k by a single modified
Newton step towards a steepest descent direction dk obtained earlier.
Step 2: Correct w̃k to wk using a Newton step along the direction lk of steepest
descent at w̃k .
Figure 13.9 (from [12]) illustrates the basic idea using only one initial point. Applying
one Newton step at z k requires ∇s(z k ). To avoid this extra cost, we apply the same
idea as in PF and Cobra and use the fact that
∗ ∗
∇s(x̃k + i ỹk ) = (gmin qmin , gmin qmin )

where gmin and qmin are the corresponding left and right singular vectors. This quan-
tity is available from the original path following procedure. In essence we approxi-
mated the gradient based at z k with the gradient based at z̃ k , so that

s(x̃k + i ỹk ) − ε
z k = z̃ k − ∗ g , (13.7)
qmin min
456 13 Computing the Matrix Pseudospectrum

Fig. 13.9 Computing


∂Λδ (A), δ < ε

Applying correction based on gradient information from z̃ k as in (13.7) it follows


that
ε−δ
w̃k = z k − ∗ g , (13.8)
qmin min

where δ < ε and ∂Λδ (A) is the new pseudospectrum boundary we are seeking. Once
w̃k is available, we perform a second Newton step that yields wk :

s(w̃k ) − δ
wk = w̃k − , (13.9)
u ∗min vmin

where the triplet used is associated with w̃k . These steps can be applied indepen-
dently to all N points. This is one sweep of PsDM; we list it as Algorithm 13.7
(PsDM_sweep).

Algorithm 13.7 PsDM_sweep: single sweep of method PsDM.


Input: N points z k of the initial contour ∂Λε (A).
Output: N points wk on the target contour ∂Λδ (A), δ < ε.
1: doall k = 1, . . . , N
2: Compute the intermediate point w̃k according to (13.8).
3: Compute the target point wk using (13.9).
4: end

Assume now that the new points computed after PsDM_sweep define satisfactory
approximations of a nearby contour ∂Λδ (A), where δ < ε. We can continue in this
manner to approximate boundaries of the pseudospectrum nested in ∂Λδ (A). As
noted earlier, the application of PsDM_sweep uses ∇s(x̃k + i ỹk ), i.e. the triplet
13.2 Dimensionality Reduction on the Domain: Methods Based on Path Following 457

at z̃ k that was available from the original computation of ∂Λε (A) with Cobra or
PF. Observe now that as the sweep proceeds to compute ∂Λδ̃ (A) from ∂Λδ (A) for
δ̃ < δ, it also computes the corresponding minimum singular triplet at w̃k . Therefore
enough derivative information is available for PsDM_sweep to be reapplied with
starting points computed via the previous application of PsDM_sweep and it is not
necessary to run PF again. Using this idea repeatedly we obtain the PsDM method
listed as Algorithm 13.8 and illustrated in Fig. 13.10.

Algorithm 13.8 PsDM: pseudospectrum descent method


Input: N points z k approximating a contour ∂Λε (A).
Output: M × N points approximating M contours ∂Λδi , ε > δ1 > . . . > δ M .
1: δ0 = ε
2: do i = 1, . . . , M
3: Compute N points of ∂Λδi by PsDM_sweep on the N points of ∂Λδi−1
4: end

Observe that each sweep can be interpreted as a map that takes as input Nin
points approximating Λδi (A) and produces Nout points approximating Λδi+1 (A)
(δi+1 < δi ). In order not to burden the discussion, we restrict ourselves to the case
that Nin = Nout . Nevertheless, this is not an optimal strategy. Specifically, since for a
given collection of ε’s, the corresponding pseudospectral boundaries are nested sets
on the complex plane, when ε > δ > 0, the area of Λδ (A) is likely to be smaller than
the area of Λε (A); similarly, the length of the boundary is likely to change; it would
typically become smaller, unless there is separation and creation of disconnected
components whose total perimeter exceeds that of the original curve, in which case
it might increase.
The cost of computing the intermediate points w̃k , k = 1, . . . , N is small since
the derivatives at z̃ k , k = 1, . . . , N have already been computed by PF or a previous
application of PsDM_sweep. Furthermore, we have assumed that σmin (z k I − A) = ε

Fig. 13.10 Pseudospectrum


descent process for a single
point
458 13 Computing the Matrix Pseudospectrum

for all k, since the points z k approximate ∂Λε (A). On the other hand, computing
the final points wk , k = 1, . . . , N requires N triplet evaluations. Let Cσmin denote
the average cost for computing the triplet. We can then approximate the cost of
PsDM_sweep by T1 = N Cσmin . The target points can be computed independently.
On a system with p processors, we can assign the computation of at most N / p
target points to each processor; one sweep will then proceed with no need for synchro-
nization and communication and its total cost is approximated by T p =  Np Cσmin .
Some additional characteristics of PsDM that are wortht of note (cf. the references
in Sect. 13.4) are the following:
• It can be shown that the local error induced by one sweep of PsDM stepping from
one contour to the next is bounded by a multiple of the square of the stepsize of
the sweep. This factor depends on the analytic and geometric characteristics of the
pseudospectrum.
• A good implementation of PsDM must incorporate a scheme to monitor any signif-
icant length reduction (more typical) or increase (if there is separation and creation
of two or more disconnected components) between boundaries as ε changes and
make the number of points computed per boundary by PsDM adapt accordingly.
• PsDM can capture disconnected components of the pseudospectrum lying inside
the initial boundary computed via PF.
As the number of points, N , on each curve is expected to be large, PsDM is much
more scalable than the other path following methods and offers many more oppor-
tunities for parallelism while avoiding the redundancy of GRID. On the other hand,
PsDM complements these methods: We expect that one would first run Cobra, PAT
or PPAT to obtain the first boundary using the limited parallelism of these methods,
and then would apply PsDM that is almost embarrassingly parallel. In particular, each
sweep can be split into as many tasks as the number of points it handles and each
task can proceed independently, most of its work being triplet computations. Fur-
thermore, when the step does not involve adaptation, no communication is necessary
between sweeps.
Regarding load balancing, as in GRID, when iterative methods need to be used, the
cost of computing the minimum singular triplets can vary a lot between points on the
same boundary and not as much between neighboring points. In the absence of any
other a priori information on how the cost of computing the triplets varies between
points, numerical experiments (cf. the references) indicate that a cyclic partitioning
strategy is more effective than block partitioning.

13.3 Dimensionality Reduction on the Matrix: Methods


Based on Projection

When the matrix is very large, the preprocessing by complete reduction of the matrix
into a Hessenberg or upper triangular form, recommended in Sect. 13.1.2, is no
longer feasible. Instead, we seek some form of dimensionality reduction that provides
13.3 Dimensionality Reduction on the Matrix: Methods Based on Projection 459

acceptable approximations to the sought pseudospectra at reasonable cost. Recalling


the Arnoldi recurrence (Eq. (9.59) in Chap. 9), an early idea [13] was to apply Krylov
methods to approximate σmin (A − z I ), from σmin (H − z I ), where H was either
the square or the rectangular (augmented) upper Hessenberg matrix that is obtained
during the Arnoldi iteration and I is the identity matrix or a section thereof to conform
in size with H . A practical advantage of this approach is that, whatever the number
of gridpoints, it requires only one (expensive) transformation; this time, however, we
would be dealing with a smaller matrix. An objection is that until a sufficient number
of Arnoldi steps have been performed so that matrix A is duly transformed into upper
Hessenberg, there is no guarantee that the minimum singular value of the square
Hessenberg matrix will provide acceptable approximations to the smallest one for
A. On the other hand, if A is unitarily similar to an upper Hessenberg matrix H , then
the ε-pseudospectra of the ( j + 1) × j upper left sections of H for j = 1, . . . , n − 1
are nested (cf. [1]),

Λε (H2,1 ) ⊆ Λε (H3,2 ) ⊆ · · · ⊆ Λε (A).

Therefore, for large enough m, Λε (Hm+1,m ) could be a good approximation for


Λε (A). This idea was further refined in the EigTool package.

13.3.1 An EigTool Approach for Large Matrices

EigTool obtains the augmented upper Hessenberg matrix H̃m = Hm+1,m and ρ Ritz
values as approximations of the eigenvalues of interest. A grid Ωh is selected over
the Ritz values and algorithm GRID is applied. Denoting by I˜ the identity matrix
Im augmented by a zero row, the computation of σmin ( H̃m − z I˜) at every gridpoint
can be done economically by exploiting the Hessenberg structure. In particular, the
singular values of H̃m − z I˜ are the same as those of the square upper triangular
factor of its “thin Q R” decomposition and these can be computed fast by inverse
iteration or inverse Lanczos iteration. A parallel algorithm for approximating the
pseudospectrum locally as EigTool can be formulated by modifying slightly the
Algorithm in [14, Fig. 2]. We call it LAPSAR and list it as Algorithm 13.9. In addition
to the parallelism made explicit in the loop in lines 3–6, we assume that the call in
line 1 is a parallel implementation of implicitly restarted Arnoldi.
The scheme approximates ρ eigenvalues, where ρ < m (e.g. the EigTool default
is m = 2ρ) using implicitly restarted Arnoldi [15] and then uses the additional
information obtained from this run to compute the pseudospectrum corresponding
to those eigenvalues by using the augmented Hessenberg matrix H̃m .
It is worth noting that at any step m of the Arnoldi process, the following result
holds [16, Lemma 2.1]:

(A + Δ)Wm = Wm Hm , where Δ = −h m+1,m wm+1 wm (13.10)
460 13 Computing the Matrix Pseudospectrum

Algorithm 13.9 LAPSAR: Local approximation of pseudospectrum around Ritz


values obtained from ARPACK.
Input: A ∈ Rn×n , l ≥ 1 positive values in decreasing order E = {ε1 , ε2 , ..., εl }; s the number
of approximated eigenvalues, OPTS options to use for the approximation, e.g. to specify the
types of eigenvalues of interest (largest in magnitude, ...) similar to the SIGMA parameter in the
MATLAB eigs function.
Output: plot ε-pseudospectrum in some appropriate neighborhood of ρ approximate eigenvalues
of A obtained using ARPACK.
1: [Rvals, H̃m ] = arpack(A, ρ, m, OPTS) //call an ARPACK implementation using subspace
size m for ρ < m eigenvalues.
2: Define grid Ωh over region containing Ritz values Rvals
3: doall z k ∈ Ωh
4: [Q, R] = qr(z I˜ − √H̃m , 0) //thin Q R exploiting the Hessenberg structure
5: compute s(z k ) = λmin (R ∗ R) //using inverse [Lanczos] iteration and exploiting the struc-
ture of R
6: end
7: Plot the l contours of s(z k ) for all values of E .

This means that the eigenvalues of Hm are also eigenvalues of a specific perturbation
of A. Therefore, whenever |h m+1,m | is smaller than ε, the eigenvalues of Hm also
lie in the pseudospectrum of A for this specific perturbation. Of course one must
be careful in pushing this approach further. Using the pseudospectrum of Hm in
order to approximate the pseudospectrum of A entails a double approximation. Even
though when |h m+1,m | < ε the eigenvalues (that is the 0-pseudospectrum of Hm ) is
contained in Λε (A), this is not necessarily true for Λε (Hm ).

13.3.2 Transfer Function Approach

Based on the matrix resolvent characterization of the pseudospectrum (13.3), we


can construct an approximation of the entire pseudospectrum instead of focusing
on regions around selected eigenvalues. The key idea is to approximate the norm of
the resolvent, R(z) = (A − z I )−1  based on transfer functions G z (A, E, D) =
D R(z)E, where D ∗ , E are full rank—typically long and thin—matrices, of row
dimension n. If Wm = [w1 , . . . , wm ] denotes the orthonormal basis for the Krylov
subspace Km (A, w1 ) constructed by the Arnoldi iteration, then two options are to
use E = Wm+1 and D = Wm∗ or D = Wm+1 ∗ . If we write

 
Hm,m
Wm+1 = (Wm , wm+1 ) and Hm+1,m = 
h m+1,m em

and abbreviate G z,m (A) = G z (A, Wm+1 , Wm∗ ) it can be shown that
13.3 Dimensionality Reduction on the Matrix: Methods Based on Projection 461

1 1
≤ G z,m (A) ≤ + G z (A, Wm+1 , Wm∗ )ũ
σmin ( H̃m − z I˜) σmin ( H̃m − z I˜)
(13.11)

where ũ is the (m + 1)th left singular vector of H̃m − z I˜ and that

1 1
≤ G z,m (A) ≤ = R(z).
σmin ( H̃m − z I˜) σmin (A − z I )

Therefore, G z,m (A) provides an approximation to R(z) that improves monoton-


ically with m and is at least as good and frequently better than 1/σmin ( H̃m − z I˜) [16,
Proposition 2.2]. We thus use it to approximate the pseudospectrum.
One might question the wisdom of this approach on account of its cost, as it
appears that evaluating Wm∗ (A − z I )−1 Wm+1 at any z requires solving m + 1 systems
of size n. Complexity is reduced significantly by introducing the function φz =
Wm∗ (A − z I )−1 wm+1 and then observing that


G z,m (A) = ((I − h m+1,m φz em )(Hm,m − z I )−1 , φz )). (13.12)

Therefore, to compute G z,m (A) for the values of z dictated by the underlying
domain method (e.g. all gridpoints in Ωh for straightforward GRID) we must compute
φz , solve m Hessenberg systems of size m, and then compute the norm of G z,m (A).
The last two steps are straightforward. To obtain the term (A − z I )−1 wm+1 requires
solving one linear system per shift z. Moreover, they all have the same right-hand side,
wm+1 . From the shift invariance property of Krylov subspaces, that is Kd (A, r ) =
Kd (A − z I, r ), it follows that if a Krylov subspace method, e.g. GMRES, is utilized
then the basis to use to approximate all solutions will be the same and so the Arnoldi
process needs to run only once. In particular, if we denote by Ŵ1 the orthogonal basis
for the Krylov subspace Kd (A, wm+1 ) and by Fd+1,d the corresponding (d + 1) × d
upper Hessenberg matrix, then

(A − z I )Ŵd = Ŵd+1 (Fd+1,d − z I˜d ),

where I˜d = (Id , 0) . We refer to Sect. 9.3.2 of Chap. 9 regarding parallel implemen-
tations of the Arnoldi iterations and highlight one implementation of the key steps
of the parallel transfer function approach, based on GMRES to solve the multiply
shifted systems in Algorithm 13.10 that we call TR.
TR consists of two parts, the first of which applies Arnoldi processes to obtain
orthogonal bases for the Krylov subspaces used for the transfer function and for the
solution of the size n system. One advantage of the method is that after the first
part is completed and the basis matrices are computed, we only need to repeat the
second part for every gridpoint. TR is organized around row-wise data partition:
Each processor is assigned a number of rows of A and the orthogonal bases. At each
step, the vector that is to orthogonalized to become the new element of the basis Wm
462 13 Computing the Matrix Pseudospectrum

Algorithm 13.10 TR: approximating the ε-pseudospectrum Λε (A) using definition


13.3 on p processors.
Input: A ∈ Rn×n , l ≥ 1 positive values in decreasing order E = {ε1 , ε2 , ..., εl }; vector w1 with
w1  = 1, scalars m, d.
Output: plot of the ε pseudospectrum for all ε ∈ E
1: Ωh as in line 1 of Algorithm 13.1. //The set Ωh = {z i |i = 1, ..., M} is partitioned across
processors; processor pr_id is assigned nodes indexed I (pr_id), where I (1), . . . , I ( p) is a
partition of {1, ..., M}.
2: doall pr_id = 1, ..., p
3: (Wm+1 , Hm+1,m ) ← arnoldi(A, w1 , m)
4: (Ŵd+1 , Fd+1,d ) ← arnoldi(A, wm+1 , d)
(i) (i)
//Each processor has some rows of Wm+1 and Ŵd+1 , say Wm+1 and Ŵd+1 , and all have
access to Hm+1,m and Fd+1,d .
5: do j ∈ I (pr_id)
6: compute y j = argmin y {(Fd+1,d − z j I˜d )y − e1 } and broadcast y j to all processors
7: end
(pr_id) ∗ (pr_id)
8: Compute φz,pr_id = (Wm ) Ŵd Y where Y = (y1 , ..., y M )
9: end
//The following  is a reduction operation
p
10: Compute Φz = k=1 φz, p in processor 1 and distribute its columns, (Φz )k , (k = 1, . . . , M)
to all processors
11: doall pr_id = 1, ..., p
12: do j ∈ I (pr_id)
D j = (I − h m+1,m (Φz ) j em )(H −1
13: m,m − z j I )
14: nm_gz j = (D j , (Φz ) j )
15: end
16: end
17: Gather all nm_gzi , i = 1, . . . , M in processor 1. //They are the computed values of
1/G zi (A)
18: Plot the contours of 1/G zi (A), i = 1, ..., M corresponding to E .

is assumed to be available in each processor. This requires a broadcast, at some point


during the iteration, but allows the matrix-vector multiplication to take place locally
on each processor in the sequel.
After completing the Arnoldi iterations (Algorithm 13.10, lines 3–4), a set of rows
of Wm+1 and Ŵd+1 are distributed among the processors. We denote these rows by
(i) (i)
Wm+1 and Ŵd+1 , where i = 1, . . . , p. In line 6, each node computes its share of the
columns of matrix Y . For the next steps, in order for the local φz to be computed, all
processors must access the whole Y , hence all-to-all communication is necessary. In
lines 12–15 each processor works on its share of gridpoints and columns of matrix
Φz . Finally, processor zero gathers the approximations of {1/s(z i )}i=1 M . For lines

8–10, note that φz,i = Wi Ŵi Y , where Wi and Ŵi , i = 1, ..., p are the subsets
of contiguousrows of the basis matrices W and Ŵ that each node privately owns.
p
Then, Φz = i=1 φz,i , which is a reduction operation applied on matrices, so it can
be processed in parallel. It is worth noting that the preferred order in applying the
two matrix-matrix multiplications of line 8 depends on the dimensions of the local
13.3 Dimensionality Reduction on the Matrix: Methods Based on Projection 463

(i) (i)
submatrices Wm , Ŵd and on the column dimension of Y and could be decided at
runtime.
Another observation is that the part of Algorithm 13.10 after line 7 involves only
dense matrix computations and limited communication. Remember also that at the
end of the first part, each processor has a copy of the upper Hessenberg matrices.
Hence, we can use BLAS3 to exploit the memory hierarchy available in each proces-
sor. Furthermore, if M ≥ p, the workload is expected to be evenly divided amongst
processors.
Observe that TR can also be applied in stages, to approximate the pseudospectrum
for subsets of Ωh at a time. This might be preferable when the size of the matrix
and number of gridpoints is very large, in which case the direct application of TR
to the entire grid becomes very demanding in memory and communication. Another
possibility, is to combine the transfer function approach with one of the methods
we described earlier that utilizes dimensionality reduction on the domain. For very
large problems, it is possible that the basis Ŵd+1 cannot be made large enough to
provide an adequate approximation for (A − z I )−1 wm+1 at one or more values of z.
Therefore, restarting is necessary using as starting vectors the residuals correspond-
ing to each shift (other than those systems that have already been approximated at
the desired tolerance) will be necessary. Unfortunately, the residuals are different
in general and the shift invariance property will be no longer readily applicable.
Cures for this problem have been proposed both in the context of GMRES [17] and
FOM(Full Orthogonalization Method) [18]. Another possibility is to use short recur-
rences methods, e.g. QMR [19] or BiCGStab [20]. In FOM for example, it was shown
that if FOM is used, then the residuals from all shifted systems will be collinear; cf.
[21].

13.4 Notes

The seminal reference on pseudospectra and their computation is [1]. The initial
reduction of A to Schur or Hessenberg form prior to computing its pseudospectra
and the idea of continuation were proposed in [22]. A first version of a Schur-based
GRID algorithm written in MATLAB was proposed in [23]. The EigTool package
was initially developed in [24]. The queue approach for the GRID algorithm and its
variants was proposed in [25] and further extended in [4] who also showed several
examples where imbalance could result from simple static work allocation. The idea
of inclusion-exclusion and the MOG algorithm 13.3 were first presented in [2]. The
parallel Modified Grid Method was described in [4] and experiments on a cluster of
single processor PC’s over Ethernet running the Cornell Multitasking Toolbox [26]
demonstrated the potential of this parallelization approach. MOG was extended for
matrix polynomials in [27] also using the idea of “inclusion disks”. These, combined
with the concentric exclusion disks we mentioned earlier, provide the possibility for
a parallel MOG-like algorithm that constructs the pseudospectrum for several values
464 13 Computing the Matrix Pseudospectrum

of ε. The path following techniques for the pseudospectrum originate from early
work in [7] where they demonstrated impressive savings over GRID. Cobra was
devised and proposed in [8]. Advancing by triangles was proposed in [9, 10]. The
Pseudospectrum Descent Method (PsDM) was described in [12]. The programs were
written in Fortran-90 and the MPI library and tested on an 8 processor SGI Origin
200 system. The method shares a lot with an original idea described in [28] for the
independent computation of eigenvalues.
The transfer function approach and its parallel implementation were described in
[16, 29]. Finally, parallel software tools for constructing pseudospectra based on the
algorithms described in this chapter can be found in in [30, 31].

References

1. Trefethen, L., Embree, M.: Spectra and Pseudospectra. Princeton University Press, Princeton
(2005)
2. Koutis, I., Gallopoulos, E.: Exclusion regions and fast estimation of pseudospectra (2000).
Submitted for publication (ETNA)
3. Braconnier, T., McCoy, R., Toumazou, V.: Using the field of values for pseudospectra genera-
tion. Technical Report TR/PA/97/28, CERFACS, Toulouse (1997)
4. Bekas, C., Kokiopoulou, E., Koutis, I., Gallopoulos, E.: Towards the effective parallel compu-
tation of matrix pseudospectra. In: Proceedings of the 15th ACM International Conference on
Supercomputing (ICS’01), pp. 260–269. Sorrento (2001)
5. Reichel, L., Trefethen, L.N.: Eigenvalues and pseudo-eigenvalues of Toeplitz matrices. Linear
Algebra Appl. 162–164, 153–185 (1992)
6. Sun, J.: Eigenvalues and eigenvectors of a matrix dependent on several parameters. J. Comput.
Math. 3(4), 351–364 (1985)
7. Brühl, M.: A curve tracing algorithm for computing the pseudospectrum. BIT 33(3), 441–445
(1996)
8. Bekas, C., Gallopoulos, E.: Cobra: parallel path following for computing the matrix
pseudospectrum. Parallel Comput. 27(14), 1879–1896 (2001)
9. Mezher, D., Philippe, B.: PAT—a reliable path following algorithm. Numer. Algorithms 1(29),
131–152 (2002)
10. Mezher, D., Philippe, B.: Parallel computation of the pseudospectrum of large matrices. Parallel
Comput. 28(2), 199–221 (2002)
11. Higham, N.: The Matrix Computation Toolbox. Technical Report, Manchester Centre for Com-
putational Mathematics (2002). http://www.ma.man.uc.uk/~higham/mctoolbox
12. Bekas, C., Gallopoulos, E.: Parallel computation of pseudospectra by fast descent. Parallel
Comput. 28(2), 223–242 (2002)
13. Toh, K.C., Trefethen, L.: Calculation of pseudospectra by the Arnoldi iteration. SIAM J. Sci.
Comput. 17(1), 1–15 (1996)
14. Wright, T., Trefethen, L.N.: Large-scale computation of pseudospectra using ARPACK and
Eigs. SIAM J. Sci. Comput. 23(2), 591–605 (2001)
15. Lehoucq, R., Sorensen, D., Yang, C.: ARPACK User’s Guide: Solution of Large-Scale Eigen-
value Problems with Implicitly Restarted Arnoldi Methods. SIAM, Philadelphia (1998)
16. Simoncini, V., Gallopoulos, E.: Transfer functions and resolvent norm approximation of large
matrices. Electron. Trans. Numer. Anal. (ETNA) 7, 190–201 (1998). http://etna.mcs.kent.edu/
vol.7.1998/pp190-201.dir/pp190-201.html
17. Frommer, A., Glässner, U.: Restarted GMRES for shifted linear systems. SIAM J. Sci. Comput.
19(1), 15–26 (1998)
References 465

18. Simoncini, V.: Restarted full orthogonalization method for shifted linear systems. BIT Numer.
Math. 43(2), 459–466 (2003)
19. Freund, R.: Solution of shifted linear systems by quasi-minimal residual iterations. In: Reichel,
L., Ruttan, A., Varga, R. (eds.) Numerical Linear Algebra, pp. 101–121. W. de Gruyter, Berlin
(1993)
20. Frommer, A.: BiCGStab( ) for families of shifted linear systems. Computing 7(2), 87–109
(2003)
21. Simoncini, V.: Restarted full orthogonalization method for shifted linear systems. BIT Numer.
Math. 43(2), 459–466 (2003)
22. Lui, S.: Computation of pseudospectra with continuation. SIAM J. Sci. Comput. 18(2), 565–573
(1997)
23. Trefethen, L.: Computation of pseudospectra. Acta Numerica 1999, vol. 8, pp. 247–295. Cam-
bridge University Press, Cambridge (1999)
24. Wright, T.: Eigtool: a graphical tool for nonsymmetric eigenproblems (2002). http://web.
comlab.ox.ac.uk/pseudospectra/eigtool. (At the Oxford University Computing Laboratory site)
25. Frayssé, V., Giraud, L., Toumazou, V.: Parallel computation of spectral portraits on the Meiko
CS2. In: Liddel, H., et al. (eds.) LNCS: High-Performance Computing and Networking, vol.
1067, pp. 312–318. Springer, Berlin (1996)
26. Zollweg, J., Verma, A.: The Cornell multitask toolbox. http://www.tc.cornell.edu/Services/
Software/CMTM/. Directory Services/Software/CMTM at http://www.tc.cornell.edu
27. Fatouros, S., Psarrakos, P.: An improved grid method for the computation of the pseudospectra
of matrix polynomials. Math. Comput. Model. 49, 55–65 (2009)
28. Koutis, I.: Spectrum through pseudospectrum. http://arxiv.org/abs/math.NA/0701368
29. Bekas, C., Kokiopoulou, E., Gallopoulos, E., Simoncini, E.: Parallel computation of
pseudospectra using transfer functions on a MATLAB-MPI cluster platform. In: Recent
Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 9th
European PVM/MPI Users’ Group Meeting. LNCS, vol. 2474. Springer, Berlin (2002)
30. Bekas, C., Kokiopoulou, E., Gallopoulos, E.: The design of a distributed MATLAB-based
environment for computing pseudospectra. Future Gener. Comput. Syst. 21(6), 930–941 (2005)
31. Mezher, D.: A graphical tool for driving the parallel computation of pseudosprectra. In: Pro-
ceedings of the 15th ACM International Conference on Supercomputing (ICS’01), pp. 270–276.
Sorrento (2001)
Index

A BLAS3, 20, 29, 30, 82, 100, 104, 233,


α-bandwidth, 42, 43 235, 264, 272, 462
Amdahl’s law, 13 DOT, 448
APL language, 176 Block cyclic reduction, 137, 139, 206
Arnoldi’s method, 299–303, 350, 351, 460, Buneman stabilization, see Stabilization
461
identities, 301
iteration, 337, 458, 460, 461 C
restarted, 355, 459 CFT, see Rapid elliptic solvers
CG, 37, 113–115, 297, 298, 312, 318–321,
372–378, 380
acceleration, 318, 320, 321
B CGNE, 323
Banded preconditioners, 91, 95, 99 preconditioned, 316
Banded solver, 65, 129, 176 Chebyshev acceleration, 390, 396, 398
Banded Toeplitz solver, 183–192 Chebyshev-Krylov procedure, 306
Bandweight, 42 Cholesky factorization, 25, 26, 99, 126, 179,
Bandwidth (matrix), 38, 40–43, 52, 57, 64, 184, 188–191, 237, 251
66–69, 72, 91, 92, 105, 115, 128–130, incomplete, 399
141, 145, 157, 176, 180, 191, 352, Cimmino method, 320, 323
363 Cobra, see Path following methods
B2GS, see Gram-Schmidt algorithm Complete Fourier transform, see CFT
BCR, see Rapid elliptic solvers Conjugate gradient, see CG
BGS, see Gram-Schmidt algorithm Conjugate gradient of the normal equations,
BiCG, 336 see CGNE
BiCGStab, 463 CORF, see Rapid elliptic solvers
Biconjugate gradient, see BiCG Cramer’s rule, 155, 192
Bidiagonalization CS decomposition, 322, 323
via Householder reduction, 272
BLAS, 17, 20, 25, 29, 30, 52, 82, 100, 104,
139, 172, 211, 232–235, 243, 264, D
272, 384, 410, 426, 439, 462 Davidson method, 363–370, 381, 389, 398,
_AXPY, 18, 32, 301 399
_DOT, 18, 301 dd_pprefix, see Divided differences
BLAS1, 20, 25, 139, 211, 232, 264, 301 dft, 168, 169, 171
BLAS2, 19, 25, 82, 100, 172, 232, 234, Diagonal boosting, 87, 100
243, 264, 394, 395, 426 Direct solver, 94, 95, 312, 431

© Springer Science+Business Media Dordrecht 2016 467


E. Gallopoulos et al., Parallelism in Matrix Computations,
Scientific Computation, DOI 10.1007/978-94-017-7188-7
468 Index

Discrete Fourier transform, see dft for tridiagonal systems, 127


Distributed memory class, 10 with diagonal pivoting, 132, 134
Divided differences with pairwise pivoting, 83
dd_pprefix, 173 with partial pivoting, 84, 91, 148
Neville, 172 without pivoting, 82, 188
Domain decomposition, 214, 327, 328 Generalized minimal residual method,
algebraic, 214 see GMRES
DDBiCGstab, 125 Givens rotations, 144, 147, 148, 150, 151,
DDCG, 125 154, 214, 231, 233, 234, 237, 244,
Domain decomposition BiCGstab, 259, 272, 393
see Domain decomposition ordering, 230
Domain decomposition CG, parallel algorithm, 144
see Domain decomposition solver, 144
Dominant Givens rotations solver
eigenpair, 345, 346, 390 based on Spike partitioning, 144
eigenvalue, 343, 346 GMRES, 299, 304, 426, 461, 463
Gram-Schmidt algorithm, 29, 233, 269, 271,
300, 382, 390
E B2GS, 391, 395
EES, see Rapid elliptic solvers block, 391
EIGENCNT algorithm, 434 classical, 234, 346
Eigendecomposition, 385 modified, 26–28, 234, 235, 272, 297,
Eigenspace, 369–372, 377 298, 300, 301, 345, 346, 348, 364, 365,
Elliptic problems, 129, 165, 183, 197, 198, 400
218, 277, 316 orthogonalization, 26, 235, 395
with reorthogonalization, 382
Graphical Processing Units (GPUs), 6, 154,
F 218, 413
FACR, see Rapid elliptic solvers GRID, see Grid methods
FEAST eigensolver, 369 Grid methods
Fast Fourier transform (FFT) algorithm, GRID, 441–446, 458, 459, 461, 463
168–170, 183, 184, 201, 211, 218, modified, 443–446
219 GRID_fact, 442
multiplication, 170 MOG, 443–446, 463
Fiedler vector, 41, 43
Schur-based GRID, 463
Finite difference discretization, 30
GRID_fact, see Grid methods
Finite elements discretization, 30, 360, 361
FOM, 426, 463
Forward substitution, 52, 141
FACR, see Rapid elliptic solvers H
Full orthogonalization method, see FOM Half-bandwidth, 40, 42, 352
Householder algorithm
hybrid, 259
G Householder orthogonalization, 391
Gauss-Seidel method, 281 with column pivoting, 242
block, 315, 332 Householder transformations, 144, 232, 233,
line, 287, 290, 291, 293 242, 251, 264, 417
point, 283 Householder vector, 26
splitting, 283 Householder-Jacobi scheme, 259
red-black point, 283
Gaussian elimination, 79, 82–84, 87, 91,
127, 129, 132, 134, 148, 153, 167, I
188, 198, 231, 284, 401 Incomplete cyclic reduction, 136, 137
dense systems, 198 Intel TBB library, 176
Index 469

Invariant subspace, 270, 271, 299, 331, 333, hybrid, 394


344–346, 351, 379 preconditioned, 370
Inverse iteration, 268, 270, 271, 349, 354, single-vector, 394
355, 367, 378, 401, 459 tridiagonalization, 392
Inverse Lanczos iteration, 443, 459 with reorthogonalization, 355
IPF, see Partial fractions without reorthogonalization, 354
Iterative refinement, 80, 183, 184, 192, 401 Lanczos blocks, 356, 357
Lanczos eigensolver, 354–356, 364, 392
J LANCZOS2, 366
Jacobi iteration for the SVD Lanczos identities, 350
1JAC, 257, 259–262, 388 Lanczos recursion, 392, 394
QJAC, 259 Lanczos vectors, 393, 394
2JAC, 254, 257, 260, 261 LAPACK, see Numerical libraries and
Jacobi method for linear systems, 261–263 environments
block, 319 LDL factorization
block-row, 315 sparse, 385
line-Jacobi scheme, 284, 289 LDU factorization, 141, 144, 146
point-Jacobi scheme, 282, 283 trid_ldu, 142
red-black line-Jacobi scheme, 285, 287 Least squares polynomials, 176
red-black point-Jacobi scheme, 283, 285 Least squares problems, 30, 105, 114, 227,
red-black scheme, 286 228, 235, 236, 238, 239, 242
Jacobi rotations, 254, 255, 261 Linear systems
Jacobi-Davidson method, 370, 381–383
banded, 91, 94, 96, 111, 115
outer iteration, 382
block tridiagonal, 94, 126, 127, 129, 131,
139, 141, 149, 150, 154, 203, 205, 207,
208, 210, 218, 219, 290
K
KACZ, see Kaczmarz method block-Hankel, 176
Kaczmarz method, see Row projection block-Toeplitz, 176
algorithms diagonally dominant SPD, 115
Krylov subspace discrete Poisson, 198, 202, 205, 214, 219
bases, 236, 299, 301, 303–306, 327, 337, nonsingular, 105, 277
460, 461 nonsymmetric, 115
invariant, 427 positive-semidefinite, 372
method, 38, 103, 299, 308, 329, 425, 426, reduced, 95, 98–105, 130, 136, 149, 153,
458 205, 401, 416, 431
modified methods, 125 singular, 249
polynomial methods, 294 symmetric indefinite, 368
preconditioned methods, 85, 299, 316 symmetric positive definite, 115, 184,
preconditioned scheme, 38 188
projection methods, 424 Toeplitz, 52, 61–73, 127, 176–192, 201,
Krylov subspace iteration 203, 212, 215, 218, 219
inner, 100
tridiagonal, 91, 156
outer, 91, 99, 100
underdetermined, 106, 110
Vandermonde, 166, 168, 172, 174
L LU factorization, 60, 82, 84, 85, 91, 92, 94–
Lagrange multipliers, 372 96, 99–101, 104, 128, 129, 154, 347,
Lagrange polynomials, 167, 171 349, 428, 429, 431, 433
Lanczos algorithm, 350, 351, 354–356, 358, banded, 95
363, 364, 369, 389 block, 84, 85, 128, 129
block, 351, 352, 358 incomplete, 399
eigensolver, 354–356, 364, 392 parallel sparse, 429
470 Index

with partial pivoting, 99 spectral radius, 109, 280, 288, 290, 294,
without pivoting, 99 316–318, 322, 323
LU/UL strategy, 104 spectrum, 109, 249, 267, 268, 304, 306,
308, 318, 322, 345–347, 349, 354, 378,
385, 387, 410
Spike, 96, 101, 153, 431
M sub-identity, 327, 328
Marching algorithms, 129, 130, 150, 214, symmetric indefinite, 392
277 symmetric positive definite (SPD), 118,
Matrix 119, 123–126, 133, 149, 298, 391, 394
banded, 91, 95, 105, 115, 116, 129, 149, Toeplitz, 62–72, 127, 165, 166, 176–191,
154, 159, 311, 384, 394 198, 201, 212, 215, 218, 410, 440, 444
bidiagonal, 177, 251, 272, 284, 307, 308, triangular, 40, 49, 53, 57, 58, 60, 62, 64,
393 66, 80, 81, 107, 130, 134, 142, 147, 148,
circulant, 177, 180, 182–185, 188 152, 166, 189, 231, 237, 240, 243, 244,
diagonal, 132, 216, 217, 252, 260, 261, 259–261, 300, 349, 387, 393, 410, 428,
266, 281, 327, 328, 390 431
diagonal form, 33, 38, 96, 115, 251, 260, tridiagonal, 98, 115, 127, 132–136, 141,
263, 307, 349, 357, 358, 360, 393, 394, 143, 144, 148, 150, 152, 154, 156, 157,
416, 429 159, 197, 198, 200, 201, 203, 204, 207–
diagonalizable, 199, 345, 410 209, 215, 219, 262–265, 267, 268, 278,
diagonally dominant, 81, 85–87, 99, 100, 280, 287, 290, 302, 306, 308, 309, 311,
104, 115, 119, 125, 127, 133, 134, 137– 339, 349, 350, 352–354, 356–359, 392,
139, 144, 191, 279, 284, 370 394
Hermitian, 249, 343, 344, 431 unitary, 249
Hessenberg, 231, 300–304, 308, 429, zero, 120, 129, 134, 135
458–462 Matrix decomposition, 201–204, 210, 211,
indefinite, 349 213, 215, 216, 218, 219
irreducible, 40, 128, 142–144, 148, 150, MD- Fourier, 215, 216
156, 157, 279, 351 Matrix reordering, 37, 38, 42, 129, 311
Jacobian, 238 minimum-degree, 37
Jordan canonical form, 410 nested dissection, 37
M-matrix, 280, 330 reverse Cuthill-McKee, 37–39, 91
multiplication, 20, 22, 24, 30, 60, 80, 87, spectral, 38–43, 91
345, 384, 385, 395, 413, 418, 462 weighted spectral, 39, 40, 42
non-diagonalizable, 410 Matrix splitting-based paracr, 134, 136
nonnegative, 80, 133 incomplete, 137
nonsingular, 38, 79, 83, 120, 123, 229, Matrix-free methods, 31
299, 336, 387, 388 Maximum product on diagonal algorithm,
norm, 439 see MPD algorithm
orthogonal, 120, 151, 194, 196, 197, 202, Maximum traversal, 40
228, 229, 242, 243, 246, 250, 251, 256, MD- Fourier, see Matrix decomposition
262, 263, 266, 272, 297, 322, 363, 375, Memory
391 distributed, 9, 10, 19, 28, 33, 96, 169,
orthornormal, 320, 358, 379 218, 235, 236, 261, 262, 384, 429
permutation, 37, 40, 192, 194, 197, 202, global, 9, 239
282, 286, 349, 362, 428 hierarchical, 261
positive definite, 314 local, 13, 24, 36, 290, 338, 385
positive semidefinite, 41 shared, 9, 218, 219, 261, 262
pseudospectrum, 417 Minimum residual iteration, 380
reduction, 356, 393, 394, 458, 463 MOG, see Grid methods
Schur form, 443, 444, 463 MPD algorithm, 40
skew-symmetric, 173 Multiplicative Schwarz algorithm, 329, 332
Index 471

as preconditioner, 333 O
splitting, 334 Ordering
MUMPS, see Numerical libraries and envi- Leja, 308, 421, 423
ronments red-black, 281–283, 285, 286
Orthogonal factorization
Givens orthogonal, 259
sparse, 107
N
Neville, see Divided differences
Newton basis, 339
Newton form, 172, 174 P
Newton Grassmann method, 368 paracr, 136
Newton interpolating polynomial, 174 Parallel computation, 13, 141, 267, 424
Newton polynomial basis, 307 Parallel cyclic reduction, see paracr
Newton’s method, 238 Partial fractions
coefficients, 206, 208, 210, 215, 411,
iteration, 37, 75, 447, 448
416, 417, 419–421, 423, 426–428
step, 364, 455
expansion, 415, 417–420, 423, 426
Newton-Arnoldi iteration, 337
incomplete
Newton-Krylov procedure, 308, 327, 337
IPF, 419–421, 423, 424, 427
Nonlinear systems, 238
representation, 205, 206, 208, 210, 213,
Norm
215, 414, 415, 420, 424, 425, 427
∞-norm, 167
solving linear systems, 210
2-norm, 130, 143, 148, 228, 242, 244,
PAT, see Path following methods
299, 352, 366, 374, 376, 384, 400, 439
Path following methods
ellipsoidal norm, 314
Cobra, 448–450, 454, 455, 457
Frobenius norm, 37–39, 243, 245, 253,
PAT, 450, 452–455
311
PF, 447–450, 454, 455, 457, 458
matrix norm, 439 PPAT, 454, 455
vector norm, 314 PDGESVD, see Singular value decomposi-
Numerical libraries and environments tion
ARPACK, 355, 459 Perron-Frobenius theorem, 279
EISPACK PF, see Path following methods
TINVIT, 272 Pivoting strategies
GotoBLAS, 20 column, 60, 154, 228, 242, 243, 262, 388
Harwell Subroutine Library, 349 complete, 60
LAPACK, 100, 104, 265, 267, 272, 384 diagonal, 132, 134, 154
DAXPY, 170 incremental, 84
DGEMV, 232 pairwise, 82–85, 91
DGER, 232 partial, 84, 85, 87, 91, 92, 99, 100, 148,
DSTEMR, 272 154, 188, 428
DSYTRD, 264 Poisson solver, 219
MATLAB, 30, 168, 232, 349, 454, 459, Polyalgorithm, 149
463 Polynomial
Matrix Computation Toolbox, 440 Chebyshev, 50, 199, 200, 205, 214, 289,
MUMPS, 429, 431 295, 304, 305, 348, 354, 390, 391, 411
PARDISO solver, 312 evaluation, 69
PELLPACK, 218 Horner’s rule, 69, 414
ScaLAPACK, 92, 137, 149, 259 Paterson-Stockmeyer algorithm, 413,
PDGEQRF, 259 414
PDSYTRD, 264 Power method, 343, 344, 346
SuperLU, 429, 431 powform algorithm, 170, 171
WSMP, 431 PPAT, see Path following methods
Numerical quadrature, 414 prefix, see Prefix computation
472 Index

prefix_opt, see Prefix computation Rayleigh quotient, 259, 363, 366, 371, 397
pr2pw algorithm, 169–172 generalized, 371, 382
Preconditioned iterative solver, 111 matrix, 390
Preconditioning, 18, 99, 113, 115, 311 Rayleigh quotient iteration, 347
Krylov subspace schemes, 38 Rayleigh quotient method, 368
left and right, 122, 154, 336 Rayleigh-Ritz procedure, 390
Prefix rd_pref, see Tridiagonal solver
parallel matrix prefix, 142–144, 146–148 Recursive doubling, 127
parallel prefix, 142–144, 146, 148, 175, RES, see Rapid elliptic solvers
176 RODDEC, 237
Prefix computation Row projection algorithms
prefix, 174, 175 block, 311
prefix_opt, 174, 175 Cimmino, 326
Programming Kaczmarz method
Single Instruction Multiple threads classical, 326
(SIMT), 6 symmetrized, 319–321, 323, 325
Multiple Instruction Multiple Data Rutishauser’s
(MIMD), 9–11, 218 Chebyshev acceleration, 390
Single Instruction Multiple Data RITZIT, 390, 391
(SIMD), 5, 6, 10, 13, 218, 277 subspace iteration, 391
Single Program Multiple Data organiza-
tion (SPMD), 11
PsDM, see Pseudospectrum methods
S
Pseudospectrum computation
SAS decomposition, 165, 195, 196, 362
LAPSAR, 459
ScaLAPACK, see Numerical libraries and
PsDM, 454, 456–458
environments
TR, 461–463
Scatter and gather procedure, 33
Schur complement, 132, 155, 219, 372
Q Schur decomposition, 442, 443
QMR, 426, 463 Schwarz preconditioning, 336
QR algorithm, 265 Sherman-Morrison-Woodbury formula,
iterations, 197, 265 127, 154, 183, 190
QR factorization, 60, 147, 154, 227, 228, simpleSolve_PF algorithm, 206
231, 237, 242, 243, 246, 262, 265, Simultaneous iteration method, 271, 345–
272, 302, 303, 399, 454 348, 369, 371–373, 377, 389, 390
by Householder transformations, 242 Singular value decomposition (SVD), 259–
incomplete, 399 262, 386, 387, 393, 395
solving linear systems, 227 block Lanczos, 394
sparse, 454 BLSVD, 392, 394
thin, 459 hybrid scheme, 261
via Givens rotations, 154 sparse, 401
with column pivoting, 60, 262 SOR, 293, 315, 316, 318, 319, 322
Quadratic equation, 155 block symmetric (SSOR), 316
SP_Givens method, 150
Sparse direct solver, 312, 387, 431
R Sparse inner product, 32
Rapid elliptic solvers, 165, 170, 198, 409, Sparse matrices
415 computations, 17, 33
BCR, 205, 209–211, 219, 411 fill-in, 31
CFT, 204, 219 graph representation, 31
CORF, 207–211 reordering, 36
EES, 215–217 storage, 31–33
FACR, 211, 219 systems, 18, 37
Index 473

Sparse matrix-multivector multiplication, TraceMIN_1, 384, 385


384 TraceMIN_2, 385
Sparse matrix-vector multiplication, 37, 301, Trace minimization algorithm
395, 396 Davidson-type, 381, 383
Sparse preconditioner, 312 Transpose-free methods, 398
Spike algorithm, 94–96, 99–101, 103–105, TREPS, see Eigenvalue parrallel solver
115, 151, 214, 311 Triangular solver
on-the-fly, 103 block banded (BBTS), 59
recursive , 101–103 CSweep, 53, 59, 185
truncated version, 104 DTS, 54, 56, 57, 60, 61
Spike DS factorization, 153 stability, 59
Spike factorization, 96, 99, 429, 430 Triangular Toeplitz solver
Spike partitioning, 144, 149 banded (BTS), 190
Spike preconditioner, 104 Triangular Toeplitz solver (TTS), 63
Spike-balance scheme, 111 Tridiagonal eigenvalue parallel solver
projection-based, 112 TREPS, 267, 269, 349, 354, 355
Stabilization Tridiagonal solver
Buneman, 209, 210 Givens-QR, 150
Steepest descent, 313, 454, 455 pargiv, 148, 150, 151, 153
Strassen’s algorithm, 22–24, 418 rd_pref, 144
parallel, 24 TRSVD, see Trace minimization
Strassen-Winograd algorithm, 23
Subspace iteration, 91, 99, 100, 271, 345–
348, 369, 371–373, 377, 389–391, U
396 UL-factorization, 104, 105
SISVD, 391, 396, 397 unvec operation, 202
Successive over-relaxation, see SOR
SuperLU, see Numerical libraries and envi-
ronments
V
SVD, see Singular value decomposition
Vandermonde solvers, 166
Symmetric-antisymmetric decomposition,
vec operation, 202
see SAS decomposition

T W
TFQMR, 426 Weighted bandwidth reduction, 39
TR, see Pseudospectrum methods
Trace minimization
TRSVD, 396–398 Z
TraceMIN, 384, 385 ZEROIN method, 269

You might also like