High Performance Numerical
Libraries
Sathish Vadhiyar
Gaussian Elimination - Review
Version 1
for each column i
zero it out below the diagonal by adding multiples of row i to
later rows
for i= 1 to n-1
for each row j below row i
for j = i+1 to n
add a multiple of row i to row j
for k = i to n
A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k)
i
0
0 0 k
i
0 0 i,i
0 0 X
j 0 0 X
0 0 x
Gaussian Elimination - Review
Version 2 – Remove A(j, i)/A(i, i) from inner loop
for each column i
zero it out below the diagonal by adding multiples of row i to
later rows
for i= 1 to n-1
for each row j below row i
for j = i+1 to n
m = A(j, i) / A(i, i)
for k = i to n
A(j, k) = A(j, k) – m* A(i, k)
i
0
0 0 k
i
0 0 i,i
0 0 X
j 0 0 X
0 0 x
Gaussian Elimination - Review
Version 3 – Don’t compute what we already know
for each column i
zero it out below the diagonal by adding multiples of row i to
later rows
for i= 1 to n-1
for each row j below row i
for j = i+1 to n
m = A(j, i) / A(i, i)
for k = i+1 to n
A(j, k) = A(j, k) – m* A(i, k)
i
0
0 0 k
i
0 0 i,i
0 0 X
j 0 0 X
0 0 x
Gaussian Elimination - Review
Version 4 – Store multipliers m below diagonals
for each column i
zero it out below the diagonal by adding multiples of row i to
later rows
for i= 1 to n-1
for each row j below row i
for j = i+1 to n
A(j, i) = A(j, i) / A(i, i)
for k = i+1 to n
A(j, k) = A(j, k) – A(j, i)* A(i, k)
i
0
0 0 k
i
0 0 i,i
0 0 X
j 0 0 X
0 0 x
GE - Runtime
Divisions
1+ 2 + 3 + … (n-1) = n2/2 (approx.)
Multiplications / subtractions
12 + 22 + 32 + 42 +52 + …. (n-1)2 = n3/3 – n2/2
Total
2n3/3
Parallel GE
1st step – 1-D block partitioning along
blocks of n columns by p processors
i
0
0 k
0
i
0 0 i,i
0 0 X
j 0 0 X
0 0 x
1D block partitioning - Steps
1. Divisions
n2/2
2. Broadcast
xlog(p) + ylog(p-1) + zlog(p-3) + … log1 <
n2logp
3. Multiplications and Subtractions
(n-1)n/p + (n-2)n/p + …. 1x1 = n3/p (approx.)
Runtime:
< n2/2 +n2logp + n3/p
2-D block
To speedup the divisions
P
0
0 k
0
i
0 0 i,i Q
0 0 X
j 0 0 X
0 0 x
2D block partitioning - Steps
1. Broadcast of (k,k)
logQ
2. Divisions
n2/Q (approx.)
3. Broadcast of multipliers
xlog(P) + ylog(P-1) + zlog(P-2) + …. = n2/Q logP
4. Multiplications and subtractions
n3/PQ (approx.)
Problem with block partitioning for
GE
Once a block is finished, the
corresponding processor remains idle
for the rest of the execution
Solution? -
Onto cyclic
The block partitioning algorithms waste
processor cycles. No load balancing
throughout the algorithm.
Onto cyclic
0123012301230
cyclic 1-D block-cyclic 2-D block-cyclic
Load balance, block operations, Has everything
Load balance but column factorization
Block cyclic
Having blocks in a processor can lead
to block-based operations (block
matrix multiply etc.)
Block based operations lead to high
performance
GE: Miscellaneous
GE with Partial Pivoting
1D block-column partitioning: which is
better? Column or row pivoting
•Column pivoting does not involve any extra steps since pivot search
and exchange are done locally on each processor. O(n-i-1)
•The exchange information is passed to the other processes by
piggybacking with the multiplier information
• Row pivoting
• Involves distributed search and exchange – O(n/P)+O(logP)
2D block partitioning: Can restrict the pivot
search to limited number of columns
Sparse Iterative Methods
Iterative & Direct methods – Pros
and Cons.
Iterative methods do not give
accurate results.
Convergence cannot be predicted
But absolutely no fills.
Parallel Jacobi, Gauss-Seidel, SOR
For problems with grid structure (1-
D, 2-D etc.), Jacobi is easily
parallelizable
Gauss-Seidel and SOR need recent
values. Hence ordering of updates
and sequencing among processors
But Gauss-Seidel and SOR can be
parallelized using red-black ordering
or checker board
2D Grid example
13 14 15 16
9 10 11 12
5 6 7 8
1 2 3 4
Red-Black Ordering
Color alternate nodes in each
dimension red and black
Number red nodes first and then
black nodes
Red nodes can be updated
simultaneously followed by
simultaneous black nodes updates
2D Grid example – Red Black
Ordering
15 7 16 8
5 13 6 14
11 3 12 4
1 9 2 10
In general, reordering can
affect convergence
Graph Coloring
In general multi-colored graph coloring
Ordering for parallel computing of Gauss-
Seidel and SOR