CZ4102 – High Performance Computing
Lecture 11:
Matrix Algorithms
- Dr Tay Seng Chuan
Reference:
“Introduction to Parallel Computing” – Chapter 8.
Topic Overview
• Parallel Algorithms for
- Matrix-Vector Multiplication
- Matrix-Matrix Multiplication
• Performance Analysis
1
Matrix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitions for parallel processing.
• The run-time performance of such algorithms depends on
the amount of overheads incurred as compared to the
computation workload.
• As a rule of thumb, good speedup can be achieved if the
computation granularity is able to outweigh the overheads
such as the communication cost, consolidation cost –
3
algorithm penalty, data packaging, etc.
Matrix-Vector Multiplication
• We aim to multiply a dense n x n matrix A with an n x 1 vector x
to yield the n x 1 result vector y.
• The serial algorithm requires n2 multiplications and additions.
A x y
n x =
• The total workload is 4
2
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix A is partitioned among n processors,
with each processor storing a complete row of the
matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
A x y
:
:
:
:
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p = n)
• Since each process starts
with only one element of
Vector x , an all-to-all
broadcast is required to
distribute all the elements
x[ j ] to all the processes.
• Process Pi now computes
• The all-to-all broadcast
and the computation of
y[i] both take time Θ(n) .
Therefore, the parallel time
is Θ(n) .
6
p=n
3
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p < n)
• Consider now the case when
p < n and we use blocks in
partitions.
n/p
• Each process initially stores
n/p complete rows of the p
matrix, and a portion of the
vector of size n/p.
• The all-to-all broadcast takes
place among p processes
and involves messages of
size n/p.
• This is followed by n/p rows of local dot products.
(computation time = n/p x n = n2/p)
Recap: All-to-All Broadcast
based on Doubling-up
Approach:
n
Now m = , we have
p
• Thus, the parallel run time of matrix-vector
n multiplication based on rowwise 1-D
T = ts log p + tw x x (p-1) partitioning (p < n) is
p
~ ts log p + tw x n
All-to-all broadcast of vector elements.
8
• This is cost-optimal.
4
Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• The n x n matrix is partitioned among n2 processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
(p = n2 processors)
n x =
n
9
Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• We must first align the vector
with the matrix appropriately.
• The first communication step
for the 2-D partitioning aligns
the vector x along the
principal diagonal of the
matrix.
• The second step copies the
vector elements from each
diagonal process to all the
processes in the
corresponding column using
n simultaneous broadcasts
among all processors in the
column.
• Finally, the result vector is
computed by performing an
all-to-one reduction along the
columns.
10
5
Matrix-Vector Multiplication:
2-D Partitioning (p = n2)
• Three basic communication operations are used in this algorithm: one-to-one
communication to align the vector along the main diagonal, one-to-all broadcast
of each vector element among the n processes of each column, and all-to-one
reduction in each row.
• These communications take Θ(log n) time. Computation time is O(1). The
parallel time of this algroithm is Θ(log n) + Θ(1) = Θ(log n) .
• The cost (process-time product) is n2 x log n = Θ(n2 log n) > n2; hence, the
algorithm is not cost-optimal. 11
Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• When using fewer than n2 processors, each process owns an
block of the matrix. p (ie, p x p ) processors are used.
• The vector is distributed in portions of elements in the last process-
column only.
• In this case, the message sizes for the alignment, broadcast, and reduction
are all .
• The computation is a product of an submatrix with a
vector of length .
p
12
6
Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• The first alignment step takes time
• The broadcast and reductions each take time
• Local matrix-vector products take time
• Total time is 13
Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense,
square matrices A and B to yield the product matrix
C = A x B.
• The serial complexity is O(n3).
n n2
n x =
A B C
14
7
Matrix-Matrix Multiplication
• A useful concept in this case is called block operations.
In this view, an n x n matrix A can be regarded as a q x q
array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• We perform q2 matrix multiplications, each involving
(n/q) x (n/q) matrices.
q
n/q
n/q
15
Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks of Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k
and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows, rows and B along
columns are needed.
• Perform local submatrix multiplication.
16
8
Matrix-Matrix Multiplication
• The two broadcasts take time
• The computation requires multiplications of
sized submatrices.
• The parallel run time is approximately
• Major drawback of the algorithm is that it is not memory
17
optimal.
Matrix-Matrix
Multiplication:
Cannon's Algorithm
• In this algorithm, we
schedule the computations
of the processes of the
ith row such that, at any
given time, each process is
using a different block Ai,k.
• These blocks can be
systematically rotated
among the processes after
every submatrix
multiplication so that every
process gets a fresh Ai,k
after each rotation. 18
9
Matrix-Matrix
Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a
way that each process multiplies its
local submatrices. This is done by
shifting all submatrices Ai,j to the left
(with wraparound) by i steps and all
submatrices Bi,j up (with
wraparound) by j steps.
• Perform local block multiplication.
• Each block of A moves one step left
and each block of B moves one step
up (again with wraparound).
• Perform next block multiplication,
add to partial result, repeat until all
blocks have been multiplied.
19
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the
maximum distance over which a
block shifts is , the two shift
operations require a total of
time.
• Each of the single-step shifts in
the compute-and-shift phase of the
algorithm takes time.
• The computation time for multiplying
matrices of size
is . (i.e., X 3 )
• The parallel time is approximately:
20
10
Matrix-Matrix Multiplication:
DNS (Dekel, Nassimi, and Sahni) Algorithm
• Uses a 3-D partitioning.
• Visualize the matrix multiplication
algorithm as a cube. Matrices A
and B come in two orthogonal
faces and result C comes out the
other orthogonal face.
• Each internal node in the cube
represents a single add-multiply k
operation (and thus the
complexity). j
• DNS algorithm partitions this i
cube using a 3-D block scheme.
21
Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of
processors.
• Move the columns of A and rows of B k=3
and perform broadcast.
• Each processor Pi, j, k computes a single
multiply: C[i,k] = A[i,k] x B[k,j]. k=2
• This is followed by an accumulation
along the k dimension.
• Since each add-multiply takes constant k=1
time and accumulation and broadcast
takes log n time, the total runtime
k=0
is log n.
• This is not cost optimal. It can be made
cost optimal by using n / log n
processors along the direction of
22
accumulation.
11
Matrix-Matrix Multiplication:
DNS Algorithm 0,3 1,3 2,3 3,3
The vertical column of processes Pi,j,*
computes the dot product of row A[i, *]
and column B[*, j]. Therefore, rows of A
and columns of B need to be moved 0,2 1,2 2,2 3,2
appropriately so that each vertical
column of processes Pi,j,* has row A[i, *]
and column B[*, j]. More precisely,
process Pi,j,k should have A[i, k] and
B[k, j]. 0,1 1,1 2,1 3,1
First, each column of A moves to a
different plane such that the j th column
occupies the same position in the plane
corresponding to k = j as it initially did in
the plane corresponding to k = 0.
23
Matrix-Matrix Multiplication:
DNS Algorithm
Now all the columns of A are
replicated n times in their
respective planes by a parallel
one-to-all broadcast along the
j axis.
Pi,0,j, Pi,1,j, ..., Pi,n-1,j receive a
copy of A[i, j] from Pi,j,j. At this
point, each vertical column of
processes Pi,j,* has row A[i, *].
More precisely, process Pi,j,k
has A[i, k].
24
12
Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0
For matrix B, the 2,3
communication steps 2,2
2,1
are similar, but the roles 2,0
of i and j in process 1,3
subscripts are switched. 1,2
In the first one-to-one 1,1
communication step, 1,0
B[i, j] is moved from
Pi,j,0 to Pi,j,i.
25
Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0
Then it is broadcast
2,3
2,2
from Pi,j,i among P0,j,i, 2,1
P1,j,i, ..., Pn-1,j,i. 2,0
1,3
At this point, each 1,2
1,1
vertical column of 1,0
processes Pi,j,* has
column B[*, j]. Now
process Pi,j,k has B[k, j],
in addition to A[i, k].
26
13
Matrix-Matrix Multiplication: DNS Algorithm
After these communication steps, A[i, k] and B[k, j] are multiplied at Pi,j,k. Now each element C[i, j]
of the product matrix is obtained by an all-to-one reduction along the k axis. During this step,
process Pi,j,0 accumulates the results of the multiplication from processes Pi,j,1, ..., Pi,j,n-1.
The DNS algorithm has three main communication steps: (1) moving the columns of A and the rows
of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A, and along
the i axis for B, and (3) all-to-one reduction along the k axis. All these operations are performed
within groups of n processes and take time O(log n). Thus, the parallel run time for 27
multiplying two n x n matrices using the DNS algorithm on n3 processes is O(log n).
Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)
• Assume that the number of processes p is
equal to q3 for some q < n.
• The two matrices are partitioned into blocks
of size (n/q) x(n/q).
• Each matrix can thus be regarded as a q x q
two-dimensional square array of blocks.
• The algorithm follows from the previous one,
except, in this case, we operate on blocks
rather than on individual elements.
28
14
Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)
• The first one-to-one communication step is performed for
both A and B, and takes time for each matrix.
• The two one-to-all broadcasts take
time for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:
29
15