0% found this document useful (0 votes)

78 views15 pages

HPC Matrix Algorithms Lecture

This document provides an overview of parallel algorithms for matrix-vector and matrix-matrix multiplication. It discusses row-wise and 2D partitioning approaches for matrix-vector multiplication and their performance analyses. For matrix-matrix multiplication, it describes performing block operations and Cannon's algorithm which systematically rotates matrix blocks among processes.

Uploaded by

kmarali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views15 pages

HPC Matrix Algorithms Lecture

Uploaded by

kmarali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CZ4102 – High Performance Computing

Lecture 11:

Matrix Algorithms

- Dr Tay Seng Chuan

Reference:
“Introduction to Parallel Computing” – Chapter 8.

Topic Overview

• Parallel Algorithms for

- Matrix-Vector Multiplication
- Matrix-Matrix Multiplication

• Performance Analysis

1
Matrix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitions for parallel processing.
• The run-time performance of such algorithms depends on
the amount of overheads incurred as compared to the
computation workload.
• As a rule of thumb, good speedup can be achieved if the
computation granularity is able to outweigh the overheads
such as the communication cost, consolidation cost –
3
algorithm penalty, data packaging, etc.

Matrix-Vector Multiplication

• We aim to multiply a dense n x n matrix A with an n x 1 vector x

to yield the n x 1 result vector y.
• The serial algorithm requires n2 multiplications and additions.
A x y

n x =

• The total workload is 4

2
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix A is partitioned among n processors,
with each processor storing a complete row of the
matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
A x y

:
:
:
:

Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p = n)

• Since each process starts

with only one element of
Vector x , an all-to-all
broadcast is required to
distribute all the elements
x[ j ] to all the processes.
• Process Pi now computes

• The all-to-all broadcast

and the computation of
y[i] both take time Θ(n) .
Therefore, the parallel time
is Θ(n) .
6
p=n

3
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning (p < n)
• Consider now the case when
p < n and we use blocks in
partitions.
n/p
• Each process initially stores
n/p complete rows of the p
matrix, and a portion of the
vector of size n/p.

• The all-to-all broadcast takes

place among p processes
and involves messages of
size n/p.

• This is followed by n/p rows of local dot products.

(computation time = n/p x n = n2/p)

Recap: All-to-All Broadcast

based on Doubling-up
Approach:

n
Now m = , we have
p
• Thus, the parallel run time of matrix-vector
n multiplication based on rowwise 1-D
T = ts log p + tw x x (p-1) partitioning (p < n) is
p

~ ts log p + tw x n
All-to-all broadcast of vector elements.
8
• This is cost-optimal.

4
Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• The n x n matrix is partitioned among n2 processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
(p = n2 processors)

n x =

n
9

Matrix-Vector Multiplication:
2-D Partitioning (p = n2 )
• We must first align the vector
with the matrix appropriately.
• The first communication step
for the 2-D partitioning aligns
the vector x along the
principal diagonal of the
matrix.
• The second step copies the
vector elements from each
diagonal process to all the
processes in the
corresponding column using
n simultaneous broadcasts
among all processors in the
column.
• Finally, the result vector is
computed by performing an
all-to-one reduction along the
columns.
10

5
Matrix-Vector Multiplication:
2-D Partitioning (p = n2)
• Three basic communication operations are used in this algorithm: one-to-one
communication to align the vector along the main diagonal, one-to-all broadcast
of each vector element among the n processes of each column, and all-to-one
reduction in each row.

• These communications take Θ(log n) time. Computation time is O(1). The

parallel time of this algroithm is Θ(log n) + Θ(1) = Θ(log n) .
• The cost (process-time product) is n2 x log n = Θ(n2 log n) > n2; hence, the
algorithm is not cost-optimal. 11

Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• When using fewer than n2 processors, each process owns an
block of the matrix. p (ie, p x p ) processors are used.
• The vector is distributed in portions of elements in the last process-
column only.
• In this case, the message sizes for the alignment, broadcast, and reduction
are all .
• The computation is a product of an submatrix with a
vector of length .
p

6
Matrix-Vector Multiplication:
2-D Partitioning (p < n2)
• The first alignment step takes time
• The broadcast and reductions each take time

• Local matrix-vector products take time

• Total time is 13

Matrix-Matrix Multiplication

• Consider the problem of multiplying two n x n dense,

square matrices A and B to yield the product matrix
C = A x B.
• The serial complexity is O(n3).
n n2

n x =

A B C
14

7
Matrix-Matrix Multiplication
• A useful concept in this case is called block operations.
In this view, an n x n matrix A can be regarded as a q x q
array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• We perform q2 matrix multiplications, each involving
(n/q) x (n/q) matrices.

q
n/q

n/q

Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks of Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k
and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows, rows and B along
columns are needed.
• Perform local submatrix multiplication.

8
Matrix-Matrix Multiplication
• The two broadcasts take time

• The computation requires multiplications of

sized submatrices.
• The parallel run time is approximately

• Major drawback of the algorithm is that it is not memory

17
optimal.

Matrix-Matrix
Multiplication:
Cannon's Algorithm
• In this algorithm, we
schedule the computations
of the processes of the
ith row such that, at any
given time, each process is
using a different block Ai,k.
• These blocks can be
systematically rotated
among the processes after
every submatrix
multiplication so that every
process gets a fresh Ai,k
after each rotation. 18

9
Matrix-Matrix
Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a
way that each process multiplies its
local submatrices. This is done by
shifting all submatrices Ai,j to the left
(with wraparound) by i steps and all
submatrices Bi,j up (with
wraparound) by j steps.
• Perform local block multiplication.
• Each block of A moves one step left
and each block of B moves one step
up (again with wraparound).
• Perform next block multiplication,
add to partial result, repeat until all
blocks have been multiplied.
19

Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the
maximum distance over which a
block shifts is , the two shift
operations require a total of
time.
• Each of the single-step shifts in
the compute-and-shift phase of the
algorithm takes time.
• The computation time for multiplying
matrices of size
is . (i.e., X 3 )

• The parallel time is approximately:

10
Matrix-Matrix Multiplication:
DNS (Dekel, Nassimi, and Sahni) Algorithm

• Uses a 3-D partitioning.

• Visualize the matrix multiplication
algorithm as a cube. Matrices A
and B come in two orthogonal
faces and result C comes out the
other orthogonal face.
• Each internal node in the cube
represents a single add-multiply k
operation (and thus the
complexity). j

• DNS algorithm partitions this i

cube using a 3-D block scheme.
21

Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of
processors.
• Move the columns of A and rows of B k=3
and perform broadcast.
• Each processor Pi, j, k computes a single
multiply: C[i,k] = A[i,k] x B[k,j]. k=2
• This is followed by an accumulation
along the k dimension.
• Since each add-multiply takes constant k=1
time and accumulation and broadcast
takes log n time, the total runtime
k=0
is log n.
• This is not cost optimal. It can be made
cost optimal by using n / log n
processors along the direction of
22
accumulation.

11
Matrix-Matrix Multiplication:
DNS Algorithm 0,3 1,3 2,3 3,3

The vertical column of processes Pi,j,*

computes the dot product of row A[i, *]
and column B[*, j]. Therefore, rows of A
and columns of B need to be moved 0,2 1,2 2,2 3,2

appropriately so that each vertical

column of processes Pi,j,* has row A[i, *]
and column B[*, j]. More precisely,
process Pi,j,k should have A[i, k] and
B[k, j]. 0,1 1,1 2,1 3,1

First, each column of A moves to a

different plane such that the j th column
occupies the same position in the plane
corresponding to k = j as it initially did in
the plane corresponding to k = 0.

Matrix-Matrix Multiplication:
DNS Algorithm

Now all the columns of A are

replicated n times in their
respective planes by a parallel
one-to-all broadcast along the
j axis.

Pi,0,j, Pi,1,j, ..., Pi,n-1,j receive a

copy of A[i, j] from Pi,j,j. At this
point, each vertical column of
processes Pi,j,* has row A[i, *].
More precisely, process Pi,j,k
has A[i, k].

12
Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0

For matrix B, the 2,3

communication steps 2,2

2,1
are similar, but the roles 2,0
of i and j in process 1,3
subscripts are switched. 1,2

In the first one-to-one 1,1

communication step, 1,0

B[i, j] is moved from

Pi,j,0 to Pi,j,i.

Matrix-Matrix Multiplication:
DNS Algorithm
3,3
3,2
3,1
3,0

Then it is broadcast
2,3
2,2
from Pi,j,i among P0,j,i, 2,1

P1,j,i, ..., Pn-1,j,i. 2,0

1,3

At this point, each 1,2

1,1
vertical column of 1,0
processes Pi,j,* has
column B[*, j]. Now
process Pi,j,k has B[k, j],
in addition to A[i, k].
26

13
Matrix-Matrix Multiplication: DNS Algorithm

After these communication steps, A[i, k] and B[k, j] are multiplied at Pi,j,k. Now each element C[i, j]
of the product matrix is obtained by an all-to-one reduction along the k axis. During this step,
process Pi,j,0 accumulates the results of the multiplication from processes Pi,j,1, ..., Pi,j,n-1.
The DNS algorithm has three main communication steps: (1) moving the columns of A and the rows
of B to their respective planes, (2) performing one-to-all broadcast along the j axis for A, and along
the i axis for B, and (3) all-to-one reduction along the k axis. All these operations are performed
within groups of n processes and take time O(log n). Thus, the parallel run time for 27
multiplying two n x n matrices using the DNS algorithm on n3 processes is O(log n).

Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)

• Assume that the number of processes p is

equal to q3 for some q < n.
• The two matrices are partitioned into blocks
of size (n/q) x(n/q).
• Each matrix can thus be regarded as a q x q
two-dimensional square array of blocks.
• The algorithm follows from the previous one,
except, in this case, we operate on blocks
rather than on individual elements.

14
Matrix-Matrix Multiplication:
DNS Algorithm (Using fewer than n3 processors.)

• The first one-to-one communication step is performed for

both A and B, and takes time for each matrix.
• The two one-to-all broadcasts take
time for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes time.
• The parallel time is approximated by:

Dense Matrix Parallel Algorithms
No ratings yet
Dense Matrix Parallel Algorithms
55 pages
Unit 4 HPC Part8
No ratings yet
Unit 4 HPC Part8
16 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
Matrix Multiplication Algorithms
No ratings yet
Matrix Multiplication Algorithms
39 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages
Matrix Mul
No ratings yet
Matrix Mul
33 pages
Class18 - Linalg II Handout PDF
No ratings yet
Class18 - Linalg II Handout PDF
48 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
No ratings yet
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
37 pages
Canon's Algorithm
No ratings yet
Canon's Algorithm
11 pages
Matrix Multiplication Optimization
No ratings yet
Matrix Multiplication Optimization
32 pages
Chapter 9 - Parallel Computation Problems
No ratings yet
Chapter 9 - Parallel Computation Problems
43 pages
Matrix Multiplication Algorithm
No ratings yet
Matrix Multiplication Algorithm
9 pages
Matrix Matrix Algorithm, Cannons Alg
No ratings yet
Matrix Matrix Algorithm, Cannons Alg
13 pages
Parallel Fox Algorithm for Matrices
No ratings yet
Parallel Fox Algorithm for Matrices
8 pages
Matrix Multiplication for Programmers
No ratings yet
Matrix Multiplication for Programmers
2 pages
Algorithms and Data Structure
No ratings yet
Algorithms and Data Structure
29 pages
AdvancedAlgorithms NN
No ratings yet
AdvancedAlgorithms NN
152 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Cabais Finals Lab Act#2
No ratings yet
Cabais Finals Lab Act#2
9 pages
To Print - Dynprog2
No ratings yet
To Print - Dynprog2
46 pages
Matrix Multiplications and Collective Communication: Michael Hanke
No ratings yet
Matrix Multiplications and Collective Communication: Michael Hanke
38 pages
Strassen's Algorithm & Optimization
No ratings yet
Strassen's Algorithm & Optimization
8 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Lecture 20ppt
No ratings yet
Lecture 20ppt
25 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
How To Multiply: 5.5 Integer Multiplication
No ratings yet
How To Multiply: 5.5 Integer Multiplication
16 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
Daa Exp02
No ratings yet
Daa Exp02
16 pages
Daa 02 R1 2
No ratings yet
Daa 02 R1 2
63 pages
Week 09 2021b
No ratings yet
Week 09 2021b
52 pages
Blocked Matrix Multiply
No ratings yet
Blocked Matrix Multiply
6 pages
MPI Matrix Multiplication 1 PDF
No ratings yet
MPI Matrix Multiplication 1 PDF
23 pages
Ca 3
No ratings yet
Ca 3
34 pages
Lecture 19ppt
No ratings yet
Lecture 19ppt
18 pages
Matrix-Matrix Multiplication
No ratings yet
Matrix-Matrix Multiplication
8 pages
Matrix Chain Multiplication
100% (1)
Matrix Chain Multiplication
20 pages
Types of Parallel Processing Explained
No ratings yet
Types of Parallel Processing Explained
3 pages
Dynamic Programming
No ratings yet
Dynamic Programming
15 pages
Multiplym 2
No ratings yet
Multiplym 2
23 pages
Algo VC Lecture24
No ratings yet
Algo VC Lecture24
32 pages
Parallel Computing with MPI
No ratings yet
Parallel Computing with MPI
26 pages
Cs3401 Unit 3
No ratings yet
Cs3401 Unit 3
45 pages
Greedy DP
No ratings yet
Greedy DP
57 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Week3 PDF
No ratings yet
Week3 PDF
36 pages
Generalized Cannon's Algorithm For Parallel Matrix Multiplication
No ratings yet
Generalized Cannon's Algorithm For Parallel Matrix Multiplication
8 pages
Parallel Matrix Multiplication
No ratings yet
Parallel Matrix Multiplication
40 pages
Parallel Computing Architectures
No ratings yet
Parallel Computing Architectures
57 pages
Strassens Matrix Multiflication
No ratings yet
Strassens Matrix Multiflication
14 pages
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
No ratings yet
Dynamic Programming: Department of CSE JNTUA College of Engg., Kalikiri
66 pages
To Read Dynprog2
No ratings yet
To Read Dynprog2
50 pages
Daa Lecture CSE
No ratings yet
Daa Lecture CSE
6 pages
Divide & Conquer Matrix Multiplication
No ratings yet
Divide & Conquer Matrix Multiplication
9 pages
ABS FM List
No ratings yet
ABS FM List
2 pages
Winters Promise Quilt Pattern
No ratings yet
Winters Promise Quilt Pattern
7 pages
Salt Harbour Part B Analysis-Easterly
No ratings yet
Salt Harbour Part B Analysis-Easterly
4 pages
SOP Pronouns EXERCISE
No ratings yet
SOP Pronouns EXERCISE
1 page
Underground Cable Fault Detection Schematic
50% (4)
Underground Cable Fault Detection Schematic
16 pages
Cu CR 1 ZR
No ratings yet
Cu CR 1 ZR
38 pages
2) Change Control
No ratings yet
2) Change Control
4 pages
Preference Form
No ratings yet
Preference Form
2 pages
AC Compressor Manual
No ratings yet
AC Compressor Manual
48 pages
Aac Sural
No ratings yet
Aac Sural
8 pages
PVT Termal Panel Genel Bilgi (Photovoltaic - Thermal - PVT - Solar - Panels)
No ratings yet
PVT Termal Panel Genel Bilgi (Photovoltaic - Thermal - PVT - Solar - Panels)
4 pages
Guideline For Customer Notifications PCN V5.0
No ratings yet
Guideline For Customer Notifications PCN V5.0
21 pages
Microprocessor Lecture 10
No ratings yet
Microprocessor Lecture 10
11 pages
University Updates: Jawaharlal Nehru Technological University Hyderabad
No ratings yet
University Updates: Jawaharlal Nehru Technological University Hyderabad
11 pages
Zayat - Wireless Infra Structure & DDF
No ratings yet
Zayat - Wireless Infra Structure & DDF
18 pages
DQ Model
No ratings yet
DQ Model
7 pages
Diacerein
0% (2)
Diacerein
3 pages
CS502 Fundamentals of Algorithms
No ratings yet
CS502 Fundamentals of Algorithms
24 pages
SHA256E s2187283
No ratings yet
SHA256E s2187283
153 pages
Types Roof Trusses: Building Technology 3 2012
100% (1)
Types Roof Trusses: Building Technology 3 2012
34 pages
Real-Life Applications of Linear Algebra
No ratings yet
Real-Life Applications of Linear Algebra
3 pages
White John S. The Limiting Distribution of The Serial Correlation Coefficient in The Explosive Case
No ratings yet
White John S. The Limiting Distribution of The Serial Correlation Coefficient in The Explosive Case
11 pages
Oracle Analytic Functions Guide
100% (1)
Oracle Analytic Functions Guide
3 pages
Python Course Syllabus
No ratings yet
Python Course Syllabus
5 pages
Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us
No ratings yet
Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us
9 pages
Physical Science Question Class X
No ratings yet
Physical Science Question Class X
9 pages
Green University of Bangladesh Department of Textile: Lab Report
No ratings yet
Green University of Bangladesh Department of Textile: Lab Report
5 pages
Essar 32
No ratings yet
Essar 32
2 pages
Q9T4-FP91G-engineering Specification PDF
No ratings yet
Q9T4-FP91G-engineering Specification PDF
14 pages
Chapter 10-Statistical Inference For Two Samples
No ratings yet
Chapter 10-Statistical Inference For Two Samples
38 pages

HPC Matrix Algorithms Lecture

Uploaded by

HPC Matrix Algorithms Lecture

Uploaded by

CZ4102 – High Performance Computing

- Dr Tay Seng Chuan

• Parallel Algorithms for

• We aim to multiply a dense n x n matrix A with an n x 1 vector x

• The total workload is 4

• Since each process starts

• The all-to-all broadcast

• The all-to-all broadcast takes

• This is followed by n/p rows of local dot products.

Recap: All-to-All Broadcast

• These communications take Θ(log n) time. Computation time is O(1). The

• Local matrix-vector products take time

• Consider the problem of multiplying two n x n dense,

• The computation requires multiplications of

• Major drawback of the algorithm is that it is not memory

• The parallel time is approximately:

• Uses a 3-D partitioning.

• DNS algorithm partitions this i

The vertical column of processes Pi,j,*

appropriately so that each vertical

First, each column of A moves to a

Now all the columns of A are

Pi,0,j, Pi,1,j, ..., Pi,n-1,j receive a

For matrix B, the 2,3

communication steps 2,2

In the first one-to-one 1,1

communication step, 1,0

B[i, j] is moved from

P1,j,i, ..., Pn-1,j,i. 2,0

At this point, each 1,2

• Assume that the number of processes p is

• The first one-to-one communication step is performed for

You might also like