KEMBAR78
Parallel Computing with MPI | PDF | Parallel Computing | Message Passing Interface
0% found this document useful (0 votes)
44 views26 pages

Parallel Computing with MPI

This document summarizes a report on distributed computing. It discusses parallel computing concepts like speedup, efficiency, and scalability. It then describes an algorithm for parallel matrix-vector multiplication using row-block decomposition across multiple processors. Each processor is assigned a block of rows and computes its portion of the matrix-vector product independently. The document analyzes the complexity of the algorithm and communication costs.

Uploaded by

LopUi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views26 pages

Parallel Computing with MPI

This document summarizes a report on distributed computing. It discusses parallel computing concepts like speedup, efficiency, and scalability. It then describes an algorithm for parallel matrix-vector multiplication using row-block decomposition across multiple processors. Each processor is assigned a block of rows and computes its portion of the matrix-vector product independently. The document analyzes the complexity of the algorithm and communication costs.

Uploaded by

LopUi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Report on

Distributed Computing
Vũ Văn Thiê.u
Master student, Master of Science Program

1 Introduction
In the early 1970s, computers began to appear that consisted of a number of separate
processors operating in parallel. This type of computer is called a parallel computer.
The basic idea of parallel computer is that a number of processors work in cooperation
on a single task. The motivation is that if it takes one processor an amount of time
t to do a task, then p processors can do the task in time t/p. Only for very special
situations this perfect “speedup” can be achieved, however, and it is our goal to de-
vise algorithms that can take as much advantage as possible, for a given problem, of
multiple processors.
One of the most popular ways to implement a parallel program in such the computer
is to use MPI (Message Passing Interface) which is written in C or Fortran (see [5]).
In this report, we deal with the parallel algorithms, which associate to the particular
problems, using MPI specification .

2 Basic concepts of parallel computing


Speedup Ideally, we could solve a problem p times as fast on p processors as on a
single processor. This ideal is rarely achieved; what is achieved is called the relative
speed-up of the parallel algorithm defined by
execution time of parallel algorithm on a single processor
Sp = . (1)
execution time using p processors

Efficiency Closely related to speedup is relative efficiency defined by


Sp
Ep = . (2)
p
Speedup and efficiency are important internal performance metrics, in the sense
that they express how well an algorithm is parallelized and how well it scales to a
large amount of processors. Since Sp ≤ p, we have Ep ≤ 1 and an efficiency of Ep = 1
corresponds to a perfect speedup of Sp = p (see e.g. [1]).

1
Scalability A parallel program is said to be scalable if it maintains a high efficiency
while the number of processors is increasing.

3 The Parallel Matrix-Vector Product


3.1 The problem
We consider the matrix-vector product
y = A x,
where A is a n × n matrix, x and y are vectors of length n.
Let Aij (i = 1, · · · , n, j = 1, · · · , n) and xi (i = 1, · · · , n) denote the components
of matrix A and vector x, respectively. Then the resulting components of vector y are:
n
X
yi = Aij xj , i = 1, · · · , n.
j=1

3.2 The algorithm


There exist many ways to parallelize the above matrix-vector product. The most
simple way is by row-block decomposition. In other words, we decompose Pp n rows of
matrix A into p blocks. Each block consists of nk rows, so that n = k=1 nk . Each
of p processors is responsible for solving a subproblem on one block, i.e.,
n
X
zi = Aij xj , i = 1, · · · , nk
j=1

or
z = B x,
provided that the complete argument x has been copied into memory of this processor.
Here B is a nk × n matrix, z is a vector of length nk and is a part of the vector y.
Using this data decomposition, each processor can completely in parallel due to the
independence of calculating in different blocks.
After loading all the sub-vectors z into vector y, we have now a solution of our
problem. We remark that as the input vector x, the output vector y need to be dis-
tributed over all process because in many applications, for example in the iterative
solvers, this output is considered as an input argument for the next iteration. There-
fore, the full parallel matrix vector product should also contain a routine that first
gathers the argument vector into memory of all processors and then performs the
parallel computation.

3.3 Implementation
The program is written in a way that we can easily change the matrix dimension n and
the number of processors p (via MPI Comm size call). The call MPI Comm rank gives
an identity rank for each processor.

2
As discussed above, in the case that n/p is not an integer, the speed-up is not
optimized and is best if we have the minimum difference in the number of rows nk that
each processor receives. For that purpose, one can device a load-balancing algorithm
that enforces the maximum difference in the number of rows is one, i.e. processors
receive nk = ⌈n/p⌉ or nk = ⌈n/p⌉ − 1 rows, with ⌈ ⌉ is the ceiling function. In the C
code, nk and the global position of the first row (of the current block) are the variables
npoints and offset respectively, where
npoints = n/p + (n%p + p - rank -1)/p;
offset = (npoints > n/p) ? (npoints*rank) : (n-(p-rank)*npoints);
The global position of a row is found straightforwardly.
Before having an actual matrix-vector product, let us first generate the matrix and
vector elements locally: initialize in each processor the nk × n matrix submatrix (ma-
trix B) and subvector - a vector of length nk ; then by gather operation (MPI Allgather
call) we collect all distributed elements of subvector, via newvector, into vector -
a vector of length n (vector x), which will be contributed in all processors (using
MPI Bcast call). Now, each processor simply performs a matrix-vector product Bx.
The result vector res (vector z) is sent back to the root where the solution of our
problem is stored.

Remark We venture to give an optimized code, with respect to the memory usage
and the communication time, as follows:
- To collect all distributed elements of subvector into vector, we need not to use
the intermediate vector, i.e. newvector (which is of the length 2n), since we call
MPI Allgatherv instead of MPI Allgather function

MPI_Allgatherv(subvector, npoints, MPI_FLOAT,


vector,count, dsplm, MPI_FLOAT, MPI_COMM_WORLD);

The memory usage is then significantly reduced, even though two new vectors of
length p (which is usually very small compared to 2n), i.e. count and displm,
are used.
- Using MPI Allgatherv call,vector is available in every processor. We therefore
need not to Bcast vector to the root.
- The resulting vector res need not to be stored in every processor

float *res=NULL;
if (rank==0) res = (float *) malloc (n*sizeof(float));

Here we introduce in each processor a local vector subres (of length npoints)
to store the result of the local matrix-vector product.
- To send the resulting vector back to the root, we then use MPI Gatherv to collect
all elements of subres into res

MPI_Gatherv(subres, npoints, MPI_FLOAT,


res,count, dsplm, MPI_FLOAT, 0, MPI_COMM_WORLD);

3
In this way, unlike using a set of MPI Send and MPI Recv, the information is
transferred only once.

Complexity To investigate the performance of the algorithm, we shall have a look


at its complexity. Assuming that all floating point operations require the same time
τcalc . The computation time for calculating each component yi is approximated by
2 n τcalc (n multiplications and n − 1 additions). Thus, using the single process, the
total computation time is about T1 = 2 n2 τcalc .
Using the same argument as above, in the parallel version, the computation time
on processor k is 2 n nk τcalc . The total computation time is maxk=1,··· ,p (2 n nk τcalc ).
This is optimized if n/p = c ∈ Z+ and n1 = n2 = · · · = np = n/p. The computation
time using p processors is then 2 n (n/p) τcalc = 2 n2 /p τcalc . If we could neglect the
communication time, the resulting speed-up and efficiency (see (1) and (2)) would be
p and 1, respectively. However, this is rarely achieved due to all kind of overheads
introduced in the parallel program, such as load imbalance and the necessity to com-
municate between processors. On the other hand, in the case that n is not completely
divided by p (n/p 6= c ∈ Z+ ), the optimized speed-up is also not obtained. For that
reason, assuming that only overheads are due to communication between processors,
and that each processor spends a time Tcomm for communication. Then
Tp = 2n2 /p τcalc + Tcomm
(see section 3.2).
The communication time is mainly due to two resources:
- The Allgather operator transfers nk data elements of current processor to p − 1
remain processors.
- The Gather operator takes nk data elements of current processor to processor 0.
Assume that sending one element of the vector takes τcomm . The communication time
is then Tcomm = nk (p − 1)τcomm + nk (p − 1)τcomm = 2nk (p − 1)τcomm . Therefore,
2n2 n
Tp ≃ τcalc + 2 (p − 1)τcomm , (3)
p p
1
Sp = 1 p−1 τcomm , (4)
p + pn τcalc
1
Ep = p−1 τcomm . (5)
1+ n τcalc

Here τcomm /τcalc is actually a hardware parameter.

3.4 Results
To test the correctness of our program (see the selection of constant TEST in the code),
let A be an identity matrix
½
1, if i = j,
Aij =
0, otherwise.

4
Then y = Ax = x. This is indeed observed from result of our program:
n = 10, p = 1, rank = 0, npoints = 10, offset = 0
The elements of the 0-th submatrix are:
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1
The elements of the vector are:
0 1 2 3 4 5 6 7 8 9
The result vector is:
0 1 2 3 4 5 6 7 8 9
The execution time (using 1 processors) is: 0.001851 seconds

To measure the scalability of the program, before and after the performance, the
routine gettimeofday (or MPI Wtime) is used to calculate the execution time using
p processors. The execution time (seconds), the speedup and the efficiency of the
progam using different number of processors p for different values of n are plotted in
Figure 1. To have an insight of these results, let us look at the formulas (3), (4) and
(5). From (3) we see that Tp is proportional to n, while it is inversely proportional to
p. Therefore, Tp increases when n increases. This is clearly shown in Figure 1a. On
the other hand, Tp decreases if p increases. This is confirmed in Figure 1d.
From (4), we see that the speedup increases as either p or n increases. In other
word, the larger problem size the higher speedup (see Figure 1b), the larger number
of processors the higher speedup (see Figure 1e).
The formula (5) implies the dependence of the efficiency on the ratio of p and n.
Therefore, if we fix n while increase p, the efficiency reduces. Especially for small n,
the efficiency of algorithm using 2 processors is largest as shown in Figure 1c. If we
increase n, while fix p, the efficiency increases. That is why in Figure 1f we observe
that the two curves of n = 2000 and n = 3000 are higher than that of n = 1000. At
the same time, we would like to use large p to have high speedup, but we then face the
reduction of the efficiency when using the such large p. Therefore, it is shown that (see
[3]) each n has a optimal number of processor p∗ . When n increases, p∗ also increases.
We observe that p = 2 seems to be a good choice for n < 2000, p = 4 for n ≃ 3000,
p = 8, 16 might be a good choice for larger n.
From Figures 1b and 1c, we see that for small n, the speedup as well as the efficiency
are small. This is due to the fact that the communication time covers the computation
time. For larger n, the speedup and the efficiency are increasing and reach their
maximal values. These values are high (e.g. S2 ≃ 1.65, S4 ≃ 3.4, E2 ≃ E4 ≃ 0.85),
close to the theoretical values, and maintained as n is increasing. Moreover, when we
increase the number of processors p, the efficiency are maintained highly(≃ 0.79, for
n = 2000). This behavior shows the scalability of our parallel program.

5
Complexity as function of n. Complexity as function of p.

Execution time Execution time


2 4

1.5 3
a) d)
1 2

0.5 1

0 0
0 1000 2000 3000 5 10 15
Speedup Speedup
15 15

10 10
b) e)
5 5

0 0
0 1000 2000 3000 5 10 15
Effeciency Effeciency
1 1

0.8
c) f)
0.5
0.6

0 0.4
0 1000 2000 3000 5 10 15
Horizontal axis: n. Horizontal axis: p.
Solid line: p = 16; Dashed line: p = 8; Solid line: n = 3000; Dashed line: n = 2000;
Dash-dotted line: p = 4; Dotted line: p = 2; Dash-dotted line: n = 1000.;

Figure 1: Complexity as function of n and p.

6
3.5 Discussion
Beside the above row-block decomposition algorithm, we venture to give a parallel
algorithm, say the column decomposition algorithm, as follows:
Let us make a partition of the matrix A as p blocks of sub-matrices. Each block
consists of nk columns, i.e., A = [B1 , B2 , · · · , Bp ], where Bk is a n × nk matrix.
We divide the vector x into its sub-vectors, i.e., x = [x1 , x2 , · · · , xp ]′ , where xk is a
vector of length nk . Then in the parallel algorithm, the processor k simply performs
a sub-problem, a matrix-vector product Bk xk = z k . Thus, similar to the previous
row block decomposition, in each processor, we can generate locally the matrix Bk
and the vector xk . The resulting vector of length n, z k , is calculated locally. In
fact, the vector y which we need to compute, is a summation of p vectors of length
n, i.e. y = z 1 + z 2 + · · · + z p . Again, a parallel algorithm for adding these vectors
arises because the addition in different rows can be done independently. We notice
that, before the second parallel algorithm is done, the vectors z 1 , z 2 , · · · , z p should be
stored in the root as either a matrix or a vector of length np (for example, r(i−1)p+j =
zij , j = 1, 2, · · · , p, i = 1, 2, · · · , n).
Using the similar way of estimating the complexity of the previous algorithm, for
the current algorithm we consider two sequential parts:

- For the first part (local matrix-vector product), the computation time on pro-
cessor k is approximated by 2n nk .

- For the second part (vector-vector addition), we can use the so-called ‘fan-in’
algorithm (we refer to [1] for more detail discussion). In the processor k, we
need to calculate nk elements. Each of them is a summation of p values. The
computation time of such the calculation using ‘fan-in’ algorithm is approximated
by log2 p. Therefore, the total computation time for this second part is nk log2 p.

If the communication time can be omitted, the total execution time for the current
algorithm should be nk (2n + log2 p), which is larger than that of previous section. If p
is small compared to n, this value is comparable to what has been shown the previous
section.
In summary, we have studied the row-block decomposition parallel algorithm used
for calculating the matrix-vector product, in both theoretical and numerical points of
view. As illustrated in the above figure, this is a quite satisfactory as it is popularly
in used. In addition, we also gave the parallel algorithm which is a promising way of
calculating a matrix-vector product.

4 The vibrating string


4.1 The model
The wave equation, which is of the form

∂2ψ 2
2∂ ψ
= c , (6)
∂t2 ∂x2

7
comes from many real and interesting applications, such as the vibrations of a uniform
string, the sound waves in gases or liquids, and optical waves. As an example, we
study an isolated guitar string of length 1, with two fixed endpoints. We refer to [2]
for more details of the construction of the problem.
Briefly, we need to solve equation (6) to find out the vibration amplitude ψ(x, t)
of the string at position x (0 ≤ x ≤ 1) and time t (t ≥ 0). Since the string is fixed on
both endpoints, mathematically we have
ψ(0, t) = ψ(1, t) = 0. (7)
We suppose that the vibration amplitude is initialized as
ψ(x, 0) = f (x), (8)
where f (x) can be any smooth function that satisfies the boundary conditions (7). For
that purpose, we chose f (x) = sin(2kπx), k ∈ Z+ .
Analytically, using Fourier transformation (see [2]), our solution is of the form

X
ψ(x, t) = Bm sin(mπx) cos(mπct), (9)
m=1

where the coefficient Bm is determined by the initial condition as


Z 1
Bm = 2 f (x) sin(mπx)dx. (10)
0

For our particular initial value f (x) = sin(2kπx) we have Bm = 0, m 6= 2k and B2k = 1.
Therefore, the final solution is ψ(x, t) = sin(2kπx) cos(2kπct). Thus,
||ψ(x, t)|| ≤ || sin(2kπx)|| = ||f (x)|| = ||ψ(x, 0)||. (11)

4.2 Numerical approach


In order to find the numerical solution of our problem (6), we use an uniform dis-
cretization for the spatial domain as well as the temporal domain and approximate
the differential operators by the so-called finite difference expressions (see [2]). That
is, the spatial domain is discretized as xi = i∆x, i = 0, 1, · · · , N, and ∆x = 1/N ,
where N is number of grid points. The temporal domain is decomposed as tj = j∆t,
where ∆t is a predetermined time step. The spatial and temporal second derivatives
(at current position xi and time tj ) are respectively approximated by
∂ 2 ψ(xi , tj ) ψ(xi , tj−1 ) − 2ψ(xi , tj ) + ψ(xi , tj+1 )
≈ ,
∂t2 ∆t2
2
∂ ψ(xi , tj ) ψ(xi−1 , tj ) − 2ψ(xi , tj ) + ψ(xi+1 , tj )
≈ , (12)
∂x2 ∆x2
where ψ(x0 , ·) = ψ(xN , ·) = 0, ψ(·, t0 ) = f (·). By substituting the expression (12) into
(6), one can get the solution at the next point in time ψ(xi , tj+1 ), using the previous
calculated values, as
ψ(xi , tj+1 ) = 2ψ(xi , tj )−ψ(xi , tj−1 )+τ 2 (ψ(xi−1 , tj ) − 2ψ(xi , tj ) + ψ(xi+1 , tj )) , (13)

8
where τ = c∆t
∆x . In other words, to update the solution at a particular point in the
space we need only few local grid points (itself and 4 neighbors) of the current time
level. The scheme is therefore well parallelized.

Remark : Our discretization scheme is stable (i.e., a small local error does not lead
to a large cumulative error) if 0 ≤ τ ≤ 1 (see [2]). The time step is therefore restricted
by ∆t ≤ ∆x/c.

4.3 Parallel algorithm


Since we consider the initial value problem, the spatial decomposition is used. That is,
each processor gets a part of the string and simulates the time-behavior of this specific
part. The algorithm is as follows: We suppose that the processors are organized as
a ring/line topology. Then, each processor has its own left and right neighbors, say
left and right. In the beginning, a part of starting values is given to each processor.
Then, at each time-integration step, the processor exchanges the boundary points (the
most left and the most right points) with the left and right neighbors to performs the
update (13), while the internal points are fully parallelized.

Remark The temporal decomposition is not used here because the data elements in
processor k are only calculated if data elements in processor k − 1 are available (i.e.,
in fact we have no parallelization).

4.4 Implementation
Being the first step, each processor picks up the starting values, such as the environ-
ment (MPI Init) and its rank in the topology (MPI Comm size and MPI Comm rank).
Using the routine parseargs, the root receives all necessary parameters, e.g. number
of grid points N , the time interval, the time step etc. into variable parms and passes
those parameters to all processors. The parseargs function is used in the following
manner
if(parseargs(argc, argv)){
MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
exit(EXIT_FAILURE);
}

Here MPI Abort is used (not MPI Finalized) since the passing-parameters program
turns out to be unsuccessful. There are two ways of passing the parameters, namely
TrfArgs1 and TrfArg2. The former bcasts the parameters in a package (using the rou-
tine MPI Type struct). The latter bcasts one parameter by one parameter. Therefore,
TrfArg2 should be the cleaner and faster method because the package is considered
as a single datatype and the communication time is smaller. For example, for a set of
parameters, it turns out that the execution time using TrfArgs1 is 0.08 seconds, while
it is only 0.04 seconds for the program using TrfArgs2.
Now, to create a ring/line topology ring, we use the routine MPI Cart create. The
coordinate and the rank ID of the current process, which are respectively denoted by

9
coords and rank, are found by using the routines MPI Cart get and MPI Cart rank.
Then, the left (of coordinate coords-1) and the right(of coordinate coords+1) pro-
cessors, whose rank IDs are left and right, can be easily obtained
MPI_Cart_shift (ring, 0, 1, &left, &right);

Then, similar to the data decomposition mentioned in Section 3, each processor


gets a proportional amount of data points npoints, where the position of the first
data point in the global set is denoted by offset. Since only a part of global string
needs to be updated in one processor, we need to store only the parts of solution at
previous time levels ψ(· , tj−1 ) and ψ(· , tj ), as well as at next time level ψ(· , tj+1 ).
These vectors (of the length npoints) are denoted by oldval, curval and newval
respectively. At the beginning, oldval and curval are both initialized as f (x), which
is a SIN function. For the test purpose, we also implement a PLUCKED function as an
initial value (see Initialize).
Now, we are able to have the computation processed. The program requires niters
iterations to reach the end-time point (i.e. parms.stime). In a single iteration (see
routine compute), three steps need to be done: exchanging the boundary points; up-
dating the solution vector; and preparing for the next iteration.
- To exchange the boundary points, we denote the most left point in vector
curval (which is curval[0]) by l neighbor and the most right point (which
is curval[npoints-1]) by r neighbor. Then, using MPI Sendrecv, we send
r neighbor to process on right while receiving into l point from process on
left. On the other hand, we send l neighbor to process on left while receiv-
ing into r point from process on right. l point and r point will be used to
calculate newval.

/* Take the most left and most right points */

l_neighbor[0]=*(curval+0);
r_neighbor[0]=*(curval+npoints-1);

/* Send r_neighbor to process on right while receiving into l_point


from process on left. Then, send the l_neighbor to process on left,
while receiving into r_point from the process on right. */

if (SendAndRecv){
MPI_Sendrecv(&r_neighbor[0], 1, MPI_DOUBLE, right, rank,
&l_point[0], 1, MPI_DOUBLE, left, left, ring, &status);
MPI_Sendrecv(&l_neighbor[0], 1, MPI_DOUBLE, left, rank,
&r_point[0], 1, MPI_DOUBLE, right, right, ring, &status);
}else{
MPI_Send(&r_neighbor[0], 1, MPI_DOUBLE, right, rank, ring);
MPI_Recv(&l_point[0], 1, MPI_DOUBLE, left, left, ring, &status);
MPI_Send(&l_neighbor[0], 1, MPI_DOUBLE, left, rank, ring);
MPI_Recv(&r_point[0], 1, MPI_DOUBLE, right, right, ring, &status);
}

10
We note that one can use MPI Send and MPI Recv separately to exchange the
boundary points. However, to avoid deadlock MPI Sendrecv is preferred since
sending and receiving happen simultaneously.
- To calculate newval, we need to consider carefully whether or not the data point
is in the boundary of the domain. If so, the boundary condition should be used.
Otherwise, we update the array according to (13), with notice of using l point
and r point. The code is as follows:

for (i=0; i <npoints; i++){


if ((offset+i==GlbLeft)||(offset+i==GlbRight)){
*(newval+i) = 0.0; /*Fix the two end points of string*/
}else{
l = (i==0) ? l_point[0] : *(curval+i-1);
m = *(curval+i);
r = (i==npoints-1) ? r_point[0] : *(curval+i+1);
d = *(oldval+i);
*(newval+i) = 2.0*m - d + tau*tau*(l-2.0*m+r);
}
}

- Finally, replace oldval by curval and curval by newval.

for (i=0; i < npoints; i++) {


*(oldval+i) = *(curval+i);
*(curval+i) = *(newval+i);
}

Complexity As always, it is good practice to analyze the parallel program in term


of its complexity. Assuming that the time integration takes most of the execution time.
It is reasonable to just consider a single time integration (once of calling compute()).
Using the same notations as in Section 3.3, however, we use “the number of grid
points” N instead of using “the problem size” n. The required time to compute one
grid point is 8τcalc . Therefore,

T1 = 8N τcalc , (14)
T1
Tp = + Tcomm
p
8N τcalc
= + 2(τsetup + τexchange ), (15)
p
T1 1
Sp = = 1 1 τsetup 1 τexchange , (16)
Tp p + 4N τcalc + 4N τcalc
T1 1
Ep = = p τsetup p τexchange . (17)
Tp 1+ 4N τcalc + 4N τcalc

11
4.5 Results and discussion
To check the correctness of the program, let c = 1 and k = 1, our exact solution is
then of the form ψ(x, t) = sin(2πx) cos(2πt), which scales the
√initial value by a factor
cos(2πt) < 1. At t = 1/8, t = 1/2 and t = 1, the factors are 2/2 = 0.7071, −1 and 1
respectively. This is also numerically observed as shown in Figure 2.

1 0.8
’wave0.dat’ ’wave1.dat’

0.8
0.6

0.6
0.4
0.4

0.2
0.2

0 0

-0.2
-0.2

-0.4
-0.4
-0.6

-0.6
-0.8

-1 -0.8
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200

a) Initial value. b) t = 1/8.


1 1
’wave4.dat’ ’wave8.dat’

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0

-0.2 -0.2

-0.4 -0.4

-0.6 -0.6

-0.8 -0.8

-1 -1
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200

c) t = 1/2. d) t = 1.

Figure 2: The vibration amplitude at the particular points in time. ∆t = 0.005.

Moreover, as mentioned in the remark in section 4.2, our numerical scheme is only
stable if ∆t ≤ ∆x. We have tested the program with two different values of ∆t
while ∆x = 1/199 is unchanged. Indeed, it is shown that our method is stable with
∆t = 0.005 (Figure 2) while it is unstable with ∆t = 0.01 (Figure 3). From Figure 3a,
we observe that the amplitude is large (≃ 15), after a very short time interval, and
keep increasing (≃ 106 in Figure 3b). The solution is no longer bounded by the initial
value (as mentioned in Section 4.2).
Having the analysis given in the section 4.4, we now focus on the performance
of the algorithm. Again, to have an explanation of these results, we shall have a
look at the formulas presenting the complexity. From (15), the execution time (in
seconds) is proportional to the ratio of N and p. That is why any curve in Figure 4a
is monotonously increasing as N increases, while it is monotonously decreasing as P
increases (Figure 4d).
From Figures 4b and 4e we observe that the speedup increases if N or p increases.

12
15 6e+06
’wave3.dat’ ’wave4.dat’

5e+06

10
4e+06

3e+06
5
2e+06

1e+06
0
0

-1e+06
-5
-2e+06

-10 -3e+06

-4e+06

-15 -5e+06
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200

a) t = 0.15. b) t = 0.20.

Figure 3: Unstable behavior (solution is plotted point by point ). ∆t = 0.01

This is supported by the theoretical expression (16). However, the increment of p


does not means that we then also get a high efficiency, because the larger p the larger
communication time, which might be cover the computation time. In fact, the effi-
ciency depends on the ratio of p and N , as you can see from formula (17). Thus, each
N has its own optimal value p∗ . From Figure 4c, p∗ = 2 and p∗ = 4 seem to be a
good choice for our problem, because the efficiency are high (≃ 0.78) and maintain as
N increases. For larger p (p = 8 and p = 16) we have not the such high efficiency,
because the communication time covers the computation time. To prove that, inside
the program we implemented a small technique to compute the exchanged time. It
turns out that the exchange time is really large (about 2/3 of execution time, or even
more), especially for large p. That is why in Figure 4f the efficiency curves quickly
decrease and are small as p increases. Our conclusion is that we should use at most 4
processors for this particular application.

5 The Time Dependent Diffusion Equation


5.1 The model
Let us consider the two-dimensional time dependent diffusion equation
µ 2
∂2c

∂c ∂ c
=D + 2 , (18)
∂t ∂x2 ∂y
where the concentration c is a function of spatial variables x and y and time t. Without
loss of generality, we assume that 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1. For the boundary
conditions, we assume that
c(x, y = 1, t) = 1, c(x, y = 0, t) = 0, (19)
c(x = 0, y, t) = c(x = 1, y, t). (20)
For the initial condition, we take
c(x, y, t = 0) = 0 for 0 ≤ x ≤ 1, 0 ≤ y < 1. (21)

13
Complexity as function of N . Complexity as function of p.

Execution time
Execution time
1500
1 1500
2 10000
1000 4 20000
a) 8 1000 40000
16 d)
500
500

0
1 2 3 4 0
Speedup 4 5 10 15
x 10
5 Speedup
5
4
4
b) 3
e) 3
2
2
1
1 2 3 4 1
Effeciency 4 5 10 15
x 10
0.8 Effeciency
1
0.6
c) 0.4
f) 0.5
0.2

0
1 2 3 4
4 0
x 10 5 10 15
Horizontal axis: N . Horizontal axis: p.

Figure 4: Complexity as function of N and p.

For this particular set of boundary and initial condition, our solution does not depends
on x and
lim c(y, t) = y. (22)
t→∞

5.2 Numerical approach


In order to find the numerical solution of the above problem, we use explicit discretiza-
tion in time and central discretization for the second order derivative operators in the
space (see [5] for more details). That is, the solution at position xl = l∆x, ym = m∆y
and tn+1 = (n + 1)∆t is calculated through the following formula

D∆t ¡ n
cn+1 n
cl−1,m + cnl+1,m + cnl,m−1 + cnl,m+1 − 4cnl,m ,
¢
l,m = cl,m + 2
(23)
∆x

14
where for simplicity we used the uniform discretizations and assuming ∆x = ∆y.
Using this scheme, to update the concentration at the next time step on a certain grid
point (l, m), we only need information about the concentration, on previous time step,
on that grid point and on the nearest neighbors l ± 1, m ± 1. The scheme is therefore
very suitable for parallel computing.

Remark Being the explicit discretization, our scheme is conditional stable. It is


shown that (see [4, 5]) the scheme is stable if

∆t 1
D ≤ . (24)
∆x2 4

5.3 The parallel algorithm and the complexity


In this section, we consider two methods to decompose the spatial domain. These
are a trip wise and a block wise decomposition. By “trip Pp wise” we mean that each
processor receives a block of Nk × N data elements ( k=1 Nk = N ) and simulates
the time-behavior of this specific part. “Block wise” decomposition means that each
processor receives √Np × √Np data elements and simulates the time-behavior of this part.
Similar to the algorithm given in Section 4.3, if we suppose that the processors are
organized in a certain topology, e.g. ring/line topology, each processor has its own up
and down neighbors in case of “trip wise” and together with left and right neighbors
in case of “block wise” decomposition. Having a part of starting values, each processor
joins in parallel processing. So, the processor need to exchange the boundary points
with its neighboring processors. The internal points are fully parallelized.
To have an insight of these decompositions, we shall have a look at their complex-
ities. Using all assumptions and notations given in Section 4.4, we firstly consider the
trip wise decomposition. In each time integration, the calculation is performed on a
grid of N × N points. So,
T1 = N 2 τcalc . (25)
Using trip wise decomposition, each process exchanges 2N boundary points with the
two neighbors. The execution time, the speedup and the efficiency are then

N2
Tp = τcalc + Tcomm
p
N2
= τcalc + 2(τsetup + N τexchange ), (26)
p
1
Sp = 1 2 τsetup 2 τexchange , (27)
p + N 2 τcalc + N τcalc
1
Ep = 2p τsetup 2p τexchange . (28)
1+ N 2 τcalc + N τcalc

Now, consider the block wise decomposition. The difference from the above method

is that the current process exchanges N/ p boundary points with each of 4 neighbors.

15
Therefore, the communication time is slightly different, that is,

N2
Tp = τcalc + Tcomm
p
N2 N
= τcalc + 4(τsetup + √ τexchange ), (29)
p p
1
Sp = 1 4 τsetup 4 τ , (30)
√ exchange
p + N 2 τcalc + N p τcalc
1
Ep = 4p τsetup

4 p τexchange
. (31)
1+ N 2 τcalc + N τcalc

In term of the communication time, the block wise decomposition is more efficient

than the other when p is large (i.e. 4N/ p < 2N ). However implementing the block
wise decomposition method requires a bit more complicated. For that reason, we take
the trip wise decomposition and implement in C using MPI library.

5.4 Implementation
In this section we discuss in details the algorithm using MPI specification.
As always, the environment is initialized using routines MPI Init, MPI Comm size
and MPI Comm rank. To create the virtual topology we use MPI Cart create. The
coordinate as well as the rank ID of the current processor are found by MPI Cart get
and MPI Cart rank. Because of the periodic boundary condition in the x-direction,
we use a decomposition in the y-direction and use the ring topology. In this way, each
processor is responsible for a part of vertical axis (along the whole horizontal axis).
For convenient with the algorithm given in Section 4.4, we use again left and right
to denote the rank ID of the upper processor and the lower processor. These rank IDs
are easily found by MPI Cart shift routine. The data decomposition, i.e. npoints
and offset, are similar to what we had in the previous sections.
In the beginning, the local solution vector curval, which is dynamically allocated,
is initialized according to (19) and (21). Then, in each iteration we perform one time
integration compute(). To update the (local) solution vector, the current processor
exchanges two arrays of length N with left and with right. Again, this is done
by using MPI Sendrecv call. The updating process is accompanied by considering
the boundary condition in the y-direction and the periodic boundary condition in the
x-direction. The full code is as follows:
void compute(void) {
MPI_Status status;
int i,j;
double l_point[N], r_point[N], l_neighbor[N], r_neighbor[N];
double l, r, u, d, m;

/* Take the first and the last rows in current processor */


for (j=0; j<N; j++){
l_neighbor[j]=*(curval+index(0,j));
r_neighbor[j]=*(curval+index(npoints-1,j));

16
}
/* Send r_neighbor to process on right while receiving into l_point
from process on left. Then, send the l_neighbor to process on left,
while receiving into r_point from the process on right. */
MPI_Sendrecv(&r_neighbor[0], N, MPI_DOUBLE, right, rank,
&l_point[0],N, MPI_DOUBLE, left, left, ring, &status);
MPI_Sendrecv(&l_neighbor[0], N, MPI_DOUBLE, left, rank,
&r_point[0],N, MPI_DOUBLE, right, right, ring, &status);
for (j=0; j<N; j++){
for (i=0; i <npoints; i++){
if (offset+i == GlbRight){
*(newval+index(i,j)) = 1.0; /*Boundary condition in y-direction */
}else{
if (offset+i == GlbLeft){
*(newval+index(i,j)) = 0.0;
}else{
u = (i==0) ? l_point[j] : *(curval+ index(i-1,j));
m = *(curval+index(i,j));
d = (i==npoints-1) ? r_point[j] : *(curval+index(i+1,j));
l = (j==0) ? *(curval+index(i,GlbRight-1)) : *(curval+index(i,j-1));
/* (because of periodicity in x-direction) */
r = (j==N-1) ? *(curval+index(i,GlbLeft+1)) : *(curval+index(i,j+1));
*(newval+index(i,j)) = *(curval+index(i,j)) +
D*dt/(dx*dx)*(l+r+u+d - 4.0*m);
}
}
}
}
/* Update the arrays */
for (j=0; j<N; j++){
for (i=0; i <npoints; i++){
*(curval+index(i,j)) = *(newval+index(i,j));
}
}
}

To analyze the data later, the output is written to a file after every ITV iterations.
It is shown that for our particular initial value, our solution does not depend on x.
Therefore, one can take a column from the concentration field, in the current process.
Then, that vector, which is denoted by subvector, is collected into the root using
MPI Gather. The output is printed to a file there. Although this I/O operator will
also induce overhead, it has not the negative effect on the efficiency of our parallel
program. In fact, we have tested the program with and without that I/O operator. It
turns out that the efficiency does not change, more or less. For example, for ∆x = 0.01,
if we do not use I/O operator then E2 = 0.38, E4 = 0.34, while if we use I/O operator
then E2 = 0.39, E4 = 0.33.

17
5.5 Results and discussion
As mentioned in (22), the numerical solution c tends to its exact value, i.e. y.
The experiment shows that after a time interval T = 10 the numerical solution
(which is written in Assignment3.txt file) differs from the exact value a very small
amount ||c − y||∞ = 2.38 · 10−8 (in my labtop, ||c − y||∞ = 2.75 · 10−14 ). Here ∆t =
0.0005, ∆x = 0.05 and D = 1.0 are used. This result implies the correctness of our
parallel simulation.
We now have a look at the performance of the algorithm. The results, which are
presented in Figure 5 show that the execution time increases if N increases (see Figure
5a) while its reduction follows the increment of p (Figure 5d). This is already predicted
in (26) and no longer surprise.

Execution time
Execution time
3000 2500
1
32
2 2000 64
2000 4
128
a) 8
d)
1500
256
16
1000 1000 512
1024
500
0 0
200 400 600 800 1000 5 10 15
Speedup
Speedup
15 15

10 10
b) e)
5 5

0 0
200 400 600 800 1000 5 10 15
Effeciency Effeciency
1 1

c)
0.5 f) 0.5

0 0
200 400 600 800 1000 5 10 15
Horizontal axis: N . Horizontal axis: p.

Figure 5: The complexity as function of N and p.

Similar to what discussed in Section 4.5, Figures 5b and 5e show that the speedup
increases if N or p increases (see (27)). Also, Figures 5c and 5f imply that p =
2 and p are the good choice for our problem because they give a higher efficiency.
Compare to the efficiency obtained in Section 4.5, the results of our problem seem

18
to be more efficient. This might due to the facts that: i) We run the program in
two different machine systems (Gene and Beowulf) which have different hardware
parameters τcomm /τcalc ; ii) If τsetup dominates τexchange then the higher efficiency
comes from factor N 2 (compare (17) and (28)).

6 The Time Independent Diffusion Equation


6.1 The model
Assuming that we are not interested in the time developing of the concentration c
appeared in Section 5, but only in the steady state (i.e. ∂c/∂t = 0), where the Laplace
equation
∂2c ∂2c
0= 2
+ 2 (32)
∂x ∂y
need to be solved. Assuming that we use the same initial condition and boundary
conditions of previous section. Using the central discretization for the second order
derivatives in space, (32) becomes
1
cl,m = (cl−1,m + cl+1,m + cl,m−1 + cl,m+1 ) . (33)
4
∆t 1
In fact, this system is derived from (23) by simply setting D ∆x 2 = 4 , in which the

maximum time step is allowed. This suggests that the efficient methods are needed to
solve system (33). In the coming sections, we will present the three popular methods to
solve that system. The general idea of these methods is to solve the system iteratively.

6.2 The Jacobi iteration


The Jacobi iterative method solves (33) in the way:
1¡ k
ck+1 cl−1,m + ckl+1,m + ckl,m−1 + ckl,m+1 ,
¢
l,m = (34)
4
where the superscript k denotes the k-th iteration. This updating process goes on,
until a certain stopping criterion is satisfied. That is, the solution is assumed to be
converged if
|ck+1 k
l,m − cl,m | < σ (35)
holds for all values (l, m), with some small number σ.
Similar to what mentioned in Section 5, our problem is well parallelized. One can
parallelize the Jacobi iteration together with the stopping criterion in the following
way
k = 0; GlobalError = σ;
While (GlobalError ≥ σ)
k = k + 1;
LocalError = 0.0;
For i = 1 to NumberOfErrorEvaluations do

19
Exchange the boundary points (the most upper and most
lower rows) in the current processor with the neighboring
processors;
For each point in the computational domain of the current
processor
- Update the value according to the boundary con-
ditions and Jacobi iteration;
- The LocalError is recalculated if the difference
between the old and the new values is not less
than σ;
EndFor
EndFor
Take the GlobalError as the maximum of all local errors which calcu-
lated in all processors;
EndWhile.
Here the local error (LocErr) is not necessary to be recalculated at every point in
the computational domain because we do care about the points where the stopping
criterion does not satisfy. Roughly, we may say LocErr is an indication of whether
or not the stopping criterion is valid in the current computational domain. This
modification therefore saves a bit computation time.
Next, the local errors reduce to the global error (GlbErr) once after a number
of error evaluations NEE, which is chosen so that it is small enough compared to the
expected number of iterations niter. Even though it then takes about NEE extra
iterations to get the required stopping criterion, the amount of communication are
reduced because the communication seems to be more expensive than the computation.
To support this, the parallel program is tested with two examples of NEE=1 and
NEE=10, while N = 20 is unchanged. It turns out that for NEE=1, niter=1225,
error= 2.39 · 10−5 , T4 = 2.99 and T8 = 4.90 (both in seconds); while for NEE=10,
niter=1240, error= 2.16 · 10−5 , T4 = 1.92 and T8 = 2.39 (seconds). The error in
the later example is slightly smaller than the former since the program is iterated in
a larger number of iterations. The execution time in the second example is indeed
smaller than in the first example.
The C code implementing the above algorithm is as follows:
GlbErr = sigma; iter = 0;
while (GlbErr >= sigma){
iter = iter +1;
LocErr = 0.0;
for (i=1; i<=NEE; i++) compute();
MPI_Allreduce (&LocErr, &GlbErr, 1, MPI_DOUBLE, MPI_MAX, ring);
}
if (rank==0) printf("Number of iterations %d\n",iter*NEE);

where the routine compute is almost the same as the code given in Section 5.4 (there
we set D∆t/∆x2 = 1/4), except that inside the updating-solution loop we add the
following commands

20
new = *(newval+index(i,j));
old = *(curval+index(i,j));
if (fabs(new-old) >= sigma) LocErr = fabs(new-old);
Having the above stopping criterion based on the certain tolerance σ, how does this
tolerance influence on the number of iterations as well as the quality of the solution,
which is measured by error = ||c − y||∞ ? The number of iterations and the quality of
the solution as function of σ are plotted in Figure 6.
4
0 x 10
10 3
N=40
N=80
−1
10 2.5

Number of iterations
−2
10 2
Error

−3
10 1.5

−4
10 1

−5
10 0.5

−6
10 −2 −4 −6 −8 0
10 10 10 10 2 3 4 5 6 7 8
σ −log(σ)

a) The error. b) Number of iterations.

Figure 6: Number of iterations and error as function of tolerance.

It is shown in Figure 6a that the error linearly depends on the tolerance used. Here
N = 40 is used.
The results in Figure 6b shows a linear dependence between the number of iteration
and log(σ) (the iteration profile is just a constant line). Here, two grid sizes N = 40
and N = 80 are used. The result suggests an N 2 dependence. As you can see from
Figure 6b, the algorithm using N = 80 requires approximately four times as many
number of iterations as using N = 40, while N is just doubled. This N 2 dependence
was also found from mathematics analysis (see [5, 4] and references therein). Finally,
we note that the number of iterations seems to be much larger than number of grid
points N 2 , e.g. for N = 80 niter= 25630 ≫ 6400 = 802 . Therefore, the direct
methods are preferred to solve our problem. In conclusion, Jacobi iteration should be
used as an example, a stepping stone of other efficient iterative methods, but is not
used in real applications.

6.3 Gauss-Seidel iterative method


An improvement over the Jacobi iteration is the Gauss-Seidel iteration, where during
the iteration a new value is used as soon as it has been calculated. Assumming that
the iteration proceeds along the row, the Gauss-Seidel iteration reads
1 ³ k+1 ´
ck+1
l,m = c + ck
l+1,m + ck+1
+ ck
l,m+1 . (36)
4 l−1,m l,m−1

Unlike Jacobi iteration, we no longer need two arrays to store the old results and the
new results, here we just use only one array curval and compute in place. This saves

21
an enormous amount of memory if N is large.
According to theory, it turns out that the Gauss-Seidel iterative method requires a
factor of two iterations less than Jacobi (see [5]). Also it is stated that we need to use
a special ordering of the unknowns on the grid to obtain an efficient algorithm because
the standard Gauss-Seidel algorithm is inherently serial. The simplest version of this
is the Red-Black algorithm.
The key idea is is to group the grid points into two groups, identified as RED and
BLACK. In our grid, the grid point (l, m) is colored as RED if l + m is even, as BLACK if
l + m is odd. Observe that the red nodes are surrounded and updated by black nodes
and vice-versa. Thus, all the red nodes are independent from each other. The same
thing holds for the black nodes. In term of parallelism, we can now first update all
red points in parallel, followed by a parallel update of all black points.
The computation is now split into two parts. Before each part, a communication
with neighboring processors is need. For instance, to update the red points, only the
black boundary points need to be exchanged. Therefore in each computation part, the
processor exchanges only a half of boundary points. This means that in comparison
with Jacobi iteration, we only double the setup times required for exchange operators,
but using the same sending times constant.
Similar to the parallel Jacobi algorithm, in the Gauss-Seidel version, a small mod-
ification is needed according to the splitting of the computation parts. That is,
k = 0; GlobalError = σ;
While (GlobalError ≥ σ)
k = k + 1;
LocalError = 0.0;
For i = 1 to NumberOfErrorEvaluations do
Exchange the black boundary points with the neighboring
processors;
For each red point in the current processor
- Update the value according to the boundary con-
ditions and Gauss-Seidel iteration;
- Recalculate LocalError if necessary;
EndFor
Exchange the red boundary points with the neighboring pro-
cessors;
For each black point in the current processor
- Update the value according to the boundary con-
ditions and Gauss-Seidel iteration;
- Recalculate LocalError if necessary;
EndFor
EndFor
Take the GlobalError as the maximum of all local errors;
EndWhile.
To have a short C code, in the program, we have defined RED=0, BLACK=1. The
variable color can be either RED or BLACK. In the computation domain of current

22
processor, a point (i,j) is colored as red if (offset+i)%2=j%2, while it is black if
(offset+i)%2=1-j%2. Then, if we want to update all points which are colored by
color, we need to consider whether or not some point (offset+i,j) has the expected
color. Thus, if j=((offset+i)+ color)%2 then that point has the color color (i.e.
will be updated), while if j=((offset+i)+(1-color))%2 then that point has a differed
color (i.e. will be used for updating). In this way, the two computation parts can be
implemented by one code (see compute()). In the main routine we then can call the
following set of commands
GlbErr = sigma; iter = 0; color = RED;
while (GlbErr >= sigma){
iter = iter +1;
LocErr = 0.0;
for (i=1; i<=NEE; i++){
for (j=1; j<=2; j++){
compute();
color = 1-color;
}
}
MPI_Allreduce (&LocErr, &GlbErr, 1, MPI_DOUBLE, MPI_MAX, ring);
}
if (rank==0) printf("Number of iterations %d\n",iter*NEE);
To demonstrate the above discussed advantages of Gauss-Seidel iteration, we have
taken N = 40 and measured the number of iterations needed for Gauss-Seidel and
compared to Jacobi. Number of iterations as function of the stopping criterion is
plotted in Figure 7a. The results shows that to obtain the same desired accuracy
the Gauss-Seidel iteration requires almost a factor of two iterations less than Jacobi.
Moreover, based on the results plotted in Figure 7b, we observe that when solving for
the same mesh size, the Gauss-Seidel method converges with half as many iterations
as needed for the Jacobi method. These results are closed to what predicted by theory.
Together with the results shown in Table 1, in which Gauss-Seidel iteration seems to be
4
x 10
8000 3
Jacobi Jacobi
Gauss−Seidel Gauss−Seidel
7000
2.5

6000
Number of iterations
Number of iterations

2
5000

4000 1.5

3000
1

2000
0.5
1000

0 0
2 3 4 5 6 7 8 10 20 30 40 50 60 70 80
−log(σ) Mesh points

a) As function of the stopping criterion. b) As function of the problem size.

Figure 7: Number of iterations for Jacobi and Gauss-Seidel as function of the stopping
criterion and of the problem size.

twice as more accurate as Jacobi iteration, we conclude that the parallelism available

23
in Jacobi iteration is now completely destroyed by the Gauss-Seidel iteration.

d=2 d=3 d=4 d=5 d=6 d=7 d=8


Jacobi 7.76e-1 4.43e-1 6.16e-2 6.16e-3 6.16e-4 6.16e-5 6.16e-6
Gauss-Seidel 6.95e-1 2.91e-1 3.08e-2 3.07e-3 3.08e-4 3.07e-5 3.08e-6

Table 1: Errors for Jacobi and Gauss-Seidel methods. d = − log(σ).

As always, it is important to have a real feeling about the complexity of the prob-
lem. For that reason, we plot in Figure 8 the execution time, the speedup and the
efficiency. This figure is closed to the similar figures in the previous sections. For ex-

Execution time Execution time


1500
p=1 1000 32
p=2 64
1000 p=4 128
p=8 256
a) p=16 d) 500 512
500 1024

0 0
200 400 600 800 1000 5 10 15
Speedup Speedup
10 10

b) 5 e) 5

0 0
200 400 600 800 1000 5 10 15
Effeciency Effeciency
1 1

c) 0.5
f) 0.5

0 0
200 400 600 800 1000 5 10 15
Horizontal axis: N . Horizontal axis: p.

Figure 8: The complexity as function of N and p.

ample, the results show that we obtain a high speedup and a high efficiency, especially
when N is large and p ≤ 8 (see Figures 8a, 8b and 8c) and increasing the number of
processors will increase the difficulty, both in term of execution time and the efficiency.
We therefore refer to the previous sections for more detail discussion.

24
6.4 Successive Over Relaxation (SOR) iteration
In this section we deal with a correction version of Gauss-Seidel method, that is, by
mixing the Gauss-Seidel result with the current value, i.e.
ω ³ k+1 ´
ck+1
l,m = c + ck
l+1,m + ck+1
+ ck k
l,m+1 + (1 − ω)cl,m , (37)
4 l−1,m l,m−1

where 1 < ω < 2.


It was shown that ([4]) SOR is very advantageous and gives much faster convergence
than Gauss-Seidel and, of course, Jacobi. The number of iterations strongly depends
on the value of ω. Such the interesting features of SOR motivate us to have a numerical
feeling of the problem. Therefore, we implement the program, then take N = 40 and
plot the number of iterations and the errors for the Jacobi, Gauss-Seidel and SOR
methods as function of the stopping criterion in Figure 9.
0
8000 10
Jacobi Jacobi
Gauss−Seidel Gauss−Seidel
7000 SOR ω=1.6
SOR ω=1.6
SOR ω=1.9 −2 SOR ω=1.9
6000 10
Number of iterations

5000
Error

−4
4000 10

3000
−6
2000 10

1000
−8
0 10 −2 −4 −6 −8
2 3 4 5 6 7 8 10 10 10 10
−log(σ) σ

a) The number of iterations. b) The error.

Figure 9: The number of iterations and the errors for the Jacobi, Gauss-Seidel and
SOR methods as function of the stopping criterion.

At the first sight, we observe from this figure that the method depends strongly
on the parameter ω, e.g. by comparing the number-of-iteration curves in the two
cases of ω = 1.6 and ω = 1.9. Furthermore, from Figure 9b we find in all methods a
linear relationship between the stoping criterion σ and the error. Then, the problem is
indeed enormously improved when using SOR, especially with ω = 1.9. The number
of iterations in this case is about six times smaller than that in Gauss-Seidel iteration
(see Figure 9a). Moreover, from Figure 9b we observe that error in SOR is much
smaller than in Jacobi and Gauss-Seidel. This further improves the efficiency of SOR.
For example, if we would like to have a maximal error (in L∞ norm) of 10−3 , SOR
requires σ = 10−3 , while Jacobi and Gauss-Seidel require σ = 10−6 . We now fully
understand why SOR is a practically useful iterative method and is applied regularly.

25
References
[1] G. Golub, J.M. Ortega, Scientific Computing: An Introduction with Parallel Com-
puting, Academic Press, 1993.

[2] A Case Study , The guitar String.

[3] A First Look at Performance.

[4] Numerical Solvers


[5] G.E. Karniadakis, R.M. Kirby II, Parallel Scientific Computing in C++ and MPI,
Cambrige University, 2003.

26

You might also like