Parallel Computing with MPI
Parallel Computing with MPI
Distributed Computing
Vũ Văn Thiê.u
Master student, Master of Science Program
1 Introduction
In the early 1970s, computers began to appear that consisted of a number of separate
processors operating in parallel. This type of computer is called a parallel computer.
The basic idea of parallel computer is that a number of processors work in cooperation
on a single task. The motivation is that if it takes one processor an amount of time
t to do a task, then p processors can do the task in time t/p. Only for very special
situations this perfect “speedup” can be achieved, however, and it is our goal to de-
vise algorithms that can take as much advantage as possible, for a given problem, of
multiple processors.
One of the most popular ways to implement a parallel program in such the computer
is to use MPI (Message Passing Interface) which is written in C or Fortran (see [5]).
In this report, we deal with the parallel algorithms, which associate to the particular
problems, using MPI specification .
1
Scalability A parallel program is said to be scalable if it maintains a high efficiency
while the number of processors is increasing.
or
z = B x,
provided that the complete argument x has been copied into memory of this processor.
Here B is a nk × n matrix, z is a vector of length nk and is a part of the vector y.
Using this data decomposition, each processor can completely in parallel due to the
independence of calculating in different blocks.
After loading all the sub-vectors z into vector y, we have now a solution of our
problem. We remark that as the input vector x, the output vector y need to be dis-
tributed over all process because in many applications, for example in the iterative
solvers, this output is considered as an input argument for the next iteration. There-
fore, the full parallel matrix vector product should also contain a routine that first
gathers the argument vector into memory of all processors and then performs the
parallel computation.
3.3 Implementation
The program is written in a way that we can easily change the matrix dimension n and
the number of processors p (via MPI Comm size call). The call MPI Comm rank gives
an identity rank for each processor.
2
As discussed above, in the case that n/p is not an integer, the speed-up is not
optimized and is best if we have the minimum difference in the number of rows nk that
each processor receives. For that purpose, one can device a load-balancing algorithm
that enforces the maximum difference in the number of rows is one, i.e. processors
receive nk = ⌈n/p⌉ or nk = ⌈n/p⌉ − 1 rows, with ⌈ ⌉ is the ceiling function. In the C
code, nk and the global position of the first row (of the current block) are the variables
npoints and offset respectively, where
npoints = n/p + (n%p + p - rank -1)/p;
offset = (npoints > n/p) ? (npoints*rank) : (n-(p-rank)*npoints);
The global position of a row is found straightforwardly.
Before having an actual matrix-vector product, let us first generate the matrix and
vector elements locally: initialize in each processor the nk × n matrix submatrix (ma-
trix B) and subvector - a vector of length nk ; then by gather operation (MPI Allgather
call) we collect all distributed elements of subvector, via newvector, into vector -
a vector of length n (vector x), which will be contributed in all processors (using
MPI Bcast call). Now, each processor simply performs a matrix-vector product Bx.
The result vector res (vector z) is sent back to the root where the solution of our
problem is stored.
Remark We venture to give an optimized code, with respect to the memory usage
and the communication time, as follows:
- To collect all distributed elements of subvector into vector, we need not to use
the intermediate vector, i.e. newvector (which is of the length 2n), since we call
MPI Allgatherv instead of MPI Allgather function
The memory usage is then significantly reduced, even though two new vectors of
length p (which is usually very small compared to 2n), i.e. count and displm,
are used.
- Using MPI Allgatherv call,vector is available in every processor. We therefore
need not to Bcast vector to the root.
- The resulting vector res need not to be stored in every processor
float *res=NULL;
if (rank==0) res = (float *) malloc (n*sizeof(float));
Here we introduce in each processor a local vector subres (of length npoints)
to store the result of the local matrix-vector product.
- To send the resulting vector back to the root, we then use MPI Gatherv to collect
all elements of subres into res
3
In this way, unlike using a set of MPI Send and MPI Recv, the information is
transferred only once.
3.4 Results
To test the correctness of our program (see the selection of constant TEST in the code),
let A be an identity matrix
½
1, if i = j,
Aij =
0, otherwise.
4
Then y = Ax = x. This is indeed observed from result of our program:
n = 10, p = 1, rank = 0, npoints = 10, offset = 0
The elements of the 0-th submatrix are:
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 1
The elements of the vector are:
0 1 2 3 4 5 6 7 8 9
The result vector is:
0 1 2 3 4 5 6 7 8 9
The execution time (using 1 processors) is: 0.001851 seconds
To measure the scalability of the program, before and after the performance, the
routine gettimeofday (or MPI Wtime) is used to calculate the execution time using
p processors. The execution time (seconds), the speedup and the efficiency of the
progam using different number of processors p for different values of n are plotted in
Figure 1. To have an insight of these results, let us look at the formulas (3), (4) and
(5). From (3) we see that Tp is proportional to n, while it is inversely proportional to
p. Therefore, Tp increases when n increases. This is clearly shown in Figure 1a. On
the other hand, Tp decreases if p increases. This is confirmed in Figure 1d.
From (4), we see that the speedup increases as either p or n increases. In other
word, the larger problem size the higher speedup (see Figure 1b), the larger number
of processors the higher speedup (see Figure 1e).
The formula (5) implies the dependence of the efficiency on the ratio of p and n.
Therefore, if we fix n while increase p, the efficiency reduces. Especially for small n,
the efficiency of algorithm using 2 processors is largest as shown in Figure 1c. If we
increase n, while fix p, the efficiency increases. That is why in Figure 1f we observe
that the two curves of n = 2000 and n = 3000 are higher than that of n = 1000. At
the same time, we would like to use large p to have high speedup, but we then face the
reduction of the efficiency when using the such large p. Therefore, it is shown that (see
[3]) each n has a optimal number of processor p∗ . When n increases, p∗ also increases.
We observe that p = 2 seems to be a good choice for n < 2000, p = 4 for n ≃ 3000,
p = 8, 16 might be a good choice for larger n.
From Figures 1b and 1c, we see that for small n, the speedup as well as the efficiency
are small. This is due to the fact that the communication time covers the computation
time. For larger n, the speedup and the efficiency are increasing and reach their
maximal values. These values are high (e.g. S2 ≃ 1.65, S4 ≃ 3.4, E2 ≃ E4 ≃ 0.85),
close to the theoretical values, and maintained as n is increasing. Moreover, when we
increase the number of processors p, the efficiency are maintained highly(≃ 0.79, for
n = 2000). This behavior shows the scalability of our parallel program.
5
Complexity as function of n. Complexity as function of p.
1.5 3
a) d)
1 2
0.5 1
0 0
0 1000 2000 3000 5 10 15
Speedup Speedup
15 15
10 10
b) e)
5 5
0 0
0 1000 2000 3000 5 10 15
Effeciency Effeciency
1 1
0.8
c) f)
0.5
0.6
0 0.4
0 1000 2000 3000 5 10 15
Horizontal axis: n. Horizontal axis: p.
Solid line: p = 16; Dashed line: p = 8; Solid line: n = 3000; Dashed line: n = 2000;
Dash-dotted line: p = 4; Dotted line: p = 2; Dash-dotted line: n = 1000.;
6
3.5 Discussion
Beside the above row-block decomposition algorithm, we venture to give a parallel
algorithm, say the column decomposition algorithm, as follows:
Let us make a partition of the matrix A as p blocks of sub-matrices. Each block
consists of nk columns, i.e., A = [B1 , B2 , · · · , Bp ], where Bk is a n × nk matrix.
We divide the vector x into its sub-vectors, i.e., x = [x1 , x2 , · · · , xp ]′ , where xk is a
vector of length nk . Then in the parallel algorithm, the processor k simply performs
a sub-problem, a matrix-vector product Bk xk = z k . Thus, similar to the previous
row block decomposition, in each processor, we can generate locally the matrix Bk
and the vector xk . The resulting vector of length n, z k , is calculated locally. In
fact, the vector y which we need to compute, is a summation of p vectors of length
n, i.e. y = z 1 + z 2 + · · · + z p . Again, a parallel algorithm for adding these vectors
arises because the addition in different rows can be done independently. We notice
that, before the second parallel algorithm is done, the vectors z 1 , z 2 , · · · , z p should be
stored in the root as either a matrix or a vector of length np (for example, r(i−1)p+j =
zij , j = 1, 2, · · · , p, i = 1, 2, · · · , n).
Using the similar way of estimating the complexity of the previous algorithm, for
the current algorithm we consider two sequential parts:
- For the first part (local matrix-vector product), the computation time on pro-
cessor k is approximated by 2n nk .
- For the second part (vector-vector addition), we can use the so-called ‘fan-in’
algorithm (we refer to [1] for more detail discussion). In the processor k, we
need to calculate nk elements. Each of them is a summation of p values. The
computation time of such the calculation using ‘fan-in’ algorithm is approximated
by log2 p. Therefore, the total computation time for this second part is nk log2 p.
If the communication time can be omitted, the total execution time for the current
algorithm should be nk (2n + log2 p), which is larger than that of previous section. If p
is small compared to n, this value is comparable to what has been shown the previous
section.
In summary, we have studied the row-block decomposition parallel algorithm used
for calculating the matrix-vector product, in both theoretical and numerical points of
view. As illustrated in the above figure, this is a quite satisfactory as it is popularly
in used. In addition, we also gave the parallel algorithm which is a promising way of
calculating a matrix-vector product.
∂2ψ 2
2∂ ψ
= c , (6)
∂t2 ∂x2
7
comes from many real and interesting applications, such as the vibrations of a uniform
string, the sound waves in gases or liquids, and optical waves. As an example, we
study an isolated guitar string of length 1, with two fixed endpoints. We refer to [2]
for more details of the construction of the problem.
Briefly, we need to solve equation (6) to find out the vibration amplitude ψ(x, t)
of the string at position x (0 ≤ x ≤ 1) and time t (t ≥ 0). Since the string is fixed on
both endpoints, mathematically we have
ψ(0, t) = ψ(1, t) = 0. (7)
We suppose that the vibration amplitude is initialized as
ψ(x, 0) = f (x), (8)
where f (x) can be any smooth function that satisfies the boundary conditions (7). For
that purpose, we chose f (x) = sin(2kπx), k ∈ Z+ .
Analytically, using Fourier transformation (see [2]), our solution is of the form
∞
X
ψ(x, t) = Bm sin(mπx) cos(mπct), (9)
m=1
For our particular initial value f (x) = sin(2kπx) we have Bm = 0, m 6= 2k and B2k = 1.
Therefore, the final solution is ψ(x, t) = sin(2kπx) cos(2kπct). Thus,
||ψ(x, t)|| ≤ || sin(2kπx)|| = ||f (x)|| = ||ψ(x, 0)||. (11)
8
where τ = c∆t
∆x . In other words, to update the solution at a particular point in the
space we need only few local grid points (itself and 4 neighbors) of the current time
level. The scheme is therefore well parallelized.
Remark : Our discretization scheme is stable (i.e., a small local error does not lead
to a large cumulative error) if 0 ≤ τ ≤ 1 (see [2]). The time step is therefore restricted
by ∆t ≤ ∆x/c.
Remark The temporal decomposition is not used here because the data elements in
processor k are only calculated if data elements in processor k − 1 are available (i.e.,
in fact we have no parallelization).
4.4 Implementation
Being the first step, each processor picks up the starting values, such as the environ-
ment (MPI Init) and its rank in the topology (MPI Comm size and MPI Comm rank).
Using the routine parseargs, the root receives all necessary parameters, e.g. number
of grid points N , the time interval, the time step etc. into variable parms and passes
those parameters to all processors. The parseargs function is used in the following
manner
if(parseargs(argc, argv)){
MPI_Abort(MPI_COMM_WORLD, EXIT_FAILURE);
exit(EXIT_FAILURE);
}
Here MPI Abort is used (not MPI Finalized) since the passing-parameters program
turns out to be unsuccessful. There are two ways of passing the parameters, namely
TrfArgs1 and TrfArg2. The former bcasts the parameters in a package (using the rou-
tine MPI Type struct). The latter bcasts one parameter by one parameter. Therefore,
TrfArg2 should be the cleaner and faster method because the package is considered
as a single datatype and the communication time is smaller. For example, for a set of
parameters, it turns out that the execution time using TrfArgs1 is 0.08 seconds, while
it is only 0.04 seconds for the program using TrfArgs2.
Now, to create a ring/line topology ring, we use the routine MPI Cart create. The
coordinate and the rank ID of the current process, which are respectively denoted by
9
coords and rank, are found by using the routines MPI Cart get and MPI Cart rank.
Then, the left (of coordinate coords-1) and the right(of coordinate coords+1) pro-
cessors, whose rank IDs are left and right, can be easily obtained
MPI_Cart_shift (ring, 0, 1, &left, &right);
l_neighbor[0]=*(curval+0);
r_neighbor[0]=*(curval+npoints-1);
if (SendAndRecv){
MPI_Sendrecv(&r_neighbor[0], 1, MPI_DOUBLE, right, rank,
&l_point[0], 1, MPI_DOUBLE, left, left, ring, &status);
MPI_Sendrecv(&l_neighbor[0], 1, MPI_DOUBLE, left, rank,
&r_point[0], 1, MPI_DOUBLE, right, right, ring, &status);
}else{
MPI_Send(&r_neighbor[0], 1, MPI_DOUBLE, right, rank, ring);
MPI_Recv(&l_point[0], 1, MPI_DOUBLE, left, left, ring, &status);
MPI_Send(&l_neighbor[0], 1, MPI_DOUBLE, left, rank, ring);
MPI_Recv(&r_point[0], 1, MPI_DOUBLE, right, right, ring, &status);
}
10
We note that one can use MPI Send and MPI Recv separately to exchange the
boundary points. However, to avoid deadlock MPI Sendrecv is preferred since
sending and receiving happen simultaneously.
- To calculate newval, we need to consider carefully whether or not the data point
is in the boundary of the domain. If so, the boundary condition should be used.
Otherwise, we update the array according to (13), with notice of using l point
and r point. The code is as follows:
T1 = 8N τcalc , (14)
T1
Tp = + Tcomm
p
8N τcalc
= + 2(τsetup + τexchange ), (15)
p
T1 1
Sp = = 1 1 τsetup 1 τexchange , (16)
Tp p + 4N τcalc + 4N τcalc
T1 1
Ep = = p τsetup p τexchange . (17)
Tp 1+ 4N τcalc + 4N τcalc
11
4.5 Results and discussion
To check the correctness of the program, let c = 1 and k = 1, our exact solution is
then of the form ψ(x, t) = sin(2πx) cos(2πt), which scales the
√initial value by a factor
cos(2πt) < 1. At t = 1/8, t = 1/2 and t = 1, the factors are 2/2 = 0.7071, −1 and 1
respectively. This is also numerically observed as shown in Figure 2.
1 0.8
’wave0.dat’ ’wave1.dat’
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
-0.2
-0.2
-0.4
-0.4
-0.6
-0.6
-0.8
-1 -0.8
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-1 -1
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
c) t = 1/2. d) t = 1.
Moreover, as mentioned in the remark in section 4.2, our numerical scheme is only
stable if ∆t ≤ ∆x. We have tested the program with two different values of ∆t
while ∆x = 1/199 is unchanged. Indeed, it is shown that our method is stable with
∆t = 0.005 (Figure 2) while it is unstable with ∆t = 0.01 (Figure 3). From Figure 3a,
we observe that the amplitude is large (≃ 15), after a very short time interval, and
keep increasing (≃ 106 in Figure 3b). The solution is no longer bounded by the initial
value (as mentioned in Section 4.2).
Having the analysis given in the section 4.4, we now focus on the performance
of the algorithm. Again, to have an explanation of these results, we shall have a
look at the formulas presenting the complexity. From (15), the execution time (in
seconds) is proportional to the ratio of N and p. That is why any curve in Figure 4a
is monotonously increasing as N increases, while it is monotonously decreasing as P
increases (Figure 4d).
From Figures 4b and 4e we observe that the speedup increases if N or p increases.
12
15 6e+06
’wave3.dat’ ’wave4.dat’
5e+06
10
4e+06
3e+06
5
2e+06
1e+06
0
0
-1e+06
-5
-2e+06
-10 -3e+06
-4e+06
-15 -5e+06
0 20 40 60 80 100 120 140 160 180 200 0 20 40 60 80 100 120 140 160 180 200
a) t = 0.15. b) t = 0.20.
13
Complexity as function of N . Complexity as function of p.
Execution time
Execution time
1500
1 1500
2 10000
1000 4 20000
a) 8 1000 40000
16 d)
500
500
0
1 2 3 4 0
Speedup 4 5 10 15
x 10
5 Speedup
5
4
4
b) 3
e) 3
2
2
1
1 2 3 4 1
Effeciency 4 5 10 15
x 10
0.8 Effeciency
1
0.6
c) 0.4
f) 0.5
0.2
0
1 2 3 4
4 0
x 10 5 10 15
Horizontal axis: N . Horizontal axis: p.
For this particular set of boundary and initial condition, our solution does not depends
on x and
lim c(y, t) = y. (22)
t→∞
D∆t ¡ n
cn+1 n
cl−1,m + cnl+1,m + cnl,m−1 + cnl,m+1 − 4cnl,m ,
¢
l,m = cl,m + 2
(23)
∆x
14
where for simplicity we used the uniform discretizations and assuming ∆x = ∆y.
Using this scheme, to update the concentration at the next time step on a certain grid
point (l, m), we only need information about the concentration, on previous time step,
on that grid point and on the nearest neighbors l ± 1, m ± 1. The scheme is therefore
very suitable for parallel computing.
∆t 1
D ≤ . (24)
∆x2 4
N2
Tp = τcalc + Tcomm
p
N2
= τcalc + 2(τsetup + N τexchange ), (26)
p
1
Sp = 1 2 τsetup 2 τexchange , (27)
p + N 2 τcalc + N τcalc
1
Ep = 2p τsetup 2p τexchange . (28)
1+ N 2 τcalc + N τcalc
Now, consider the block wise decomposition. The difference from the above method
√
is that the current process exchanges N/ p boundary points with each of 4 neighbors.
15
Therefore, the communication time is slightly different, that is,
N2
Tp = τcalc + Tcomm
p
N2 N
= τcalc + 4(τsetup + √ τexchange ), (29)
p p
1
Sp = 1 4 τsetup 4 τ , (30)
√ exchange
p + N 2 τcalc + N p τcalc
1
Ep = 4p τsetup
√
4 p τexchange
. (31)
1+ N 2 τcalc + N τcalc
In term of the communication time, the block wise decomposition is more efficient
√
than the other when p is large (i.e. 4N/ p < 2N ). However implementing the block
wise decomposition method requires a bit more complicated. For that reason, we take
the trip wise decomposition and implement in C using MPI library.
5.4 Implementation
In this section we discuss in details the algorithm using MPI specification.
As always, the environment is initialized using routines MPI Init, MPI Comm size
and MPI Comm rank. To create the virtual topology we use MPI Cart create. The
coordinate as well as the rank ID of the current processor are found by MPI Cart get
and MPI Cart rank. Because of the periodic boundary condition in the x-direction,
we use a decomposition in the y-direction and use the ring topology. In this way, each
processor is responsible for a part of vertical axis (along the whole horizontal axis).
For convenient with the algorithm given in Section 4.4, we use again left and right
to denote the rank ID of the upper processor and the lower processor. These rank IDs
are easily found by MPI Cart shift routine. The data decomposition, i.e. npoints
and offset, are similar to what we had in the previous sections.
In the beginning, the local solution vector curval, which is dynamically allocated,
is initialized according to (19) and (21). Then, in each iteration we perform one time
integration compute(). To update the (local) solution vector, the current processor
exchanges two arrays of length N with left and with right. Again, this is done
by using MPI Sendrecv call. The updating process is accompanied by considering
the boundary condition in the y-direction and the periodic boundary condition in the
x-direction. The full code is as follows:
void compute(void) {
MPI_Status status;
int i,j;
double l_point[N], r_point[N], l_neighbor[N], r_neighbor[N];
double l, r, u, d, m;
16
}
/* Send r_neighbor to process on right while receiving into l_point
from process on left. Then, send the l_neighbor to process on left,
while receiving into r_point from the process on right. */
MPI_Sendrecv(&r_neighbor[0], N, MPI_DOUBLE, right, rank,
&l_point[0],N, MPI_DOUBLE, left, left, ring, &status);
MPI_Sendrecv(&l_neighbor[0], N, MPI_DOUBLE, left, rank,
&r_point[0],N, MPI_DOUBLE, right, right, ring, &status);
for (j=0; j<N; j++){
for (i=0; i <npoints; i++){
if (offset+i == GlbRight){
*(newval+index(i,j)) = 1.0; /*Boundary condition in y-direction */
}else{
if (offset+i == GlbLeft){
*(newval+index(i,j)) = 0.0;
}else{
u = (i==0) ? l_point[j] : *(curval+ index(i-1,j));
m = *(curval+index(i,j));
d = (i==npoints-1) ? r_point[j] : *(curval+index(i+1,j));
l = (j==0) ? *(curval+index(i,GlbRight-1)) : *(curval+index(i,j-1));
/* (because of periodicity in x-direction) */
r = (j==N-1) ? *(curval+index(i,GlbLeft+1)) : *(curval+index(i,j+1));
*(newval+index(i,j)) = *(curval+index(i,j)) +
D*dt/(dx*dx)*(l+r+u+d - 4.0*m);
}
}
}
}
/* Update the arrays */
for (j=0; j<N; j++){
for (i=0; i <npoints; i++){
*(curval+index(i,j)) = *(newval+index(i,j));
}
}
}
To analyze the data later, the output is written to a file after every ITV iterations.
It is shown that for our particular initial value, our solution does not depend on x.
Therefore, one can take a column from the concentration field, in the current process.
Then, that vector, which is denoted by subvector, is collected into the root using
MPI Gather. The output is printed to a file there. Although this I/O operator will
also induce overhead, it has not the negative effect on the efficiency of our parallel
program. In fact, we have tested the program with and without that I/O operator. It
turns out that the efficiency does not change, more or less. For example, for ∆x = 0.01,
if we do not use I/O operator then E2 = 0.38, E4 = 0.34, while if we use I/O operator
then E2 = 0.39, E4 = 0.33.
17
5.5 Results and discussion
As mentioned in (22), the numerical solution c tends to its exact value, i.e. y.
The experiment shows that after a time interval T = 10 the numerical solution
(which is written in Assignment3.txt file) differs from the exact value a very small
amount ||c − y||∞ = 2.38 · 10−8 (in my labtop, ||c − y||∞ = 2.75 · 10−14 ). Here ∆t =
0.0005, ∆x = 0.05 and D = 1.0 are used. This result implies the correctness of our
parallel simulation.
We now have a look at the performance of the algorithm. The results, which are
presented in Figure 5 show that the execution time increases if N increases (see Figure
5a) while its reduction follows the increment of p (Figure 5d). This is already predicted
in (26) and no longer surprise.
Execution time
Execution time
3000 2500
1
32
2 2000 64
2000 4
128
a) 8
d)
1500
256
16
1000 1000 512
1024
500
0 0
200 400 600 800 1000 5 10 15
Speedup
Speedup
15 15
10 10
b) e)
5 5
0 0
200 400 600 800 1000 5 10 15
Effeciency Effeciency
1 1
c)
0.5 f) 0.5
0 0
200 400 600 800 1000 5 10 15
Horizontal axis: N . Horizontal axis: p.
Similar to what discussed in Section 4.5, Figures 5b and 5e show that the speedup
increases if N or p increases (see (27)). Also, Figures 5c and 5f imply that p =
2 and p are the good choice for our problem because they give a higher efficiency.
Compare to the efficiency obtained in Section 4.5, the results of our problem seem
18
to be more efficient. This might due to the facts that: i) We run the program in
two different machine systems (Gene and Beowulf) which have different hardware
parameters τcomm /τcalc ; ii) If τsetup dominates τexchange then the higher efficiency
comes from factor N 2 (compare (17) and (28)).
maximum time step is allowed. This suggests that the efficient methods are needed to
solve system (33). In the coming sections, we will present the three popular methods to
solve that system. The general idea of these methods is to solve the system iteratively.
19
Exchange the boundary points (the most upper and most
lower rows) in the current processor with the neighboring
processors;
For each point in the computational domain of the current
processor
- Update the value according to the boundary con-
ditions and Jacobi iteration;
- The LocalError is recalculated if the difference
between the old and the new values is not less
than σ;
EndFor
EndFor
Take the GlobalError as the maximum of all local errors which calcu-
lated in all processors;
EndWhile.
Here the local error (LocErr) is not necessary to be recalculated at every point in
the computational domain because we do care about the points where the stopping
criterion does not satisfy. Roughly, we may say LocErr is an indication of whether
or not the stopping criterion is valid in the current computational domain. This
modification therefore saves a bit computation time.
Next, the local errors reduce to the global error (GlbErr) once after a number
of error evaluations NEE, which is chosen so that it is small enough compared to the
expected number of iterations niter. Even though it then takes about NEE extra
iterations to get the required stopping criterion, the amount of communication are
reduced because the communication seems to be more expensive than the computation.
To support this, the parallel program is tested with two examples of NEE=1 and
NEE=10, while N = 20 is unchanged. It turns out that for NEE=1, niter=1225,
error= 2.39 · 10−5 , T4 = 2.99 and T8 = 4.90 (both in seconds); while for NEE=10,
niter=1240, error= 2.16 · 10−5 , T4 = 1.92 and T8 = 2.39 (seconds). The error in
the later example is slightly smaller than the former since the program is iterated in
a larger number of iterations. The execution time in the second example is indeed
smaller than in the first example.
The C code implementing the above algorithm is as follows:
GlbErr = sigma; iter = 0;
while (GlbErr >= sigma){
iter = iter +1;
LocErr = 0.0;
for (i=1; i<=NEE; i++) compute();
MPI_Allreduce (&LocErr, &GlbErr, 1, MPI_DOUBLE, MPI_MAX, ring);
}
if (rank==0) printf("Number of iterations %d\n",iter*NEE);
where the routine compute is almost the same as the code given in Section 5.4 (there
we set D∆t/∆x2 = 1/4), except that inside the updating-solution loop we add the
following commands
20
new = *(newval+index(i,j));
old = *(curval+index(i,j));
if (fabs(new-old) >= sigma) LocErr = fabs(new-old);
Having the above stopping criterion based on the certain tolerance σ, how does this
tolerance influence on the number of iterations as well as the quality of the solution,
which is measured by error = ||c − y||∞ ? The number of iterations and the quality of
the solution as function of σ are plotted in Figure 6.
4
0 x 10
10 3
N=40
N=80
−1
10 2.5
Number of iterations
−2
10 2
Error
−3
10 1.5
−4
10 1
−5
10 0.5
−6
10 −2 −4 −6 −8 0
10 10 10 10 2 3 4 5 6 7 8
σ −log(σ)
It is shown in Figure 6a that the error linearly depends on the tolerance used. Here
N = 40 is used.
The results in Figure 6b shows a linear dependence between the number of iteration
and log(σ) (the iteration profile is just a constant line). Here, two grid sizes N = 40
and N = 80 are used. The result suggests an N 2 dependence. As you can see from
Figure 6b, the algorithm using N = 80 requires approximately four times as many
number of iterations as using N = 40, while N is just doubled. This N 2 dependence
was also found from mathematics analysis (see [5, 4] and references therein). Finally,
we note that the number of iterations seems to be much larger than number of grid
points N 2 , e.g. for N = 80 niter= 25630 ≫ 6400 = 802 . Therefore, the direct
methods are preferred to solve our problem. In conclusion, Jacobi iteration should be
used as an example, a stepping stone of other efficient iterative methods, but is not
used in real applications.
Unlike Jacobi iteration, we no longer need two arrays to store the old results and the
new results, here we just use only one array curval and compute in place. This saves
21
an enormous amount of memory if N is large.
According to theory, it turns out that the Gauss-Seidel iterative method requires a
factor of two iterations less than Jacobi (see [5]). Also it is stated that we need to use
a special ordering of the unknowns on the grid to obtain an efficient algorithm because
the standard Gauss-Seidel algorithm is inherently serial. The simplest version of this
is the Red-Black algorithm.
The key idea is is to group the grid points into two groups, identified as RED and
BLACK. In our grid, the grid point (l, m) is colored as RED if l + m is even, as BLACK if
l + m is odd. Observe that the red nodes are surrounded and updated by black nodes
and vice-versa. Thus, all the red nodes are independent from each other. The same
thing holds for the black nodes. In term of parallelism, we can now first update all
red points in parallel, followed by a parallel update of all black points.
The computation is now split into two parts. Before each part, a communication
with neighboring processors is need. For instance, to update the red points, only the
black boundary points need to be exchanged. Therefore in each computation part, the
processor exchanges only a half of boundary points. This means that in comparison
with Jacobi iteration, we only double the setup times required for exchange operators,
but using the same sending times constant.
Similar to the parallel Jacobi algorithm, in the Gauss-Seidel version, a small mod-
ification is needed according to the splitting of the computation parts. That is,
k = 0; GlobalError = σ;
While (GlobalError ≥ σ)
k = k + 1;
LocalError = 0.0;
For i = 1 to NumberOfErrorEvaluations do
Exchange the black boundary points with the neighboring
processors;
For each red point in the current processor
- Update the value according to the boundary con-
ditions and Gauss-Seidel iteration;
- Recalculate LocalError if necessary;
EndFor
Exchange the red boundary points with the neighboring pro-
cessors;
For each black point in the current processor
- Update the value according to the boundary con-
ditions and Gauss-Seidel iteration;
- Recalculate LocalError if necessary;
EndFor
EndFor
Take the GlobalError as the maximum of all local errors;
EndWhile.
To have a short C code, in the program, we have defined RED=0, BLACK=1. The
variable color can be either RED or BLACK. In the computation domain of current
22
processor, a point (i,j) is colored as red if (offset+i)%2=j%2, while it is black if
(offset+i)%2=1-j%2. Then, if we want to update all points which are colored by
color, we need to consider whether or not some point (offset+i,j) has the expected
color. Thus, if j=((offset+i)+ color)%2 then that point has the color color (i.e.
will be updated), while if j=((offset+i)+(1-color))%2 then that point has a differed
color (i.e. will be used for updating). In this way, the two computation parts can be
implemented by one code (see compute()). In the main routine we then can call the
following set of commands
GlbErr = sigma; iter = 0; color = RED;
while (GlbErr >= sigma){
iter = iter +1;
LocErr = 0.0;
for (i=1; i<=NEE; i++){
for (j=1; j<=2; j++){
compute();
color = 1-color;
}
}
MPI_Allreduce (&LocErr, &GlbErr, 1, MPI_DOUBLE, MPI_MAX, ring);
}
if (rank==0) printf("Number of iterations %d\n",iter*NEE);
To demonstrate the above discussed advantages of Gauss-Seidel iteration, we have
taken N = 40 and measured the number of iterations needed for Gauss-Seidel and
compared to Jacobi. Number of iterations as function of the stopping criterion is
plotted in Figure 7a. The results shows that to obtain the same desired accuracy
the Gauss-Seidel iteration requires almost a factor of two iterations less than Jacobi.
Moreover, based on the results plotted in Figure 7b, we observe that when solving for
the same mesh size, the Gauss-Seidel method converges with half as many iterations
as needed for the Jacobi method. These results are closed to what predicted by theory.
Together with the results shown in Table 1, in which Gauss-Seidel iteration seems to be
4
x 10
8000 3
Jacobi Jacobi
Gauss−Seidel Gauss−Seidel
7000
2.5
6000
Number of iterations
Number of iterations
2
5000
4000 1.5
3000
1
2000
0.5
1000
0 0
2 3 4 5 6 7 8 10 20 30 40 50 60 70 80
−log(σ) Mesh points
Figure 7: Number of iterations for Jacobi and Gauss-Seidel as function of the stopping
criterion and of the problem size.
twice as more accurate as Jacobi iteration, we conclude that the parallelism available
23
in Jacobi iteration is now completely destroyed by the Gauss-Seidel iteration.
As always, it is important to have a real feeling about the complexity of the prob-
lem. For that reason, we plot in Figure 8 the execution time, the speedup and the
efficiency. This figure is closed to the similar figures in the previous sections. For ex-
0 0
200 400 600 800 1000 5 10 15
Speedup Speedup
10 10
b) 5 e) 5
0 0
200 400 600 800 1000 5 10 15
Effeciency Effeciency
1 1
c) 0.5
f) 0.5
0 0
200 400 600 800 1000 5 10 15
Horizontal axis: N . Horizontal axis: p.
ample, the results show that we obtain a high speedup and a high efficiency, especially
when N is large and p ≤ 8 (see Figures 8a, 8b and 8c) and increasing the number of
processors will increase the difficulty, both in term of execution time and the efficiency.
We therefore refer to the previous sections for more detail discussion.
24
6.4 Successive Over Relaxation (SOR) iteration
In this section we deal with a correction version of Gauss-Seidel method, that is, by
mixing the Gauss-Seidel result with the current value, i.e.
ω ³ k+1 ´
ck+1
l,m = c + ck
l+1,m + ck+1
+ ck k
l,m+1 + (1 − ω)cl,m , (37)
4 l−1,m l,m−1
5000
Error
−4
4000 10
3000
−6
2000 10
1000
−8
0 10 −2 −4 −6 −8
2 3 4 5 6 7 8 10 10 10 10
−log(σ) σ
Figure 9: The number of iterations and the errors for the Jacobi, Gauss-Seidel and
SOR methods as function of the stopping criterion.
At the first sight, we observe from this figure that the method depends strongly
on the parameter ω, e.g. by comparing the number-of-iteration curves in the two
cases of ω = 1.6 and ω = 1.9. Furthermore, from Figure 9b we find in all methods a
linear relationship between the stoping criterion σ and the error. Then, the problem is
indeed enormously improved when using SOR, especially with ω = 1.9. The number
of iterations in this case is about six times smaller than that in Gauss-Seidel iteration
(see Figure 9a). Moreover, from Figure 9b we observe that error in SOR is much
smaller than in Jacobi and Gauss-Seidel. This further improves the efficiency of SOR.
For example, if we would like to have a maximal error (in L∞ norm) of 10−3 , SOR
requires σ = 10−3 , while Jacobi and Gauss-Seidel require σ = 10−6 . We now fully
understand why SOR is a practically useful iterative method and is applied regularly.
25
References
[1] G. Golub, J.M. Ortega, Scientific Computing: An Introduction with Parallel Com-
puting, Academic Press, 1993.
26