2018 Book IntroductionToParallelComputin PDF
2018 Book IntroductionToParallelComputin PDF
Roman Trobec · Boštjan Slivnik 
Patricio Bulić · Borut Robič
Introduction
to Parallel
Computing
From Algorithms to Programming on
State-of-the-Art Platforms
Undergraduate Topics in Computer
Science
Series editor
Ian Mackie
Advisory Board
Samson Abramsky, University of Oxford, Oxford, UK
Chris Hankin, Imperial College London, London, UK
Mike Hinchey, University of Limerick, Limerick, Ireland
Dexter C. Kozen, Cornell University, Ithaca, USA
Andrew Pitts, University of Cambridge, Cambridge, UK
Hanne Riis Nielson, Technical University of Denmark, Kongens Lyngby, Denmark
Steven S. Skiena, Stony Brook University, Stony Brook, USA
Iain Stewart, University of Durham, Durham, UK
Undergraduate Topics in Computer Science (UTiCS) delivers high-quality
instructional content for undergraduates studying in all areas of computing and
information science. From core foundational and theoretical material to final-year
topics and applications, UTiCS books take a fresh, concise, and modern approach
and are ideal for self-study or for a one- or two-semester course. The texts are all
authored by established experts in their fields, reviewed by an international advisory
board, and contain numerous examples and problems. Many include fully worked
solutions.
Introduction to Parallel
Computing
From Algorithms to Programming
on State-of-the-Art Platforms
123
Roman Trobec                                           Patricio Bulić
Jožef Stefan Institute                                 University of Ljubljana
Ljubljana, Slovenia                                    Ljubljana, Slovenia
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To all who make our lives worthwhile.
Preface
This monograph is an overview of practical parallel computing and starts with the
basic principles and rules which will enable the reader to design efficient parallel
programs for solving various computational problems on the state-of-the-art com-
puting platforms.
   The book too was written in parallel. The opening Chap. 1: “Why do we need
Parallel Programming” has been shaped by all of us during instant communication
immediately after the idea of writing such a book had cropped up. In fact, the first
chapter was an important motivation for our joint work. We spared no effort in
incorporating of our teaching experience into this book.
   The book consists of three parts: Foundations, Programming, and Engineering,
each with a specific focus:
• Part I, Foundations, provides the motivation for embarking on a study of
  parallel computation (Chap. 1) and an introduction to parallel computing
  (Chap. 2) that covers parallel computer systems, the role of communication,
  complexity of parallel problem-solving, and the associated principles and laws.
• Part II, Programming, first discusses shared memory platforms and OpenMP
  (Chap. 3), then proceeds to message passing library (Chap. 4), and finally to
  massively parallel processors (Chap. 5). Each chapter describes the methodol-
  ogy and practical examples for immediate work on a personal computer.
• Part III, Engineering, illustrates parallel solving of computational problems on
  three selected problems from three fields: Computing the number p (Chap. 6)
  from mathematics, Solving the heat equation (Chap. 7) from physics, and Seam
  carving (Chap. 8) from computer science. The book concludes with some final
  remarks and perspectives (Chap. 9).
   To enable readers to immediately start gaining practice in parallel computing,
Appendix A provides hints for making a personal computer ready to execute
parallel programs under Linux, macOS, and MS Windows.
   Specific contributions of the authors are as follows:
• Roman Trobec started the idea of writing a practical textbook, useful for stu-
  dents and programmers on a basic and advanced levels. He has contributed
  Chap. 4: “MPI Processes and Messaging”, Chap. 9: “Final Remarks and Per-
  spectives”, and to chapters of Part III.
                                                                                 vii
viii                                                                           Preface
Part I     Foundations
1 Why       Do We Need Parallel Programming . . . . . . . . . . . . . . .             .   .   .   .   .   .    3
  1.1        Why—Every Computer Is a Parallel Computer . . . . . . . .                .   .   .   .   .   .    3
  1.2        How—There Are Three Prevailing Types of Parallelism .                    .   .   .   .   .   .    4
  1.3        What—Time-Consuming Computations Can Be Sped up                          .   .   .   .   .   .    5
  1.4        And This Book—Why Would You Read It? . . . . . . . . .                   .   .   .   .   .   .    7
2 Overview of Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .              .   .   .    9
  2.1 History of Parallel Computing, Systems and Programming . .                                  .   .   .    9
  2.2 Modeling Parallel Computation . . . . . . . . . . . . . . . . . . . . . .                   .   .   .   11
  2.3 Multiprocessor Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .               .   .   .   13
        2.3.1 The Parallel Random Access Machine . . . . . . . . . . .                            .   .   .   13
        2.3.2 The Local-Memory Machine . . . . . . . . . . . . . . . . . .                        .   .   .   17
        2.3.3 The Memory-Module Machine . . . . . . . . . . . . . . . . .                         .   .   .   18
  2.4 The Impact of Communication . . . . . . . . . . . . . . . . . . . . . .                     .   .   .   18
        2.4.1 Interconnection Networks . . . . . . . . . . . . . . . . . . . . .                  .   .   .   19
        2.4.2 Basic Properties of Interconnection Networks . . . . . .                            .   .   .   19
        2.4.3 Classification of Interconnection Networks . . . . . . . .                           .   .   .   22
        2.4.4 Topologies of Interconnection Networks . . . . . . . . . .                          .   .   .   25
  2.5 Parallel Computational Complexity . . . . . . . . . . . . . . . . . . .                     .   .   .   31
        2.5.1 Problem Instances and Their Sizes . . . . . . . . . . . . . .                       .   .   .   31
        2.5.2 Number of Processing Units Versus Size of Problem
              Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   32
        2.5.3 The Class NC of Efficiently Parallelizable Problems .                                .   .   .   33
  2.6 Laws and Theorems of Parallel Computation . . . . . . . . . . . .                           .   .   .   36
        2.6.1 Brent’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . .               .   .   .   36
        2.6.2 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              .   .   .   37
  2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       .   .   .   42
  2.8 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .             .   .   .   44
                                                                                                              ix
x                                                                                                Contents
Part II     Programming
3 Programming Multi-core and Shared Memory Multiprocessors
  Using OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   47
  3.1 Shared Memory Programming Model . . . . . . . . . . . . . . . . .                      .   .   .   47
  3.2 Using OpenMP to Write Multithreaded Programs . . . . . . . . .                         .   .   .   49
        3.2.1 Compiling and Running an OpenMP Program . . . . . .                            .   .   .   50
        3.2.2 Monitoring an OpenMP Program . . . . . . . . . . . . . . .                     .   .   .   52
  3.3 Parallelization of Loops . . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   .   53
        3.3.1 Parallelizing Loops with Independent Iterations . . . . .                      .   .   .   54
        3.3.2 Combining the Results of Parallel Iterations . . . . . . .                     .   .   .   62
        3.3.3 Distributing Iterations Among Threads . . . . . . . . . . .                    .   .   .   72
        3.3.4 The Details of Parallel Loops and Reductions . . . . . .                       .   .   .   76
  3.4 Parallel Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   .   .   .   78
        3.4.1 Running Independent Tasks in Parallel . . . . . . . . . . .                    .   .   .   78
        3.4.2 Combining the Results of Parallel Tasks . . . . . . . . . .                    .   .   .   82
  3.5 Exercises and Mini Projects . . . . . . . . . . . . . . . . . . . . . . . .            .   .   .   84
  3.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        .   .   .   86
4 MPI   Processes and Messaging . . . . . . . . . . . . . . . . . . . . . . . . . . .            .   .   87
  4.1    Distributed Memory Computers Can Execute in Parallel . . . . .                          .   .   87
  4.2    Programmer’s View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   88
  4.3    Message Passing Interface . . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   89
         4.3.1 MPI Operation Syntax . . . . . . . . . . . . . . . . . . . . . . . .              .   .   92
         4.3.2 MPI Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . .            .   .   93
         4.3.3 MPI Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . .            .   .   95
         4.3.4 Make Your Computer Ready for Using MPI . . . . . . . .                            .   .   95
         4.3.5 Running and Configuring MPI Processes . . . . . . . . . .                          .   .   95
    4.4 Basic MPI Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         .   .   98
         4.4.1 MPI_INIT (int *argc, char ***argv) . . . . . .                                    .   .   98
         4.4.2 MPI_FINALIZE () . . . . . . . . . . . . . . . . . . . . . . . . .                 .   .   98
         4.4.3 MPI_COMM_SIZE (comm, size) . . . . . . . . . . . . . .                            .   .   98
         4.4.4 MPI_COMM_RANK (comm, rank) . . . . . . . . . . . . . .                            .   .   98
    4.5 Process-to-Process Communication . . . . . . . . . . . . . . . . . . . .                 .   .   99
         4.5.1 MPI_SEND (buf, count, datatype, dest,
                 tag, comm) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .          . . 100
         4.5.2 MPI_RECV (buf, count, datatype, source,
                 tag, comm, status ) . . . . . . . . . . . . . . . . . . . . . .                 . . 101
         4.5.3 MPI_SENDRECV (sendbuf, sendcount,
                 sendtype, dest, sendtag, recvbuf,
                 recvcount, recvtype, source, recvtag,
                 comm, status) . . . . . . . . . . . . . . . . . . . . . . . . . . .             . . 103
         4.5.4 Measuring Performances . . . . . . . . . . . . . . . . . . . . . .                . . 104
Contents                                                                                                             xi
In Part I, we first provide the motivation for delving into the realm of parallel com-
putation and especially of parallel programming. There are several reasons for doing
so: first, our computers are already parallel; secondly, parallelism can be of great
practical value when it comes to solving computationally demanding problems from
various areas; and finally, there is inertia in the design of contemporary computers
which keeps parallelism a key ingredient of future computers.
   The second chapter provides an introduction to parallel computing. It describes
different parallel computer systems and formal models for describing such systems.
Then, various patterns for interconnecting processors and memories are described
and the role of communication is emphasized. All these issues have impact on the
execution time required to solve computational problems. Thus, we introduce the
necessary topics of the parallel computational complexity. Finally, we present some
laws and principles that govern parallel computation.
Why Do We Need Parallel
Programming                                                                                 1
Chapter Summary
The aim of this chapter is to give a motivation for the study of parallel computing and
in particular parallel programming. Contemporary computers are parallel, and there
are various reasons for that. Parallelism comes in three different prevailing types
which share common underlying principles. Most importantly, parallelism can help
us solve demanding computational problems.
Nowadays, all computers are essentially parallel. This means that within every oper-
ating computer there always exist various activities which, one way or another, run
in parallel, at the same time. Parallel activities may arise and come to an end inde-
pendently of each other—or, they may be created purposely to involve simultaneous
performance of various operations whose interplay will eventually lead to the desired
result. Informally, the parallelism is the existence of parallel activities within a com-
puter and their use in achieving a common goal. The parallelism is found on all levels
of a modern computer’s architecture:
During the last decades, many different parallel computing systems appeared on
the market. First, they have been sold as supercomputers dedicated to solving spe-
cific scientific problems. Perhaps, the most known are the computers made by Cray
and Connection Machine Corporation. But as mentioned above, the parallelism has
spread all the way down into the consumer market and all kinds of handheld devices.
   Various parallel solutions gradually evolved into modern parallel systems that
exhibit at least one of the three prevailing types of parallelism:
• First, shared memory systems, i.e., systems with multiple processing units
  attached to a single memory.
• Second, distributed systems, i.e., systems consisting of many computer units,
  each with its own processing unit and its physical memory, that are connected
  with fast interconnection networks.
• Third, graphic processor units used as co-processors for solving general-purpose
  numerically intensive problems.
1.2 How—There Are Three Prevailing Types of Parallelism                                5
   Apart from the parallel computer systems that have become ubiquitous, extremely
powerful supercomputers continue to dominate the parallel computing achieve-
ments. Supercomputers can be found on the Top 500 list of the fastest computer
systems ever built and even today they are the joy and pride of the world superpow-
ers.
   But the underlying principles of parallel computing are the same regardless of
whether the top supercomputers or consumer devices are being programmed. The
programming principles and techniques gradually evolved during all these years.
Nevertheless, the design of parallel algorithms and parallel programming are still
considered to be an order of magnitude harder than the design of sequential algo-
rithms and sequential-program development.
   Relating to the three types of parallelism introduced above, three different
approaches to parallel programming exist: threads model for shared memory sys-
tems, message passing model for distributed systems, and stream-based model for
GPUs.
To see how parallelism can help you solve problems, it is best to look at examples.
In this section, we will briefly discuss the so-called n-body problem.
The n-body problem The classicaln-body problem is the problem of predicting the
individual motions of a group of objects that interact with each other by gravitation.
Here is a more accurate statement of the problem:
   While the classical n-body problem was motivated by the desire to understand
the motions of the Sun, Moon, planets, and the visible stars, it is nowadays used to
comprehend the dynamics of globular cluster star systems. In this case, the usual
Newton mechanics, which governs the moving of bodies, must be replaced by the
Einstein’s general relativity theory, which makes the problem even more difficult.
We will, therefore, refrain from dealing with this version of the problem and focus
on the classical version as introduced above and on the way it is solved on a parallel
computer.
   So how can we solve a given classical n-body problem? Let us first describe in
what form we expect the solution of the problem. As mentioned above, the classical
n-body problem assumes the classical, Newton’s mechanics, which we all learned in
school. Using this mechanics, a given instance of the n-body problem is described as
6                                               1 Why Do We Need Parallel Programming
a particular system of 6n differential equations that, for each of n bodies, define its
location (x(t), y(t), z(t)) and momentum (mvx (t), mv y (t), mvz (t)) at an instant t.
The solution of this system is the sought-for description of the evolution of the
n-body system at hand. Thus, the question of solvability of a particular classical
n-body problem boils down to the question of solvability of the associated system of
differential equations that are finally transformed into a system of linear equations.
   Today, we know that
• if n = 2, the classical n-body problem always has analytical solution, simply
   because the associated system of equations has an analytic solution.
• if n > 2, analytic solutions exist just for certain initial configurations of n bodies.
• In general, however, n-body problems cannot be solved analytically.
   It follows that, in general, the n-body problem must be solved numerically, using
appropriate numerical methods for solving systems of differential equations.
   Can we always succeed in this? The numerical methods numerically integrate the
differential equations of motion. To obtain the solution, such methods require time
which grows proportionally to n 2 . We say that the methods have time complexity of
the order O(n 2 ). At first sight, this seems to be rather promising; however, there is
a large hidden factor in this O(n 2 ). Because of this factor, only the instances of the
n-body problem with small values of n can be solved using these numerical methods.
To extend solvability to larger values of n, methods with smaller time complexity
must be found. One such is the Barnes–Hut method with time complexity O(n log n).
But, again, only the instances with limited (though larger) values of n can be solved.
For large values of n, numerical methods become prohibitively time-consuming.
   Unfortunately, the values of n are in practice usually very large. Actually, they are
too large for the abovementioned numerical methods to be of any practical value.
   What can we do in this situation? Well, at this point, parallel computation enters
the stage. The numerical methods which we use for solving systems of differential
equations associated with the n-body problem are usually programmed for single-
processor computers. But if we have at our disposal a parallel computer with many
processors, it is natural to consider using all of them so that they collaborate and
jointly solve systems of differential equations. To achieve that, however, we must
answer several nontrivial questions: (i) How can we partition a given numerical
method into subtasks? (ii) Which subtasks should each processor perform? (iii) How
should each processor collaborate with other processors? And then, of course, (iv)
How will we code all of these answers in the form of a parallel program, a program
capable of running on the parallel computer and exploiting its resources.
   The above questions are not easy, to be sure, but there have been designed parallel
algorithms for the above numerical methods, and written parallel programs that
implement the algorithms for different parallel computers. For example, J. Dubinsky
et al. designed a parallel Barnes–Hut algorithm and parallel program which divides
the n-body system into independent rectangular volumes each of which is mapped
to a processor of a parallel computer. The parallel program was able to simulate
evolution of n-body systems consisting of n = 640,000 to n = 1,000,000 bodies. It
turned out that, for such systems, the optimal number of processing units was 64.
1.3 What—Time-Consuming Computations Can Be Sped up                                7
At that number, the processors were best load-balanced and communication between
them was minimal.
We believe that this book could provide the first step in the process of attaining
the ability to efficiently solve, on a parallel computer, not only the n-body problem
but also many other computational problems of a myriad of scientific and applied
problems whose high computational and/or data complexities make them virtually
intractable even on the fastest sequential computers.
Overview of Parallel Systems
                                                                                                            2
Chapter Summary
In this chapter we overview the most important basic notions, concepts, and the-
oretical results concerning parallel computation. We describe three basic models
of parallel computation, then focus on various topologies for interconnection of
parallel computer nodes. After and a brief introduction to analysis of parallel com-
putation complexity, we finally explain two important laws of parallel computation,
the Amdahl’s law and Brents’s theorem.
1 There  are also other divisions that partition the class of all algorithms according to other criteria,
such as exact and non-exact algorithms; or deterministic and non-deterministic algorithms. However,
in this book we will not divide algorithms systematically according to these criteria.
© Springer Nature Switzerland AG 2018                                                                  9
R. Trobec et al., Introduction to Parallel Computing,
Undergraduate Topics in Computer Science,
https://doi.org/10.1007/978-3-319-98833-7_2
10                                                       2   Overview of Parallel Systems
Tpar .
Alternatively, we might choose the “performance” to mean how many times is the
parallel execution of P on C( p) faster than the sequential execution of P; this is
called the speedup of P on C( p),
                                               T seq
                                       S=            .
                                         def
T par
                                                 S
                                       E=          .
                                           def
Since Tpar  Tseq  p·Tpar , it follows that speedup is bounded above by p and effi-
ciency is bounded above by
                                        E  1.
This means that, for any C and p, the parallel execution of P on C( p) can be at most
p times faster than the execution of P on a single processor. And the efficiency of the
parallel execution of P on C( p) can be at most 1. (This is when each processing unit
is continually engaged in the execution of P, thus contributing 1p -th to its speedup.)
Later, in Sect. 2.5, we will involve one more parameter to these definitions.
   From the above definitions we see that both speedup and efficiency depend on
Tpar , the parallel execution time of P on C( p). This raises new questions:
2.1 History of Parallel Computing, Systems and Programming                                        11
These are important general questions about parallel computation which must be
answered prior to embarking on a practical design and analysis of parallel algorithms.
The way to answer these questions is to appropriately model parallel computation.
Parallel computers vary greatly in their organization. We will see in the next section
that their processing units may or may not be directly connected one to another; some
of the processing units may share a common memory while the others may only own
local (private) memories; the operation of the processing units may be synchronized
by a common clock, or they may run each at its own pace. Furthermore, usually there
are architectural details and hardware specifics of the components, all of which show
up during the actual design and use of a computer. And finally, there are technological
differences, which manifest in different clock rates, memory access times etc. Hence,
the following question arises:
To answer the question, we apply ideas similar to those discovered in the case of
sequential computation. There, various models of computation were discovered.2
In short, the intention of each of these models was to abstract the relevant properties
of the (sequential) computation from the irrelevant ones.
   In our case, a model called the Random Access Machine (RAM) is particularly
attractive. Why? The reason is that RAM distills the important properties of the
general-purpose sequential computers, which are still extensively used today, and
which have actually been taken as the conceptual basis for modeling of parallel
computing and parallel computers. Figure 2.1 shows the structure of the RAM.
2 Some of these models of computation are the μ-recursive functions, recursive functions, λ-calculus,
                                                   rn
                                                                                         M
                                                MEMORY
0 1 2
• The RAM consists of a processing unit and a memory. The memory is a poten-
  tially infinite sequence of equally sized locations m 0 , m 1 , . . .. The index i is called
  the address of m i . Each location is directly accessible by the processing unit: given
  an arbitrary i, reading from m i or writing to m i is accomplished in constant time.
  Registers are a sequence r1 . . . rn of locations in the processing unit. Registers are
  directly accessible. Two of them have special roles. Program counter pc (=r1 )
  contains the address of the location in the memory which contains the instruction
  to be executed next. Accumulator a (=r2 ) is involved in the execution of each
  instruction. Other registers are given roles as needed. The program is a finite
  sequence of instructions (similar to those in real computers).
• Before the RAM is started, the following is done: (a) a program is loaded into
  successive locations of the memory starting with, say, m 0 ; (b) input data are written
  into empty memory locations, say after the last program instruction.
• From now on, the RAM operates independently in a mechanical stepwise fashion
  as instructed by the program. Let pc = k at the beginning of a step. (Initially,
  k = 0.) From the location m k , the instruction I is read and started. At the same
  time, pc is incremented. So, when I is completed, the next instruction to be
  executed is in m k+1 , unless I was one of the instructions that change pc (e.g.
  jump instructions).
It turned out that finding an answer to this question is substantially more challenging
than it was in the case of sequential computation. Why? Since there are many ways
to organize parallel computers, there are also many ways to model them; and what is
difficult is to select a single model that will be appropriate for all parallel computers.
2.2 Modeling Parallel Computation                                                               13
The Parallel Random Access Machine, in short PRAM model, has p processing
units that are all connected to a common unbounded shared memory (Fig. 2.2). Each
processing unit can, in one step, access any location (word) in the shared memory
by issuing a memory request directly to the shared memory.
   The PRAM model of parallel computation is idealized in several respects. First,
there is no limit on the number p of processing units, except that p is finite. Next,
also idealistic is the assumption that a processing unit can access any location in the
shared memory in one single step. Finally, for words in the shared memory it is only
assumed that they are of the same size; otherwise they can be of arbitrary finite size.
   Note that in this model there is no interconnection network for transferring mem-
ory requests and data back and forth between processing units and shared memory.
(This will radically change in the other two models, the LMM (see Sect. 2.3.2) and
the MMM (see Sect. 2.3.3)).
   However, the assumption that any processing unit can access any memory location
in one step is unrealistic. To see why, suppose that processing units Pi and P j
3 Infact, currently the research is being pursued also in other, non-conventional directions, which
do not build on RAM or any other conventional computational models (listed in previous footnote).
Such are, for example, dataflow computation and quantum computation.
14                                                            2   Overview of Parallel Systems
 (i) which sorts of simultaneous accesses to the same location are allowed; and
(ii) the way in which unpredictability is avoided when simultaneously accessing the
     same location.
• EREW-PRAM. This is the most realistic of the three variations of the PRAM
  model. The EREW-PRAM model does not support simultaneous accessing to the
  same memory location; if such an attempt is made, the model stops executing
  its program. Accordingly, the implicit assumption is that programs running on
  EREW-PRAM never issue instructions that would simultaneously access the same
  location; that is, any access to any memory location must be exclusive. So the
  construction of such programs is the responsibility of algorithm designers.
• CREW-PRAM. This model supports simultaneous reads from the same memory
  location but requires exclusive writes to it. Again, the burden of constructing such
  programs is on the algorithm designer.
• CRCW-PRAM. This is the least realistic of the three versions of the PRAM model.
  The CRCW-PRAM model allows simultaneous reads from the same memory loca-
  tion, simultaneous writes to the same memory location, and simultaneous reads
  from and writes to the same memory location. However, to avoid unpredictable
  effects, different additional restrictions are imposed on simultaneous writes. This
  yields the following versions of the model CRCW-PRAM:
The answer is yes, but not too much. The foggy “too much” is clarified in the next
Theorem, where CRCW-PRAM( p) denotes the CRCW-PRAM with p processing
units, and similarly for the EREW-PRAM( p). Informally, the theorem tells us that by
passing from the EREW-PRAM( p) to the “more powerful” CRCW-PRAM( p) the
parallel execution time of a parallel algorithm may reduce by some factor; however,
this factor is bounded above and, indeed, it is at most of the order O(log p).
The LMM model has p processing units, each with its own local memory (Fig. 2.4).
The processing units are connected to a common interconnection network. Each
processing unit can access its own local memory directly. In contrast, it can access
a non-local memory (i.e., local memory of another processing unit) only by sending
a memory request through the interconnection network.
   The assumption is that all local operations, including accessing the local memory,
take unit time. In contrast, the time required to access a non-local memory depends
on
INTERCONNECTION NETWORK
P1 P2 Pp
M1 M2 Mp
Fig.2.4 The LMM model of parallel computation has p processing units each with its local memory.
Each processing unit directly accesses its local memory and can access other processing unit’s local
memory via the interconnection network
The MMM model (Fig. 2.5) consists of p processing units and m memory modules
each of which can be accessed by any processing unit via a common interconnection
network. There are no local memories to processing units. A processing unit can
access the memory module by sending a memory request through the interconnection
network.
   It is assumed that the processing units and memory modules are arranged in such
a way that—when there are no coincident accesses—the time for any processing
unit to access any memory module is roughly uniform. However, when there are
coincident accesses, the access time depends on
We have seen that both LMM model and MMM model explicitly use interconnec-
tion networks to convey memory requests to the non-local memories (see Figs. 2.4
and 2.5). In this section we focus on the role of an interconnection network in a
2.4 The Impact of Communication                                                      19
multiprocessor model and its impact on the the parallel time complexity of parallel
algorithms.
Since the dawn of parallel computing, the major hallmark of a parallel system have
been the type of the central processing unit (CPU) and the interconnection net-
work. This is now changing. Recent experiments have shown that execution times
of most real world parallel applications are becoming more and more dependent on
the communication time rather than on the calculation time. So, as the number of
cooperating processing units or computers increases, the performance of intercon-
nection networks is becoming more important than the performance of the processing
unit. Specifically, the interconnection network has great impact on the efficiency and
scalability of a parallel computer on most real world parallel applications. In other
words, high performance of an interconnection network may ultimately reflect in
higher speedups, because such an interconnection network can shorten the overall
parallel execution time as well as increase the number of processing units that can
be efficiently exploited.
   The performance of an interconnection network depends on several factors. Three
of the most important are the routing, the flow-control algorithms, and the network
topology. Here routing is the process of selecting a path for traffic in an interconnec-
tion network; flow control is the process of managing the rate of data transmission
between two nodes to prevent a fast sender from overwhelming a slow receiver; and
network topology is the arrangement of the various elements, such as communica-
tion nodes and channels, of an interconnection network.
   For the routing and flow-control algorithms efficient techniques are already known
and used. In contrast, network topologies haven’t been adjusting to changes in tech-
nological trends as promptly as the routing and flow-control algorithms. This is one
reason that many network topologies which were discovered soon after the very birth
of parallel computing are still being widely used. Another reason is the freedom that
end users have when they are choosing the appropriate network topology for the
anticipated usage. (Due to modern standards, there is no such freedom in picking
or altering routing or flow-control algorithms). As a consequence, a further step in
performance increase can be expected to come from the improvements in the topol-
ogy of interconnection networks. For example, such improvements should enable
interconnection networks to dynamically adapt to the current application in some
optimal way.
•    node degree,
•    regularity,
•    symmetry,
•    diameter,
•    path diversity, and
•    expansion scalability.
In the following we define each of them and give comments where appropriate:
• In an interconnection network, there may exist multiple paths between two nodes.
  In such case, the nodes can be connected in many ways. A packet starting at source
  node will have at its disposal multiple routes to reach the destination node. The
  packet can take different routes (or even different continuations of a traversed part
  of a route) depending on the current situation in the network. An interconnection
  network that has high path diversity offers more alternatives when packets need
  to seek their destinations and/or avoid obstacles.
• Scalability is (i) the capability of a system to handle a growing amount of work,
  or (ii) the potential of the system to be enlarged to accommodate that growth. The
  scalability is important at every level. For example, the basic building block must
  be easily connected to other blocks in a uniform way. Moreover, the same building
  block must be used to build interconnection networks of different sizes, with only
  a small performance degradation for the maximum-size parallel computer. Inter-
  connection networks have important impact on scalability of parallel computers
  that are based on the LMM or MMM multiprocessor model. To appreciate that,
  note that scalability is limited if node degree is fixed.
• channel bandwidth,
• bisection bandwidth, and
• latency.
• Channel bandwidth, in short bandwidth, is the amount of data that is, or theo-
  retically could be, communicated through a channel in a given amount of time.
  In most cases, the channel bandwidth can be adequately determined by using a
  simple model of communication which advocates that the communication time
  tcomm , needed to communicate given data through the channel, is the sum ts + td
  of the start-up time ts , needed to set-up the channel’s software and hardware, and
  the data transfer time td , where td = mtw , the product of the number of words
  making up the data, m, and the transfer time per one word, tw . Then the channel
  bandwidth is 1/tw .
• A given interconnection network can be cut into two (almost) equal-sized compo-
  nents. Generally, this can be done in many ways. Given a cut of the interconnection
  network, the cut-bandwidth is the sum of channel bandwidths of all channels con-
  necting the two components. The smallest cut-bandwidth is called the bisection
  bandwidth (BBW) of the interconnection network. The corresponding cut is the
  worst-case cut of the interconnection network. Occasionally, the bisection band-
  width per node (BBWN) is needed; we define it as BBW divided by |N |, the
  number of nodes in the network. Of course, both BBW and BBWN depend on
  the topology of the network and the channel bandwidths. All in all, increasing
22                                                       2   Overview of Parallel Systems
   The transfer of data from a source node to a destination node is measured in terms
of various units which are defined as follows:
These units are closely related to the bandwidth and to the latency of the network.
Mapping Interconnection Networks into Real Space
An interconnection network of any given topology, even if defined in an abstract
higher-dimensional space, eventually has to be mapped into the physical, three-
dimensional (3D) space. This means that all the chips and printed-circuit boards
making up the interconnection network must be allocated physical places.
   Unfortunately, this is not a trivial task. The reason is that mapping usually has
to optimize certain, often contradicting, criteria while at the same time respecting
various restrictions. Here are some examples:
• One such restriction is that the numbers of I/O pins per chip or per printed-
  circuit board are bounded above. A usual optimization criterion is that, in order
  to prevent the decrease of data rate, cables be as short as possible. But due to
  significant sizes of hardware components and due to physical limitations of 3D-
  space, mapping may considerably stretch certain paths, i.e., nodes that are close
  in higher-dimensional space may be mapped to distant locations in 3D-space.
• We may want to map processing units that communicate intensively as close
  together as possibly, ideally on the same chip. In this way we may minimize
  the impact of communication. Unfortunately, the construction of such optimal
  mappings is NP-hard optimization problem.
• An additional criterion may be that the power consumption is minimized.
Interconnection networks can be classified into direct and indirect networks. Here
are the main properties of each kind.
2.4 The Impact of Communication                                                                  23
Direct Networks
A network is said to be direct when each node is directly connected to its neighbors.
How many neighbors can a node have? In a fully connected network, each of the
n = |N | nodes is directly connected to all the other nodes, so each node has n − 1
neighbors. (See Fig. 2.6).
   Since such a network has 21 n(n − 1) = (n 2 ) direct connections, it can only be
used for building systems with small numbers n of nodes. When n is large, each node
is directly connected to a proper subset of other nodes, while the communication to
the remaining nodes is achieved by routing messages through intermediate nodes.
An example of such a direct interconnection network is the hypercube; see Fig. 2.13
on p. 28.
Indirect Networks
An indirect network connects the nodes through switches. Usually, it connects pro-
cessing units on one end of the network and memory modules on the other end of the
network. The simplest circuit for connecting processing units to memory modules is
the fully connected crossbar switch (Fig. 2.7). Its advantage is that it can establish
a connection between processing units and memory modules in an arbitrary way.
   At each intersection of a horizontal and vertical line is a crosspoint. A crosspoint
is a small switch that can be electrically opened (◦) or closed (•), depending on
whether the horizontal and vertical lines are to be connected or not. In Fig. 2.7 we
see eight crosspoints closed simultaneously, allowing connections between the pairs
(P1 , M1 ), (P2 , M3 ), (P3 , M5 ), (P4 , M4 ), (P5 , M2 ), (P6 , M6 ), (P7 , M8 ) and (P8 , M7 ) at
the same time. Many other combinations are also possible.
   Unfortunately, the fully connected crossbar has too large complexity to be used
for connecting large numbers of input and output ports. Specifically, the number of
crosspoints grows as pm, where p and m are the numbers of processing units and
memory modules, respectively. For p = m = 1000 this amounts to a million cross-
points which is not feasible. (Nevertheless, for medium-sized systems, a crossbar
24                                                                     2    Overview of Parallel Systems
                                          PROCESSING UNITS
                                                             P3
P4
P5
P6
P7
P8
M1 M2 M3 M4 M5 M6 M7 M8
MEMORY MODULES
design is workable, and small fully connected crossbar switches are used as basic
building blocks within larger switches and routers).
   This is why indirect networks connect the nodes through many switches. The
switches themselves are usually connected to each other in stages, using a regular
connection pattern between the stages. Such indirect networks are called the multi-
stage interconnection networks; we will describe them in more detail on p. 29.
Indirect networks can be further classified as follows:
• A non-blocking network can connect any idle source to any idle destination,
  regardless of the connections already established across the network. This is due
  to the network topology which ensures the existence of multiple paths between
  the source and destination.
• A blocking rearrangeable networks can rearrange the connections that have
  already been established across the network in such a way that a new connection
  can be established. Such a network can establish all possible connections between
  inputs and outputs.
• In a blocking network, a connection that has been established across the network
  may block the establishment of a new connection between a source and desti-
  nation, even if the source and destination are both free. Such a network cannot
  always provide a connection between a source and an arbitrary free destination.
   The distinction between direct and indirect networks is less clear nowadays. Every
direct network can be represented as an indirect network since every node in the direct
network can be represented as a router with its own processing element connected
to other routers. However, for both direct and indirect interconnection networks, the
full crossbar, as an ideal switch, is the heart of the communications.
2.4 The Impact of Communication                                                            25
It is not hard to see that there exist many network topologies capable of intercon-
necting p processing units and m memory modules (see Exercises). However, not
every network topology is capable of conveying memory requests quickly enough
to efficiently back up parallel computation. Moreover, it turns out that the network
topology has a large influence on the performance of the interconnection network
and, consequently, of parallel computation. In addition, network topology may incur
considerable difficulties in the actual construction of the network and its cost.
    In the last few decades, researchers have proposed, analyzed, constructed, tested,
and used various network topologies. We now give an overview of the most notable or
popular ones: the bus, the mesh, the 3D-mesh, the torus, the hypercube, the multistage
network and the fat tree.
The Bus
This is the simplest network topology. See Fig. 2.8. It can be used in both local-
memory machines (LMMs) and memory-module machines (MMMs). In either case,
all processing units and memory modules are connected to a single bus. In each step,
at most one piece of data can be written onto the bus. This can be a request from a
processing unit to read or write a memory value, or it can be the response from the
processing unit or memory module that holds the value.
    When in a memory-module machine a processing unit wants to read a memory
word, it must first check to see if the bus is busy. If the bus is idle, the processing unit
puts the address of the desired word on the bus, issues the necessary control signals,
and waits until the memory puts the desired word on the bus. If, however, the bus
is busy when a processing unit wants to read or write memory, the processing unit
must wait until the bus becomes idle. This is where drawbacks of the bus topology
become apparent. If there is a small number of processing units, say two or three,
the contention for the bus is manageable; but for larger numbers of processing units,
say 32, the contention becomes unbearable because most of the processing units will
wait most of the time.
    To solve this problem we add a local cache to each processing unit. The cache
can be located on the processing unit board, next to the processing unit chip, inside
the processing unit chip, or some combination of all three. In general, caching is
not done on an individual word basis but on the basis of blocks that consist of, say,
64 bytes. When a word is referenced by a processing unit, the word’s entire block
                        BUS
                                                    P1     P2     P3                  Pp
                                                                           BUS
  P1        P2                        Pp
M1 M2 Mp M1 M2 Mm
is fetched into the local cache of the processing unit. After that many reads can be
satisfied out of the local cache. As a result, there will be less bus traffic, and the
system will be able to support more processing units.
    We see, that the practical advantages of using buses are that (i) they are simple
to build, and (ii) it is relatively easy to develop protocols that allow processing units
to cache memory values locally (because all processing units and memory modules
can observe the traffic on the bus). The obvious disadvantage of using a bus is that
the processing units must take turns accessing the bus. This implies that as more
processing units are added to a bus, the average time to perform a memory access
grows proportionately with the number of processing units.
The Ring
The ring is among the simplest and the oldest interconnection networks. Given n
nodes, they are arranged in linear fashion so that each node has a distinct label i,
where 0  i  n − 1. Every node is connected to two neighbors, one to the left and
one to the right. Thus, a node labeled i is connected to the nodes labeled i + 1 mod n
and i − 1 mod n (see Fig. 2.9). The ring is used in local-memory machines (LMMs).
2D-Mesh
A two-dimensional mesh is an interconnection network that can be arranged in
rectangular fashion, so that each switch in the mesh has a distinct label (i, j), where
0  i  X − 1 and 0  j  Y − 1. (See Fig. 2.10). The values X and Y determine
the lengths of the sides of the mesh. Thus, the number of switches in a mesh is X Y .
Every switch, except those on the sides of the mesh, is connected to six neighbors:
one to the north, one to the south, one to the east, and one to the west. So a switch
                                           i
2.4 The Impact of Communication                                                        27
Fig. 2.11 A 2D-torus. Each node represents a processor unit with local memory
labeled (i, j), where 0 < i < X − 1 and 0 < j < Y − 1, is connected to the switches
labeled (i, j + 1), (i, j − 1), (i + 1, j), and (i − 1, j).
   Meshes typically appear in local-memory machines (LMMs): a processing unit
(along with its local memory) is connected to each switch, so that remote memory
accesses are made by routing messages through the mesh.
2D-Torus (Toroidal 2D-Mesh)
In the 2D-mesh, the switches on the sides have no connections to the switches on
the opposite sides. The interconnection network that compensates for this is called
the toroidal mesh, or just torus when d = 2. (See Fig. 2.11). Thus, in torus every
switch located at (i, j) is connected to four other switches, which are located at
(i, j + 1 mod Y ), (i, j − 1 mod Y ), (i + 1 mod X, j) and (i − 1 mod X, j).
   Toruses appear in local-memory machines (LMMs): to each switch is connected
a processing unit with its local memory. Each processing unit can access any remote
memory by routing messages through the torus.
3D-Mesh and 3D-Torus
A three-dimensional mesh is similar to two-dimensional. (See Fig. 2.12). Now each
switch in a mesh has a distinct label (i, j, k), where 0  i  X − 1, 0  j  Y − 1,
and 0  k  Z − 1. The values X , Y and Z determine the lengths of the sides of
the mesh, so the number of switches in it is X Y Z . Every switch, except those on the
sides of the mesh, is now connected to six neighbors: one to the north, one to the
south, one to the east, one to the west, one up, and one down. Thus, a switch labeled
(i, j, k), where 0 < i < X − 1, 0 < j < Y − 1 and 0 < k < Z − 1, is connected to
the switches (i, j + 1, k), (i, j − 1, k), (i + 1, j, k), (i − 1, j, k), (i, j, k + 1) and
(i, j, k − 1). Such meshes typically appear in LMMs.
   We can expand a 3D-mesh into a toroidal 3D-mesh by adding edges that connect
nodes located at the opposite sides of the 3D-mesh. (Picture omitted). A switch
labeled (i, j, k) is connected to the switches (i + 1 mod X, j, k), (i − 1 mod X, j, k),
(i, j + 1 mod Y, k), (i, j − 1 mod Y, k), (i, j, k + 1 mod Z ) and (i, j, k − 1 mod Z ).
   3D-meshes and toroidal 3D-meshes are used in local-memory machines (LMMs).
28                                                      2   Overview of Parallel Systems
                                                                                  k
                                                    j
Hypercube
A hypercube is an interconnection network that has n = 2b nodes, for some b  0.
(See Fig. 2.13). Each node has a distinct label consisting of b bits. Two nodes are
connected by a communication link if an only if their labels differ in precisely one
bit location. Hence, each node of a hypercube has b = log2 n neighbors.
   Hypercubes are used in local-memory machines (LMMs).
 The k-ary d-Cube Family of Network Topologies
Interestingly, the ring, the 2D-torus, the 3D-torus, the hypercube, and many other
topologies all belong to one larger family of k-ary d-cube topologies.
   Given k  1 and d  1, the k-ary d-cube topology is a family of certain “gridlike”
topologies that share the fashion in which they are constructed. In other words, the
k-ary d-cube topology is a generalization of certain topologies. The parameter d is
called the dimension of these topologies and k is their side length, the number of
nodes along each of the d directions. The fashion in which the k-ary d-cube topology
is constructed is defined inductively (on the dimension d):
PROCESSING UNITS
P1 P2 P3 P4 P5 P6 P7 P8
1st stage
2nd stage
3rd stage
4th stage
M1 M2 M3 M4 M5 M6 M7 M8
MEMORY MODULES
Fig. 2.14 A 4-stage interconnection network capable of connecting 8 processing units to 8 memory
modules. Each switch ◦ can establish a connection between arbitrary pair of input and output
channels
P P P P P P P P P P P P P P P P
M M M M M M M M M M M M M M M M
Fig. 2.15 A fat-tree. Each switch   ◦ can establish a connection between arbitrary pair of incident
channels
T (n)
In Sect. 2.1, we defined the parallel execution time Tpar , speedup S, and efficiency
E of a parallel program P for solving a problem Π on a computer C( p) with p
processing units. Let us augment these definitions so that they will involve the size
n of the instances of Π . As before, the program P for solving Π and the computer
C( p) are tacitly understood, so we omit the corresponding indexes to simplify the
notation. We obtain the parallel execution time Tpar (n), speedup S(n), and efficiency
E(n) of solving Π ’s instances of size n:
                                                Tseq (n)
                                    S(n) =               ,
                                          def
Tpar (n)
                                                  S(n)
                                     E(n) =            .
                                            def
   So let us pick an arbitrary n and suppose that we are only interested in solving
instances of Π whose size is n. Now, if there are too few processing units in C( p), i.e.,
p is too small, the potential parallelism in the program P will not be fully exploited
during the execution of P on C( p), and this will reflect in low speedup S(n) of P.
Likewise, if C( p) has too many processing units, i.e., p is too large, some of the
processing units will be idling during the execution of the program P, and again this
will reflect in low speedup of P. This raises the following question that obviously
deserves further consideration:
It is reasonable to expect that the answer will depend somehow on the type of C,
that is, on the multiprocessor model (see Sect. 2.3) underlying the parallel computer
C. Until we choose the multiprocessor model, we may not be able to obtain answers
of practical value to the above question. Nevertheless, we can make some general
observations that hold for any type of C. First observe that, in general, if we let n
grow then p must grow too; otherwise, p would eventually become too small relative
to n, thus making C( p) incapable of fully exploiting the potential parallelism of P.
Consequently, we may view p, the number of processing units that are needed to
maximize speedup, to be some function of n, the size of the problem instance at
hand. In addition, intuition and practice tell us that a larger instance of a problem
requires at least as many processing units as required by a smaller one. In sum, we
can set
                                       p = f (n),
where f : N → N is some nondecreasing function, i.e., f (n)  f (n + 1), for all n.
2.5 Parallel Computational Complexity                                                   33
    Second, let us examine how quickly can f (n) grow as n grows? Suppose that
 f (n) grows exponentially. Well, researchers have proved that if there are exponen-
tially many processing units in a parallel computer then this necessarily incurs long
communication paths between some of them. Since some communicating process-
ing units become exponentially distant from each other, the communication times
between them increase correspondingly and, eventually, blemish the theoretically
achievable speedup. The reason for all of that is essentially in our real, 3-dimensional
space, because
• each processing unit and each communication link occupies some non-zero vol-
  ume of space, and
• the diameter of the smallest sphere containing exponentially many processing
  units and communication links is also exponential.
p = poly(n),
    Combined with Theorem 2.1 this means that for p = poly(n) the execution of P on
EREW-PRAM( p) will be at most O(log n)-times slower than on CRCW-PRAM( p).
    But this also tells us that, when p = poly(n), choosing a model from the models
CRCW-PRAM( p), CREW-PRAM( p), and EREW-PRAM( p) to execute a program
affects the execution time of the program by a factor of the order O(log n), where n
is the size of the problem instances to be solved. In other words:
We usually write logi n instead of (log n)i to avoid clustering of parentheses. The sum
ak logk n + ak−1 logk−1 n + · · · + a0 is asymptotically bounded above by O(logk n).
To see why, consider Exercises in Sect. 2.7.
We are ready to formally introduce the class of problems we are interested in.
Example 2.1 Suppose that we are given the problem Π ≡ “add n given numbers.”
Then π ≡ “add numbers 10, 20, 30, 40, 50, 60, 70, 80” is an instance of size(π ) = 8
2.5 Parallel Computational Complexity                                                              35
a1 a2 a3 a4 a5 a6 a7 a8
s=1 P1 P2 P3 P4
s=2 P1 P2
a1+a2+a3+a4 a5+a6+a7+a8
s=3 P1
a1+a2+a3+a4+a5+a6+a7+a8
Fig. 2.16 Adding eight numbers in parallel with four processing units
of the problem Π . Let us now focus on all instances of size 8, that is, instances of
the form π ≡ “add numbers a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 .”
   The fastest sequential algorithm for computing the sum a1 + a2 + a3 + a4 + a5 +
a6 + a7 + a8 requires Tseq (8) = 7 steps, with each step adding the next number to
the sum of the previous ones.
   In parallel, however, the numbers a1 , a2 , a3 , a4 , a5 , a6 , a7 , a8 can be summed in
just Tpar (8) = 3 parallel steps using 28 = 4 processing units which communicate in a
tree-like pattern as depicted in Fig. 2.16. In the first step, s = 1, each processing unit
adds two adjacent input numbers. In each next step, s  2, two adjacent previous
partial results are added to produce a new, combined partial result. This combining
of partial results in a tree-like manner continues until 2s+1 > 8. In the first step,
s = 1, all of the four processing units are engaged in computation; in step s = 2, two
processing units (P3 and P4 ) start idling; and in step s = 3, three processing units
(P2 , P3 and P4 ) are idle.
   In general, instances π(n) of Π can be solved in parallel time Tpar = 
log n =
O(log n) with 
 n2  = O(n) processing units communicating in similar tree-like pat-
                                                                          T seq (n)
terns. Hence, Π ∈ NC and the associated speedup is S(n) =                 T par (n)   = O( logn n ). 
   Notice that, in the above example, the efficiency of the tree-like parallel addition
of n numbers is quite low, E(n) = O( log1 n ). The reason for this is obvious: only
half of the processing units engaged in a parallel step s will be engaged in the next
parallel step s + 1, while all the other processing units will be idling until the end of
computation. This issue will be addressed in the next section by Brent’s Theorem.
36                                                     2   Overview of Parallel Systems
In this section we describe the Brent’s theorem, which is useful in estimating the
lower bound on the number of processing units that are needed to keep a given
parallel time complexity. Then we focus on the Amdahl’s law, which is used for
predicting the theoretical speedup of a parallel program whose different parts allow
different speedups.
Tpar, M (P).
Let us now reduce the number of processing units of M to some fixed number
and denote the obtained machine with the reduced number of processing units by
R.
R is a PRAM of the same type as M which can use, in every step of its operation, at
most p processing units.
   Let us now run P on R. If p processing units cannot support, in every step of the
execution, all the potential parallelism of P, then the parallel runtime of P on R,
Tpar, R (P),
may be larger than Tpar, M (P). Now the question raises: Can we quantify Tpar, R (P)?
  The answer is given by Brent’s Theorem which states that
                                                         
                                           W
                         Tpar, R (P) = O     + Tpar, M (P) .
                                           p
2.6 Laws and Theorems of Parallel Computation                                             37
12
10
                                           0
                                                                    processing units
128
256
                                                                                          512
                                                               16
32
                                                                         64
                                               1
                                                           8
Proof Let Wi be the number
                        T     of P’s operations performed by M in ith step and
T := Tpar, M (P). Then i=1    Wi = W. To perform the Wi operations of the ith step
                   
of M, R needs Wpi steps. So the number of steps which R makes during its
                               T  Wi  T  Wi               T
execution of P is Tpar, R (P) = i=1    p   i=1 p + 1  1p i=1         Wi + T =
W
p   + Tpar, M (P).                                                                        
Intuitively, we would expect that doubling the number of processing units should
halve the parallel execution time; and doubling the number of processing units again
should halve the parallel execution time once more. In other words, we would expect
that the speedup from parallelization is a linear function of the number of processing
units (see Fig. 2.17).
   However, linear speedup from parallelization is just a desirable optimum which
is not very likely to become a reality. Indeed, in reality very few parallel algorithms
achieve it. Most of parallel programs have a speedup which is near-linear for small
numbers of processing elements, and then flattens out into a constant value for large
numbers of processing elements (see Fig. 2.18).
38                                                          2   Overview of Parallel Systems
12
10
                                            0
                                                                       processing units
16
32
64
128
256
                                                                                             512
Setting the Stage
How can we explain this unexpected behavior? The clues for the answer will be
obtained from two simple examples.
     Note: P1 cannot be sped up by adding new processing units, because scanning the
     disk directory is intrinsically sequential process. In contrast, P2 can be sped up
     by adding new processing units; for example, each file can be passed to a separate
     processing unit. In sum, a sequential program can be viewed as a sequence of two
     parts that differ in their parallelizability, i.e., amenability to parallelization.
• Example 2. Let P be as above. Suppose that the (sequential) execution of P takes
  20 min, where the following holds (see Fig. 2.19):
P1 P2
Tseq(P1) Tseq(P2)
Note: since only P2 can benefit from additional processing units, the parallel execu-
tion time Tseq (P) of the whole P cannot be less than the time Tseq (P1 ) taken by the
non-parallelizable part P1 (that is, 2 min), regardless of the number of additional pro-
cessing units engaged in the parallel execution of P. In sum, if parts of a sequential
program differ in their potential parallelisms, they differ in their potential speedups
from the increased number of processing units, so the speedup of the whole program
will depend on their sequential runtimes.
The clues that the above examples brought to light are recapitulated as follows: In
general, a program P executed by a parallel computer can be split into two parts,
• part P1 which does not benefit from multiple processing units, and
• part P2 which does benefit from multiple processing units;
• besides P2 ’s benefit, also the sequential execution times of P1 and P2 influence
  the parallel execution time of the whole P (and, consequently, P’s speedup).
Derivation
We will now assess quantitatively how the speedup of P depends on P1 ’s and P2 ’s
sequential execution times and their amenability to parallelization and exploitation
of multiple processing units.
   Let Tseq (P) be the sequential execution time of P. Because P = P1 P2 , a sequence
of parts P1 and P2 , we have
where Tseq (P1 ) and Tseq (P2 ) are the sequential execution times of P1 and P2 , respec-
tively (see Fig. 2.20).
   When we actually employ additional processing units in the parallel execution of
P, it is the execution of P2 that is sped up by some factor s > 1, while the execution
of P1 does not benefit from additional processing units. In other words, the execution
time of P2 is reduced from Tseq (P2 ) to 1s Tseq (P2 ), while the execution time of P1
remains the same, Tseq (P1 ). So, after the employment of additional processing units
the parallel execution time Tpar (P) of the whole program P is
                                                     1
                              Tpar (P) = Tseq (P1 ) + Tseq (P2 ).
                                                     s
40                                                                 2   Overview of Parallel Systems
The speedup S(P) of the whole program P can now be computed from definition,
                                               Tseq (P)
                                    S(P) =              .
                                               Tpar (P)
We could stop here; however, it is usual to express S(P) in terms of b, the fraction
of Tseq (P) during which parallelization of P is beneficial. In our case
                                            Tseq (P2 )
                                     b=                .
                                            Tseq (P)
Plugging this in the expression for S(P), we finally obtain the Amdahl’s Law
                                                   1
                                   S(P) =                      .
                                             1−b+          b
                                                           s
                            1               1                 1
                  S=                 =                                  = 2.91.
                          1−b+   b
                                 s       0.3 +   0.7
                                                  s        0.3 +   0.7
                                                                   16
                            1               1                 1
                  S=                 =                                  = 3.11 ,
                          1−b+   b
                                 s       0.3 +   0.7
                                                  s        0.3 +   0.7
                                                                   32
                            1               1                 1
                  S=                 =                                  = 3.22.
                          1−b+   b
                                 s       0.3 +   0.7
                                                  s        0.3 +   0.7
                                                                   64
   Finally, if we double the number of processing units even to 128, the maximum
speedup we can achieve is
                            1               1                 1
                  S=                 =                                  = 3.27.
                          1−b+   b
                                 s       0.3 +   0.7
                                                  s        0.3 +   0.7
                                                                   128
  In this case doubling the processing power only slightly improves the speedup.
Therefore, using more processing units is not necessarily the optimal approach.
  Note that this complies with actual speedups of realistic programs as we have
depicted in Fig. 2.18.
 A Generalization of Amdahl’s Law
Until now we assumed that there are just two parts of of a given program, of which
one cannot benefit from multiple processing units and the other can. We now assume
that the program is a sequence of three parts, each of which could benefit from
multiple processing units. Our goal is to derive the speedup of the whole program
when the program is executed by multiple processing units.
   So let P = P1 P2 P3 be a program which is a sequence of three parts P1 , P2 , and
P3 . See Fig. 2.21. Let Tseq (P1 ) be the time during which the sequential execution of
P spends executing part P1 . Similarly we define Tseq (P2 ) and Tseq (P3 ). Then the
sequential execution time of P is
   But we want to run P on a parallel computer. Suppose that the analysis of P shows
that P1 could be parallelized and sped up on the parallel machine by factor s1 > 1.
Similarly, P2 and P3 could be sped up by factors s2 > 1 and s3 > 1, respectively.
So we parallelize P by parallelizing each of the three parts P1 , P2 , and P3 , and
run P on the parallel machine. The parallel execution of P takes Tpar (P) time,
where Tpar (P) = Tpar (P1 ) + Tpar (P2 ) + Tpar (P3 ). But Tpar (P1 ) = s11 Tseq (P1 ), and
similarly for Tpar (P2 ) and Tpar (P3 ). It follows that
                              1              1            1
                 Tpar (P) =      Tseq (P1 ) + Tseq (P2 ) + Tseq (P3 ).
                              s1             s2           s3
                                                                                           T   (P)
Now the speedup of P can easily be computed from its definition, S(P) = T seq    par (P)
                                                                                         .
   We can obtain a more informative expression for S(P). Let b1 be the fraction of
                                                                                T (P1 )
Tseq (P) during which the sequential execution of P executes P1 ; that is, b1 = Tseq
                                                                                  seq (P)
                                                                                           .
Similarly we define b2 and b3 . Applying this in the definition of S(P) we obtain
                                    Tseq (P)              1
                          S(P) =             =                          .
                                    Tpar (P)     b1
                                                      +   b2
                                                               +   b3
                                                 s1       s2       s3
2.7 Exercises
 1. How many pairwise interactions must be computed when solving the n-body
    problem if we assume that interactions are symmetric?
 2. Give an intuitive explanation why Tpar  Tseq  p·Tpar , where Tpar and Tseq are
    the parallel and sequential execution times of a program, respectively, and p is
    the number of processing units used during the parallel execution.
 3. Can you estimate the number of different network topologies capable of inter-
    connecting p processing units Pi and m memory modules M j ? Assume that each
    topology should provide, for every pair (Pi ,M j ), a path between Pi and M j .
 4. Let P be an algorithm for solving a problem Π on CRCW-PRAM( p). Accord-
    ing to Theorem 2.1, the execution of P on EREW-PRAM( p) will be at most
    O(log p)-times slower than on CRCW-PRAM( p). Now suppose that p =
    poly(n), where n is the size of a problem instance. Prove that log p = O(log n).
2.7 Exercises                                                                       43
In presenting the topics in this Chapter we have strongly leaned on Trobec et al. [26]
and Atallah and Blanton [3]. On the computational models of sequental computation
see Robič [22]. Interconnection networks are discussed in great detail in Dally and
Towles [6], Duato et al. [7], Trobec [25] and Trobec et al. [26]. The dependence
of execution times of real world parallel applications on the performance of the
interconnection networks is discussed in Grama et al. [12].
                                                                     Part II
                                                              Programming
Chapter Summary
Of many different parallel and distributed systems, multi-core and shared memory
multiprocessors are most likely the easiest to program if only the right approach is
taken. In this chapter, programming such systems is introduced using OpenMP, a
widely used and ever-expanding application programming interface well suited for
the implementation of multithreaded programs. It is shown how the combination
of properly designed compiler directives and library functions can provide a pro-
gramming environment where the programmer can focus mostly on the program and
algorithms and less on the details of the underlying computer system.
Fig. 3.2 A parallel system with two quad-core CPUs supporting simultaneous multithreading con-
tains 16 logical cores all connected to the same memory
Fig. 3.3 Two examples of a race condition when two threads attempt to increase the value at the
same location in the main memory
Fig. 3.4 Preventing race conditions as illustrated in Fig. 3.3 using locking
as illustrated in Fig. 3.3. In such situations, the result is both incorrect and undefined:
in either case, the value in the memory will be increased by either 1 or 2 but not by
1 and 2.
    To avoid the race condition, exclusive access to the shared address in the main
memory must be ensured using some mechanism like locking using semaphores or
atomic access using read-modify-write instructions. If locking is used, each thread
must lock the access to the shared memory location before modifying it and unlock
it afterwards as illustrated in Fig. 3.4. If a thread attempts to lock something that the
other thread has already locked, it must wait until the other thread unlocks it. This
approach forces one thread to wait but guarantees the correct result.
    The peripheral devices are not shown in Figs. 3.1 and 3.2. It is usually assumed
that all threads can access all peripheral devices but it is again up to software to
resolve which thread can access each device at any given time.
        One such thing is OpenMP, a parallel programming environment best suitable for
     writing parallel programs that are to be run on shared memory systems. It is not
     yet another programming language but an add-on to an existing language, usually
     Fortran or C/C++. In this book, OpenMP atop of C will be used.
        The application programming interface (API) of OpenMP is a collection of
     • compiler directives,
     • supporting functions, and
     • shell variables.
        OpenMP compiler directives tell the compiler about the parallelism in the source
     code and provide instructions for generating the parallel code, i.e., the multi-
     threaded translation of the source code. In C/C++, directives are always expressed
     as #pragmas. Supporting functions enable programmers to exploit and control the
     parallelism during the execution of a program. Shell variables permit tunning of
     compiled programs to a particular parallel system.
     To illustrate different kinds of OpenMP API elements, we will start with a simple
     program in Listing 3.1.
         This program starts as a single thread that first prints out the salutation. Once
     the execution reaches the omp parallel directive, several additional threads are
     created alongside the existing one. All threads, the initial thread and the newly created
     threads, together form a team of threads. Each thread in the newly established team of
     threads executes the statement immediately following the directive: in this example
     it just prints out its unique thread number obtained by calling OpenMP function
     omp_get_thread_num. When all threads have done that threads created by the
     omp parallel directive are terminated and the program continues as a single
     thread that prints out a single new line character and terminates the program by
     executing return 0.
         To compile and run the program shown in Listing 3.1 using GNU GCC C/C++
     compiler, use the command-line option -fopenmp as follows:
3.2 Using OpenMP to Write Multithreaded Programs                                        51
Hello, world: 2 5 1 7 6 0 3 4
   Without OMP_NUM_THREADS being set, the program would set the number of
threads to match the number of logical cores threads can run on. For instance, on a
CPU with 2 cores and hyper-threading, 4 threads would be used and a permutation
of numbers from 0 to 3 would be printed out.
   Once the threads are started, it is up to a particular OpenMP implementation and
especially the underlying operating system to carry out scheduling and to resolve
competition for the single standard output the permutation is printed on. Hence, if
the program is run several times, a different permutation of thread numbers will most
likely be printed out each time. Try it.
52          3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
During the design, development, and debugging of parallel programs reasoning about
parallel algorithms and how to encode them better rarely suffices. To understand how
an OpenMP program actually runs on a multi-core system, it is best to monitor and
measure the performance of the program. Even more, this is the simplest and the
most reliable way to know how many cores your program actually runs on.
   Let us use the program in Listing 3.2 as an illustration. The program starts several
threads, each of them printing out one Fibonacci number computed using the naive
and time-consuming recursive algorithm.
   On most operating systems, it is usually easy to measure the running time of a
program execution. For instance, compiling the above program and running it using
time utility on Linux as
yields some Fibonacci numbers, and then as the last line of output, the information
about the program’s running time:
     3.2 Using OpenMP to Write Multithreaded Programs                                       53
     (See Appendix A for instructions on how to measure time and monitor the execution
     of a program on Linux, macOS and MS Windows.)
        The user and system time amount to the total time that all logical cores together
     spent executing the program. In the example above, the sum of the user and system
     time is bigger than the real time, i.e., the elapsed or wall-clock time. Hence, various
     parts of the program must have run on several logical cores simultaneously.
        Most operating systems provide system monitors that among other metrics show
     the amount of computation performed by individual cores. This might be very infor-
     mative during OpenMP program development, but be careful as most system monitor
     reports the overall load on an individual logical core, i.e., load of all programs running
     on a logical core.
        Using a system monitor while the program shown in Listing 3.2 is run on an
     otherwise idle system, one can observe the load on individual logical cores during
     program execution. As threads finish one after another, one can observe how the
     load on individual logical cores drops as the execution proceeds. Toward the end of
     execution, with only one thread remaining, it can be seen how the operating system
     occasionally migrates the last thread from one logical core to another.
Listing 3.3 Printing out all integers from 1 to max in no particular order.
         The program in Listing 3.3 starts as a single initial thread. The value max is read
     and stored in variable max. The execution then reaches the most important part of
     the program, namely, the for loop which actually prints out the numbers (each
     preceded by the number of a thread that prints it out). But the omp parallel
     for directive in line 6 specifies that the for loop must be executed in parallel, i.e.,
     its iterations must be divided among and executed by multiple threads running on
     all available processing units. Hence, a number of slave threads is created, one per
     each available processing unit or as specified explicitly (minus one that the initial
     thread runs on). The initial thread becomes the master thread and together with the
     newly created slave threads the team of threads is formed. Then,
     • iterations of the parallel for loop are divided among threads where each iteration
       is executed by the thread it has been assigned to, and
     • once all iterations have been executed, all threads in the team are synchronized
       at the implicit barrier at the end of the parallel for loop and all slave threads are
       terminated.
     Finally, the execution proceeds sequentially and the master thread terminates the
     program by executing return 0. The execution of the program in Listing 3.3 is
     illustrated in Fig. 3.5.
         Several observations must be made regarding the program in Listing 3.3 (and exe-
     cution of parallel for loops in general). First, the program in Listing 3.3 does not
     specify how the iterations should be divided among threads (as explicit scheduling
     of iterations will be described later). In such cases, most OpenMP implementations
     divide the entire iteration space into chunks where each chunk containing a subin-
     terval of all iterations is executed by one thread. Note, however, that this must not
     be the case as if left unspecified, it is up to a particular OpenMP implementation to
     do as it likes.
3.3 Parallelization of Loops                                                                55
Fig. 3.5 Execution of the program for printing out integers as implemented in Listing 3.3
56          3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
   Second, once the iteration space is divided into chunks, all iterations of an indi-
vidual chunk are executed sequentially, one iteration after another. And third, the
parallel for loop variable i is made private in each thread executing a chunk of
iterations as each thread must have its own copy of i. On the other hand, variable
max can be shared by all threads as it is set before and is only read within the parallel
region.
   However, the most important detail that must be paid attention to is that the overall
task of printing out all integers from 1 to max in no particular order can be divided
into N totally independent subtasks of almost the same size. In such cases, the
parallelization is trivial.
   As the access to the standard output is serialized, printing out integers does not
happen as parallel as it might seem. Therefore, an example of truly parallel compu-
tation follows.
        The structure of function vectAdd is very similar to the program for printing
     out integers shown in Listing 3.3: a simple parallel for loop where the result of one
     iteration is completely independent of the results produced by other loops. Even more,
     different iterations access different array elements, i.e., they read from and write to
     completely different memory locations. Hence, no race conditions can occur.          
 1    d o u b l e * v e c t A d d ( d o u b l e * c , d o u b l e * a , d o u b l e * b , int n ) {
 2        # p r a g m a omp p a r a l l e l for
 3            for ( int i = 0; i < n ; i ++)
 4                c [ i ] = a [ i ] + b [ i ];
 5        return c;
 6    }
        Consider now printing out all pairs of integers from 1 to max in no particular
     order, something that calls for two nested for loops. As all iterations of both nested
     loops are independent, either loop can be parallelized while the other is not. This is
     achieved by placing the omp parallel for directive in front of the loop targeted
     for parallelization. For instance, the program with the outer for loop parallelized is
     shown in Listing 3.5.
     Listing 3.5 Printing out all pairs of integers from 1 to max in no particular order by parallelizing
     the outermost for loop only.
        Assume all pairs of integers from 1 to max are arranged in a square table. If 4
     threads are used and max = 6, each iteration of the parallelized outer for loop prints
     out a few lines of the table as illustrated in Fig. 3.6a. Note that the first two threads
     are assigned twice as much work than the other two threads which, if run on 4 logical
     cores, will have to wait idle until the first two complete as well.
        However, there are two other ways of parallelizing nested loops. First, the two
     nested for loops can be collapsed in order to be parallelized together using clause
     collapse(2) as shown in Listing 3.6.
        Because of the clause collapse(2) in line 6, the compiler merges the two
     nested for loops into one and parallelizes the resulting single loop. The outer for
     loop running from 1 to max and max inner for loops running from 1 to max as
     well, are replaced by a single loop running from 1 to max 2 . All max 2 iterations are
     divided among available threads together. As only one loop is parallelized, i.e., the
     58              3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
     Fig. 3.6 Partition of the problem domain when all pairs of integers from 1 to 6 must be printed
     using 4 threads: a if only the outer for loop is parallelized, b if both for loops are parallelized
     together, and c if both for loops are parallelized separately
     Listing 3.6 Printing out all pairs of integers from 1 to max in no particular order by parallelizing
     both for loops together.
     one that comprises iterations of both nested for loops, the execution of the program
     in Listing 3.6 still follows the pattern illustrated in Fig. 3.5. For instance, if max = 6,
     all 36 iterations of the collapsed single loop are divided among 4 thread as shown
     in Fig. 3.6b. Compared with the program in Listing 3.5, the work is more evenly
     distributed among threads.
        The other method of parallelizing nested loops is by parallelizing each for loop
     separately as shown in Listing 3.7.
     Listing 3.7 Printing out all pairs of integers from 1 to max in no particular order by parallelizing
     each nested for loop separately.
3.3 Parallelization of Loops                                                                     59
   To have one parallel region within the other as shown in Listing 3.7 active at the
same time, nesting of parallel regions must be enabled first. This is achieved by calling
omp_set_nested(1) before mtxMul is called or by setting OMP_NESTED to
true. Once nesting is activated, iterations of both loops are executed in parallel sep-
arately as illustrated in Fig. 3.7. Compare Figs. 3.5 and 3.7 and note how many more
threads are created and terminated in the latter, i.e., if nested loops are parallelized
separately.
Fig. 3.7 The execution of the program for printing out all pairs of integers using separately paral-
lelized nested loops as implemented in Listing 3.7
     60            3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
 1    d o u b l e ** m t x M u l ( d o u b l e ** c , d o u b l e ** a , d o u b l e ** b , int n ) {
 2        # p r a g m a omp p a r a l l e l for c o l l a p s e (2)
 3            for ( int i = 0; i < n ; i ++)
 4                for ( int j = 0; j < n ; j ++) {
 5                    c [ i ][ j ] = 0.0;
 6                    for ( int k = 0; k < n ; k ++)
 7                        c [ i ][ j ] = c [ i ][ j ] + a [ i ][ k ] * b [ k ][ j ];
 8                }
 9        return c;
10    }
Listing 3.8 Matrix multiplication where the two outermost loops are parallelized together.
 1    d o u b l e ** m t x M u l ( d o u b l e ** c , d o u b l e ** a , d o u b l e ** b , int n ) {
 2        # p r a g m a omp p a r a l l e l for
 3            for ( int i = 0; i < n ; i ++)
 4                # p r a g m a omp p a r a l l e l for
 5                    for ( int j = 0; j < n ; j ++) {
 6                        c [ i ][ j ] = 0.0;
 7                        for ( int k = 0; k < n ; k ++)
 8                            c [ i ][ j ] = c [ i ][ j ] + a [ i ][ k ] * b [ k ][ j ];
 9                    }
10        return c;
11    }
     Listing 3.9 Matrix multiplication where the two outermost loops are parallelized separately.
3.3 Parallelization of Loops                                                                    61
Fig. 3.8 Conway’s Game of Life: a particular initial population turns into an oscillating one
  Writing functions for matrix multiplication where only one of the two outermost
forloops is parallelized, either outer of inner, is left as an exercise.       
•   each live cell with fewer than two neighbors dies of underpopulation,
•   each live cell with two or three neighbors lives on,
•   each live cell with more than three neighbors dies of overpopulation, and
•   each dead cell with three neighbors becomes a live cell.
It is assumed that each cell has eight neighbors, four along its sides and four on its
corners.
    Once the initial generation is set, all the subsequent generations can be computed.
Sometimes the population of live cells die out, sometimes it turns into a static colony,
other times it oscillates forever. Even more sophisticated patterns can appear includ-
ing traveling colonies and colony generators. Figure 3.8 shows an evolution of an
oscillating colony on the 10 × 10 plane.
    The program for computing Convay’s Game of life is too long to be included
entirely, but its core is shown in Listing 3.10. To understand it, observe the following:
  Except for the omp parallel for directive, the code in Listing 3.10 is the
same as if it was written for the sequential execution: the (outermost) while loop
     62               3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
     runs over all generations to be computed while the inner two loops are used to
     compute the next generation and store it in aux_plane given the current generation
     in plane. More precisely, the rules of the game are implemented in the switch
     statement in lines 6–11: the case for plane[i][j]==0 implements the rule for
     dead cells and the case for plane[i][j]==1 implements the rules for live cells.
     Once the new generation has been computed, the arrays are swapped so that the
     generation just computed becomes the current one.
        The omp parallel for directive in line 2 is used to specify that the iterations
     of the two for loops in lines 3–12 can be performed simultaneously. By inspecting
     the code, it becomes clear that just like in matrix multiplication every iteration of the
     outer for loop computes one row of the plane representing the next generation and
     that every iteration of the inner loop computes a single cell of the next generation.
     As array plane is read only and the (i, j)-th iteration of the collapsed loop is the
     only one writing to the (i, j)-th cell of array aux_plane, there can be no race
     conditions and there are no dependencies among iterations.
        The implicit synchronization at the end of the parallelized loop nest is crucial.
     Without synchronization, if the master thread performed the swap in line 13 before
     other threads finished the computation within both for loops, it would cause all
     other threads to mess up the computation profoundly.
        Finally, instead of parallelizing the two for loops together it is also possible
     to parallelize them separately just like in matrix multiplication. But the outermost
     loop, i.e., while loop, cannot be parallelized as every iteration (except the first one)
     depends on the result of the previous one.                                             
     In most cases, however, individual loop iterations aren’t entirely independent as they
     are used to solve a single problem together and thus each iteration contributes its part
     to the combined solution. Most often then not partial results of different iterations
     must be combined together.
 1        w h i l e ( gens - - > 0) {
 2            # p r a g m a omp p a r a l l e l for c o l l a p s e (2)
 3                for ( int i = 0; i < size ; i ++)
 4                    for ( int j = 0; j < size ; j ++) {
 5                        int n e i g h s = n e i g h b o r s ( plane , size , i , j ) ;
 6                        s w i t c h ( p l a n e [ i ][ j ]) {
 7                            case 0: a u x _ p l a n e [ i ][ j ] = ( n e i g h s == 3) ;
 8                                          break ;
 9                            case 1: a u x _ p l a n e [ i ][ j ] = ( n e i g h s == 2) || ( n e i g h s == 3) ;
10                                          break ;
11                        }
12                    }
13            char ** t m p _ p l a n e = a u x _ p l a n e ; a u x _ p l a n e = p l a n e ; p l a n e = t m p _ p l a n e ;
14        }
        If integers from the given interval are to be added instead of printed out, all subtasks
     must somehow cooperate to produce the correct sum. The first parallel solution that
     comes to mind is shown in Listing 3.11. It uses a single variable sum where the
     result is to be accumulated.
Listing 3.11 Summation of integers from a given interval using a single shared variable — wrong.
         Again, iterations of the parallel for loop are divided among multiple threads. In
     all iterations, threads use the same shared variable sum on both sides of assignment
     in line 8, i.e., they read from and write to the same memory location. As illustrated
     in Fig. 3.9 where every box containing =+ denotes the assignment sum = sum +
     i, the accesses to variable sum overlap and the program is very likely to encounter
     race conditions illustrated in Fig. 3.3.
         Indeed, if this program is run multiple times using several threads, it is very likely
     that it will not always produce the same result. In other words, from time to time it
     will produce the wrong result. Try it.
         To avoid race conditions, the assignment sum = sum + i can be put inside a
     critical section — a part of a program that is performed by at most one thread at
     a time. This is achieved by the omp critical directive which is applied to the
     statement or a block immediately following it. The program using critical sections
     is shown in Listing 3.12.
         The program works correctly because the omp critical directive performs
     locking around the code it contains, i.e., the code that accesses variable sum, as
     64           3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
Listing 3.12 Summation of integers from a given interval using a critical section — slow.
     illustrated in Fig. 3.4 and thus prevents race conditions. However, the use of critical
     sections in this program makes the program slow because at every moment at most
     one thread performs the addition and assignment while all other threads are kept
     waiting as illustrated in Fig. 3.10.
         It is worth comparing the running times of the programs shown in Listings
     3.11 and 3.12. On a fast multi-core processor, a large value for max possibly causing
     an overflow is needed so that the difference can be observed.
         Another way to avoid race conditions is to use atomic access to variables as shown
     in Listing 3.13.
         Although sum is a single variable shared by all threads in the team, the program
     computes the correct result as the omp atomic directive instructs the compiler
Listing 3.13 Summation of integers from a given interval using a atomic variable access — faster.
         To prevent race conditions and to avoid locking or explicit atomic access to vari-
     ables at the same time, OpenMP provides a special operation called reduction. Using
     it, the program in Listing 3.11 is rewritten to the program shown in Listing 3.14.
     Listing 3.14 Summation of integers from a given interval using reduction — fast.
66           3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
                            √
Fig. 3.12 Integrating y =       1 − x 2 numerically from 0 to 1
                                OpenMP: reduction
  Technically, reduction is yet another data sharing attribute specified by
       reduction(reduction-identifier :list )
  clause.
  For each variable in the list, a private copy is created in each thread of a parallel
  region, and initialized to a value specified by the reduction-identifier. At the end
  of a parallel region, the original variable is updated with values of all private
  copies using the operation specified by the reduction-identifier.
  The reduction-identifier may be +, -, &, |, ˆ, &&, ||, min, and max. For * and
  && the initial value is 1; for min and max the initial value is the largest and the
  smallest value of the variable’s type, respectively; for all other operations, the
  initial value is 0.
     rectangle, i.e., 1/N ,multiplied by the function value computed in the left-end point
     of the interval, i.e., 1 − (i/N )2 . Thus,
                                                              ⎛                            ⎞
                                
                                 1                     
                                                       N −1                           	2
                                     1 − x 2d x ≈             ⎝1       1−
                                                                                 1
                                                                                   i        ⎠
                                 0                             N                 N
                                                       i=0
 1        double x = 0.0;
 2        # p r a g m a omp p a r a l l e l for r e d u c t i o n (+: i n t e g r a l )
 3            for ( int i = 0; i < i n t e r v a l s ; i ++) {
 4                d o u b l e fx = s q r t (1.0 - x * x ) ;
 5                i n t e g r a l = i n t e g r a l + fx * dx ;
 6                x = x + dx ;
 7            }
                                                      √
     Listing 3.16 Computing π by integrating              1 − x 2 from 0 to 1 using a non-paralellizable loop.
        This works well if the program is run by only one thread (set OMP_NUM_THREADS
     to 1), but produces the wrong result if multiple threads are used. The reason is that the
     iterations are no longer independent: the value of x is propagated from one iteration
     3.3 Parallelization of Loops                                                                             69
     Fig. 3.13 Computing π by random shooting: different threads shoot independently, but the final
     result is a combination of all shots
     to another so the next iteration cannot be performed until the previous has been fin-
     ished. However, the omp parallel for directive in line 2 of Listing 3.16 states
     that the loop can and should be parallelized. The programmer unwisely requested
     the parallelization and took the responsibility for ensuring the loop can indeed be
     parallelized too lightly.                                                          
        From the parallel programming view, the program in Listing 3.17 is basically
     simple: num_shots are shot within the parallel for loop in lines 17–23 and
     the number of hits is accumulated in variable num_shots . Furthermore, it also
     resembles the program in Listing 3.14: the results of independent iterations combined
     together to yield the final result (Fig. 3.14).
        The most intricate part of the program is generating random shots. The usual
     random generators, i.e., rand or random, are not reentrant or thread-safe: they
     should not be called in multiple threads because they use a single hidden state that
     is modified on each call regardless of the thread the call is made in. To avoid this
     problem, function rnd has been written: it takes a seed, modifies it, and returns a
     random value in the interval [0, 1). Hence, a distinct seed for each thread is created in
     lines 12–14 where OpenMP function omp_get_max_threads is used to obtain
     the number of future threads that will be used in the parallel for loop later on. Using
     these seeds, the program contains one distinct random generator for each thread.
        The rate of convergence toward π is much lower than if random shooting is used
     instead of numerical integration. However, this example shows how simple it is to
     implement a wide class of Monte Carlo methods if random generator is applied
     correctly: one must only run all random based individual experiments, e.g., shots
     into [0, 1] × [0, 1] in lines 18–20, and aggregate the results, e.g., count the number
     of hits within the unit circle.
        As long as the number of individual experiments is known in advance and the
     complexity of individual experiments is approximately the same, the approach is pre-
     sented in Listing 3.17 suffices. Otherwise, a more sophisticated approach is needed,
     but more about that later.
        Before proceeding, we can rewrite the program in Listing 3.17 to a simpler one.
     By splitting the omp parallel and omp for we can define the thread-local seed
     inside the parallel region as shown in Listing 3.19.
        Let us demonstrate that computing π by random shooting into [0, 1] × [0, 1] and
     counting the shots inside the unit circle can also be encoded differently as shown in
     Listing 3.19, but at its core it stays the same.
        Namely, the parallel regions, one per each available thread, specified by the omp
     parallel directive in line 13 are used instead of the parallel for loop (see also
     Listings 3.1 and 3.2). Within each parallel region, the seed for the thread-local random
     generator is generated in lines 15. Then, the number of shots that must be carried
     out by the thread is computed in lines 16–18 and finally all shots are performed in
     a thread-local sequential while loop in lines 19–23. Unlike the iterations of the
     parallel par loop in Listing 3.18, the iterations of the while loop do not contain a
     call of function omp_get_thread_num . However, the aggregation of the results
     obtained by the parallel regions, i.e., the number of hits, is done using the reduction
     in the same way as in Listing 3.18.                                                    
     72             3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
     So far no attention has been paid on how iterations of a parallel loop, or of a several
     collapsed parallel loops, are distributed among different threads in a single team
     of threads. However, OpenMP allows the programmer to specify several different
     iteration scheduling strategies.
        Consider computing the sum of integers from a given interval again using the
     program shown in Listing 3.14. This time, however, the program will be modified
     as shown in Listing 3.20. First, the schedule(runtime) clause is added to the
     omp for directive in line 8. It allows the iteration schedule strategy to be defined
     once the program is started using the shell variable OMP_SCHEDULE . Second, in
     line 10, each iteration prints out the the number of thread that executes it. And third,
     different iterations take different time to execute as specified by the argument of
     function sleep in line 11: iterations 1, 2, and 3 require 2, 3, and 4 s, respectively,
     while all other iterations require just 1 second.
     Listing 3.20 Summation of integers from a given interval where iteration scheduling strategy is
     determined in runtime.
  1 s. Hence, thread T0 finishes much later than all other threads as can be seen in
  Fig. 3.15.
• If OMP_SCHEDULE=static,1 or OMP_SCHEDULE=static,2, the itera-
  tions are divided into chunks containing 1 or 2 iterations, respectively. Chunks
  are then assigned to threads in a round-robin fashion as
or
Fig. 3.14 A distribution of 14 iterations among 4 threads where iterations 1, 2 and 3 require more
time than other iterations, using static iteration scheduling strategy
Fig. 3.15 A distribution of 14 iterations among 4 threads where iterations 1, 2 and 3 require more
time than other iterations, using static,1 (left) and static,2 (right) iteration scheduling
strategies
Fig. 3.16 A distribution of 14 iterations among 4 threads where iterations 1, 2 and 3 require more
time than other iterations, using dynamic,1 (left) and dynamic,2 (right) iteration scheduling
strategies
or
     The scheduling of iterations is illustrated in Fig. 3.16: the overall running is further
     reduced to 5 or 6 s, again depending on the chunk size. The overall running time
     of 5 s is the minimal possible as each thread performs the same amount of work.
     3.3 Parallelization of Loops                                                                                      75
        To produce Fig. 3.17, max_iters has been set to 100. Each point of the black
     region, i.e., within the Mandelbrot set, takes 100 iterations to compute. However, each
     point within the dark gray region requires more than 10 yet less than 100 iterations.
     Likewise, each point within the light gray region requires more than 5 and less than
     10 iterations and all the rest, i.e., points colored white, require at most 5 iterations
     each. As different points and thus different iterations of the collapsed for loops
     in lines 2 and 3 require significantly different amount of computation, it matters
     what iteration scheduling strategy is used. Namely, if static,100 is used instead
     of simply static, the running time is reduced by approximately 30 percent; the
     choice of dynamic,100 reduces the running time even more. Run the program
     and measure its running time under different iteration scheduling strategies.         
     The parallel for loop and reduction operation are so important in OpenMP pro-
     gramming that they should be studied and understood in detail.
        Let us return to the program for computing the sum of integers from 1 to max as
     shown in Listing 3.14. If it assumed that T , the number of threads, divides max and
     the static iteratin scheduling startegy is used, the program can be rewritten into
     the one shown in Listing 3.22. (See exercises for the case when T does not divide
     max.)
Listing 3.22 Implementing efficient summation of integers by hand using simple reduction.
        The initial thread first obtains T , the number of threads available (using OpenMP
     function omp_get_max_threads), and creates an array sums of variables used
     for summation within each thread. Although the array sums is going to be shared
     by all threads, each thread will access only one of its T elements.
    3.3 Parallelization of Loops                                                                77
Fig. 3.18 Computing the reduction in time O(log2 T ) using T /2 threads when T = 12
        Reaching omp parallel region the master thread creates (T − 1) slave threads
    to run alongside the master thread. Each thread, master or slave, first computes its
    subinterval (lines 11–12), initializes its local summation variable to 0 (line 13), and
    then executes its thread-local sequential for loop (line 14–15). Once all threads
    have finished computing local sums, only the master thread is left alive. It adds the
    local summation variables and prints the result. The overall execution is performed
    as shown in Fig. 3.9. However, no race conditions appear because each thread uses
    its own summation variable, i.e., the t-th thread uses the t-th element sums[t] of
    array sums.
        From the implementation point of view, the program in Listing 3.22 uses array
    sums instead of thread-local summation variables and performs the reduction by
    the master thread only. Array sums is created by the master thread before creating
    slave threads so that the explicit reduction, which is performed in line 18 after the
    slave threads have been terminated and their local variables (t, lo, hi, and n) have
    been lost, can be implemented.
        Furthermore, the reduction is performed by adding local summation variables,
    i.e., elements of sums, one after another to variable sum. This takes O(T ) time
    and works fine if the number of threads is small, e.g., T = 4 or T = 8. However, if
    there are a few hundred threads, a solution shown in Listing 3.23 that works in time
    O(log2 T ) and produces the result in sums[0], is often preferred (unless the target
    system architecture requires even more sophisticated method).
Listing 3.23 Implementing efficient summation of integers by hand using simple reduction.
    The idea behind the code shown in Listing 3.23 is illustrated in Fig. 3.18. In List-
    ing 3.23 variable, d contains the distance between elements of array sums being
    added, and as it doubles in each iteration, there are 	log2 T 
 iterations of the outer
    loop. Variable t denotes the left element of each pair being added in the inner loop.
    But as the inner loop is performed in parallel by at least T /2 threads which operate
    on distinct elements of array sums, all additions in the inner loop are performed
    simultaneously, i.e., in time O(1).
       Note that either method used for computing the reduction uses (T − 1) additions.
    However, in the first method (line 18 of Listing 3.22) additions are performed one
78          3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
after another while in the second method (Listing 3.23) certain additions can be
performed simultaneously.
Although most parallel programs spend most of their time running parallel loops,
this is not always the case. Hence, it is worth exploring how a program consisting of
different tasks can be parallelized.
As above, where parallelization of loops that need not combine the results of its
iterations was explained first, we start with explanation of tasks where cooperation
is not needed.
    Consider computing the sum of integers from 1 to max one more time. At the end
of a previous section, it was shown how iterations of a single parallel for loop are
distributed among threads. This time, however, the interval from 1 to max is split
into a number of mutually disjoint subintervals. For each subinterval, a task that
first computes the sum of all integers of a subinterval and then adds the sum of the
subinterval to the global sum, is used.
    The idea is implemented as the program in Listing 3.24. For the sake of simplicity,
it is assumed that T , denoting the number of tasks and stored in variable tasks,
divides max.
    Computing the sum is performed in the parallel block in lines 9–25. The for
loop in line 12 creates all T tasks where each task is defined by the code in lines
13–23. Once the tasks are created, it is more or less up to OpenMP’s runtime system
to schedule tasks and execute them.
    The important thing, however, is that the for loop in line 12 is executed by only
one thread as otherwise each thread would create its own set of T tasks. This is
achieved by placing the for loop in line 12 under the OpenMP directive single.
    The OpenMP directive task in line 13 specifies that the code in lines 14–23 is
to be executed as a single task. The local sum is initialized to 0 and the subinterval
bounds are computed from the task number, i.e., t. The integers of the subinterval
are added up and the local sum is added to the global sum using atomic section to
prevent a race condition between two different tasks.
    Note that when a new task is created, the execution of the task that created the
new task continues without delay; once created, the new task has a life of its own.
Namely, when the master thread in Listing 3.24 executes the for loop, it creates
one new task in each iteration, but the iterations (and thus creation of new tasks) are
executed one after another without waiting for the newly created tasks to finish (in
fact, it would make no sense at all to wait for them to finish). However, all tasks must
     3.4 Parallel Tasks                                                                                                         79
     finish before the parallel region can end. Hence, once the global sum is printed
     out in line 26 of Listing 3.24, all tasks has already finished.
        The difference between the approaches taken in the previous and this section can
     be told in yet another way. Namely, when iterations of a single parallel for loop
     are distributed among threads, tasks, one per thread, are created implicitly. But when
     a number of explicit tasks is used, the loop itself is split among tasks that are then
     distributed among threads.
                                  OpenMP: tasks
  A task is declared using the directive
     #pragma omp task [clause [[ ,] clause] …]
        structured-block
  The task directive creates a new task that executes structured-block. The new
  task can be executed immediately or can be deferred. A deferred task can be
  later executed by any thread in the team.
  The task directive can be further refined by a number of clauses, the most
  important being the following ones:
  • final(scalar-logical-expression) causes, if scalar-logical-expression eval-
    uates to true, that the created task does not generate any new tasks any more,
    i.e., the code of would-be-generated new subtasks is included in and thus
    executed within this task;
  • if([ task:]scalar-logical-expression) causes, if scalar-logical-expression
    evaluates to false, that an undeferred task is created, i.e., the created task
    suspends the creating task until the created task is finished.
  For other clauses see OpenMP specification.
  Converting a parallel for loop into a set of tasks is not very interesting and in
most cases does not help either. The real power of tasks, however, can be appreciated
when the number and the size of individual tasks cannot be known in advance. In
     3.4 Parallel Tasks                                                                                                     81
     Listing 3.25 Computing Fibonacci numbers using OpenMP’s tasks: smaller tasks, i.e., for smaller
     Fibonacci numbers are created first.
     other words, when the problem or the algorithm demands that tasks are created
     dynamically.
     Listing 3.26 Computing Fibonacci numbers using OpenMP’s tasks: smaller tasks, i.e., for smaller
     Fibonacci numbers are created last.
     Listing 3.27 The parallel implementation of the Quicksort algorithm where each recursive call is
     performed as a new task.
         The partition part of the algorithm, implemented in lines 4–14 of Listing 3.27,
     is the same as in the sequential version. The recursive calls, though, are modified
    82             3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
    because they can be performed independently, i.e., at the same time. Each of the two
    recursive calls is therefore executed as its own task.
       However, no matter how efficient creating new tasks is, it takes time. Creating a
    new task only makes sense if a part of the table that must be sorted using a recursive
    call is big enough. In Listing 3.27, the clause final in lines 15 and 17 is used to
    prevent creating new tasks for parts of table that contain less than 1000 elements. The
    threshold 1000 has been chosen by experience; choosing the best threshold depends
    on many factors (the number of elements, the time needed to compare two elements,
    the implementation of OpenMP’s tasks, …). The experimental way of choosing it
    shall be, to some extent, covered in the forthcoming chapters.
       There is an analogy with the sequential algorithm: recursion takes time as well
    and to speed up the sequential Quicksort algorithm, the insertion sort is used once
    the number of elements falls below a certain threshold.
       There should be no confusion about the arguments for function par_qsort .
    However, function par_qsort must be called within a parallel region by
    exactly one thread as shown in Listing 3.28.
1        # p r a g m a omp p a r a l l e l
2            # p r a g m a omp s i n g l e
3                p a r _ q s o r t ( strings , 0 , n u m _ s t r i n g s - 1 , c o m p a r e ) ;
Listing 3.28 The call of the parallel implementation of the Quicksort algorithm.
       As the Quicksort algorithm itself is rather efficient, i.e., it runs in time O(n log n),
    a sufficient number of elements must be used to see that the parallel version actually
    outperforms the sequential one. The comparison of running times is summarized in
    Table 3.1. By comparing the running times of the sequential version with the parallel
    version running within a single thread, one can estimate the time needed to create
    and destroy OpenMP’s threads.
       Using 4 or 8 threads the parallel version is definitely faster, although the speed
    us consider the Quicksort algorithm up is not proportional to the number of threads
    used. Note that the partition of the table in lines 4–14 of Listing 3.27 is performed
    sequentially and recall the Amdahl law.                                                  
     Table 3.1 The comparison of the running time of the sequential and parallel version of the Quick-
     sort algorithm when sorting n random strings of max length 64 using a quad-core processor with
     multithreading
     n                    seq                par
                                            (1 thread)          (4 threads)        (8 threads)
     105                  0.05 s            0.07 s              0.04 s             0.04 s
     106                  0.79 s            0.99 s              0.44 s             0.32 s
     107                  11.82 s           12.47 s             4.27 s             3.57 s
     108                  201.13 s          218.14 s            71.90 s            61.81 s
        Counting swaps during the partition phase in a sequential program is trivial. For
     instance, as shown in Listing 3.29, three new variables can be introduced, namely
     count, locount, and hicount that contain the number of swaps in the current
     partition phase and the total numbers of swaps in recursive calls, respectively. (In
     the sequential program, this could be done with a single counter, but having three
     counters instead is more appropriate for the developing of the parallel version.)
Listing 3.29 The call of the parallel implementation of the Quicksort algorithm.
        In the parallel version, the modification is not much harder, but a few things must
     be taken care of. First, as recursive calls in lines 16 and 18 of Listing 3.27 change
     to assignment statements in lines 19 in 21 of Listing 3.29, the values of variables
     locount and hicount are set in two newly created tasks and must, therefore, be
     shared among the creating and the created tasks. This is achieved using shared
     clause in lines 18 and 20.
84          3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
    Second, remember that once the new tasks in lines 18–19 and 20–21 are created,
the task that has just created them continues. To prevent it from computing the sum of
all three counters and returning the result when variables locount and hicount
might not have been set yet, the taskwait directive is used. It represents an explicit
barrier: all tasks created by the task executing it must finish before that task can
continue.
    At the end of the parallel section is an implicit barrier before all tasks
created within the parallel section must finish just like all iterations of a parallel
loop must. Hence, in Listing 3.24, there is no need for an explicit barrier using
taskwait.                                                                         
Exercises
 1. Modify the program in Listing 3.1 so that it uses a team of 5 threads within the par-
    allel region by default. Investigate how shell variables OMP_NUM_THREADS and
    OMP_THREAD_LIMIT influence the execution of the original and modified
    program.
 2. If run with one thread per logical core, threads started by the program in List-
    ings 3.1 print out their thread numbers in random order while threads started
    by the program in Listing 3.2 always print out their results in the same order.
    Explain why.
 3. Suppose two 100 × 100 matrices are to be multiplied using 8 threads. How many
    dot products, i.e., operations performed by the innermost for loop, must each
    thread compute if different approaches to parallelizing the two outermost for
    loops of matrix multiplication illustrated in Fig. 3.6 are used?
 4. Draw a 3D graph with the size of the square matrix along one independent axis,
    e.g., from 1 to 100, and the number of available threads, e.g., from 1 to 16, along
    the other showing the ratio between the number of dot products computed by
    the most and the least loaded thread for different approaches to parallelizing the
    two outermost for loops of matrix multiplication illustrated in Fig. 3.6.
3.5 Exercises and Mini Projects                                                    85
 5. Modify the programs for matrix multiplication based on different loop paral-
    lelization methods to compute C = A · B T instead of C = A · B. Compare the
    running time of the original and modified programs.
 6. Suppose 4 threads are being used when the program in Listing 3.20 and max =
    20. Determine which iteration will be performed by which thread if static,1,
    static,2 or static,3 is used as a iteration scheduling strategy. Try without
    running the program first. (Assume that iterations 1, 2 and 3 require 2, 3 and 4
    units of time while all other iterations require just 1 unit of time.)
 7. Suppose 4 threads are being used when the program in Listing 3.20 and
    max = 20. Determine which iteration will be performed by which thread if
    dynamic,1, dynamic,2 or dynamic,3 is used as a iteration scheduling
    strategy. Is the solution uniquely defined? (Assume that iterations 1, 2 and 3
    require 2, 3 and 4 units of time while all other iterations require just 1 unit of
    time.)
 8. Modify lines 12 and 13 in Listing 3.22 so that the program works correctly even
    if T , the number of threads, does not divide max. The number of iterations of
    the for loop in lines 15 and 16 should not differ by more than 1 for any two
    threads.
 9. Modify the program in Listing 3.22 so that the modified program implements
    static,c iteration scheduling strategy instead of static as is the case in
    Listing 3.22. The chunk size c must be a constant declared in the program.
10. Modify the program in Listing 3.22 so that the modified program implements
    dynamic,c iteration scheduling strategy instead of static as is the case in
    Listing 3.22. The chunk size c must be a constant declared in the program.
    Hint: Use a shared counter of iterations that functions as a queue of not yet
    scheduled iterations outside the parallel section.
11. While computing the sum of all elements of sums in Listing 3.23, the program
    creates new threads within every iteration of the outer loop. Rewrite the code so
    that creation of new threads in every iteration of the outer loop is avoided.
12. Try rewriting the programs in Listings 3.25 and 3.26 using parallel for loops
    instead of OpenMP’s tasks to mimic the behavior of the original program as
    close as possible. Find out which iteration scheduling strategy should be used.
    Compare the running time of programs using parallel for loops with those that
    use OpenMP’s tasks.
13. Modify the program in Listing 3.27 so that it does not use final but works in
    the same way.
14. Check the OpenMP specification and rewrite the program in Listing 3.24 using
    the taskloop directive.
Mini Projects
P1. Write a multi-core program that uses CYK algorithm [13] to parse a string of
    symbols. The inputs are a context-free grammar G in Chomsky Normal Form
    and a string of symbols. At the end, the program should print yes if the string
    of symbols can be derived by the rules of the grammar and no otherwise.
86          3 Programming Multi-core and Shared Memory Multiprocessors Using OpenMP
     Write a sequential program (no OpenMP directives at all) as well. Compare the
     running time of the sequential program with the running time of the multi-core
     program and compute the speedup for different grammars and different string
     lengths.
     Hint: Observe that in the classical formulation of CYK algorithm the iterations
     of the outermost loop must be performed one after another but that iterations
     of the second outermost loop are independent and offer a good opportunity for
     parallelization.
P2. Write a multi-core program for the “all-pairs shortest paths” problem [5]. The
    input is a weighted graph with no negative cycles and the expected output are
    lengths of the shortest paths between all pairs of vertices (where the length of a
    path is a sum of weights along the edges that the path consists of).
    Write a sequential program (no OpenMP directives at all) as well. Compare the
    running time of the sequential program with the running time of the multi-core
    program and compute the speedup achieved
     1. for different number of cores and different number of threads per core, and
     2. for different number of vertices and different number of edges.
     Hint 1: Take the Bellman–Ford algorithm for all-pairs shortest paths [5] and
     consider its matrix multiplication formulation. For a graph G = V, E your
     program should achieve at least time O(|V |4 ), but you can do better and achieve
     time O(|V |3 log2 |V |). In neither case should you ignore the cache performance:
     allocate matrices carefully.
     Hint 2: Instead of using the Bellman–Ford algorithm, you can try parallelizing
     the Floyd–Warshall algorithm that runs in time O(|V |3 ) [5]. How fast is the
     program based on the Floyd–Warshall algorithm compared with the one that
     uses the O(|V |4 ) or O(|V |3 log2 |V |) Bellman–Ford algorithm?
The primary source of information including all details of OpenMP API is available
at OpenMP web site [20] where the complete specification [18] and a collection of
examples [19] are available. OpenMP version 4.5 is used in this book as version 5.0
is still being worked on by OpenMP Architecture Review Board. The summary card
for C/C++ is also available at OpenMP web site.
    As standards and specifications are usually hard to read, one might consider some
book wholly dedicated to programming using OpenMP. Although relatively old and
thus lacking the most of the modern OpenMP features, the book by Rohit Chandra
et al. [4] provides a nice introduction to underlying ideas upon which OpenMP is
based upon and the basic OpenMP constructs. A more recent and comprehensive
description of OpenMP, version 4.5, can be found in the book by Ruud van der Pas
et al. [21].
MPI Processes and Messaging
                                                                                         4
Chapter Summary
Distributed memory computers cannot communicate through a shared memory.
Therefore, messages are used to coordinate parallel tasks that eventually run on
geographically distributed but interconnected processors. Processes as well as their
management and communication are well defined by a platform-independent mes-
sage passing interface (MPI) specification. MPI is introduced from the practical point
of view, with a set of basic operations that enable implementation of parallel pro-
grams. We will give simple example programs that will serve as an aid for a smooth
start of using MPI and as motivation for developing more complex applications.
We know from previous chapters that there are two main differences between the
shared memory and distributed memory computer architectures. The first difference
is in the price of communication: the time needed to exchange a certain amount of
data between two or more processors is in favor of shared memory computers, as
these can usually communicate much faster than the distributed memory computers.
The second difference, which is in the number of processors that can cooperate effi-
ciently, is in favor of distributed memory computers. Usually, our primary choice
when computing complex tasks will be to engage a large number of fastest avail-
able processors, but the communication among them poses additional limitations.
Cooperation among processors implies communication or data exchange among
them. When the number of processors must be high (e.g., more than eight) to reduce
the execution time, the speed of communication becomes a crucial performance
factor.
    There is a significant difference in the speed of data movement between two
computing cores within a single multi-core computer, depending on the location of
data to be communicated. This is because the data can be stored in registers, cache
© Springer Nature Switzerland AG 2018                                              87
R. Trobec et al., Introduction to Parallel Computing,
Undergraduate Topics in Computer Science,
https://doi.org/10.1007/978-3-319-98833-7_4
88                                                         4 MPI Processes and Messaging
memory, or system memory, which can differ by up to two orders of magnitude if their
access times are considered. The differences in the communication speed get even
more pronounced in the interconnected computers, again by orders of magnitude, but
this now depends on the technology and topology of the interconnection networks
and on the geographical distance of the cooperating computers.
   Taking into account the above facts, complex tasks can be executed efficiently
either (i) on a small number of extremely fast computers or (ii) on a large number of
potentially slower interconnected computers. In this chapter, we focus on the presen-
tation and usage of the Message Passing Interface (MPI), which enables system-
independent parallel programming. The well-established MPI standard1 includes
process creation and management, language bindings for C and Fortran, point-
to-point and collective communications, and group and communicator concepts.
Newer MPI standards are trying to better support the scalability in future extreme-
scale computing systems, because currently, the only feasible option for increasing
the computing power is to increase the number of cooperating processors. Advanced
topics, as one-sided communications, extended collective operations, process topolo-
gies, external interfaces, etc., are also covered by these standards, but are beyond the
scope of this book.
   The final goal of this chapter is to advise users how to employ the basic MPI
principles in the solution of complex problems with a large number of processes that
exchange application data through messages.
Programmers have to be aware that the cooperation among processes implies the data
exchange. The total execution time is consequently a sum of computation and com-
munication time. Algorithms with only local communication between neighboring
processors are faster and more scalable than the algorithms with the global commu-
nication among all processors. Therefore, the programmer’s view of a problem that
will be parallelized has to incorporate a wide number of aspects, e.g., data indepen-
dency, communication type and frequency, balancing the load among processors,
balancing between communication and computation, overlapping communication
and computation, synchronous or asynchronous program flow, stopping criteria, and
others.
   Most of the above issues that are related to communication are efficiently solved by
the MPI specification. Therefore, we will identify the mentioned aspects and describe
efficient solutions through the standardized MPI operations. Further sections should
not be considered as an MPI reference guide or MPI library implementation manual.
1 Against potential ambiguities, some segments of text are reproduced from A Message-Passing
Interface Standard (Version 3.1), © 1993, 1994, 1995, 1996, 1997, 2008, 2009, 2012, 2015, by
University of Tennessee, Knoxville, Tennessee.
4.2 Programmer’s View                                                                   89
We will just try to rise the interest of readers, through simple and illustrative examples,
and to show how some of the typical problems can be efficiently solved by the MPI
methodology.
The standardization effort of a message passing interface (MPI) library began in 90s
and is one of the most successful projects of the software standardization. Its driving
force was, from the beginning, a cooperation between academia and industry that
has been created with the MPI standardization forum.
   The MPI library interface is a specification, not an implementation. The MPI is
not a language, and all MPI “operations” are expressed as functions, subroutines, or
methods, according to the appropriate language bindings for C and Fortran, which
are a part of the MPI standard. The MPI standard defines the syntax and semantics of
library operations that support the message passing model, independently of program
language or compiler specification.
   Since the word “PARAMETER” is a keyword in the Fortran language, the MPI
standard uses the word “argument” to denote the arguments to a subroutine. It is
expected that C programmers will understand the word “argument”, which has no
specific meaning in C, as a “parameter”, thus allowing to avoid unnecessary confusion
for Fortran programmers.
   An MPI program consists of autonomous processes that are able to execute their
own code in the sense of multiple instruction multiple data (MIMD) paradigm. An
MPI “process” can be interpreted in this sense as a program counter that addresses
their program instructions in the system memory, which implies that the program
codes executed by each process have not to be the same.
   The processes communicate via calls to MPI communication operations, inde-
pendently of operating system. The MPI can be used in a wide range of programs
written in C or Fortran. Based on the MPI library specifications, several efficient
MPI library implementations have been developed, either in open-source or in a
proprietary domain. The success of the project is evidenced by a coherent develop-
ment of the parallel software projects that are portable between different computing
environments, e.g., parallel computers, clusters, and heterogeneous networks, and
scalable along wide numbers of cooperating processors, from one to millions. Finally,
the MPI interface is designed for end users, parallel library writers and developers
of parallel software tools.
   Any MPI program should have operations to initialize execution environment
and to control starting and terminating procedures of all generated processes. MPI
processes can be collected into groups of specific size that can communicate in its
own environment where each message sent in a context must be received only in the
same context. A process group and context together form an MPI communicator.
A process is identified by its rank in the group associated with a communicator.
There is a default communicator MPI_COMM_WORLD whose group encompasses all
    90                                                      4 MPI Processes and Messaging
    initial processes, and whose context is default. Two essential questions arise early in
    any MPI parallel program: “How many processes are participating in computation?”
    and “Which are their identities?” Both questions will be answered after calling two
    specialized MPI operations.
       The basic MPI communication is characterized by two fundamental MPI opera-
    tions MPI_SEND and MPI_RECV that provide sends and receives of process data,
    represented by numerous data types. Besides the data transfer these two operations
    synchronize the cooperating processes in time instants where communication has to
    be established, e.g., a process cannot proceed if the expected data has not arrived.
    Further, a sophisticated addressing is supported within a group of ranked processes
    that are a part of a communicator. A single program may use several communicators,
    which manage common or separated MPI processes. Such a concept enables to use
    different MPI based parallel libraries that can communicate independently, without
    interference, within a single parallel program.
       Even that the most of parallel algorithms can be implemented by just a few
    MPI operations, the MPI-1 standard offers a set of more than 120 operations for
    elegant and efficient programming, including operations for collective and asyn-
    chronous communication in numerous topologies of interconnected computers. The
    MPI library is well documented from its beginning and constantly developing. The
    MPI-2 provides standardized process start-up, dynamic process creation and man-
    agement, improved data types, one-sided communication, and versatile input/output
    operations. The MPI-3 standard introduces non-blocking collective communication
    that enables communication-computation overlapping and the MPI shared memory
    (SHM) model that enables efficient programming of hybrid architectures, e.g., a
    network of multi-core computing nodes.
       Complete MPI is quite a large library with 128 MPI-1 operations, with twice as
    much in MPI-2 and even more in MPI-3. We will start with only six basic operations
    and further add a few from the complete MPI set for greater flexibility in the parallel
    programming. However, to fulfill the desires of this textbook one need to master just
    a few dozens of MPI operations that will be described in more detail in the following
    sections.
       Very well organized documentation can be found on several web pages, for exam-
    ple, on the following link: http://www.mcs.anl.gov/research/projects/mpi/tutorial/
    mpiexmpl/contents.html with assignments, solution, program output and many use-
    ful hints and additional links. The latest MPI standard and further information about
    MPI are available on http://www.mpi-forum.org/.
        The “Hello World” has been written in C programming language; hence, the three-
     line preamble should be commented and replaced by int main(int argc,
     char **argv), if C++ compiler is used. The “Hello World” code looks like a
     standard C code with several additional lines with MPI_ prefix, which are calls
     to global MPI operations that are executed on all processes. Note that some MPI
     operations that will be introduced later could be local, i.e., executed on a single
     process.
        The “Hello World” code in Listing 4.1 is the same for all processes. It has to
     be compiled only once to be executed on all active processes. Such a methodology
     could simplify the development of parallel programs. Run the program with:
       $ mpiexec -n 3 MSMPIHello
     from Command prompt of the host process, at the path of directory where
     MSMPIHello.exe is located. The program should output three “Hello World”
     messages, each with a process identification data.
        All non-MPI procedures are local, e.g., printf in the above example. It runs on
     each process and prints separate “Hello World” notice. If one would prefer to have
     only a notice from a specific process, e.g., 0, an extra if(rank == 0) statement
     should be inserted. Let us comment the rest of the above code:
        The above MPI program, including the definition of variables, will be executed
     in all active processes. The number of processes will be determined by parameter
     -n of the MPI execution utility mpiexec, usually provided by the MPI library
     implementation.
        Depending on the number of processes, the printf function will run on each
     process, which will print a separate “Hello World” notice. If all processes will print
     the output, we expect size lines with “Hello World” notice, one from each process.
     Note that the order of the printed notices is not known in advance, because there is no
     guaranty about the ordering of the MPI processes. We will address this topic, in more
     detail, later in this chapter. Note also that in this simple example no communication
     between processes has been required.                                                 
 1    program hello_world
 2    i n c l u d e ’ / usr / i n c l u de / mpif . h ’
 3    i n t e g e r ierr , num_procs , my_id
 4
 5    call M P I _ I N I T ( ierr )
 6    call M P I _ C O M M _ R A N K ( M P I _ C O M M _ W O R L D , my_id , ierr )
 7    call M P I _ C O M M _ S I Z E ( M P I _ C O M M _ W O R L D , num_procs , ierr )
 8    print * , " Hello world from p r o c e s s " , my_id , " of " , n u m _ p r o c s
 9    call M P I _ F I N A L I Z E ( ierr )
10    stop
11    end
Listing 4.2 ”Hello world” MPI program OMPIHello.f in Fortran programming language.
        Note that capitalized MPI_ prefix is used again in the names of MPI operations,
     which are also capitalized in Fortran syntax, but the different header file mpif.h is
     included. MPI operations return a status of execution success, i.e., ierr in the case
     of Fortran program.
• Function names are equal to the MPI definitions but with the MPI_ prefix and the
  first letter of the function name in uppercase, e.g., MPI_Finalize().
• The status of execution success of MPI operations is returned as integer return
  codes, e.g., ierr = MPI_Finalize(). The return code can be an error code
  or MPI_SUCCESS for successful competition, defined in the file mpi.h. Note
  that all predefined constants and types are fully capitalized.
• Operation arguments IN are usually passed by value with an exception of the send
  buffer, which is determined by its initial address. All OUT and INOUT arguments
  are passed by reference (as pointers), e.g.,
  MPI_Comm_size (MPI_COMM_WORLD, & size).
MPI communication operations specify the message data length in terms of number
of data elements, not in terms of number of bytes. Specifying message data elements
is machine independent and closer to the application level. In order to retain machine
independent code, the MPI standard defines its own basic data types that can be used
for the specification of message data values, and correspond to the basic data types
of the host language.
   As MPI does not require that communicating processes use the same representa-
tion of data, i.e., data types, it needs to keep track of possible data types through the
build-in basic MPI data types. For more specific applications, MPI offers operations
to construct custom data types, e.g., array of (int, float) pairs, and many other
options. Even that the typecasting between a particular language and the MPI library
may represent a significant overhead, the portability of MPI programs significantly
benefits.
   Some basic MPI data types that correspond to the adequate C or Fortran data types
are listed in Table 4.1. Details on advanced structured and custom data types can be
found in the before mentioned references.
   The data types MPI_BYTE and MPI_PACKED do not correspond to a C or a
Fortran data type. A value of type MPI_BYTE consists of a byte, i.e., 8 binary digits.
A byte is uninterpreted and is different from a character. Different machines may have
different representations for characters or may use more than one byte to represent
94                                                         4 MPI Processes and Messaging
Table 4.1 Some MPI data types corresponding to C and Fortran data types
MPI data type     C data type    MPI data type                            Fortran data type
MPI_INT           int            MPI_INTEGER                              INTEGER
MPI_SHORT         short int      MPI_REAL                                 REAL
MPI_LONG          long int       MPI_DOUBLE_PRECISION                     DOUBLE
                                                                          PRECISION
MPI_FLOAT         float          MPI_COMPLEX                              COMPLEX
MPI_DOUBLE        double         MPI_LOGICAL                              LOGICAL
MPI_CHAR          char           MPI_CHARACTER                            CHARACTER
MPI_BYTE           /             MPI_BYTE                                 /
MPI_PACKED         /             MPI_PACKED                               /
characters. On the other hand, a byte has the same binary value on all machines. If
the size and representation of data are known, the fastest way is the transmission of
raw data, for example, by using an elementary MPI data type MPI_BYTE.
   The MPI communication operations have involved only buffers containing a con-
tinuous sequence of identical basic data types. Often, one wants to pass messages
that contain values with different data types, e.g., a number of integers followed by a
sequence of real numbers; or one wants to send noncontiguous data, e.g., a subblock
of a matrix. The type MPI_PACKED is maintained by MPI_PACK or MPI_UNPACK
operations, which enable to pack different types of data into a contiguous send buffer
and to unpack it from a contiguous receive buffer.
   A more efficient alternative is a usage of derived data types for construction of
custom message data. The derived data types allow, in most cases, to avoid explicit
packing and unpacking, which requires less memory and time. A user specifies
in advance the layout of data types to be sent or received and the communication
library can directly access a noncontinuous data. The simplest noncontiguous data
type is the vector type, constructed with MPI_Type_vector. For example, a
sender process has to communicate the main diagonal of an N × N array of integers,
declared as:
int matrix[N][N];
which is stored in a row-major layout. A continuous derived data type diagonal
can be constructed:
MPI_Datatype MPI_diagonal;
that specifies the main diagonal as a set of integers:
MPI_Type_vector (N, 1, N+1, MPI_INT, & diagonal);
where their count is N, block length is 1, and stride is N+1. The receiver process
receives the data as a contiguous block. There are further options that enable the
construction of sub-arrays, structured data, irregularly strided data, etc.
   If all data of an MPI program is specified by MPI types it will support data transfer
between processes on computers with different memory organization and different
interpretations of elementary data items, e.g., in heterogeneous platforms. The par-
allel programs, designed with MPI data types, can be easily ported even between
4.3 Message Passing Interface                                                        95
The MPI standard assumes a reliable and error-free underlying communication plat-
form; therefore, it does not provide mechanisms for dealing with failures in the
communication system. For example, a message sent is always received correctly,
and the user need not check for transmission errors, time-outs, or similar. Simi-
larly, MPI does not provide mechanisms for handling processor failures. A program
error can follow an MPI operation call with incorrect arguments, e.g., non-existing
destination in a send operation, exceeding available system resources, or similar.
   Most of MPI operation calls return an error code that indicates the completion
status of the operation. Before the error value is returned, the current MPI error
handler is called, which, by default, aborts all MPI processes. However, MPI provides
mechanisms for users to change this default and to handle recoverable errors. One
can specify that no MPI error is fatal, and handle the returned error codes by custom
error-handling routines.
In order to test the presented theory, we need to install first the necessary software
that will make our computer ready for running and testing MPI programs.
In Appendix A of this book, readers will find short instructions for the installation of
free MPI supporting software for either for Linux, macOS, or MS Windows-powered
computers. Beside a compiler for selected program language, an MPI implementa-
tion of the MPI standard is needed with a method for running MPI programs. Please
refer the instruction in Appendix A and run your first “Hello Word” MPI program.
Then you can proceed here in order to find some further hints for running and test-
ing simple MPI programs, either on a single multi-core computer or on a set of
interconnected computers.
Any MPI library will provide you with the mpiexec (or mpirun) program that
can launch one or more MPI applications on a single computer or on a set of
interconnected computers (hosts). The program has many options that are stan-
dardized to some extent, but one is advised to check actual program options with
mpiexec -help. Most common options are −n <num_processes>, -host
or -machinefile.
96                                                      4 MPI Processes and Messaging
$ mpiexec -n 3 MyMPIprogram
MPI will automatically distribute processes among the available cores, which can be
specified by option −cores <num_cores_per_host> Alternatively, the pro-
gram can be launched on two interconnected computers, on each with four processes,
with:
Single Computer
The configuration file can be used for a specification of processes on a single computer
or on a set of interconnected computers. For each host, the number of processes to
be used on that host can be defined by a number that follows a computer name. For
example, on a computer with a single core, the following configuration file defines
four processes per computing core:
localhost 4
   If your computer has, for example, four computing cores, MPI processes will
be distributed among the cores automatically, or in a way specified by the user in
the MPI configuration file, which supports, in this case, the execution of the MPI
parallel program on a single computer. The configuration file could have the following
structure:
        localhost
        localhost
        localhost
        localhost
4.3 Message Passing Interface                                                       97
specifying that a single process will run on each computing core if mpiexec option
-n 4 is used, or two processes will run on each computing core if -n 8 is used, etc.
Note that there are further alternative options for configuring MPI processes that are
usually described in more detail in -help options of a specific MPI implementation.
   Your computer is now ready for the coding and testing more useful MPI programs
that will be discussed in following sections. Before that, some further hints are given
for the execution of MPI programs on a set of interconnected computers.
Interconnected Computers
If you are testing your program on a computer network you may select several
computers to perform defined processes and run and test your code. The configuration
file must be edited in a way that all cooperating computers are listed. Suppose that
four computers will cooperate, each with two computing cores. The configuration
file: myhostsfile should contain names or IP addresses of these computers, e.g.:
   computer_name1
   computer_name2
   192.20.301.77
   computer_name4
each in a separate line, and with the first name belonging to the name of the local
host, i.e., the computer from which the MPI program will be started, by mpiexec.
   Let us execute our MPI program MyMPIprogram on a set of computers in a
network, e.g., connected with an Ethernet. Editing, compiling, and linking processes
are the same as in the case of a single computer. However, the MPI executable should
be available to all computers, e.g., by a manual copying of the MPI executable on
the same path on all computers, or more systematically, through a shared disk.
   On MS Windows, a service for managing the local MPI processes, e.g., smpd dae-
mons should be started by smpd -d on all cooperating computers before launching
MPI programs. The cooperating computers should have the same version of the MPI
library installed, and the compiled MPI executable should be compatible with the
computing platforms (32 or 64 bits) on all computers. The command from the master
host:
$mpiexec -machinefile myhostsfile \\MasterHost\share\
MyMPIprog
will enable to run the program on a set of processes, eventually located on different
computers, as has been specified in the configuration file myhostsfile.
   Note also that the potential user should be granted with rights for executing
the programs on selected computers. One will need a basic user account and an
access to the MPI executable that must be located on the same path on all comput-
ers. In Linux, this can be accomplished automatically by placing the executable in
/home/username/ directory. Finally, a method that allows automatic login, e.g.,
in Linux, SSH login without password, is needed, to enable automatic login between
cooperating computers.
98                                                    4 MPI Processes and Messaging
Let us recall the presented issues in a more systematic way by a brief description of
four basic MPI operations. Two trivial operations without MPI arguments will initiate
and shut down the MPI environment. Next two operations will answer the questions:
“How many processes will cooperate?” and “Which is my ID among them?” Note
that all four operations are called from all processes of the current communicator.
The operation initiates an MPI library and environment. The arguments argc and
argv are required in C language binding only, where they are parameters of the
main C program.
4.4.2 MPI_FINALIZE ()
The operation shuts down the MPI environment. No MPI routine can be called before
MPI_INIT or after MPI_FINALIZE, with one exception MPI_INITIALIZED
(flag), which queries if MPI_INIT has been called.
The operation determines the number of processes in the current communicator. The
input argument comm is the handle of communicator; the output argument size
returned by the operation MPI_COMM_SIZE is the number of processes in the group
of comm. If comm is MPI_COMM_WORLD, then it represents the number of all active
MPI processes.
The operation determines the identifier of the current process within a communicator.
The input argument comm is the handle of the communicator; the output argument
rank is an ID of the process from comm, which is in the range from 0 to size-1.
4.4 Basic MPI Operations                                                               99
We know from previous chapters that a traditional process is associated with a pri-
vate program counter of its private address space. Processes may have multiple
program threads, associated with separate program counters, which share a single
process’ address space. The message passing model formalizes the communication
between processes that have separate address spaces. The process-to-process com-
munication has to implement two essential tasks: data movement and synchroniza-
tion of processes; therefore, it requires cooperation of sender and receiver processes.
Consequently, every send operation expects a pairing/matching receive operation.
The cooperation is not always apparent in the program, which may hinder the under-
standing of the MPI code.
   A schematic presentation of a communication between sender Process_0 and
receiver Process_1 is shown in Fig. 4.1. In this case, optional intermediate mes-
sage buffers are used in order to enable sender Process_0 to continue immediately
after it initiates the send operation. However, Process_0 will have to wait on the
return from the previous call, before it can send a new message. On the receiver
side, Process_1 can do some useful work instead of idling while waiting on the
matching message reception. It is a communication system that must ensure that
the message will be reliably transferred between both processes. If the processes
have been created on a single computer, the actual communication will be proba-
bly implemented through a shared memory. If the processes reside on two distant
computers, then the actual communication might be performed through an existing
interconnection network using, e.g., TCP/IP communication protocol.
   Although that blocking send/receive operations enable a simple way for synchro-
nization of processes, they could introduce unnecessary delays in cases where sender
and receiver do not reach communication point at the same real time. For example,
if Process_0 issues a send call significantly before the matching receives call in
Process_1, Process_0 will start waiting to the actual message data transfer. In
the same way, processes’ idling can happen if a process that produces many messages
is much faster than the consumer process. Message buffering may alleviate the idling
to some extent, but if the amount of data exceeds the capacity of the message buffer,
which can always happen, Process_0 will be blocked again.
   The next concern of the blocking communication are deadlocks. For example, if
Process_0 and Process_1 initiate their send calls in the same time, they will
be blocked forever by waiting matching receive calls. Fortunately, there are several
100                                                              4 MPI Processes and Messaging
                      May I Send?
      Process_0                                                                       Process_1
                                                                          Yes, go!
       sent
   message data                                      Process_0
                                                       buffer
                     Process_0
                       buffer                                                         waiting on
                                        Communication system                          expected
                                        data(n)                                       matching
       waiting on                              ...                                    message
        receive                                      data(2)
       complete                                        data(1)
      confirmation
                                                                       Process_1
                                                                         buffer
                                                                                       received
                                                                 Message received    message data
                                 Time
Fig. 4.1 Communication between two processes awakes both of them while transferring data from
sender Process_0 to receiver Process_1, possibly with a set of shorter sub-messages
ways for alleviating such situations, which will be described in more detail near the
end of Sect. 4.7.
   Before an actual process-to-process transfer of data happens, several issues have
to be specified, e.g., how will message data be described, how processes will be
identified, and how the receiver recognizes/screens messages, when the operations
will complete. The MPI_SEND and MPI_RECV operations are responsible for the
implementation of the above issues.
The operation, invoked by a blocking call MPI_SEND in the sender process source,
will not complete until there is a matching MPI_RECV in receiver process dest,
identified by a corresponding rank. The MPI_RECV will empty the input send
buffer buf of matching MPI_SEND. The MPI_SEND will return when the message
data has been delivered to the communication system and the send buffer buf of the
sender process source can be reused. The send buffer is specified by the following
arguments: buf - pointer to the send buffer, count - number of data items, and
datatype - type of data items. The receiver process is addressed by an envelope
that consists of arguments dest, which is the rank of receiver process within all
processes in the communicator comm, and of a message tag.
   The message tags provide a mechanism for distinguishing between different mes-
sages for the same receiver process identified by destination rank. The tag is
an integer in the range [0, UB] where UB, defined in mpi.h, can be found by
querying the predefined constant MPI_TAG_UB. When a sender process has to send
more separate messages to a receiver process, the sender process will distinguish
them by using tags, which will allow receiver process to efficiently screening its
4.5 Process-to-Process Communication                                              101
This operation waits until the communication system delivers a message with
matching datatype, source, tag, and comm. Messages are screened at the
receiving part based on specific source, which is a rank of the sender pro-
cess within communicator comm, or not screened at all on source by equating it
with MPI_ANY_SOURCE. The same screening is performed with tag, or if screen-
ing on tag is not necessary, by using MPI_ANY_TAG, instead. After return from
MPI_RECV the output buffer buf is emptied and can be reused.
    The number of received data items of datatype must be equal or fewer as spec-
ified by count, which must be positive or zero. Receiving more data items results
in an error. In such cases, the output argument status contains further information
about the error. The entire set of arguments: count, datatype, source, tag
and comm, must match between the sender process and the receiver process to initi-
ate actual message passing. When a message, posted by a sender process, has been
collected by a receiver process, the message is said to be completed, and the program
flows of the receiver and the sender processes may continue.
    Most implementations of the MPI libraries copy the message data out of the user
buffer, which was specified in the MPI program, into some other intermittent system
or network buffer. When the user buffer can be reused by the application, the call to
MPI_SEND will return. This may happen before the matching MPI_RECV is called
or it may not, depending on the message data length.
        $ mpiexec -n 2 MPImessage
        MPI process 0 started...
        MPI process 1 started...
        Message of length 128 send to process 1.
        Message of length 128 returned to process 0.
        Message of length 256 send to process 1.
        Message of length 256 returned to process 0.
        Message of length 512 send to process 1.
        Message of length 512 returned to process 0.
        Message of length 1024 send to process 1.
        Message of length 1024 returned to process 0.
        Message of length 2048 send to process 1.
        Message of length 2048 returned to process 0.
        Message of length 4096 returned to process 0.
        Message of length 4096 send to process 1.
        Message of length 8192 returned to process 0.
4.5 Process-to-Process Communication                                               103
   The program blocks at the message length 65536, which is in some relation with
the capacity of the MPI data buffer in the actual MPI implementation. When the
message exceeds it, MPI_Send in both processes block and enter a deadlock. If we
just change the order of MPI_Send and MPI_Recv by comment lines 36–37 and
uncomment lines 31–32 in process with rank = 1, all expected messages until the
length 16777216 are transferred correctly. Some further discussion about the reasons
for such a behavior will be provided later, in Sect. 4.7.                         
The MPI standard specifies several additional operations for message transfer that are
a combination of basic MPI operations. They are useful for writing more compact
programs. For example, operation MPI_SENDRECV combines a sending of mes-
sage to destination process dest and a receiving of another message from process
source, in a single call in sender and receiver process; however, with two distinct
message buffers: sendbuf, which acts as an input, and recvbuf, which is an
output. Note that buffers’ sizes and types of data can be different.
   The send-receive operation is particularly effective for executing a shift operation
across a chain of processes. If blocking MPI_SEND and MPI_RECV are used, then
one needs to order them correctly, for example, even processes send, then receive,
odd processes receive first, then send - so as to prevent cyclic dependencies that may
lead to deadlocks. By using MPI_SENDRECV, the communication subsystem will
manage these issues alone.
   There are further advanced communication operations that are a composition
of basic MPI operations. For example, MPI_SENDRECV_REPLACE (buf,
count, datatype, dest, sendtag, source, recvtag, comm,
status) operation implements the functionality the MPI_SENDRECV, but uses
only a single message buffer. The operation is therefore useful in cases with send
and receive messages of the same length and of the same data type.
104                                                    4 MPI Processes and Messaging
      MPI_INIT,
      MPI_FINALIZE,
      MPI_COMM_SIZE,
      MPI_COMM_RANK,
      MPI_SEND,
      MPI_RECV,
      MPI_WTIME.
The elapsed time (wall clock) between two points in an MPI program can be measured
by using operation MPI_WTIME (). Its use is self-explanatory through a short
segment of an MPI program example:
   We are now ready to write a simple example of a useful MPI program that will
measure the speed of communication channel between two processes. The program
is presented, in more detail, in the next example.
Process_0 Process_1
                  inbuf         buf
                  ...                                                                        ...
                 MPI_SEND(buf...)                     ...                                    MPI_RECV (buf...)
                  ...                                 ...                                    ...
                                                                                             outbuf        buf
                                                    Messages of increasing length
     spent on setting up the software and hardware of the message communication chan-
     nel, i.e., on the start-up time ts . On the other hand, with long messages, the data
     transfer time will dominate, hence, we could expect that the communication band-
     width will approach to a theoretical value of the communication channel. Therefore,
     the length of messages will vary from just a few data items to very long messages.
     The test will be repeated nloop times, with shorter messages, in order to get more
     reliable average results.
        Considering the above methodology, an example of MPI program MSMPIbw.
     cpp, for measuring the communication bandwidth, is given in Listing 4.4. We have
     again a single program but slightly different codes for the sender and the receiver
     process. The essential part, message passing, starts in the sender process with a call to
     MPI_Send, which will be matched in the receiver process by a call to corresponding
     MPI_Recv.
32           }
33           for ( k = 0; k < N U M B E R _ O F _ T E S T S ; k ++) {
34             if ( rank == 0) {
35                 t1 = M P I _ W t i m e () ;
36                 for ( j = 0; j < n l o o p ; j ++) { // send m e s s a g e n l o o p times
37                     M P I _ S e n d ( buf , n , M P I _ D O U B L E , 1 , k , M P I _ C O M M _ W O R L D ) ;
38                 }
39                 t2 = ( M P I _ W t i m e () - t1 ) / n l o o p ;
40             }
41             else if ( rank == 1) {
42                 for ( j = 0; j < n l o o p ; j ++) { // r e c e i v e m e s s a g e n l o o p t i m e s
43                     M P I _ R e c v ( buf , n , M P I _ D O U B L E , 0 , k , M P I _ C O M M _ W O R L D , & s t a t u s ) ;
44                 }
45             }
46           }
47           if ( rank == 0) { // c a l c u l a t e b a n d w i d t h
48             double bandwidth ;
49             b a n d w i d t h = n * s i z e o f ( d o u b l e ) *1.0 e -6 * 8 / t2 ; // in Mb / sec
50             p r i n t f ( " \ t %10 d \ t %10.8 f \ t %8.2 f \ n " , n , t2 , b a n d w i d t h ) ;
51           }
52           free ( buf ) ;
53        }
54        M P I _ F i n a l i z e () ;
55        r e t u r n 0;
56    }
        The output of the MPI program from Listing 4.4, which has been executed on two
     processes, each running on one of two computer cores that communicate through
     the shared memory, is shown in Fig. 4.3a with a screenshot of rank = 0 process
     user terminal, and in Fig. 4.3b with a corresponding bandwidth graph. The results
     confirmed our expectations. The bandwidth is poor with short messages and reaches
     the whole capacity of the memory access with longer messages.
        If we assume that with very short messages, the majority of time is spent on the
     communication setup, we can read from Fig. 4.3a (first line of data) that the setup time
     was 0.18 µs. The setup time starts increasing when the messages become longer than
     16 of doubles. A reason could be that processes communicate until know through
     the fastest cache memory. Then the bandwidth increases until message length 512
     of doubles. A reason for a drop at this length could be cache memory incoherences.
     The bandwidth converges then to 43 Gb/s, which could be a limit of cache memory
     access. If message lengths are increased above 524 thousands of doubles, the band-
     width is becoming lower and stabilizes at around 17 Gb/s, eventually because of a
     limit in shared memory access. Note that the above merits are strongly related to a
     specific computer architecture and may therefore significantly differ among different
     computers.                                                                             
        You are encouraged to run the same program on your computer and compare the
     obtained results with the results from Fig. 4.3. You may also run the same program
     on two interconnected computers, e.g., by Ethernet or Wi-Fi, and try to explain the
     obtained differences in results, taking into account a limited speed of your connection.
     Note that the maximum message lengths n could be made shorter in the case of slower
     communication channels.
4.6 Collective MPI Communication                                                             107
(a)
                                                  (b)
                                                         105
                                      bandwidth [Mb/s]
                                                         104
103
                                                         102
                                                           100    102       104      106     108
                                                                 Message length [#doubles]
Fig. 4.3 The bandwidth of a communication channel between two processes on a single computer
that communicate through shared memory. a) Message length, communication time, and bandwidth,
all in numbers; b) corresponding graph of the communication bandwidth
The communication operations, described in the previous sections, are called from
a single process, identified by a rank, which has to be explicitly expressed in the
MPI program, e.g., by a statement if(my_id == rank). The MPI collective
operations are called by all processes in a communicator. Typical tasks that can be
elegantly implemented in this way are as follows: global synchronization, reception
of a local data item from all cooperating processes in the communicator, and a lot of
others, some of them described in this section.
          inbuf        Dθ
          ...                             ...                             ...
          MPI _BCAST                      MPI_BCAST                       MPI _BCAST
          ...                             ...                             ...
          outbuf       Dθ                 outbuf      Dθ                  outbuf       Dθ
Fig. 4.4 Root process broadcasts the data from its input buffer in the output buffers of all processes
Fig. 4.5 Root process gathers the data from input buffers of all processes in its output buffer
This operation works inverse to MPI_GATHER, i.e., it scatters data from inbuf of
process root to outbuf of all remaining processes, including itself. Note that the
count outcnt and type outtype of the data in each of the receiver processes are
the same, so, data is scattered into equal segments.
    A schematic presentation of data relocation after the call to MPI_SCATTER is
shown in Fig. 4.6 for the case of three processes, where process with rank = 0
is the root process. Note again that all processes have to call MPI_SCATTER to
complete the requested data relocation.
    There are also more complex collective operations, e.g., MPI_GATHERV and
MPI_SCATTERV that allow a varying count of process data from each process and
permit some options for process data placement on the root process. Such extensions
are possible by changing the incnt and outcnt arguments from a single integer to
an array of integers, and by providing a new array argument displs for specifying
the displacement relative to root buffers at which to place the processes’ data.
            inbuf       Dθ D1 D2
            ...                            ...                            ...
            MPI_SCATTER                    MPI_SCATTER                    MPI_SCSTTER
            ...                            ...                            ...
Fig. 4.6 Root process scatters the data from its input buffer to output buffers of all processes in its
output buffer
110                                                     4 MPI Processes and Messaging
Instead of just relocating data between processes, MPI provides a set of operations
that perform several simple manipulations on the transferred data. These operations
represent a combination of collective communication and computational manipula-
tion in a single call and therefore simplify MPI programs.
   Collective MPI operations for data manipulation are based on data reduction
paradigm that involves reducing a set of numbers into a smaller set of numbers via
a data manipulation. For example, three pairs of numbers: {5, 1}, {3, 2}, {7, 6},
each representing the local data of a process, can be reduced in a pair of maximum
numbers, i.e., {7, 6}, or in a sum of all pair numbers, i.e., {15, 9}, and in the same
way for other reduction operations defined by MPI:
Fig. 4.7 Root process collects the data from input buffers of all processes, performs per-element
MPI_SUM manipulation, and saves the result in its output buffer
       (a)                                                               (b)
       0.1                                                               0.1
     0.09                                                               0.09
     0.08                                                               0.08
     0.07                                                               0.07
     0.06                                                               0.06
     0.05                                                               0.05
     0.04                                                               0.04
     0.03                                                               0.03
     0.02                                                               0.02
     0.01                                                               0.01
         0                                                                 0
          0   0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9               1           0    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9      1
                       Area of subdomains                                      rank=0(light gray) and rank=1(dark gray)
     Fig. 4.8 a) Discretization of interval [0, 1] in 10 subintervals for numerical integration of quarter
     circle area; b) decomposition for two parallel processes: light gray subintervals are sub-domain of
     rank 0 process; dark gray subintervals of rank 1 process
        A simple case for two processes and ten intervals is shown in Fig. 4.8b. Five
     subintervals {1,3,5,7,9}, marked in gray, are integrated by rank 0 process and
     the other five subintervals {2,4,6,8,10}, marked in dark, are integrated by rank 1
     process.
        An example of an MPI program that implements parallel computation of π , for
     an arbitrary p and N , in C programming language, is given in Listing 4.5:
        Let us open a Terminal window and a Task Manager window (see Fig. 4.9), where
     we see that the computer used has four cores, eight logical processors and is utilized
     by a background task for 17%. After running the compiled program for the calculation
     of π on a single process and with 109 intervals, the execution time is about 31.4 s and
     the CPU utilization increases to 30%. In the case of four processes, the execution time
     drops to 7.9 s and utilization increases to 70%. Running the program on 8 processes
     a further speedup is noticed, by the execution time 5.1 s and CPU utilization 100%,
     because all computational resources of the computer are now fully utilized. From
     prints in the Terminal window, it is evident that the number π was calculated with
     similar accuracy in all cases. With our simple MPI program, we achieved a speedup
     a bit higher than 6, which is excellent!                                              
     Fig. 4.9 Screenshots of Terminal window and Task Manager indicating timing of the program for
     calculation of π and the computer utilization history
114                                                        4 MPI Processes and Messaging
           MPI_INIT, MPI_FINALIZE,
           MPI_COMM_SIZE, MPI_COMM_RANK,
           MPI_SEND, MPI_RECV,
           MPI_BARRIER,
           MPI_BCAST, MPI_GATHER, MPI_SCATTER,
           MPI_REDUCE, MPI_ALLREDUCE,
           MPI_WTIME, MPI_STATUS,
           MPI_INITIALIZED.
semantics: a communication does not complete at either end before both processes
rendezvous at the communication. A send executed in this mode is nonlocal, because
its competition requires a cooperation of sender and receiver processes.
    The ready mode send may be started only if the matching receive has been already
called. Otherwise, the operation is erroneous and its outcome is undefined. On some
systems, this allows the removal of a handshake operation that is otherwise required,
which could result in improved performance. In a correct program, a ready send
can be replaced by a standard send with no effect on the program results, but with
eventually improved performances.
    The receive call MPI_RECV is always blocking, because it returns only after the
receive buffer contains the expected received message.
Non-blocking Communication
Non-blocking send start calls are denoted by a leading letter I in the name of MPI
operation. They can use the same four modes as blocking sends: standard, buffered,
synchronous, and ready, i.e., MPI_ISEND, MPI_IBSEND, MPI_ISSEND, MPI_
IRSEND. Sends of all modes, except ready, can be started whether a matching receive
has been posted or not; a non-blocking ready send can be started only if a matching
receive is posted. In all cases, the non-blocking send start call is local, i.e., it returns
immediately, irrespective of the status of other processes. Non-blocking communi-
cations return immediately request handles that can be waited on, or queried, by
specialized MPI operations that enables to wait or to test for their completion.
   The syntax of the non-blocking MPI operations are the same as in the standard
communication mode, e.g.:
MPI_ISEND (buf, count, datatype, dest, tag, comm,
request), or
MPI_IRECV (buf, count,datatype, dest, tag, comm,
request),
except with an additional request handle that is used for later querying by send-
complete calls, e.g.:
MPI_WAIT (request, status), or
MPI_TEST (request, flag, status).
    A non-blocking standard send call MPI_ISEND initiates the send operation, but
does not complete it, in a sense that it will return before the message is copied out
of the send buffer. A later separate call is needed to complete the communication,
i.e., to verify that the data has been copied out of the send buffer. In the meantime,
a computation can run concurrently. In the same way, a non-blocking receive call
MPI_IRECV initiates the receive operation, but does not complete it. The call will
return before a message is stored into the receive buffer. A later separate call is needed
to verify that the data has been received into the receive buffer. While querying about
the reception of the complete message, a computation can run concurrently.
    We can expect that a non-blocking send MPI_ISEND immediately followed
by send-complete call MPI_WAIT is functionally equivalent to a blocking send
MPI_SEND. One can wait on multiple requests, e.g., in a master/slave MPI pro-
gram, where the master waits either for all or for some slaves’ messages, using MPI
operations:
4.7 Communication and Computation Overlap                                          117
We know from previous sections that after a call to receive operation, e.g., MPI_
RECV, the process will wait patiently until a matching MPI_SEND is posted. If the
118                                                   4 MPI Processes and Messaging
matching send is never posted, the receive operation will wait forever in a deadlock.
In practice, the program will become unresponsive until some time limit is exceeded,
or the operating system will report a crash. The above situation can appear if two
MPI_RECV are issued in approximately the same time, on two different processes,
that mutually expect a matching send and are waiting to the matching messages that
will be never delivered. Such a situation is shown below with a segment from an MPI
program, in C language, for process with rank = 0 and rank =1, respectively:
   if (rank == 0) {
     MPI_Recv (rec_buf, count, MPI_BYTE, 1, tag, comm, &status);
     MPI_Send (send_buf, count, MPI_BYTE, 1, tag, comm);
   }
   if (rank == 1) {
     MPI_Recv (rec_buf, count, MPI_BYTE, 0, tag, comm, &status);
     MPI_Send (send_buf, count, MPI_BYTE, 0, tag, comm);
   }
   In the same way, if two blocking MPI_SENDs are issued in approximately the
same time, on process, e.g., with rank = 0 and rank =1, respectively, both fol-
lowed by a matching MPI_RECV, they will never finish if MPI_SENDs are imple-
mented without buffers. Even in the case that message buffering is implemented,
it will usually suffice only for shorter messages. With longer messages, a deadlock
situation could be expected, when the buffer space is exhausted, which was already
demonstrated in Listing 4.3.
   The above situations are called “unsafe” because they depend on the implementa-
tion of the MPI communication operations and on the availability of system buffers.
The portability of such unsafe programs may be limited.
   Several solutions are available that can make an unsafe program “correct”. The
simplest approach is to use the order of communication operations more carefully.
For example, in the given example, by a call to MPI_SEND, in process with rank =
0, first. Consequently, with exchanging the order of two lines in the program segment
for process with rank = 0:
   if (rank == 0) {
     MPI_Send (send_buf, count, MPI_BYTE, 1, tag, comm);
     MPI_Recv (rec_buf, count, MPI_BYTE, 1, tag, comm, &status);
   }
   if (rank == 1) {
     MPI_Recv (rec_buf, count, MPI_BYTE, 0, tag, comm, &status);
     MPI_Send (send_buf, count, MPI_BYTE, 0, tag, comm);
   }
send and receive operations are automatically matched and deadlocks are avoided in
both processes.
4.7 Communication and Computation Overlap                                          119
   ...
   MPI_Request requests[2]
   ...
   if (rank == 0) {
     MPI_Irecv (rec_buf,count,MPI_BYTE,1,tag,comm,&requests[0]);
     MPI_Isend(send_buf,count,MPI_BYTE,1,tag,comm,&requests[1]);
   }
   else if (rank == 1) {
     MPI_Irecv (rec_buf,count,MPI_BYTE,0,tag,comm,&requests[0]);
     MPI_Isend(send_buf,count,MPI_BYTE,0,tag,comm,&requests[1]);
   }
   MPI_Waitall (2, request, MPI_STATUSES_IGNORE);
The call to MPI_IRECV is issued first, which provides a receive data buffer that is
ready for the message that will arrive. This approach avoids extra memory copies of
data buffers, avoids deadlock situations and could therefore speed up the program
execution.
   Finally, non-blocking buffered send can be used MPI_BSEND with explicit allo-
cation of separate send buffers by MPI_BUFFER_ATTACH, however, this approach
needs extra memory.
     communication and calculation tasks could overlap, which could result in a shorter
     execution time.
        One way to implement the above task is to start a master process that will receive
     messages from all slave processes, and then proceed with its calculation work. The
     slave processes will send their messages and then start to calculate. The program
     runs until all communication and calculation are done. A simple demonstration code
     of overlapping communication and calculation is given in Listing 4.6.
 1            else {            // slave p r o c e s s e s
 2                    s t a r t _ c = M P I _ W t i m e () ;
 3    # if 1
 4                    M P I _ I s e n d ( buff , MSGSIZE ,                 // non - b l o c k i n g send
 5                              M P I _ D O U B L E , master , tag , M P I _ C O M M _ W O R L D , & r e q u e s t ) ;
 6    # e n d i f # if 0
 7                    M P I _ S e n d ( buff , MSGSIZE ,                 // b l o c k i n g send
 8                              M P I _ D O U B L E , master , tag , M P I _ C O M M _ W O R L D ) ;
 9    # endif
10                    c o m m _ t = M P I _ W t i m e () - s t a r t _ c ;
11                    s t a r t _ w = M P I _ W t i m e () ;
12                    work_r = other_work (p);
13                    w o r k _ t = M P I _ W t i m e () - s t a r t _ w ;
14                    M P I _ W a i t (& request , & s t a t u s ) ;            // block until Isend is done
15            }
16            r u n _ t i m e = M P I _ W t i m e () - s t a r t _ p ;
17            p r i n t f ( " R a n k \ t Comm [ s ] \ t Calc [ s ] \ t Total [ s ] \ t W o r k _ r e s u l t \ n " ) ;
18            p r i n t f ( " % d \ t % e \ t % e \ t % e \ t % e \ t \ n " , myid , comm_t , work_t , run_time ,←
                 work_r );
19            fflush ( stdout );                // to c o r r e c t l y f i n i s h all p r i n t s
20            free ( buff ) ;
21            M P I _ F i n a l i z e () ;
22    }
        The program from Listing 4.6 has to be executed with at least two processes: one
     master and one or more slaves. The non-blocking MPI_Isend call, in all processes,
     returns immediately to the next program statement without waiting for the commu-
     nication task to complete. This enables other_work to proceed without delay.
     Such a usage of non-blocking send (or receive), to avoid processor idling, has the
     effect of “latency hiding”, where MPI latency is the elapsed time for an operation,
     e.g., MPI_Isend, to complete. Note that we have used MPI_ANY_SOURCE in the
     master process to specify message source. This enables an arbitrary arrival order of
     messages, instead of a predefined sequence of processes that can further speed up
     the program execution.
        The output of this program should be as follows:
        $ mpiexec -n 2 MPIhiding
        Rank Comm[s]        Calc[s]                                  Total[s]                 Work_result
         1    2.210910e-04 1.776894e+00                              2.340671e+00             6.109991e-01
        Rank Comm[s]        Calc[s]                                  Total[s]                 Work_result
         0    1.692562e-05 1.747064e+00                              2.340667e+00             6.109991e-01
        Note that the total execution time is longer than the calculation time. The commu-
     nication time is negligible, even that we have sent 100 millions of doubles. Please
     use blocking MPI communication, compare the execution time, and explain the dif-
     ferences. Please experiment with different numbers of processes, different message
     lengths, and different amount of calculation, and explain the behavior of the execu-
     tion time.                                                                         
     122                                                                      4 MPI Processes and Messaging
        The output of this program depends on the number of cooperating processes. For
     the case of 3 processes it could be as follows:
          $ mpiexec -n 3 MPIfairness
          Msg from 1 with tag 0
          Msg from 1 with tag 1
          Msg from 1 with tag 2
          Msg from 1 with tag 3
          Msg from 1 with tag 4
          Msg from 1 with tag 5
          Msg from 1 with tag 6
          Msg from 1 with tag 7
          Msg from 1 with tag 8
          Msg from 1 with tag 9
          Msg from 2 with tag 0
          Msg from 2 with tag 1
          Msg from 2 with tag 2
          Msg from 2 with tag 3
          Msg from 2 with tag 4
          Msg from 2 with tag 5
          Msg from 2 with tag 6
          Msg from 2 with tag 7
          Msg from 2 with tag 8
          Msg from 2 with tag 9
     We see that all messages from the process with rank 1 have been received first,
     even that the process with rank 2 has also attempted to send its messages, so the
     communication was unfair. The order of received messages, identified by tags, is the
     same as the order of sent messages, so the communication was non-overtaking. 
     All communication operations introduced in previous sections have used the default
     communicator MPI_COMM_WORLD, which incorporates all processes involved and
     defines a default context. More complex parallel programs usually need more process
     groups and contexts to implement various forms of sequential or parallel decompo-
     sition of a program. Also, the cooperation of different software developer groups is
     much easier if they develop their software modules in distinct contexts. The MPI
124                                                   4 MPI Processes and Messaging
library supports modular programming via its communicator mechanism that pro-
vides the “information hiding” and “local name space”, which are both needed in
modular programs.
   We know from previous sections that any MPI communication operation specifies
a communicator, which identifies a process group that can be engaged in the com-
munication and a context (tagging space) in which the communication occurs. Differ-
ent communicators can encapsulate the same or different process groups but always
with different contexts. The message context can be implemented as an extended
tag field, which enables to distinguish between messages from different contexts.
A communication operation can receive a message only if it was sent in the same
context; therefore, MPI processes that run in different contexts cannot be interfered
by unwanted messages.
   For example, in master–slave parallelization, master process manages the tasks
for slave processes. To distinguish between master and slave tasks, statements like
if(rank==master) and if(rank>master) for ranks in a default commu-
nicator MPI_COMM_WORLD can be used. Alternatively, the processes of a default
communicator can be splitted into two new sub-communicators, each with a different
group of processes. The first group of processes, eventually with a single process,
performs master tasks, and the second group of processes, eventually with a larger
number of processes, executes slave tasks. Note that both sub-communicators are
encapsulated into a new communicator, while the default communicator still exists.
A collective communication is possible now in the default communicator or in the
new communicator.
   In a further example, a sequentially decomposed parallel program is schematically
shown in Fig. 4.10. Each of the three vertical lines with blocks represents a single
process of the parallel program, i.e., P_0, P_1, and P_2. All three processes form
a single process group. The processes are decomposed in consecutive sequential
program modules shown with blocks. Process-to-process communication calls are
shown with arrows. In Fig. 4.10a, all processes and their program modules run in
the same context, while in Fig. 4.10b, program modules, encircled by dashed curves,
run in two different contexts that were obtained by a duplication of the default
communicator.
   Figure 4.10a shows that MPI processes P_0 and P_2 have finished sooner than
P_1. Dashed arrows denote messages that have been generated during subsequent
computation in P_0 and P_2. The messages could be accepted by a sequential pro-
gram module P1_mod_1 of MPI process P_1, which is eventually NOT correct.
A problem solution is shown in Fig. 4.10b. The program modules run here in two
different contexts, New_comm(1) and New_comm(2). The early messages will be
accepted now correctly by MPI receive operations in program module P1_mod_2
from communicator New_comm(2), which uses a distinct tag space that will cor-
rectly match the problematic messages.
4.7 Communication and Computation Overlap                                                         125
(a)                                                        (b)
  P_0          P_1          P_2                              P_0         P_1          P_2
                                       MPI_COMM_DUP
   P0_mod_1
P1_mod_1
P2_mod_1
P0_mod_1
P1_mod_1
                                                                                       P2_mod_1
                                         New_comm(1)
   P0_mod_2
                                                             P0_mod_2
                                         New_comm(2)
                            P2_mod_2
                                                                                       P2_mod_2
               P1_mod_2
                                                                          P1_mod_2
Fig. 4.10 Sequentially decomposed parallel program that runs on three processes. a) processes run
in the same context; b) processes run in two different contexts
  The MPI standard specifies several operations that support modular programming.
Two basic operations implement duplication or splitting of an existing communicator
comm.
MPI_COMM_DUP (comm, new_comm)
is executed by each process from the parent communicator comm. It creates a
new communicator new_comm comprising the same process group but a new con-
text. This mechanism supports sequential composition of MPI programs, as shown
in Fig. 4.10, by separating communication that is performed for different pur-
poses. Since all MPI communication is performed within a specified communicator,
MPI_COMM_DUP provides an effective way to create a new user-specified commu-
nicator, e.g., for use by a specific program module or by a library, in order to prevent
interferences of messages.
MPI_COMM_SPLIT (comm, color, key, new_comm)
creates a new communicator new_comm from the initial communicator comm, com-
prising disjoint subgroups of processes with optional reordering of their ranks. Each
subgroup contains all processes of the same color, which is a nonnegative argu-
ment. It can be MPI_UNDEFINED; in this case, its corresponding process will not be
included in any of the new communicators. Within each subgroup, the processes are
ranked in the order defined by the value of corresponding argument key, i.e., a lower
value of key implies a lower value of rank, while equal process keys preserve the
original order of ranks. A new sub-communicator is created for each subgroup and
126                                                    4 MPI Processes and Messaging
            P_0         P_1          P_2         P_3          P_4     P_5         P_6         P_7     Rank of Process
                                                                                                      in MPI_ COMM_WORLD
            c=0         c=1          c=0         c=1          c=0     c=1         c=0         c=1         color = rank%2
          P_0         P_2         P_4          P_6              P_1         P_3         P_5         P_7      Process groups of
          r_0         r_1         r_2          r_3              r_0         r_1         r_2         r_3      two new_comm
     Fig. 4.11 Visualization of splitting the default communicator with eight processes into two sub-
     communicators with disjoint sets of four processes
     Consequently, we get two process groups with four processes per group. Note that
     the new process ranks in both groups are equal to {0 1 2 3}, because the key = 0
     in all processes, and consequently, the original order of ranks remain the same. For
     an additional test, the master process calculates the sum of processes’ ranks in each
     new group of a new communicator, using the MPI_REDUCE operation. In this simple
     example, the sum of ranks in both groups should be equal to 0 + 1 + 2 + 3 = 6.
     Listing 4.8 Splitting a default communicator in two process groups of a new communicator. First
     and second process groups include, respectively, processes with even and odd ranks from the default
     communicator.
128                                                      4 MPI Processes and Messaging
   The output of compiled program from Listing 4.8, after running it on eight pro-
cesses, should be similar to:
   $ mpiexec -n 8 MSMPIsplitt
   ’MPI_COMM_WORLD’ process rank/size      4/8   has   rank/size   2/4   in   ’new_comm’
   ’MPI_COMM_WORLD’ process rank/size      6/8   has   rank/size   3/4   in   ’new_comm’
   ’MPI_COMM_WORLD’ process rank/size      5/8   has   rank/size   2/4   in   ’new_comm’
   ’MPI_COMM_WORLD’ process rank/size      0/8   has   rank/size   0/4   in   ’new_comm’
   Sum of ranks in ’new_com’: 6
   ’MPI_COMM_WORLD’ process rank/size      7/8 has rank/size 3/4 in ’new_comm’
   ’MPI_COMM_WORLD’ process rank/size      1/8 has rank/size 0/4 in ’new_comm’
   Sum of ranks in ’new_com’: 6
   ’MPI_COMM_WORLD’ process rank/size      3/8 has rank/size 1/4 in ’new_comm’
   ’MPI_COMM_WORLD’ process rank/size      2/8 has rank/size 1/4 in ’new_comm’
   The above output confirms our expectations. We have two process groups in the
new communicator, each comprising four processes with ranks 0 to 3. Both sums of
ranks in process groups are 6, as expected.                                       
   For an exercise, suppose that we have seven processes in the default communicator
MPI_COMM_WORLD with initial ranks = {0 1 2 3 4 5 6}. Note that for this case,
the program should be executed by mpiexec option -n 7. Let the color be
(rank >= 2) and key be (rank <= 3), which results in process colors = {0 0 1
1 1 1 1} and keys = {1 1 1 1 0 0 0}. After a call to MPI_COMM_SPLIT operation, two
process groups are created in new_comm, with two and five members, respectively.
By using initial rank for the processes identification, the processes in new groups
are new_g1 = {0 1} and new_g2 = {2 3 4 5 6}.
   The new ranks of processes in both groups are determined according to corre-
sponding values of keys. Aligning the initial rank and key, we see, for example,
that process with initial rank = 0 is aligned with key = 1, or process with initial
rank = 4 is aligned with key = 0, etc. Now, the keys can be assigned to process
groups as: key_g1 = {1 1} and key_g2 = {1 1 0 0 0}. Because smaller values of
keys relate with smaller values of ranks, and because equal keys does not change
the original rank’s order, we get: rank_g1 = {0 1} and rank_g2 = {3 4 0 1 2}. For
example, process with initial rank = 4 becomes a member of new_g2 with rank
= 0. Obviously, the sums of ranks in both groups of the new communicator are 1 and
10, respectively. Please, feel free to adapt MPI program MSMPIsplitt.cpp from
Listing 4.8 in a way that it will implement the described example.
Already in the simple cases of MPI programs, one can analyze the speedup as a
function of the problem size and as a function of the number of cooperating processes.
   The parallelization of sequential problems can be guided by various methodolo-
gies that provide the same quantitative results, however, in different execution time
or with different memory requirements. Some parallelization approaches are better
4.8 How Effective Are Your MPI Programs?                                           129
for a smaller number of computing nodes and other for a larger number of nodes. We
are looking for an optimal solution that is simple, efficient, and scalable. A simple
parallelization methodology, proposed by Ian Foster in his famous book “Designing
and Building Parallel Programs”, is performed in four distinct stages: Partitioning,
Communication, Agglomeration, and Mapping (PCAM).
   In the first two stages, the sequential problem is decomposed into, as small as
possible, tasks and the required communication among the tasks is identified. The
available parallel execution platform is ignored for these two phases, because the
aim is a maximal decomposition, with the final goal, to improve concurrency and
scalability of the discovered parallel algorithms.
   The third and fourth stages respect the ability of targeted parallel computer. The
identified fine-grained tasks have to be agglomerated to improve performance and
to reduce development costs. The last stage is devoted to the mapping of tasks on
real computers, taking into account the locality of communication and balancing of
calculation load.
   The developed parallel program speedup and, consequently, its efficiency and
scalability depend mainly on the following three issues:
Test Questions
 1. True or false:
    (a) MPI is a message passing library specification not a language or compiler
    specification.
    (b) In the MPI model processes communicate only by shared memory.
    (c) MPI is useful for an implementation of MIMD/SPMD parallelism.
    (d) A single MPI program is usually written that can run with a general number
    of processes.
    (e) It is necessary to specify explicitly, which part of the MPI code will run with
    specific processes.
 2. True or false:
    (a) A group and context together form a communicator.
    (b) A default communicator MPI_COMM_WORLD contains in its group all initial
130                                                        4 MPI Processes and Messaging
Mini Projects
P1. Implement MPI program for a 2-D finite difference algorithm on a square
    domain with n × n = N points. Assume 5 points stencil (actual point and four
    neighbors). Assume ghost boundary points in order to simplify the calculation
    in border points (all stencils, including boundary points, are equal). Compare
    the obtained results, after a specified number of iterations, on a single MPI pro-
    cess and on a parallel multi-core computer, e.g., with up to eight cores. Use the
    performance models for calculation and communication to explain your results.
    Plot the execution time as a function of the number of points N and as a function
    of the number of processes p for, e.g., 104 time steps.
P2. Use MPI point-to-point communication to implement the broadcast and reduce
    functions. Compare the performance of your implementation with that of the
    MPI global operations MPI_BCAST and MPI_REDUCE for different data sizes
    and different numbers of processes. Use data sizes up to 104 doubles and up to
    all available number of processes. Plot and explain the obtained results.
P3. Implement the summation of four vectors, each of N doubles, with an algorithm
    similar to the reduction algorithm. The final sum should be available on all
    processes. Use four processes. Each of them will initially generate its own
    vector. Use MPI point-to-point communication to implement your version of
    the summation of the generated vector. Test your program for small and large
    vectors. Comment results and compare the performance of your implementation
    with that of the MPI_ALLREDUCE. Explain any differences.
The primary source of MPI information is available at MPI Forum website: https://
www.mpi-forum.org/ where the complete MPI library specifications and documents
are available. MPI features of Version 2.0 are mostly referenced in this book as later
versions include more advanced options, however, they are backward compatible
with MPI 2.0.
   Newer MPI standards [10] are trying to better support the scalability in future
extreme-scale computing systems using advanced topics as: one-sided commu-
nications, extended collective operations, process topologies, external interfaces,
etc. Advanced topics, e.g., a virtual shared memory emulation through so-called
MPI windows, which could simplify the programming and improve the execution
132                                                    4 MPI Processes and Messaging
efficiency, are beyond the scope of this book and are well covered by the continual
evolving MPI standard, which should be an ultimate reference of enthusiastic
programmers.
   More demanding readers are adviced to check several well-documented open-
source references for further reading, e.g., for the MPI standard [16], for MPI imple-
mentations [1,2], and many other internet sources for advanced MPI programming.
   Note that besides the parallel algorithm, parallelization methodology [9], and
the computational performance of the cooperating computers, the parallel program
efficiency depends also on the topology and speed of the interconnection network
[26].
OpenCL for Massively Parallel Graphic
Processors                                                                                  5
Chapter Summary
This chapter will teach us how to program GPUs using OpenCL. Almost all desktop
computers ship with a quad-core processor and a GPU. Thus, we need a programming
environment in which a programmer can write programs and run them on either a
GPU or a quad-core CPU and a GPU. While CPUs are designed to handle complex
tasks, such as time slicing, branching, etc., GPUs only do one thing well. They
handle billions of repetitive low-level arithmetic operations. High-level languages,
such as CUDA and OpenCL, that target the GPUs directly, are available today so
GPU programming is rapidly becoming one of the mainstreams in the computer
science community.
       Now, suppose we are running the following fragment of code on a slimmed single-
    core CPU from Fig. 5.1b:
    The C code in Listing 5.1 implements vector addition of two floating-point vectors,
    each containing 128 elements. A slimmed CPU executes a single instruction stream
    obtained after the compilation of the program in Listing 5.1. A compiled fragment
    of the function VectorAdd that runs on a single-core CPU is presented in Fig. 5.2.
    With the first two instructions in Fig. 5.2, we clear the registers r2 and r3 (suppose
    r0 iz a zero register). The register r2 is used to store loop counter (tid from
    Listing 5.1) while the register r3 contains offset in the vectors VecA and VecB.
    Within the L1 loop CPU loads adjacent elements from the vectors VecA and VecB
    into the floating-point registers f1 and f2, adds them and stores the result from the
    register f1 into the vector VecC. After that we increment the offset in the register
    r3. Recall that the vectors contain floating-point numbers, which are represented
    with 32 bits (4 bytes), thus the offset is incremented by 4. At the end of the loop,
    we increment the loop counter (variable tid) in the register r2, compare the loop
    counter with the value of 128 (the number of elements in each vector) and loop back
    if the counter is smaller, then the length of the vectors VecA and VecB.
        Instead of using one slimmed CPU core from Fig. 5.2, we can use two such
    cores. Why? If we use two CPU cores form Fig. 5.2, we will be able to execute
    two instruction streams fully in parallel (Fig. 5.3). A two cores CPU from Fig. 5.3
    replicates processing resources (Fetch/Decode logic, ALU, and execution context)
    and organizes them into two independent cores. When an application features two
Fig. 5.3 Two instructions streams (two threads) are executed fully in parallel on two CPU cores
Fig. 5.4 A GPU core with eight ALUs, eight execution contexts, and shared fetch/decode logic
instruction streams (i.e., two threads), a two cores CPU provides increased throughput
by simultaneously executing these instruction streams on each core. In the case of
vector addition from Listing 5.1, we can now run two threads on each core. In this
case, each thread will add 64 adjacent vector elements. Notice that both threads in
Fig. 5.3 have the same instruction stream but use different data. The first thread adds
the first 64 elements (the loop index tid in the register r2 iterates from 0 to 63),
while the second thread adds the last 64 elements (the loop index tid in the register
r2 iterates from 64 to 127).
   We can achieve even higher performance by further replicating ALUs and exe-
cution contexts as in Fig. 5.4. Instead of replicating the complete CPU core from
Fig. 5.2, we can replicate only ALU and execution context and leaving the fetch/de-
code logic shared among ALUs. As the fetch/decode logic is shared, all ALUs should
execute the same operations contained in an instruction stream, but they can use dif-
ferent input data. Figure 5.4 depicts such a core with eight ALUs, eight execution
contexts and shared fetch/decode logic. Such a core usually implements additional
storage for data shared among the threads.
   On such a core, we can add eight adjacent vector elements in parallel using one
instruction stream. The instruction stream is now shared across threads with identical
program counters (PC). The same instruction is executed for each thread but on
different data. Thus, there is one ALU and one execution context per thread. Each
thread should now use its own ID (tid) to identify data which is to be used in
5.1 Anatomy of a GPU                                                                           137
instructions. The compiler for such a CPU core should be able to translate the code
from Listing 5.1 into the assembly code from Fig. 5.4. When the first instruction is
fetched it is dispatched to all eight ALUs within the core. Recall that each ALU has
its own set of registers (execution context) so each ALU would add its own tid to
its own register r2. The same holds also for the second and all following instructions
in the instruction stream. For example, the instruction
lfp f1,r3(vecA)
is executed on all ALUs at the same time. This instruction loads the element from
vector vecA at the address vecA+r3. Because the value in r3 is based on dif-
ferent tid, each ALU will operate on different element form vector vecA. Most
modern GPUs use this approach where the cores execute scalar instructions but one
instruction stream is shared across many threads.
    In this book, we well refer to a CPU core from Fig. 5.4 as Compute Unit (CU)
and to ALU as Processing Element. Let us summarize the key-features of computer
units. We can say that they are general-purpose processors, but they are designed very
differently than the general-purpose cores in CPUs—they support so-called SIMD
(Single Instruction Multiple Data) parallelism through replication of execution units
(ALUs), and corresponding execution contexts, they do not support branch prediction
or speculative execution and they have less cache than general-purpose CPUs.
    We can further improve the execution speed of our vector addition problem repli-
cating compute units. Figure 5.5 shows a GPU containing 16 compute units. Using
16 compute units as in Fig. 5.5 we can add 128 adjacent vector elements in parallel
using one instruction stream. Each CU executes a code snippet in Fig. 5.5, which
represents one thread. Let us suppose that we run 128 threads and each thread has
its own ID, tid, where tid is in range 0 . . . 127. The first two instructions load
138                                     5 OpenCL for Massively Parallel Graphic Processors
Fig. 5.5 Sixteen compute units each containing eigth processing elements and eigth separate
contexts
the thread ID tid into r3 and multiply it by 4 (in order to obtain the correct offset
in floating-point vector). Now, the register r3 that belongs to each thread contains
the offset of the vector element that will be accessed in that thread. Each thread
then adds two adjacent elements of vecA and vecB and stores the result into the
corresponding element of vecC. Because each compute units has eight processing
elements (128 processing elements in total), there is no need for the loop. Hopefully,
we are now able to understand the basic idea behind modern GPUs: use as many
ALUS as possible and let ALUs execute same instructions in a lock-step basis, i.e.,
running the same instruction at the same time but on different data.
Modern GPUs comprise of tens of compute units. The efficiency of wide SIMD
processing allows GPUs to pack many CU cores densely with processing elements.
For example, the NVIDIA GeForce GTX780 GPU contains 2304 processing ele-
ments. These processing elements are organized into 12 CU cores (192 PEs per CU).
All modern GPUs maintain large numbers of execution contexts on chip to provide
maximal memory latency-hiding ability. This represents a significant departure from
CPU designs, which attempt to avoid or minimize stalls primarily using large, low-
latency data caches and complicated out of order execution logic. Each CU contains
5.1 Anatomy of a GPU                                                                            139
thousands of 32-bit registers that are used to store execution context and are evenly
allocated to threads (or PEs). Registers are both the fastest and most plentiful mem-
ory in the compute unit. As an example, CU in NVIDIA GeForce GTX780 (Kepler
microarchitecture) contains 65,536 (64 K) 32-bit registers. To achieve large-scale
multithreading, execution contexts must be compact. The number of thread contexts
supported by a CU core is limited by the size of on-chip execution context stor-
age. GPUs can manage many thread contexts (and provide maximal latency-hiding
ability) when threads use fewer resources. When threads require large amounts of
storage, the number of execution contexts (and latency-hiding ability) provided by
a GPU drops. Table 5.1 shows the structure of some of the modern NVIDIA GPUs.
The GPU device containing hundreds of simple processing elements is ideally suited
for computations that can be run in parallel. That is, data parallelism is optimally
handled on the GPU device. This typically involves arithmetic on large data sets
(such as vectors, matrices, and images), where the same operation can be performed
across thousands, if not millions, of data elements at the same time. To exploit such
a huge parallelism, the programmers should partition their programs into thousands
of threads and schedule them among compute units. To make it easier to switch to
OpenCL later in this chapter, we will now define and use the same thread terminology
as OpenCL does. In that sense, we will use the term work-item (WI) for a thread.
   Work-items (or threads) are actually scheduled among compute units in two steps,
which are given as follows:
Fig. 5.6 A programmer partitions a program into blocks of work-item (threads) called work-groups.
Work-groups execute independently from each other on CUs. Generally, a GPU with more CUs
will execute the program faster than a GPU with fewer CUs
                                            Warp
     A warp is a group of 32 work-items from the same work-group that are executed in
     parallel at the same time. Work-items in a warp execute in a so-called lock-step basis.
     Each warp contains work-items of consecutive, increasing work-items IDs. Individual
     work-items composing a warp start together at the same program address, but they
     have their own instruction address counter and register state and are therefore free to
     branch and execute independently. However, the best performance is achieved when all
     work-items from the same warp execute the same instructions.
    If processing elements within a CU remain idle during the period while a warp is
stalled, then a GPU is inefficiently utilized. Instead, GPUs maintain more execution
contexts on CU than they can simultaneously execute (recall that a huge register
file is used to store context for each work-item). In such a way, PEs can execute
instructions from active work-items when others are stalled. The execution context
(program counters, registers, etc) for each warp processed by a CU is maintained on-
chip during the entire lifetime of the warp. Therefore, switching from one execution
context to another has no cost. Also, having multiple resident work-groups per CU
can help reduce idling in the case of barriers, as warps from different work-groups
do not need to wait for each other at barriers.
142                                        5 OpenCL for Massively Parallel Graphic Processors
Fig. 5.7 Scheduling of warps within a compute unit. At every instruction issue time, a warp sched-
uler selects a warp that has work-items ready to execute its next instruction. Each warp always
contains work-items of consecutive work-items IDs, but warps are executed out of order
      • Registers
      • Local Memory
      • Texture Memory
      • Constant Memory
      • Global Memory
Modern GPUs have several memories that can be accessed from a single work-item.
Memory hierarchy of a modern GPU is shown in Fig. 5.8. A memory hierarchy has a
5.1 Anatomy of a GPU                                                             143
number of levels of areas where work-items can place data. Each level has its latency
(i.e., access time) as shown in Table 5.2.
    The GPU has thousands of registers per compute unit (CU). The registers are at
the first and also the most preferable level, as their access time is 1 cycle. Recall
that GPU dedicates real registers to each and every work-item. The number of
registers per work-item is calculated at compile time. Depending on the particular
microarchitecture of a CU, there are 16 K, 32 K, or 64 K registers for all work-items
within an CU. For example, with Kepler microarchitecture you get 64 K of registers
per CU. If you decide to partition your program such that there are 256 work-
items per work-group, and that there are four work-groups per CU, you will get
65536 / (256*4) = 64 registers per work-item on a CU.
144                                         5 OpenCL for Massively Parallel Graphic Processors
   Each CU contains a small amount (64 kB) of very fast on-chip memory that
can be accessed from the work-items running at the particular CU. It is mainly
used for data interchange within a work-group running on CU. This memory is
called local or shared memory. Local memory acts as a user-controlled L1 cache.
Actually, on modern GPUs, this on-chip memory can be used as a user-controlled
local memory or standard hardware-controlled L1 cache. For example, on Kepler
CUs this memory can be split of 48 KB local memory/16 KB L1 cache. On CUs
with the Tesla microarchitecture, there is 16 kB of local memory and no L1 cache.
Local memory has around one-fifth of the speed of registers.
                                      Memory coalescing
      Coalesced memory access or memory coalescing refers to combining multiple memory
      accesses into a single transaction. Grouping of work-items into warps is not only relevant
      to computation, but also to global memory accesses. The GPU device coalesces global
      memory loads and stores issued by work-items of a warp into as few transactions as
      possible to minimize DRAM bandwidth. On the recent GPUs, every successive 128
      bytes (e.g., 32 single precision words) memory can be accessed by a warp in a single
      transaction.
    The largest memory space on GPU is the global memory. The global memory
space is implemented in high-speed GDDR, or graphics dynamic memory, which
achieves very high bandwidth, but like all memory, has a high latency. GPU global
memory is global because it’s accessible from both the GPU and the CPU. It can
actually be accessed from any device on the PCI-E bus. For example, the GeForce
GTX780 GPU has 3 GB of global memory implemented in GDDR5. Global memory
resides in device DRAM and it is used for transfers between the host and device as
well as for the data input to and output from work-items running on CUs. Reads and
writes to global memory are always initiated from CU and are always 128 bytes wide
starting at the address aligned at 128-bytes boundary. The blocks of memory that are
accessed in one memory transactions are called segments. This has an extremely
important consequence. If two work-items of the same warp access two data that fall
into the same 128-bytes segment, data is delivered in a single transaction. If on the
other hand there is data in a segment you fetch that no work-item requested—it is
being read anyway and you (probably) waste bandwidth. And if two work-items from
the same warp access two data that fall into two different 128-bytes segments, two
memory transactions are required. The important thing to remember is that to ensure
memory coalescing we want work-items from the same warp to access contiguous
elements in memory so to minimize the number of required memory transactions.
    There are also two additional read-only memory spaces within global memory that
are accessible by all work-items: constant memory and texture memory. The con-
stant memory space resides in device memory and is cached. This is where constants
and program arguments are stored. Constant memory has two special properties: first,
it is cached, and second, it supports broadcasting a single value to all work-items
5.1 Anatomy of a GPU                                                                              145
                                       OpenCL kernel
     Code that gets executed on a GPU device is called a kernel in OpenCL. The kernels are
     written in a C dialect, which is mostly straightforward C with a lot of built-in functions
     and additional data types. The body of a kernel function implements the computation
     to be completed by all work-items.
within a warp. This broadcast takes place in just a single cycle. Texture memory is
cached so an image read costs one memory read from device memory only on a cache
miss, otherwise, it just costs one read from the texture cache. The texture cache is
optimized for 2D spatial access pattern, so work-items of the same warp that read
image addresses that are close together will achieve best performance.
So far, we have learned how GPUs are built, what are compute units and processing
elements, how work-groups and work-items are scheduled on CUs, which memory
is present on a modern GPU, and what is the memory hierarchy of a modern GPU.
We have mentioned that a programmer is responsible for partitioning programs into
work-groups of work-items. In the following sections, we will learn what is a pro-
grammer’s view of a heterogeneous system and how to use OpenCL to program for
a GPU.
5.2.1 OpenCL
OpenCL (Open Computing Language) is the open, royalty-free standard for cross-
platform, parallel programming of diverse processors found in personal computers,
servers, mobile devices, and embedded platforms. OpenCL is a framework for writing
programs that execute across heterogeneous platforms consisting of central process-
ing units (CPUs), graphics processing units (GPUs), and other types of processors
or hardware accelerators. OpenCL specifies:
OpenCL defines the OpenCL C programming language that is used to write compute
kernels—the C like functions that implements the task which is to be executed
by all work-items running on a GPU. Unfortunately, OpenCL has one significant
146                                      5 OpenCL for Massively Parallel Graphic Processors
drawback: it is not easy to learn. Even the most introductory application is difficult
for a newcomer to grasp. Prior to jump into OpenCL and take advantage of its
parallel-processing capabilities, an OpenCL developer needs to clearly understand
three basic concepts: heterogeneous system (also called platform model), execution
model, and memory model.
A heterogeneous system (also called platform model) consists of a single host con-
nected to one or more OpenCL devices (e.g., GPUs, FPGA accelerators, DSP or even
CPU). The device is where the OpenCL kernels execute. A typical heterogeneous
system is shown in Fig. 5.9. An OpenCL program consists of the host program, that
runs on the host (typically this is a desktop computer with a general-purpose CPU),
and one or more kernels that run on the OpenCL devices. The OpenCL device com-
prises of several compute units. Each compute unit comprises of tens or hundreds of
processing elements.
The OpenCL execution model defines how kernels execute. The most important
concept to understand is NDRange (N-Dimensional Range) execution. The host
program invokes a kernel over an index space. An example of an index space which
is easy to understand is a for loop in C. In the for loop defined by the statement
for(int i=0; i<5; i++), any statements within this loop will execute five
times, with i = 0, 1, 2, 3, 4. In this case, the index space of the loop is [0, 1, 2, 3, 4]. In
OpenCL, index space is called NDRange, and can have 1, 2, or 3 dimensions. OpenCL
kernel functions are executed exactly one time for each point in the NDRange index
space. This unit of work for each point in the NDRange is called a work-item.
Unlike for loops in C, where loop iterations are executed sequentially and in-order,
an OpenCL device is free to execute work-items in parallel and in any order. Recall
that work-items are not scheduled for execution individually onto OpenCL devices.
Instead, work-items are organized into work-groups, which are the unit of work
scheduled onto compute units. Because of this, work-groups also define the set of
work-items that may share data using local memory. Synchronization is possible
only between the work-items in a work-group.
     Work-items have unique global IDs from the index space. Work-items are further
grouped into work-groups and all work-items within a work-group are executed on
the same compute unit. Work-groups have a unique work-group ID and work-items
have a unique local ID within a work-group. NDRange defines the total number
of work-items that execute in parallel. This number is called global work size and
must be provided by a programmer before the kernel is submitted for execution. The
number of work-items within a work-group is called local work size. The programmer
may also set the local work size at runtime. Work-items within a work-group can
communicate with each other and we can synchronize them. In addition, work-items
within a work-group are able to share memory. Once the local work size has been
determined, the NDRange (global work size) is divided automatically into work-
groups, and the work-groups are scheduled for execution on the device.
     A kernel function is written on the host. The host program then compiles the
kernel and submits the kernel for execution on a device. The host program is thus
responsible for creating a collection of work-items, each of which uses the same
instruction stream defined by a single kernel. While the instruction stream is the
same, each work-item operates on different data. Also, the behavior of each work-
item may vary because of branch statements within the instruction stream.
     Figure 5.10 shows an example of NDRange where each small square represents
a work-item. NDRange in Fig. 5.10 is a two-dimensional index space of size (GX,
GY). Each work-item within this NDRange has its own global index (gx , g y ). For
example, the shaded square has global index (10, 12). The work items are grouped
into two-dimensional work-groups. Each work-group contains 64 work-items and
is of size (LX, LY). Each work-item within a work-group has a unique local index
(l x , l y ). For example, the shaded square has local index (2, 4). Also, each work-group
has its own work-group index (wx , w y ). For example, the work-group containing
the shaded square has work-group index (1, 1). And finally, the size of the NDRange
index space can be expressed with the number of work-groups in each dimension,
(WX, WY).
148                                       5 OpenCL for Massively Parallel Graphic Processors
Since common memory address space is unavailable on the host and the OpenCL
devices, the OpenCL memory model defines four regions of memory accessible
to work-items when executing a kernel. Figure 5.11 shows the regions of memory
accessible by the host and the compute device. OpenCL generalizes the different
types of memory available on a device into private memory, local memory, global
memory, and constant memory, as follows:
1. Private memory: a memory region that is private per work item. For example,
   on a GPU device this would be registered within the compute unit.
5.2 Programmer’s View                                                           149
2. Local memory: a memory region that is shared within a work-group. All work-
   items in the same work-group have both read and write access. On a GPU device,
   this is local memory within the compute unit.
3. Global memory: a memory region in which all work-items and work-groups
   have read and write access. It is visible to all work-items and all work-groups.
   On a GPU device, it is implemented in GDDR5. This region of memory can be
   allocated only by the host during runtime.
4. Constant memory: a region of global memory that stays constant throughout
   the execution of the kernel. Work-items have only read access to this region. The
   host is permitted both read and write access.
When writing kernels in the OpenCL language, we must declare memory with certain
address space qualifiers to indicate whether the data resides in global (_ _global),
constant (_ _constant), local (_ _local), or it will default to private within a
kernel.
     150                                            5 OpenCL for Massively Parallel Graphic Processors
     We will start with a simple C program that adds the adjacent elements of two arrays
     (vectors). with N elements each. The sample C code for vector addition that is
     intended to run on a single-core CPU is shown in Listing 5.2.
     We compute the sum within a while loop. The index iGID ranges from 0
     to iNumElements - 1. In each iteration, we add elements a[iGID] and
     b[iGID] and place the result in the c[iGID].
        Now, we will try to implement the same problem using OpenCL and execute it on
     a GPU. We will use this simple problem of adding two vectors because the emphasis
     will be on getting familiar with OpenCL and not on solving the problem itself. We
     will show how to split the code into two parts: the kernel function and the host code.
     Kernel Function
     We can accomplish the same addition on a GPU. To execute the vector addition
     function on a GPU device, we must write it as a kernel function that is executed on
     a GPU device. Each thread on the GPU device will then execute the same kernel
     function. The main idea is to replace loop iterations with kernel functions executing at
     each point in a problem domain. For example, process vectors with iNumElements
     elements with one kernel invocation per element or iNumElements threads (kernel
     executions). The OpenCL kernel is a code sequence that will be executed by every
     single thread running on a GPU. It is very similar in structure to a C function, but
     it has the qualifier _ _kernel. This qualifier alerts the compiler that a function is
     to be compiled to run on an OpenCL device instead of the host. The arguments are
     passed to a kernel as they are passed to any C function. The arguments in the global
     memory are described with _ _global qualifier and the arguments in the shared
     memory are described with _ _local qualifier. These arguments should be always
     passed as pointers.
     5.3 Programming in OpenCL                                                                      151
        As each thread executing the kernel function operates on its own data, there should
     be a way to identify the thread end link it with particular data. To determine the thread
     id, we use the get_global_id function, which works for multiple dimensions.
        The kernel function should look similar to the function VectorAdd from List-
     ing 5.2. If we assume that each work-item calculates one element of array C, the
     kernel function looks like in Listing 5.3.
 1    // O p e n C L K e r n e l F u n c t i o n for e l e m e n t by e l e m e n t
 2    //      vector addition
 3    _ _ k e r n e l void V e c t o r A d d (
 4                                             _ _ g l o b a l float * a ,
 5                                             _ _ g l o b a l float * b ,
 6                                             _ _ g l o b a l float * c ,
 7                                             int i N u m E l e m e n t s
 8                                             ) {
 9
10          // find my global index and handle the data at this index
11          int iGID = g e t _ g l o b a l _ i d (0) ;
12
13          if ( iGID < i N u m E l e m e n t s ) {
14               // add a d j a c e n t e l e m e n t s
15               c [ iGID ] = a [ iGID ] + b [ iGID ];
16          }
17    }
Host Code
In developing an OpenCL project, the first step is to code the host application. The
host application runs on a user’s computer (the host) and dispatches kernels to con-
nected devices. The host application can be coded in C or C++. Because OpenCL
supports a wide range of heterogeneous platforms, the programmer must first deter-
mine which OpenCL devices are connected to the platform. After he discovers the
devices constituting the platform, the programmer chooses one or more devices on
which he wants to run the kernel function. Only after that can he compile and execute
the kernel function on the selected device. Thus, the kernel functions are compiled
in runtime and the compilation process is initiated from the host code.
   Prior to execute a kernel function, the host program for a heterogeneous system
must carry out the following steps:
1. Discover the OpenCL devices that constitute the heterogeneous system. The
   OpenCL abstraction of the heterogeneous system is represented with platform
   and devices. The platform consists of one or more devices capable of executing
   the OpenCL kernels.
2. Probe the characteristics of these devices so that the software (kernel functions)
   can adapt to the specific features.
3. Read the program source containing the kernel function(s) and compile the ker-
   nel(s) that will run on the selected device(s).
4. Set up memory objects on the selected device(s) that will hold the data for the
   computation.
5. Compile and run the kernel(s) on the selected device(s).
6. Collect the final result from device(s).
The host code can be very difficult to understand for the beginner, but we will soon
realize that a large part of the host code is repeated and can be reused in different
applications. Once we understand the host code, we will only devote our attention
to writing kernel functions. The above steps are accomplished through the following
series of calls to OpenCL API within the host code:
 1         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2         // STEP 1: D i s c o v e r and i n i t i a l i z e the d e v i c e s
 3         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4
 5         // Use c l G e t D e v i c e I D s () to r e t r i e v e the n u m b e r of
 6         // d e v i c e s p r e s e n t
 7         status = clGetDeviceIDs (
 8                                               NULL ,
 9                                               CL_DEVICE_TYPE_ALL ,
10                                               0,
     5.3 Programming in OpenCL                                                                                159
11                                                   NULL ,
12                                                   & numDevices );
13        if ( s t a t u s != C L _ S U C C E S S )
14        {
15             p r i n t f ( " E r r o r : F a i l e d to c r e a t e a d e v i c e g r o u p !\ n " ) ;
16             return EXIT_FAILURE ;
17        }
18
19        p r i n t f ( " The n u m b e r of d e v i c e s f o u n d = % d \ n " , n u m D e v i c e s ) ;
20
21        // A l l o c a t e e n o u g h space for each device
22        d e v i c e s = ( c l _ d e v i c e _ i d *) m a l l o c ( n u m D e v i c e s * s i z e o f ( ←
          cl_device_id ));
23        // Fill in d e v i c e s with c l G e t D e v i c e I D s ()
24        status = clGetDeviceIDs (
25                                                      NULL ,
26                                                      CL_DEVICE_TYPE_ALL ,
27                                                      numDevices ,
28                                                      devices ,
29                                                      NULL ) ;
30        if ( s t a t u s != C L _ S U C C E S S )
31        {
32                p r i n t f ( " E r r o r : F a i l e d to c r e a t e a d e v i c e g r o u p !\ n " ) ;
33                return EXIT_FAILURE ;
     160                                       5 OpenCL for Massively Parallel Graphic Processors
           cl_int clGetDeviceInfo(
              cl_device_id device,
              cl_device_info param_name,
              size_t param_value_size,
              void *param_value,
              size_t *param_value_size_ret)
           clGetDeviceInfo returns CL_SUCCESS if the function is executed successfully. Param-
           eters are:
34 }
     the first call is used to discover the number of present devices. This number is returned
     in the numDevices variable. On an Apple laptop with an Intel GPU, there are two
     discovered devices:
     The number of devices found = 2
     Once we know the number of devices, we make enough space in devices buffer
     with malloc(), and then we make the second call to clGetDeviceIDs() to
     obtain the list of all devices in the devices buffer.
        We can get and print information about an OpenCL device with the clGet
     DeviceInfo() function.
        The sample code for printing information about discovered OpenCL devices is
     shown in Listing 5.6.
     5.3 Programming in OpenCL                                                                  161
     The following is the output of the code in Listing 5.6 for an Apple laptop with an
     Intel GPU:
     === OpenCL devices found on platform: ===
       -- Device 0 --
       DEVICE_NAME = Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz
       DEVICE_VENDOR = Intel
       DEVICE_MAX_COMPUTE_UNITS = 8
       CL_DEVICE_MAX_WORK_GROUP_SIZE = 2200
       CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
       CL_DEVICE_MAX_WORK_ITEM_SIZES = 1024, 1, 1
     162                                                     5 OpenCL for Massively Parallel Graphic Processors
           -- Device 1 --
           DEVICE_NAME = Iris Pro
           DEVICE_VENDOR = Intel
           DEVICE_MAX_COMPUTE_UNITS = 40
           CL_DEVICE_MAX_WORK_GROUP_SIZE = 1200
           CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS = 3
           CL_DEVICE_MAX_WORK_ITEM_SIZES = 512, 512, 512
        We can use this information about an OpenCL device later in our host program
     to automatically adapt the kernel to the specific features. In the above example, the
     device with index 1 is an Intel Iris Pro GPU. It has 40 compute units, each work-group
     can have up to 1200 work-items, which can span into three-dimensional NDRange
     and the maximum size in each dimension is 512. We can also see, that the device
     with index 0 is an Intel Core i7 CPU, which can also execute OpenCL code. It has
     eight compute units (four cores, two hardware threads per core).
     2. Create a context
     Once we have discovered the available OpenCL devices on the platform and have
     obtained ad least one device ID, we can create an OpenCL context. The context
     is used to group devices and memory objects together and to manage command
     queues, program objects, and kernel objects. An OpenCL context is created with one
     or more devices. Contexts are used by the OpenCL runtime for managing objects
     such as command-queues, memory, program, and kernel objects and for executing
     kernels on one or more devices specified in the context. An OpenCL context is created
     with the clCreateContext function. Listing 5.7 shows a call to this function.
 1           // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2           // STEP 2: C r e a t e a c o n t e x t
 3           // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4
 5           c l _ c o n t e x t c o n t e x t = NULL ;
 6           // C r e a t e a c o n t e x t u s i n g c l C r e a t e C o n t e x t () and
 7           // a s s o c i a t e it with the d e v i c e s
 8           context = clCreateContext (
 9                                                           NULL ,
10                                                           numDevices ,
11                                                           devices ,
12                                                           NULL ,
13                                                           NULL ,
14                                                           & status );
15           if (! c o n t e x t )
16           {
17                   p r i n t f ( " E r r o r : F a i l e d to c r e a t e a c o m p u t e c o n t e x t !\ n " ) ;
18                   return EXIT_FAILURE ;
19           }
 1    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2    // STEP 3: Create a c o m m a n d queue
 3    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4
 5    cl_command_queue cmdQueue ;
 6    // Create a command queue using c l C r e a t e C o m m a n d Q u e u e () ,
 7    // and a s s o c i a t e it with the device you want to e x e c u t e
 8    // on
 9    cmdQueue = clCreateCommandQueue (
10                                            context ,
11                                            d e v i c e s [1] , // GPU
12                                            CL_QUEUE_PROFILING_ENABLE ,
13                                            & status );
14
15   if (! c m d Q u e u e )
16   {
17       p r i n t f ( " E r r o r : F a i l e d to c r e a t e a c o m m a n d c o m m a n d s !\ n " ) ;
18       return EXIT_FAILURE ;
19   }
1          // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
2          // STEP 4: Create the p r o g r a m o b j e c t for a c o n t e x t
3          // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
4
     5.3 Programming in OpenCL                                                                                 165
 1          // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2          // STEP 5: Build the p r o g r a m
 3          // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4
 5          ciErr = c l B u i l d P r o g r a m (
 6                                                       cpProgram ,
 7                                                       0,
 8                                                       NULL ,
 9                                                       NULL ,
10                                                       NULL ,
11                                                       NULL ) ;
12
13          if ( ciErr != C L _ S U C C E S S )
14          {
15               s i z e _ t len ;
16               char buffer [2048];
17
18                 p r i n t f ( " E r r o r : F a i l e d to b u i l d p r o g r a m e x e c u t a b l e !\ n " ) ;
19                 c l G e t P r o g r a m B u i l d I n f o ( cpProgram ,
20                                                             d e v i c e s [1] ,
21                                                             CL_PROGRAM_BUILD_LOG ,
22                                                             sizeof ( buffer ) ,
23                                                             buffer ,
     5.3 Programming in OpenCL                                                                      167
          This function creates a buffer object within the context context of the size size bytes
          using flags flags. The pointer to the allocated buffer data host_ptr holds the address
          in the device memory. It returns a valid non-zero buffer object and errcode_ret is
          set to CL_SUCCESS if the program object is created successfully. Otherwise, it returns
          NULL value with the values returned in errcode_ret.
          A bit-field flags is used to specify allocation and usage information such as the
          memory arena that should be used to allocate the buffer object and how it will be used.
          The following are some of the possible values for flags:
          CL_MEM_READ_WRITE This flag specifies that the memory object will be read
            and written by a kernel. This is the default.
          CL_MEM_WRITE_ONLY This flags specifies that the memory object will be
            written but not read by a kernel. Reading from a buffer object created with
            CL_MEM_WRITE_ONLY inside a kernel is undefined.
          CL_MEM_READ_ONLY This flag specifies that the memory object is a read-only
            memory object when used inside a kernel. Writing to a buffer or image object created
            with CL_MEM_READ_ONLY inside a kernel is undefined.
24                                        & len ) ;
25               printf ("%s\n" , buffer );
26               exit (1) ;
27         }
     memory from code that execute on device. Also, we pass these pointers as arguments
     to kernels, i.e., functions that execute on device. To read or write to device buffers
     from the host, we must use OpenCL dedicated functions clEnqueueReadBuffer
     and clEnqueueWriteBuffer. Upon creation, the contents of the device buffers
     are undefined. We must explicitly fill the device buffers with our data from the host
     application. We will show this in the next subsection.
        Listing 5.11 shows how to create three device buffers: bufferA and bufferB
     that are read-only and are used to store input vectors; and bufferC that is write-only
     and used to store the result of vector addition.
 1         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2         // STEP 6: C r e a t e d e v i c e b u f f e r s
 3         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4
 5         c l _ m e m b u f f e r A ; // Input array on the device
 6         c l _ m e m b u f f e r B ; // Input array on the device
 7         c l _ m e m b u f f e r C ; // O u t p u t a r r a y on the d e v i c e
 8         // c l _ m e m n o E l e m e n t s ;
 9
10         // Size of data :
11         size_t datasize = sizeof ( cl_float ) * iNumElements ;
12
13         // Use c l C r e a t e B u f f e r () to c r e a t e a b u f f e r o b j e c t ( d_A )
14         // that will contain the data from the host array A
15         bufferA = clCreateBuffer (
16                                                context ,
17                                                CL_MEM_READ_ONLY ,
18                                                datasize ,
19                                                NULL ,
20                                                & status );
21
22         // Use c l C r e a t e B u f f e r () to c r e a t e a b u f f e r o b j e c t ( d_B )
23         // that will contain the data from the host array B
24         bufferB = clCreateBuffer (
25                                                context ,
26                                                CL_MEM_READ_ONLY ,
27                                                datasize ,
28                                                NULL ,
29                                                & status );
30
31         // Use c l C r e a t e B u f f e r () to c r e a t e a b u f f e r o b j e c t ( d_C )
32         // with enough space to hold the output data
33         bufferC = clCreateBuffer (
34                                                context ,
35                                                CL_MEM_WRITE_ONLY ,
36                                                datasize ,
37                                                NULL ,
38                                                & status );
 1         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2         // STEP 7: Write host data to d e v i c e b u f f e r s
 3         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4         // Use c l E n q u e u e W r i t e B u f f e r () to write input array A to
 5         // the d e v i c e b u f f e r b u f f e r A
 6         status = clEnqueueWriteBuffer (
 7                                                                    cmdQueue ,
 8                                                                    bufferA ,
 9                                                                    CL_FALSE ,
10                                                                    0,
11                                                                    datasize ,
12                                                                    srcA ,
13                                                                    0,
14                                                                    NULL ,
15                                                                    NULL ) ;
16
17         // Use c l E n q u e u e W r i t e B u f f e r () to write input array B to
18         // the d e v i c e b u f f e r b u f f e r B
19         status = clEnqueueWriteBuffer (
20                                                           cmdQueue ,
21                                                           bufferB ,
22                                                           CL_FALSE ,
23                                                           0,
24                                                           datasize ,
25                                                           srcB ,
26                                                           0,
27                                                           NULL ,
28                                                           NULL ) ;
           This function creates a kernel object from a function kernel_name contained within
           a program object program with a successfully built executable. Refer to OpenCLTM
           2.2 Specification for more detailed description.
 1          // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2          // STEP 8: Create and c o m p i l e the k e r n e l
 3          // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4
 5          // C r e a t e the k e r n e l
 6          ckKernel = clCreateKernel (
 7                                                      cpProgram ,
 8                                                      " VectorAdd ",
 9                                                      & ciErr ) ;
10          if (! c k K e r n e l || ciErr != C L _ S U C C E S S )
11          {
12              p r i n t f ( " E r r o r : F a i l e d to c r e a t e c o m p u t e k e r n e l !\ n " ) ;
13              exit (1) ;
14          }
          Arguments to the kernel are referred by indices that go from 0 for the leftmost argument
          to n − 1, where n is the total number of arguments declared by a kernel. The argument
          index refers to the specific argument position of the kernel definition that must be
          set. The last two arguments of clSetKernelArg specify the size of the argument
          data and the pointer to the actual data that should be used as the argument value. If a
          kernel function argument is declared to be a pointer of a built-in or user defined type
          with the __global or __constant qualifier, a buffer memory object must be used. Refer
          to OpenCLTM 2.2 Specification for more detailed description.
 1         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2         // STEP 9: Set the kernel a r g u m e n t s
 3         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4         // Set the A r g u m e n t v a l u e s
 5         ciErr = c l S e t K e r n e l A r g ( ckKernel ,
 6                                                      0,
 7                                                      sizeof ( cl_mem ) ,
 8                                                      ( void *) & b u f f e r A ) ;
 9         ciErr |= c l S e t K e r n e l A r g ( ckKernel ,
10                                                        1,
11                                                        sizeof ( cl_mem ) ,
12                                                        ( void *) & b u f f e r B ) ;
13         ciErr |= c l S e t K e r n e l A r g ( ckKernel ,
14                                                        2,
15                                                        sizeof ( cl_mem ) ,
16                                                        ( void *) & b u f f e r C ) ;
17         ciErr |= c l S e t K e r n e l A r g ( ckKernel ,
18                                                        3,
19                                                        sizeof ( cl_int ) ,
20                                                        ( void *) & i N u m E l e m e n t s ) ;
          cl_int clEnqueueNDRangeKernel (
             cl_command_queue command_queue,
             cl_kernel kernel,
             cl_uint work_dim,
             const size_t *global_work_offset,
             const size_t *global_work_size,
             const size_t *local_work_size,
             cl_uint num_events_in_wait_list,
             const cl_event *event_wait_list,
             cl_event *event)
1          // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
2          // Start Core s e q u e n c e ... copy input data to GPU , compute ,
3          //           copy r e s u l t s back
     5.3 Programming in OpenCL                                                                                        173
 4
 5         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 6         // STEP 10: E n q u e u e the kernel for e x e c u t i o n
 7         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 8         // L a u n c h k e r n e l
 9         ciErr = c l E n q u e u e N D R a n g e K e r n e l (
10                                                                      cmdQueue ,
11                                                                      ckKernel ,
12                                                                      1,
13                                                                      NULL ,
14                                                                      & szGlobalWorkSize ,
15                                                                      & szLocalWorkSize ,
16                                                                      0,
17                                                                      NULL ,
18                                                                      NULL ) ;
19         if ( ciErr != C L _ S U C C E S S )
20         {
21                p r i n t f ( " E r r o r l a u n c h u n g k e r n e l !\ n " ) ;
22         }
23
24         // Wait for the c o m m a n d c o m m a n d s to get s e r v i c e d b e f o r e
25         //  r e a d i n g back res u l t s
26         //
27         clFinish ( cmdQueue );
 1         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2         // STEP 11: Read the output buffer back to the host
 3         // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4
 5         // S y n c h r o n o u s / b l o c k i n g read of re s u l t s
 6         ciErr = c l E n q u e u e R e a d B u f f e r (
 7                                                         cmdQueue ,
 8                                                         bufferC ,
 9                                                         CL_TRUE ,
10                                                         0,
11                                                         datasize ,
12                                                         srcC ,
13                                                         0,
14                                                         NULL ,
15                                                         NULL ) ;
     The OpenCL standard does not specify how the abstract execution model provided
     by OpenCL is mapped to the hardware. We can enqueue any number of threads (work
     items), and provide a work-group size (number of work_items in a work-group), with
     at least the following constraints:
174                                       5 OpenCL for Massively Parallel Graphic Processors
                                         Occupancy
      Occupancy is a ratio of active warps per compute unit to the maximum number of
      allowed warps. We should always keep the occupancy high because this is a way to
      hide latency when executing instructions. A compute unit should have a warp ready to
      execute in every cycle as this is the only way to keep hardware busy.
     hardware also limits the number of work-groups in a single launch (usually this is
     65535 in each NDRange dimension).
        In our previous example, we have vectors with 512 × 512 elements (iNum
     Elements) and we have launched the same number of work-items (szGlobal
     WorkSize). As each work-group contains 512 work-items (szLocalWorkSize),
     we have 512 work-groups in a single launch. Is it possible to add larger vectors and
     where is the limit? If we tried to add two vectors with 512 × 512 elements, we would
     fail to launch a kernel with such a large number of work-items (work-groups). So
     how would we use a GPU to add two arbitrary long vectors? First, we should limit
     the number of work-items and the number of work-groups. Secondly, one work-item
     should perform more than one addition. Let us first look at the new kernel function
     in Listing 5.17.
 1    // O p e n C L K e r n e l F u n c t i o n for e l e m e n t by e l e m e n t
 2    //      v e c t o r a d d i t i o n of a r b i t r a r y long v e c t o r s
 3    _ _ k e r n e l void V e c t o r A d d A r b i t r a r y (
 4                                             _ _ g l o b a l float * a ,
 5                                             _ _ g l o b a l float * b ,
 6                                             _ _ g l o b a l float * c ,
 7                                             int i N u m E l e m e n t s
 8                                             ) {
 9
10          // find my global index
11          int iGID = g e t _ g l o b a l _ i d (0) ;
12
13          while ( iGID < i N u m E l e m e n t s ) {
14              // add a d j a c e n t e l e m e n t s
15              c [ iGID ] = a [ iGID ] + b [ iGID ];
16              iGID += g e t _ g l o b a l _ s i z e (0) ;
17          }
18    }
     We used a while loop to iterate through the data (this kernel is very similar to the
     function from Listing 5.2). Rather than incrementing iGID by 1, a many core GPU
     device could increment iGID by the number of work-items that we are using. We
     want each work-item to start on a different data index, so we use the thread global
     index:
     int iGID = get_global_id(0);
     After each thread finishes its work at the current index, we increment iGID by the
     total number of work-items in NDRange. This number is obtained from the function
     get_global_size(0):
     iGID += get_global_size(0);
     The only remaining piece is to fix the execution model in the host code. To ensure
     that we never launch too many work-groups and work-items, we will fix the number
     of work-groups to a small number, but still large enough to have a good occupancy.
     We will launch 512 work-groups with 256 work-items per work-group (thus the total
     number of work-items will be 131072). The only change in the host code is
     176                                          5 OpenCL for Massively Parallel Graphic Processors
     szLocalWorkSize = 256;
     szGlobalWorkSize = 512*256;
     We will now take a look at vector dot products. We will start with the simple version
     first to illustrate basic features of memory and work-item management in OpenCL
     programs. We will again recap the usage of NDRange and work-item ID. We will
     then analyze performance of the simple version and extend the simple version to
     version which employs local memory.
         The computation of a vector dot product consists of two steps. First, we multi-
     ply corresponding elements of the two input vectors. This is very similar to vector
     addition but utilizes multiplication instead of addition. In the second step, we sum
     all the products instead of just storing them to an output vector. Each working-item
     multiplies a pair of corresponding vector elements and then moves on to its next pair.
     Because the result would be the sum of all these pairwise products, each working-
     item keeps a sum of its products. Just like in the addition example, the working-items
     increment their indices by the total number of threads. The kernel function for the
     first step is shown in Listing 5.18.
 1    // O p e n C L K e r n e l F u n c t i o n for N a i v e Dot P r o d u c t
 2    _ _ k e r n e l void D o t P r o d u c t N a i v e (
 3                                                         _ _ g l o b a l float * a ,
 4                                                         _ _ g l o b a l float * b ,
 5                                                         _ _ g l o b a l float * c ,
 6                                                         int i N u m E l e m e n t s
 7                                                         ) {
 8
 9          // find my global index
10          int iGID = g e t _ g l o b a l _ i d (0) ;
11          int index = iGID ;
12
13          while ( iGID < i N u m E l e m e n t s ) {
14              // add a d j a c e n t e l e m e n t s
15              c [ iGID ] = a [ index ] * b [ index ];
16              index += g e t _ g l o b a l _ s i z e (0) ;
17          }
18    }
     Each element of the array c holds the sum of products obtained form one work-
     item, i.e., c[iGID] holds a sum of products obtained by the work-item with the
     global index iGID. After all work-item finish their work, we should sum all the
     elements form the vector c to produce a single scalar product. But how do we know
     when have all work-items finished their work? We need a mechanism to synchronize
     work-items. The only way to synchronize all work-items in NDRange is to wait
     for the kernel function to finish. After the kernel function finishes, we can read the
     results (vector c) from a GPU device and sum its elements on host. The host code is
5.3 Programming in OpenCL                                                                                             177
very similar to the host code from Listing 5.4. We have to make the following two
changes:
1. the vector c has a different size than vectors a and b. It has the same number of
   elements as the number of all work-items in NDRange (szGlobalWorkSize):
   1       // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   2       // STEP 6: C r e a t e d e v i c e b u f f e r s
   3       // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
   4
   5       ...
   6
   7       c l _ m e m b u f f e r C ; // Output array on the device
   8
   9       // Size of data for b u f f e r C :
  10       size_t datasize_c = sizeof ( cl_float ) * szGlobalWorkSize ;
  11
  12       ...
  13
  14       // Use c l C r e a t e B u f f e r () to c r e a t e a b u f f e r o b j e c t ( d_C )
  15       // with enough space to hold the output data
  16       bufferC = clCreateBuffer (
  17                                              context ,
  18                                              CL_MEM_READ_WRITE ,
  19                                              datasize_c ,
  20                                              NULL ,
  21                                              & status );
Listing 5.19 Create a buffer object C for naive implementation of vector dot product
2. after the kernel executes on a GPU device, we read vector c from device and
   serially sum all its elements to produce the final dot product:
      The function returns profiling information for the command associated with event if
      profiling is enabled. The first argument is the event being queried, and the second
      argument, param_name is an enumeration value describing the query. Most often
      used param_name values are:
      Event objects are used to capture profiling information that measure execution time of
      a command. Profiling of OpenCL commands can be enabled using a command-queue
      created with CL_QUEUE_PROFILING_ENABLE flag set in properties argument to
      clCreateCommandQueue. OpenCL devices are required to correctly track time
      across changes in device frequency and power states. Refer to OpenCLTM 2.2 Specifi-
      cation for more detailed description.
   The code snippet form Listing 5.20 shows how to measure kernel execution time
using OpenCL profiling events.
180                                    5 OpenCL for Massively Parallel Graphic Processors
This way, we can profile operations on both memory objects and kernels. Results for
dot product of two vectors of size 16777216 (512 × 512 × 64) on an Apple laptop
with an Intel GPU are as follows:
Host device data transfer has much lower bandwidth than global memory access. So
we should perform as much computation on a GPU device as possible and read as
small amount of data from a GPU device as possible. In this case, the threads should
cooperate to calculate the final sum. Work-items can safely cooperate through local
memory by means of synchronization. Local memory can be shared by all work-
items in a work-group. Local memory on a GPU is implemented on a compute device.
To allocate local memory, the _ _local address space qualifier is used in variable
declaration. We will use a buffer in local memory named ProductsWG to store each
work-item’s running sum. This buffer will store szLocalWorkSize products so
each work-item in the work-group will have a place to store its temporary result.
Since the compiler will create a copy of the local variables for each work-group, we
need to allocate only enough memory such that each thread in the work-group has
an entry. It is relatively simple to declare local memory buffers as we just pass local
arrays as arguments to the kernel:
__kernel void DotProductShared(
                 __global float* a,
                 __global float* b,
                 __global float* c,
                 __local* ProductsWG,
                 int iNumElements)
5.3 Programming in OpenCL                                                                     181
                                    OpenCL: Barrier
      barrier(mem_fence_flag)
We then set the kernel argument with a value of NULL and a size equal to the size
we want to allocate for the argument (in byte). Therefore, it should be as follows:
ciErr |= clSetKernelArg(ckKernel,
                        3,
                        sizeof(float) * szLocalWorkSize,
                        NULL);
Now, each work-item computes a running sum of the product of corresponding entries
in a and b. After reaching the end of the array, each thread stores its temporary sum
into the local memory (buffer ProductsWG):
// work-item global index
int iGID = get_global_id(0);
// work-item local index
int iLID = get_local_id(0);
float temp = 0.0;
while (iGID < iNumElements) {
   // multiply adjacent elements
   temp += a[iGID] * b[iGID];
   iGID += get_global_size(0);
}
//store the product in local memory
ProductsWG[iLID] = temp;
At this point, we need to sum all the temporary values we have placed in the
ProductsWG. To do this, we will need some of the threads to read the values
that have been stored there. This is a potentially dangerous operation. We should
place a synchronization barrier to guarantee that all of these writes to the local buffer
ProductsWG complete before anyone tries to read from this buffer. The OpenCL
C language provides functions to allow synchronization of work-items. However,
as we mentioned, the synchronization can only occur between work-items in the
same work-group. To achieve that, OpenCL implements a barrier memory fence for
synchronization with the barrier() function. The function barrier() creates
a barrier that blocks the current work-item until all other work-items in the same
group has executed the barrier before allowing the work-item to proceed beyond the
182                                   5 OpenCL for Massively Parallel Graphic Processors
    Now that we have guaranteed that our local memory has been filled, we can sum
the values in it. We call the general process of taking an input array and performing
some computations that produce a smaller array of results a reduction. The naive
way to accomplish this reduction would be having one thread iterate over the shared
memory and calculate a running sum. This will take us time proportional to the length
of the array. However, since we have hundreds of threads available to do our work,
we can do this reduction in parallel and take time that is proportional to the logarithm
of the length of the array. Figure 5.12 shows a summation reduction. The idea is that
each work-item adds two of the values in ProductsWG and store the result back to
ProductsWG. Since each thread combines two entries into one, we complete the
first step with half as many entries as we started with. In the next step, we do the
same thing for the remaining half. We continue until we have the sum of every entry
in the first element of ProductsWG. The code for the summation reduction is
                                         Reduction
     In computer science, the reduction is a special type of operation that is commonly used
     in parallel programming to reduce the elements of an array into a single result.
After we have completed one step, we have the same restriction we did after com-
puting all the pairwise products. Before we can read the values we just stored
in ProductsWG, we need to ensure that every thread that needs to write to
ProductsWG has already done so. The barrier(CLK_LOCAL_MEM_FENCE)
after the assignment ensures this condition is met. It is important to note that when
using barrier, all work-items in the work-group must execute the barrier function. If
the barrier function is called within a conditional statement, it is important to ensure
that all work-items in the work-group enter the conditional statement to execute the
barrier. For example, the following code is an illegal use of barrier because the barrier
will not be encountered by all work-items:
if (iLID < i) {
   ProductsWG[iLID] += ProductsWG[iLID+i];
   barrier(CLK_LOCAL_MEM_FENCE);
}
Any work-item with the local index iLID greater than or equal to i will never exe-
cute the barrier(CLK_LOCAL_MEM_FENCE). Because of the guarantee that no
instruction after a barrier(CLK_LOCAL_MEM_FENCE) can be executed before
every work-item of the work-group has executed it, the hardware simply continues
to wait for these work-items. This effectively hangs the processor because it results
in the GPU waiting for something that will never happen. Such a kernel will actually
cause the GPU to stop responding, forcing you to kill your program.
    After termination of the summation reduction, each work-group has a single num-
ber remaining. This number is sitting in the first entry of the ProductsWG buffer and
is the sum of every pairwise product the work-items in that work-group computed.
We now store this single value to global memory and end our kernel:
if (iLID == 0) {
   c[iWGID] = ProductsWG[0];
}
     184                                                     5 OpenCL for Massively Parallel Graphic Processors
     As there is only one element from ProductsWG that is transferred to global memory,
     only a single thread needs to perform this operation. Since each work-group writes
     exactly one value to the global array c, we can simply index it by WGID, which is
     the work-group index.
        We are left with an array c, each entry of which contains the sum produced by
     one of the parallel work-groups. The last step of the dot product is to sum the entries
     of c. Because array c is relatively small, we return control to the host and let the
     CPU finish the final step of the addition, summing the array c.
        Listing 5.21 shows the entire kernel function for dot product using shared memory
     and summation reduction.
 1    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2    // O p e n C L K e r n e l F u n c t i o n for dot p r o d u c t
 3    // u s i n g s h a r e d m e m o r y snd s u m m a t i o n r e d u c t i o n
 4    _ _ k e r n e l void D o t P r o d u c t S h a r e d ( _ _ g l o b a l float * a ,
 5                                                                 _ _ g l o b a l float * b ,
 6                                                                 _ _ g l o b a l float * c ,
 7                                                                 _ _ l o c a l * ProductsWG ,
 8                                                                 int i N u m E l e m e n t s )
 9    {
10
11           // work - item global index
12           int iGID = g e t _ g l o b a l _ i d (0) ;
13           // work - item local index
14           int iLID = g e t _ l o c a l _ i d (0) ;
15           // work - group index
16           int iWGID = g e t _ g r o u p _ i d (0) ;
17           // how many work - items are in WG ?
18           int iWGS = g e t _ l o c a l _ s i z e (0) ;
19
20           float temp = 0.0;
21           while ( iGID < i N u m E l e m e n t s ) {
22                   // m u l t i p l y a d j a c e n t e l e m e n t s
23                   temp += a [ iGID ] * b [ iGID ];
24                   iGID += g e t _ g l o b a l _ s i z e (0) ;
25           }
26           // s t o r e the p r o d u c t
27           P r o d u c t s W G [ iLID ] = temp ;
28
29           // wait for all th r e a d s in WG :
30           barrier ( CLK_LOCAL_MEM_FENCE );
31
32           // S u m m a t i o n r e d u c t i o n :
33           int i = iWGS /2;
34           while ( i !=0) {
35                if ( iLID < i ) {
36                        P r o d u c t s W G [ iLID ] += P r o d u c t s W G [ iLID + i ];
37                }
38                barrier ( CLK_LOCAL_MEM_FENCE );
39                i = i /2;
40           }
41
42           // s t o r e p a r t i a l dot p r o d u c t into global m e m o r y :
43           if ( iLID == 0) {
44                c [ iWGID ] = P r o d u c t s W G [0];
45           }
46    }
     Listing 5.21 Vector Dot Product Kernel - implementation using local memory and summation
     reduction
     5.3 Programming in OpenCL                                                                                     185
        In the host code for this example,we should create the bufferC memory object
     that will hold szGlobalWorkSize/szLocalWorkSize partial dot products.
     Listing 5.22 shows how to create bufferC.
 1
 2    s i z e _ t d a t a s i z e _ c = s i z e o f ( c l _ f l o a t ) * ( s z G l o b a l W o r k S i z e / ←
              szLocalWorkSize );
 3
 4    // Use c l C r e a t e B u f f e r () to c r e a t e a b u f f e r o b j e c t ( d_C )
 5    // with enough space to hold the output data
 6    bufferC = clCreateBuffer (
 7                                           context ,
 8                                           CL_MEM_READ_WRITE ,
 9                                           datasize_c ,
10                                           NULL ,
11                                           & status );
Listing 5.22 Create bufferC for dot product using local memory
 1    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2    // STEP 9: Set the kernel a r g u m e n t s
 3    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4    // Set the A r g u m e n t v a l u e s
 5    ciErr = c l S e t K e r n e l A r g ( ckKernel ,
 6                                                 0,
 7                                                 sizeof ( cl_mem ) ,
 8                                                 ( void *) & b u f f e r A ) ;
 9    ciErr |= c l S e t K e r n e l A r g ( ckKernel ,
10                                                   1,
11                                                   sizeof ( cl_mem ) ,
12                                                   ( void *) & b u f f e r B ) ;
13    ciErr |= c l S e t K e r n e l A r g ( ckKernel ,
14                                                   2,
15                                                   sizeof ( cl_mem ) ,
16                                                   ( void *) & b u f f e r C ) ;
17    ciErr |= c l S e t K e r n e l A r g ( ckKernel ,
18                                                   3,
19                                                   sizeof ( float ) * szLocalWorkSize ,
20                                                   NULL ) ;
21    ciErr |= c l S e t K e r n e l A r g ( ckKernel ,
22                                                   4,
23                                                   sizeof ( cl_int ) ,
24                                                   ( void *) & i N u m E l e m e n t s ) ;
Listing 5.23 Set kernel arguments for dot product using shared memory
     The argument with index 3 is used to create local memory buffer of size sizeof
     (float) * szLocalWorkSize for each work-group. As this argument is
     declared in the kernel function with the _ _local qualifier, the last entry to
     clSetKernelArg must be NULL.
       Results for dot product of two vectors of size 67108864 (512 × 512 × 256) on an
     Apple laptop with an Intel GPU are
     This section describes a matrix multiplication application using OpenCL for GPUs in
     a step-by-step approach. We will start with the most basic version (naive) where focus
     will be on the code structure for the host application and the OpenCL GPU kernels.
     The naive implementation is rather straightforward, but it gives us a nice starting
     point for further optimization. For simplicity of presentation, we will consider only
     square matrices whose dimensions are integral multiples of 32 on a side. Matrix
     multiplication is a key building block for dense linear algebra and the same pattern
     of computation is used in many other algorithms. We will start with simple version
     first to illustrate basic features of memory and work-item management in OpenCL
     programs. After that we will extend to version which employs local memory.
         Before starting, it is helpful to briefly recap how a matrix–matrix multiplication
     is computed. The element ci, j of C is the dot product of the ith row of A and the
      jth column of B. The matrix multiplication of two square matrices is illustrated in
     Fig. 5.13. For example, as illustrated in Fig. 5.13, the element c5,2 is the dot product
     of the row 5 of A and the column 2 of B.
         To implement matrix multiplication of two square matrices of dimension N ×
     N , we will launch N × N work-items. Indexing of work-items in NDRange will
     correspond to 2D indexing of the matrices. Work-item (i, j) will calculate the element
     ci, j using row i of A and column j of B. So, each work-item loads one row of matrix
     A and one column of matrix B from global memory, do the dot product, and store
     the result back to matrix C in the global memory. The matrix A is, therefore, read N
     times from global memory and B is read N times from global memory. The simple
     version of matrix multiplication can be implemented in the plain C language using
     three nested loops as in Listing 5.24. We assume data to be stored in row-major order
     (C-style).
16                   d o t p r o d += m a t r i x A [ yGID * N + i ] * m a t r i x B [ i * N + xGID ←
             ];
17           }
18           m a t r i x C [ yGID * N + xGID ] = d o t p r o d ;
19    }
     Each work item first discovers its global ID in 2D NDRange. The index of the
     column xGID is obtained from the first dimension of NDRange using get_
     global_id(0). Similarly, the index of the row yGID is obtained from the second
     dimension of NDRange using get_global_id(1). After obtaining its global ID,
     each work-item do the dot product between the yGID-th row of A and the xGID-th
     column of B. The dot product is stored to the element in the yGID-th row and the
     xGID-th column of C.
     The Host Code
     As we learned previously, the host code should probe for devices, create context, cre-
     ate buffers, and compile the OpenCL program containing kernels. These steps are the
     same as in the vector addition example from Sect. 5.3.1. Assuming you have already
     initialized OpenCL, created the context and the queue, and created the appropriate
     buffers and memory copies. Listing 5.26 shows how to compile the kernel for naive
     matrix multiplication.
 1    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2    // STEP 8: Create and c o m p i l e the k e r n e l
 3    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4    ckKernel = clCreateKernel (
 5                                                       cpProgram ,
 6                                                       " matrixmulNaive ",
 7                                                       & ciErr ) ;
 8    if (! c k K e r n e l || ciErr != C L _ S U C C E S S )
 9    {
10           p r i n t f ( " E r r o r : F a i l e d to c r e a t e c o m p u t e k e r n e l !\ n " ) ;
11           exit (1) ;
12    }
Listing 5.26 Create and compile the kernel for naive matrix multiplication
Prior to launch the kernel we should set the kernel arguments as in Listing 5.27.
 1    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2    // STEP 9: Set the kernel a r g u m e n t s
 3    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4    ciErr = c l S e t K e r n e l A r g ( ckKernel , 0 , s i z e o f ( c l _ m e m ) , ( void *) & ←
             bufferA );
 5    ciErr |= c l S e t K e r n e l A r g ( ckKernel , 1 , s i z e o f ( c l _ m e m ) , ( void *) & ←
             bufferB );
 6    ciErr |= c l S e t K e r n e l A r g ( ckKernel , 2 , s i z e o f ( c l _ m e m ) , ( void *) & ←
             bufferC );
 7    ciErr |= c l S e t K e r n e l A r g ( ckKernel , 3 , s i z e o f ( c l _ i n t ) , ( void *) & ←
             iRows ) ;
     Listing 5.27 Set the kernel arguments for naive matrix multiplication
     5.3 Programming in OpenCL                                                                                           189
     Finally, we are ready to launch the kernel matrixmulNaive. Listing 5.28 shows
     how you launch the kernel.
 1    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2    // Start Core s e q u e n c e ... copy input data to GPU , compute ,
 3    //         copy r e s u l t s back
 4
 5    // set and log Global and Local work size d i m e n s i o n s
 6    c o n s t c l _ i n t iWI = 16;
 7    c o n s t s i z e _ t s z L o c a l W o r k S i z e [2] = { iWI , iWI };
 8    c o n s t s i z e _ t s z G l o b a l W o r k S i z e [2] = { iRows , iRows };
 9    cl_event kernelevent ;
10
11    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
12    // STEP 10: E n q u e u e the k e r n e l for e x e c u t i o n
13    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
14    ciErr = c l E n q u e u e N D R a n g e K e r n e l (
15                                                                 cmdQueue ,
16                                                                 ckKernel ,
17                                                                 2,
18                                                                 NULL ,
19                                                                 szGlobalWorkSize ,
20                                                                 szLocalWorkSize ,
21                                                                 0,
22                                                                 NULL ,
23                                                                 & kernelevent );
     As can be seen from the code in Listing 5.28, NDRange is of 2D size (iRows,
     iRows). That means that we launch (iRows, iRows) work-items. These work-
     items are further grouped into work-groups of dimension (16,16). If for exam-
     ple the size of matrices (iRows, iRows) is (4096, 4096), we launch 256 × 256
     work-groups. As the number of work-groups is probably larger than the number of
     compute-units present in a GPU, we keep all compute-units busy. Recall that we
     should always keep the occupancy high, because this is a way to hide latency when
     executing instructions.
        Execution time for naive matrix multiplication of two square matrices of size
     3584 × 3584 on an Apple laptop with an Intel GPU is
     Looking at the loop in the kernel code from Listing 5.25, we can notice that each work-
     item loads 2 × N elements from global memory—two for each iteration through the
     loop, one from the matrix A and one from the matrix B. Since accesses to global
     memory are relatively slow, this can slow down the kernel, leaving the work-items
     idle for hundreds of clock cycles, for each access. Also, we can notice that for each
     element of C in a row, we use the same row of A and that each work-item in a
     work-group uses the same columns of B.
190                                   5 OpenCL for Massively Parallel Graphic Processors
    But not only are we accessing the GPU’s off-chip memory way too much, we do
not even care about memory coalescing! Assuming row-major order when storing
matrices in global memory, the elements from the matrix A are accessed with unit
stride, while elements from the matrix B are accessed with stride N.
    Recall from Sect. 5.1.4 that to ensure memory coalescing, we want work-items
from the same warp to access contiguous elements in memory so to minimize the
number of required memory transactions. As work-items of the same warp access
32 contiguous floating-point elements from the same row of A, all these elements
fall into the same 128-bytes segment and data is delivered in a single transaction.
On the other hand, work-items of the same warp access 32 floating-point elements
from B that are 4N bytes apart, so for each element from the matrix B a new memory
transaction is needed. Although the GPU’s caches probably will help us out a bit,
we can get much more performance by manually caching sub-blocks of the matrices
(tiles) in the GPU’s on-chip local memory.
    In other words, one way to reduce the number of accesses to global memory is
to have the work-items load portions of matrices A and B into local memory, where
we can access them much more quickly. So we will use local memory to avoid non-
coalesced global memory access. Ideally, we would load both matrices entirely into
local memory, but unfortunately, local memory is a rather limited resource and cannot
hold two large matrices. Recall that older devices have 16kB of local memory per
compute unit, and more recent devices have 48 kB of local memory per compute unit.
So we will content ourselves with loading portions of A and B into local memory as
needed, and making as much use of them as possible while they are there.
    Assume that we multiply two matrices as shown in Fig. 5.14. To calculate the
elements of the square submatrix C (tile C), we should multiply the corresponding
rows and columns of matrices A and B. Also, we can subdivide the matrices A and
B into submatrices (tiles) such as shown in Fig. 5.14. Now, we can multiply the
corresponding row and the column from the A and B tiles and sum up these partial
products. The process is shown in the lower part of Fig. 5.14. We can also observe
that the individual rows and columns in tiles A and B are accessed several times.
For example, in the 3 × 3 tiles from Fig. 5.14, all elements on the same row of the
tile C are computed using the same data of the A tiles and all elements on the same
column of the submatrix C are computed using the same data of the B tile. As the
tiles are in local memory, these accesses are fast.
    The idea of using tiles in matrix multiplication is as follows. The number of work-
items that we start is equal to the number of elements in the matrix. Each work-item
will be responsible for computing one element of the product matrix C. The index
of the element in the matrix is equal to the global index of a work-item in NDrange.
At the same time, we create the same number of work-groups as is the number of
tiles. The number of elements in a tile will be equal to the number of threads in a
work-group. This means that the element index within a tile will be the same as the
local index in the group.
    For reference, consider the matrix multiplication in Fig. 5.15. All matrices are of
size 8 × 8, so we will have 64 work-items in NDrange. We divide matrices A, B and
C in non-overlapping sub-blocks (tiles) of size T W × T W , where T W = 4 as in
5.3 Programming in OpenCL                                                           191
Fig. 5.15. Let us also suppose that tiles are indexed starting in the upper left corner.
Now, consider the element c5,2 in the matrix C, in Fig. 5.15. The element c5,2 falls
into the tile (0,1). The work-item responsible for computing the element c5,2 has the
global row index 5 and the global column index 2. Also, the same work-item has the
local row index 1 and local column index 2. This work-item computes the element
c5,2 in C by multiplying together row 5 in A, and column 2 in B, but it will do it
in pieces using tiles. As we already said, all work-items responsible for computing
192                                          5 OpenCL for Massively Parallel Graphic Processors
elements of the same tile in the matrix C should be in the same work-group. Let us
explain this process for the work-item that computes the element c5,2 . The work-item
should access the tiles (0,1) and (1,1) from A and tiles (0,0) and (1,1) from B. The
computation is performed in two steps. First, the work-item computes dot product
between the row 1 from the tile (0,1) in A and the column 2 from the tile (0,0) in B.
In the second step, the same work-item computes dot product between the the row
1 from the tile (1,1) in A and the column 2 from the tile (1,0) in B. Finally, it adds
this dot product to the one computed in the first step.
     5.3 Programming in OpenCL                                                                            193
         If we want to compute the first dot product as fast as possible, the elements form
     the row 1 from the tile (0,1) in A and the elements from the column 2 from the tile
     (0,0) in B should be in the local memory. The same is true for all rows in the tile
     (0,1) in A and all columns in the tile (0,0) in B because all work-items from the
     same work-group will access these elements concurrently. Also, once the first step
     is finished, the same will be true for the second step but this time for the tile (1,1) in
     A and the tile (1,0) in B.
         So, before every step, all work-items from the same work-group should perform
     a collaborative load of tiles A and B into local memory. This is performed in such
     a way that the work-item in the ith local row and the jth local column performs
     two loads from global memory per tile: the element with local index (i, j) from
     the corresponding tile in matrix A and the element with local index (i, j) from the
     corresponding tile in matrix B. Figure 5.15 illustrates this process. For example, the
     work item that computes the element c5,2 reads:
         Where is the benefit of using tiles? If we load the left-most (0,1) tile of matrix
     A into local memory, and the top-most (0,0) of those tiles of matrix B into local
     memory, then we can compute the first T W × T W products and add them together
     just by reading from local memory. But here is the benefit: as long as we have those
     tiles in local memory, every work-item from the work-group computing a tile form
     C can compute that portion of their sum from the same data in local memory. When
     each work item has computed this sum, we can load the next T W × T W tiles from
     A and B, and continue adding the term-by-term products to our value in C. And after
     all of the tiles have been processed, we will have computed our entries in C.
     The Tiled Multiplication Kernel
     The kernel code for the tiled matrix multiplication is shown in Listing 5.29.
 1    # d e f i n e T I L E _ W I D T H 16
 2
 3    // O p e n C L K e r n e l F u n c t i o n for tiled matrix m u l t i p l i c a t i o n
 4    _ _ k e r n e l void m a t r i x m u l T i l e d (
 5                                                       _ _ g l o b a l float * matrixA ,
 6                                                       _ _ g l o b a l float * matrixB ,
 7                                                       _ _ g l o b a l float * matrixC ,
 8                                                       int N ) {
 9
10
11          // Local memory to fit the tiles
12          _ _ l o c a l f l o a t m a t r i x A s u b [ T I L E _ W I D T H ][ T I L E _ W I D T H ];
13          _ _ l o c a l f l o a t m a t r i x B s u b [ T I L E _ W I D T H ][ T I L E _ W I D T H ];
14
15          // g l o b a l t h r e a d i n d e x
16          int xGID = g e t _ g l o b a l _ i d (0) ; // c o l u m n in N D R a n g e
17          int yGID = g e t _ g l o b a l _ i d (1) ; // row in N D R a n g e
18
19          // local thread index
20          int xLID = g e t _ l o c a l _ i d (0) ; // column in tile
21          int yLID = g e t _ l o c a l _ i d (1) ; // row in tile
     194                                                  5 OpenCL for Massively Parallel Graphic Processors
22
23           f l o a t d o t p r o d = 0.0;
24
25           for ( int tile = 0; tile < N / T I L E _ W I D T H ; tile ++) {
26                 // C o l l a b o r a t i v e l o a d i n g of tiles into s h a r e d m e m o r y :
27                 // Load a tile of m a t r i x A into local m e m o r y
28                 m a t r i x A s u b [ yLID ][ xLID ] =
29                       m a t r i x A [ yGID * N + ( xLID + tile * T I L E _ W I D T H ) ];
30                 // Load a tile of m a t r i x B into local m e m o r y
31                 m a t r i x B s u b [ yLID ][ xLID ] =
32                       m a t r i x B [( y L ID + tile * T I L E _ W I D T H ) * N + xGID ];
33
34                  // S y n c h r o n i s e to make sure the tiles are loaded
35                  barrier ( CLK_LOCAL_MEM_FENCE );
36
37                  for ( int i = 0; i < T I L E _ W I D T H ; i ++) {
38                      d o t p r o d +=
39                            m a t r i x A s u b [ yLID ][ i ] * m a t r i x B s u b [ i ][ xLID ];
40                  }
41
42                  // Wait for other work - items to finish
43                  //      b e f o r e l o a d i n g next tile
44                  barrier ( CLK_LOCAL_MEM_FENCE );
45           }
46
47           m a t r i x C [ yGID * N + xGID ] = d o t p r o d ;
48    }
         Tiles are stored in matrixAsub and matrixBsub. Each work-item finds its
     global index and its local index. The outer loop goes through all the tiles necessary
     to calculate the products in C. Each work-item in the work-group in one iteration of
     the outer loop first reads its elements from the global memory and writes them to the
     tile element with its local index. After loading its elements, each work-item waits
     at the barrier until the tiles are loaded. Then, in the innermost loop, each work-item
     calculates dot product between a row yLID form the tile matrixAsub and the
     column xLID from the tile matrixBsub. After that the work-item waits again at
     the barrier for the other work-items to finish their dot products. Then all work-items
     load next tiles and repeat the process.
     The Host Code
     To implement tiling, we will leave our host code from the previous naive kernel
     intact. The only thing we should change is to create and compile the appropriate
     kernel function:
 1    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 2    // STEP 8: Create and c o m p i l e the k e r n e l
 3    // * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
 4    ckKernel = clCreateKernel (
 5                                                       cpProgram ,
 6                                                       " matrixmulTiled ",
 7                                                       & ciErr ) ;
 8    if (! c k K e r n e l || ciErr != C L _ S U C C E S S )
 9    {
10           p r i n t f ( " E r r o r : F a i l e d to c r e a t e c o m p u t e k e r n e l !\ n " ) ;
11           exit (1) ;
12    }
5.3   Programming in OpenCL                                                           195
Listing 5.30 Create and compile the kernel for tiled matrix multiplication
Note that it already uses 2D work-groups of 16 by 16. This means that the tiles are
also 16 by 16.
   Execution time for tiled matrix multiplication of two square matrices of size
3584 × 3584 on an Apple laptop with an Intel GPU is
5.4 Exercises
1. To verify that you understand how to control the argument definitions for a kernel,
   modify the kernel in Listing 5.3 so it adds four vectors together. Modify the host
   code to define four vectors and associate them with relevant kernel arguments.
   Read back the final result and verify that it is correct.
2. Use local memory to minimize memory movement costs and optimize perfor-
   mance of the matrix multiplication kernel in Listing 5.25. Modify the kernel so
   that each work-item copies its own row of A into local memory. Report kernel
   execution time.
3. Modify the kernel from the previous exercise so that each work-group collabo-
   ratively copies its own column of B into local memory. Report kernel execution
   time.
4. Write an OpenCL program that computes the Mandelbrot set. Start with the
   program in Listing 3.21.
5. Write an OpenCL program that computes π . Start with the program in List-
   ing 3.15. Hint: the parallelization is similar to the parallelization of a dot product.
6. Write an OpenCL program that transposes a matrix. Use local memory and collab-
   oratively reads to minimize memory movement costs and optimize performance
   of the kernel.
7. Given an input array {a0 , a1 , . . . , an−1 } in pointer d_a, write an OpenCL program
   that stores the reversed array {an−1 , an−2 , . . . , a0 } in pointer d_b. Use multiple
   blocks. Try to revert data in local memory. Hint: using work-groups and local
   memory revert data in array slices. Then, revert slices in global memory.
8. Write an OpenCL program to detect edges on black and with images using the
   Sobel filter.
The aim of Part III is to explain why a parallel program can be more or less efficient.
A basic approaches are described for the performance evaluation and analysis of
parallel programs. Instead of analyzing complex applications, we focus on two simple
cases, i.e. a parallel computation of number π , by using numerical integration, and
a solution of simplified partial differential equation on 1-D domain, by using
explicit solution methodology. Both cases, already mentioned in previous chapters,
even so simple, they already incorporate most of possible pitfalls that could arise
during their parallelization. The first case, computation of pi, requires just a few
communication among parallel tasks, while in the explicit solution of PDE, each
process communicate with its neighbors in every time step.
   Besides these two cases, we also evaluate the Seam Carving algorithm in terms
of performance on CPU and a GPU platform. Seam Carving is an image process-
ing algorithm in 2-D domain and as such appropriate for implementation on GPU
platforms. It comprises a few steps of which some cannot be effectively parallelized.
   Parallel programs run on adequate platforms, i.e. multi-core computers, intercon-
nected computers or computing clusters, and GPU accelerators. After an implemen-
tation of any parallel program, several questions remain to be answered, e.g.:
We will answer the above questions by running the programs with different param-
eters, e.g. size of the computation domain and the number of processors. We will
follow also the execution efficiency and limitations that are specific for each of the
three parallel methodologies: OpenMP, MPI, and OpenCL.
   An electronic extension of the Engineering part will be permanently available on
a book web, hosted by Springer server. Our aim is that it become a vivid forum of
readers, students, teachers and other developers. We expect your inputs in a form of
your own cases, solutions, comments, and proposals. Soon after the publication of
this book more complex cases will be provided, i.e. a numerical solution of a 2-D
198
Chapter Summary
In computing the number π , by simple numerical integration, the focus is in par-
allel implemention on three different parallel architectures and programming envi-
ronments: OpenMP on the multicore processor, MPI on a cluster, and OpenCL on a
GPU. In all three cases a spatial domain decomposition is used for paralelization, but
differences in communication between parallel tasks and in combining the results of
these tasks are shown. Measurements of the running time and speed-up are included
to assist self-studying readers.
that is in a direct relation with the value of π . The numerical integration is performed
by calculation and summation of all N sub-interval areas. A sequential version of
the algorithm in a pseudocode, which results in an approximate value of π , is given
below:
                          102                                                  100
                                                   run-time [s]
                                                   abs-error
                                                                               10-2
                          100
                                                                               10-4
           run-time [s]
10-6
                                                                                       error [s]
                          10-2
10-8
                                                                               10-10
                          10-4
10-12
                          10-6                                                 10-14
                             100   102       104          106         108   1010
                                         N -number of sub-intervals
Fig. 6.1 Run-time and absolute error on a single MPI process in computation of π as a function of
number of sub-intervals N
intervals, where the impact of MPI program setup time, cache memory, or interactions
with operating system could be present. In the same way, the approximation error
becomes smaller and smaller, until the largest number of intervals, where a small jump
is presents, possibly because of a limited precision of the floating point arithmetic.
   The next step is to find out the most efficient way to parallelize the problem, i.e., to
engage a greater number of cooperating processors in order to speed up the program
execution. Even though the sequential Algorithm 1 is very simple, it implies most
of the problems that arise also in more complex examples. First, the program needs
to distribute tasks among cooperating processors in a balanced way. A relatively
small portion of data should be communicated to cooperating processes, because the
processes will generate their local data by a common equation for a unit circle. All
processes have to implement their local computation of partial sums, and finally, the
partial results should be assembled, usually by a global communication, in a host
process to be available for users.
   Regarding sequential Algorithm 1, we see that the calculation of each sub-interval
area is independent, and consequently, the algorithm has a potential to be parallelized.
In order to make the calculation parallel, we will use domain decomposition approach
and master–slave implementation. Because all values of yi can be calculated locally
and because the domain decomposition is known explicitly, there is no need for a
massive data transfer between the master process and slave processes. The master
process will just broadcast the number of intervals. Then, the local integration will
run in parallel on all processes. Finally, the master process reduces the partial sums
into the final approximation of π . The parallelized algorithm is shown below:
    We have learned from this simple example that, besides the calculation, there
are other tasks to be done (i) domain decomposition, (ii) their distribution, and
(iii) assembling of the final result, which are inherently sequential, and therefore
limit the final speedup. We further see that all processes are not identical. Some
of the processes are slaves because they just calculate their portion of data. The
master process has to distribute the number of intervals and to gather and sum up the
202                                  6 Engineering: Parallel Computation of the Number π
6.1 OpenMP
6.00
          5.00
seconds
4.00
3.00
2.00
1.00
          0.00
                 1   2   3   4   5    6    7    8     9    10   11   12    13    14    15    16
                                          number of threads
Fig. 6.2 Computing π using the numerical integration of the unit circle using 109 intervals on a
quad-core processor with hyperthreading
          3.00
                                                                          wall clock time
                                                                          ideal speedup
2.50
          2.00
seconds
1.50
1.00
0.50
          0.00
                 1   2   3   4   5    6    7    8     9    10   11   12    13    14    15    16
                                            number of threads
Fig. 6.3 Computing π using random shooting into the square [0, 1] × [0, 1] using 108 shots
   Although the wall clock time shown in Fig. 6.2 in Fig. 6.3 decreases with the
number of threads the total amount of CPU time increases. These can be expected,
since more threads require more administrative tasks from the OpenMP run-time.
204                                                 6 Engineering: Parallel Computation of the Number π
          7.30
                                         cpu time                                                     cpu time
          7.25
                                                                       4.00
          7.20
seconds
                                                             seconds
          7.15
                                                                       3.50
          7.10
          7.05
                                                                       3.00
          7.00
          6.95
                                                                       2.50
          6.90
                 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16                       1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
                            number of threads                                             number of threads
Fig. 6.4 The total CPU time needed to compute π using numerical integration (left) and random
shooting (right)
6.2 MPI
An MPI C code for the parallel computation of π , together with some explanation
and comments, are provided in Chap. 4 Listing 4.5. We would like to test the behavior
of run-time as a function of the number of MPI processes p. On the test notebook
computer, two cores are present. Taking into account that four logical processors
are available, we could expect some speedup of the execution with up to four MPI
processes. With more than four processes, the run-time could start increasing, because
of an MPI overhead. We will test our program with up to eight processes. Starting a
same program on different number of processes can be accomplished by consecutive
mpiexec commands with appropriate value of parameter -n or by a simple bash
file that prepares the execution parameters, which are passed to the main program
through its argc and argv arguments.
    The behavior of approximation error should be the same as in the case of a sin-
gle process. In the computation of π , the following number of sub-intervals have
been used N = [5e9, 5e8, 5e7, 5e6] (5e9 is a scienific notation for 5.109 ). Note,
that such big numbers of sub-intervals were used because we want to have a com-
putationally complex task, even that the computation of sub-interval areas is quite
simple. Usually, in realistic tasks, there is much more computation by itself and tasks
become complex automatically. Two smaller values of N have been used to test the
impact of the ratio calculation/communication complexity on the program execution.
The obtained results for parallel run-times (RT) in seconds and speedups (SU), in
computation of π , on a notebook computer are shown in Fig. 6.5.
    We have first checked that the error in parallel approximation of π is the same as
in the case of a single process. The run-time behaves as expected, with the maximum
speedup of 2.6 with four processes and large N . With two processes the speedup
is almost 2, because the physical cores have been allocated. Up to four processes,
the speedup increases but not ideal, because logical processors cannot provide the
same performance as the physical cores because of hyperthreading technology. The
program is actually executed on a shared memory computer with potentially negli-
gible communication delays. However, if N is decreased, e.g., to 5e6 or more, the
6.2 MPI                                                                                                205
                          4                                                           4
                                                            RT-5e9 [s]   SU-5e9
                         3.5                                RT-5e8 [s]   SU-5e8
                                                            RT-5e7 [s]   SU-5e7       3.5
                                                            RT-5e6 [s]   SU-5e6
                          3
                                                                                      3
                         2.5
          run-time [s]
                                                                                            speed-up
                          2                                                           2.5
                         1.5
                                                                                      2
                                                                                      1.5
                         0.5
                          0                                                           1
                               1   2   3       4        5          6     7        8
                                       p - number of MPI processes
Fig. 6.5 Parallel run-time (RT) and speedup (SU) in computation of π on a notebook computer for
p = [1, . . . , 8] MPI processes and N = [5e9, 5e8, 5e7, 5e6] sub-intervals
Parallel program performances are tested on MPICH MPI with various options for
mpirun.mpich. First, we run np = 1...128 experiments, for 1 to 128 MPI
processes, with default parameters:
206                                          6 Engineering: Parallel Computation of the Number π
                          12                                                            35
                                                                       RT-D-5e9 [s]
                                                                       RT-D-5e6 [s]
                                                                                        30
                          10                                           SU-D-5e9
                                                                       SU-D-5e6
                                                                                        25
                          8
           run-time [s]
                                                                                             speed-up
                                                                                        20
                          6
                                                                                        15
                          4
                                                                                        10
                          2
                                                                                        5
                          0                                                             0
                               0   20   40      60      80       100     120          140
                                        p - number of MPI processes
Fig. 6.6 Parallel run-time (RT-D) and speedup (SU-D) in computations of π on a cluster of 8
interconnected computers with total 32 cores and p = [1, . . . , 128] MPI processes with default
parameters of mpirun.mpich
  Parameters N and np are provided from bash file as N = [5e9, 5e8, 5e7, 5e6]
and np = [1, . . . , 8]. The maximum number of MPI processes, i.e., 128, was deter-
mined from the command line when running the bash file:
where data.txt is an output file for results. The obtained results for parallel run-
times, in seconds, with default parameters of mpirun (RT-D) and corresponding
speedups (SU-D), are shown in Fig. 6.6. For better visibility only two pairs of graphs
are shown, for largest and smallest N .
    Let us look first the speedup for N = 5e9 intervals. We see that the speedup
increases up to 64 processes, where reaches its maximal value 32. For more processes,
it drops and deviates around 17. The situation is similar with thousand times smaller
number of intervals N = 5e6, however, the maximal speedup is only 5 and for more
than 64 processes there is no speedup. We expected this, because calculation load
decreases with smaller number of sub-intervals and the impact of communication
and operating system overheads prevail.
    We further see that the speedup scales but not ideal. 64 MPI processes are needed
to execute the program 32 times faster as a single computer. The reason could be
in the allocation of processes along the physical and logical cores. Therefore, we
repeat experiments with mpirun parameter -bind-to core:1, which forces to
run just a single process on each core and possibly prevents operating system to move
6.2 MPI                                                                                                207
                         12                                                            35
                                                                      RT-B-5e9 [s]
                                                                      RT-B-5e6 [s]
                                                                                       30
                         10                                           SU-B-5e9
                                                                      SU-B-5e6
                                                                                       25
                         8
          run-time [s]
                                                                                            speed-up
                                                                                       20
                         6
                                                                                       15
                         4
                                                                                       10
                         2
                                                                                       5
                         0                                                             0
                              0   20   40      60      80       100     120          140
                                       p - number of MPI processes
Fig. 6.7 Parallel run-time (RT-B) and speedup (SU-B) in computations of π on a cluster of 8
interconnected computers for p = [1, . . . , 128] MPI processes, bound to cores
processes around cores. The obtained results for parallel run-times in seconds (RT-B)
with processes bound to cores and corresponding speedups (SU-B), are shown in
Fig. 6.7. The remaining execution parameters are the same as in previous experiment.
   The bind parameter improves the execution performance with N = 5e9 intervals
in the sense that the speedup of 32 is achieved already with 32 processes, which is
ideal. But then the speedup falls abruptly by a factor of 2, possibly because of the
fact, that with more than 32 MPI processes, some processing cores must manage two
processes, which slows down the whole program execution.
   We further see that with more than 64 processes speedups fall significantly in all
tests, which is a possible consequence of inability to use the advantage of hyper-
threading. With larger number of processes, larger than the number of cores, on
several cores run more than two processes, which slows down the whole program
by a factor of 2. Consequently, the slope of speedup scaling, with more than 32 pro-
cesses, is also reduced by 2. With this reduced slope, the speedup reaches the second
peak by 64 processes. Then the speedup falls again to an approximate value of 22.
   The speedup with N = 5e6 intervals remains similar as in previous experiment
because of lower computation load. It is a matter of even more detailed analysis, why
the speedup behaves quite unstable for some cases. The reasons could be in cache
memory faults, MPI overheads, collective communication delays, interaction with
operating system, etc.
     208                                          6 Engineering: Parallel Computation of the Number π
6.3 OpenCL
     If we look at Algorithm 1, we can see that it is very similar to dot product calculation
     covered in Sect. 5.3.4. We use the same principle: we will use a buffer in local memory
     named LocalPiValues to store each work-item’s running sum of the π value.
     This buffer will store szLocalWorkSize partial π values so each work-item in
     the work-group will have a place to store its temporary result. Then, we use the
     principle of reduction to sum up all π values in the work-group. We now store the
     final value of each work-group to an array in global memory. Because this array is
     relatively small, we return control to the host and let the CPU finish the final step of
     the addition. Listing 6.1 shows the kernel code for computing π .
 1    _ _ k e r n e l void C a l c u l a t e P i S h a r e d (
 2                                                        _ _ g l o b a l float * c ,
 3                                                        ulong i N u m I n t e r v a l s )
 4    {
 5            _ _ l o c a l f l o a t L o c a l P i V a l u e s [ 2 5 6 ] ; // work - group size = 256
 6
 7          // work - item global index
 8          int iGID = g e t _ g l o b a l _ i d (0) ;
 9          // work - item local index
10          int iLID = g e t _ l o c a l _ i d (0) ;
11          // work - group index
12          int iWGID = g e t _ g r o u p _ i d (0) ;
13          // how many work - items are in WG ?
14          int iWGS = g e t _ l o c a l _ s i z e (0) ;
15
16          float x = 0.0;
17          float y = 0.0;
18          float pi = 0.0;
19
20          while ( iGID < i N u m I n t e r v a l s ) {
21            x = ( float ) (1.0 f /( float ) i N u m I n t e r v a l s ) *(( float ) iGID -0.5 f ) ;
22            y = ( float ) sqrt (1.0 f - x * x ) ;
23            pi += 4.0 f * ( float ) ( y /( float ) i N u m I n t e r v a l s ) ;
24            iGID += g e t _ g l o b a l _ s i z e (0) ;
25          }
26
27          // s t o r e the p r o d u c t
28          L o c a l P i V a l u e s [ iLID ] = pi ;
29          // wait for all th r e a d s in WG :
30          barrier ( CLK_LOCAL_MEM_FENCE );
31
32          // S u m m a t i o n r e d u c t i o n :
33          int i = iWGS /2;
34          while ( i !=0) {
35               if ( iLID < i ) {
36                       L o c a l P i V a l u e s [ iLID ] += L o c a l P i V a l u e s [ iLID + i ];
37               }
38               barrier ( CLK_LOCAL_MEM_FENCE );
39               i = i /2;
40          }
41
42          // s t o r e p a r t i a l dot p r o d u c t into global m e m o r y :
43          if ( iLID == 0) {
44               c [ iWGID ] = L o c a l P i V a l u e s [0];
45          }
46    }
Table 6.1 Experimental         No. of      CPU time [s]    GPU time [s]   Speedup
results for OpenCL π           intervals
computation
                               106         0.01            0.0013         7.69
                               33 × 106    0.31            0.035          8.86
                               109         9.83            1.07           9.18
To analyze the performance of the OpenCL program for computing π , the sequential
version has been run on a quad-core processor Intel Core i7 6700HQ running at
2,2 GHz, while the parallel version has been run on an Intel Iris Pro 5200 GPU
running at 1,1 GHz. This is a small GPU integrated on the same chip as the CPU and
has only 40 processing elements. The results are presented in Table 6.1. We run the
kernel in NDrange of size:
Chapter Summary
This chapter presents a simplified model for a computer simulation of changes in
the temperature of a thin bar. The phenomena is modelled by a heat equation PDE
dependant on space and time. Explicite finite differences in 1-D domain are used
as a solution methodology. The paralelization of the problem, on both OpenMP and
MPI, leads to a program with significant communication requirements. The obtained
experimental measurements of program run-time are analysed in order to optimize
performances of the parallel program.
Partial differential equations (PDE) are a useful tool for the description of natural
phenomena like heat transfer, fluid flow, mechanical stresses, etc. The phenomena
are described with spatial and time derivatives. For example, a temperature evolution
in a thin isolated bar of length L, with no heat sources, can be described by a PDE
of the following form:
                                 ∂ T (x, t)    ∂ 2 T (x, t)
                                            =c
                                     ∂t            ∂x2
where T (x, t) is an unknown temperature at position x in time t, and c is a thermal
diffusivity constant with typical values for metals being about 10−5 m2 /s. The PDE
says that the first derivative of temperature T by t is equal to the second derivative
of T by x. To fully determine the solution of the above PDE, we need the initial
temperature of the bar, a constant T0 in our simplified case:
T (x, 0) = T0 .
Finally, fixed temperatures, independent of time, at both ends of the bar are imposed
as T (0) = TL and T (L) = TR .
   In the case of strong solution methodology, the problem domain is discretized in
space and the derivatives are approximated by, e.g., finite differences. This results in
a sparse system matrix A, with three diagonals in the case of 1-D domain or with five
diagonals in 2-D case. To save memory, and because the matrix structure is known,
the vectors with old and new temperatures are only needed. The evolution in time
can be obtained by an explicit iterative calculation, e.g., Euler method, based on
the extrapolation of the current solution and its derivatives into the next time-step,
respecting the boundary conditions. A developing solution in time can be obtained
by a simple matrix-vector multiplication. If only a stationary solution is desired, then
the time derivatives of the solution become zero, and the solution can be obtained
in a single step, through a solution of the resulting linear system, Au = b, where A
is a system matrix, u is a vector of unknown solution values in discretization nodes,
and b is a vector of boundary conditions.
    We will simplify the problem by analyzing 1-D domain only. Note that an exten-
sion in 2-D domain, i.e., a plate, is quite straightforward, and can be left for a mini
project. For our simplified case, an isolated thin long bar, an exact analytic solution
exists. Temperature is spanning in a linear way between both fixed temperatures at
boundaries. However, in real cases, with realistic domain geometry and complex
boundary conditions, the analytic solution may not exist. Therefore, an approximate
numerical solution is the only option. To find the numerical solution, the domain
has to be discretized in space with j = 1 . . . N points. To simplify the problem, the
discretization points are equidistant, so x j+1 − x j = Δx is a constant. Discretized
temperatures T j (t) for j = 2 . . . (N − 1) solution values in inner points and T1 = TL
and TN = TR are boundary points with fixed boundary conditions.
    Finally, we also have to discretize time in equal time-steps ti+1 − ti = Δt. Using
finite difference approximations for time and spatial derivatives:
                                         c Δt                                        
            T j (ti+1 ) = T j (ti ) +           T j−1 (ti ) − 2T j (ti ) + T j+1 (ti ) .
                                        (Δx) 2
   We start again with the validation of our program, on a notebook with a single
MPI process and with the code from Listing 7.2. The test parameters were set as
follows: p = 1, N = 30, nt = [1, . . . , 1000], c = 9e - 3, T0 = 20, TL = 25, TR =
18, time = 60, L = 1. Because the exact solution is known, i.e., a line between TL ,
TR , the maximal absolute error of the numerical solution was calculated to validate
the solution behavior. If the model and numerical method are correct, the error should
converge to zero.The resulting temperatures evolution in time, on the simulated bar,
are shown in Fig. 7.1.
   The set of curves in Fig. 7.1 confirms that the numerical method produces in initial
time-steps a solution near to initial temperature. Then, the simulated temperatures
change as expected. While the number of time-steps nt increases, the temperatures
advance toward the correct result, which we know that is a straight line between left
and right boundary temperatures, i.e., in our case, between 25◦ and 18◦ .
   We have learned, that in the described solution methodology, the calculation of
a temperature T j in a discretization point depends only on temperatures in two
neighboring points, consequently, the algorithm has a potential to be parallelized.
We will again use domain decomposition, however, the communication between
processes will be needed in each time-step, because the left and right discretization
point of a sub-domain is not immediately available for neighboring processes. A
point-to-point communication is needed to exchange boundaries of sub-domains. In
1-D case, the discretized bar is decomposed into a number of sub-domains, which are
equal to the number of processes. All sub-domains should manage a similar number
214                                               7 Engineering: Parallel Solution of 1-D Heat Equation
23
22
                                  T[ ]
                                         21
20
19
                                         18
                                              0          5       10       15       20          25             30
                                                                           x
of points, because an even load balance is desired. Some special treatment is needed
for calculation near the domain boundaries.
   Regarding the communication load, some collective communication is needed at
the beginning and end of the calculation. Additionally, a point-to-point communica-
tion is needed in each time-step to exchange the temperature of sub-domain border
points. In our simplified case, the calculation will stop after a predefined number of
time-steps, hence, no communication is needed for this purpose. The parallelized
algorithm is shown below:
7.1 OpenMP
 1    # p r a g m a omp p a r a l l e l f i r s t p r i v a t e ( k )
 2        {
 3            d o u b l e * To = T o l d ;
 4            d o u b l e * Tn = T n e w ;
 5            while (k - -) {
 6                # p r a g m a omp for
 7                for ( int i = 1; i <= n ; i ++) {
 8                    Tn [ i ] = To [ i ]
 9                        + C * ( To [ i - 1] - 2.0 * To [ i ] + To [ i + 1]) ;
10                }
11                d o u b l e * T = To ; To = Tn ; Tn = T ;
12            }
13        }
        It is worth examining the wall clock time of this algorithm first. Figure 7.2 sum-
     marizes the wall clock time needed for 106 iterations in 105 points using a quad-core
     processor with multithreading. With more than 4 threads nothing is gained in terms
     of a wall clock time. As this is a floating-point-intensive application run on a proces-
     sor with one floating-point unit per physical core, this can be expected: two threads
     running on two logical cores of the same physical core must share the same floating
     point unit. This leads to more synchronizing among threads and, consequently, to
     the increase of the total CPU time as shown in Fig. 7.3.
        The interested reader might investigate how the wall clock time and the speedup
     change if the ratio between the number of points along the bar and the number
216                                           7 Engineering: Parallel Solution of 1-D Heat Equation
           120.00
                                                                                        wall clock time
                                                                                        ideal speedup
100.00
            80.00
 seconds
60.00
40.00
20.00
             0.00
                    1       2            3              4       5               6            7            8
                                                   number of threads
Fig. 7.2 The wall clock time needed to compute 106 iterations of 1-D heat transfer along a bar in
105 points
200
150
                                             100
                                                    1       2       3       4       5        6       7        8
                                                                        number of threads
of iterations in time change. Namely, decreasing the number of points along the bar
makes the synchronization at the barrier at the end of the parallel for loop relatively
more expensive.
7.2 MPI
In the solution of heat equation, the problem domain is discretized first. In our
simplified case, a temperature diffusion in a thin bar is computed, which is modeled
by 1-D domain. For efficient use of parallel computers, the problem domain must
be partitioned (decomposed) into possibly equal sub-domains. The number of sub-
domains should be equal to the number of MPI processes p. We prescribe a certain
     7.2 MPI                                                                                                                  217
 45        // a l l o c a t e l o c a l b u f f e r s and c a l c u l a t e new t e m p e r a t u r e s
 46        X = ( d o u b l e *) m a l l o c (( N +2) * s i z e o f ( d o u b l e ) ) ; // N +2 point c o o r d i n a t e s
 47        for ( i = 0; i <= N + 1; i ++)                      // ghost points are in X [0] and X [ N +1]
 48        {
 49            X [ i ] = (( d o u b l e ) ( m y _ i d * N + i - 1) * L e n g t h
 50                + ( double )( num_p * N - my_id * N - i) * X_min )
 51                / ( d o u b l e ) ( n u m _ p * N - 1) ;
 52        }
 53        T = ( d o u b l e *) m a l l o c (( N + 2) * s i z e o f ( d o u b l e ) ) ;       // a l l o c a t e
 54        T _ n e w = ( d o u b l e *) m a l l o c (( N + 2) * s i z e o f ( d o u b l e ) ) ;
 55        for ( i = 1; i <= N ; i ++)
 56            T [ i ] = T_0 ;
 57        T [0] = 0.0; T [ N + 1] = 0.0;
 58        delta_time = ( time_max - time_min ) / ( double )( j_max - j_min );
 59        d e l t a _ X = ( L e n g t h - X _ m i n ) / ( d o u b l e ) ( n u m _ p * N - 1) ;
 60
 61        cfl = c * d e l t a _ t i m e / pow ( delta_X ,2) ; // check CFL
 62        if ( m y _ i d == 0)
 63          printf ("           CFL s t a b i l i t y c o n d i t i o n v a l u e = % f \ n " , cfl ) ;
 64        if ( cfl >= 0.5)
 65        {
 66          if ( m y _ i d == 0)
 67              printf ("          C o m p u t a t i o n c a n c e l l e d : CFL c o n d i t i o n f a i l e d .\ n " ) ;
 68          return ;
 69        }
 70        for ( j = 1; j <= j _ m a x ; j ++)                     // c o m p u t e T _ n e w
 71        {
 72          t i m e _ n e w = (( d o u b l e ) ( j - j _ m i n ) * t i m e _ m a x
 73              + ( double )( j_max - j) * time_min )
 74              / ( double )( j_max - j_min );
 75          if (0 < m y _ i d )              // send T [1] to my_id -1 // r e p l a c e with S e n d R e c v ?
 76          {
 77              tag = 1;
 78              M P I _ S e n d (& T [1] , 1 , M P I _ D O U B L E , my_id -1 , tag , M P I _ C O M M _ W O R L D ) ;
 79          }
 80          if ( m y _ i d < n u m _ p - 1)                 // r e c e i v e T [ N +1] from my_id +1.
 81          {
 82              tag = 1;
 83              M P I _ R e c v (& T [ N +1] ,1 , M P I _ D O U B L E , m y _ i d +1 , tag , M P I _ C O M M _ W O R L D ,& s t a t u s ) ;
 84          }
 85          if ( m y _ i d < n u m _ p - 1)                // send T [ N ] to my_id +1
 86          {
 87              tag = 2;
 88              M P I _ S e n d (& T [ N ] , 1 , M P I _ D O U B L E , m y _ i d +1 , tag , M P I _ C O M M _ W O R L D ) ;
 89          }
 90          if (0 < m y _ i d )              // r e c e i v e T [0] from my_id -1
 91          {
 92              tag = 2;
 93              M P I _ R e c v (& T [0] ,1 , M P I _ D O U B L E , my_id -1 , tag , M P I _ C O M M _ W O R L D ,& s t a t u s ) ;
 94          }
 95          for ( i = 1; i <= N ; i ++)                       // update t e m p e r a t u r e s
 96          {
 97              T _ n e w [ i ] = T [ i ] + ( d e l t a _ t i m e * c / pow ( delta_X ,2) ) *
 98                                                 ( T [ i -1] - 2.0 * T [ i ] + T [ i +1]) ;
 99          }
100          if ( m y _ i d == 0)             T _ n e w [1] = T_L ;           // update b o u n d a r i e s T with BC
101          if ( m y _ i d == num_p - 1) T _ n e w [ N ] = T_R ;
102          for ( i = 1; i <= N ; i ++)                       T [ i ] = T _ n e w [ i ]; // update inner T
103        }
104        free ( T ) ;
105        free ( T_new ) ;
106        free ( X ) ;
107        return ;
108    }
    After the successful validation, already presented, the analysis of run-time behav-
ior, again on a two-core notebook computer and on eight cluster computers, has been
performed. To fulfil the CFL condition, variables: t, nt, N p , and c has to be appropri-
ately selected. We set the number of all discretization points to N = [5e5, 5e4, 5e3],
and, therefore, the number of points per process N p is obtained by scaling with the
number of processes p. For example, if N = 5e5 and p = 4, N p = 1.25e5, etc. For
accurate timings and balanced communication and computation load nt was set to
1e4. To be sure about CFL condition, constant c is set to 1e - 11. Other parameters
remain the same as in the PDE validation test. The parallel program run-time (RT) in
seconds and speedup (SU), as a function of the number of processes p = [1, . . . , 8]
and discretization points, on a notebook computer, are shown in Fig. 7.4.
    The obtained results bring two important messages. First, the maximum speedup
is only about two, which is smaller than in the case of π calculation. The explanation
for this could be in a smaller computation/communication time ratio. It seems that
the time spent on communication is almost the same as on the calculation. Second,
the speedup drops significantly on more than 4 MPI processes, below one, which is
even more pronounced with smaller number of discretization points. The explanation
of such a behavior is in the relative amount of communication, which is performed
after each time-step.
    Next experiments were performed on eight computing cluster nodes with the same
approach as in Chap. 6 and with the same parameters as in the notebook test. In this
case, we use mpirun parameter -bind-to core:1, which appears to be more
promising in previous tests. We can expect a high impact of communication load.
Even that the messages are short, with just a few doubles, the delay is significant
                         30                                                      4
                                                  RT-5e5 [s]    SU-5e5
                                                  RT-5e4 [s]    SU-5e4
                                                                                 3.5
                         25                       RT-5e3 [s]    SU-5e3
                         20
                                                                                 2.5
          run-time [s]
speed-up
15 2
                                                                                 1.5
                         10
                         5
                                                                                 0.5
                         0                                                       0
                              1   2   3       4           5     6        7   8
                                      p - number of MPI processes
Fig. 7.4 Parallel run-time (RT) and speedup (SU) of heat equation solution for N =
[5e5, 5e4, 5e3] and 1 to 8 MPI processes on a notebook computer
220                                           7 Engineering: Parallel Solution of 1-D Heat Equation
                          800                                                            18
                                                                        RT-B-5e5 [s]
                                                                        RT-B-5e3 [s]     16
                          700
                                                                        SU-B-5e5
                                                                        SU-B-5e3
                                                                                         14
                          600
                                                                                         12
                          500
           run-time [s]
                                                                                              speed-up
                                                                                         10
                          400
                                                                                         8
                          300
                                                                                         6
                          200
                                                                                         4
100 2
                            0                                                            0
                                0   20   40      60      80       100     120          140
                                         p - number of MPI processes
Fig.7.5 Parallel run-time (RT-B) and speedup (SU-B) of heat equation solution for N = [5e5, 5e3]
and 1 to 128 MPI processes bound to cores
mainly because of the communication start-up time. The run-time (RT-B) in seconds
and corresponding speedup (SU-B), as a function of the number of MPI processes
p = [1, . . . , 128] and the number of discretization points N = [5e5, 5e3] is shown
in Fig. 7.5.
   The results are a surprise! The speedup is very unstable and approaches to maxi-
mum value 14 with about 30 processes. Then it jumps to 6 and remains stable until
64 processes with another jump to almost 0, with more than 64 processes. We guess
that the problem is in communication.
   The first step occurs when the number of MPI processes increases from 28 to 29.
This step happens because one communication channel gets the additional burden.
Such a behavior slows-down the whole program because the remaining processes
are waiting.
   The results are a surprise! The speedup is very unstable and approaches to maxi-
mum value 14 with about 30 processes. Then it jumps to 6 and remains stable until
64 processes with another jump to almost 0, with more than 64 processes. We guess
that the problem is in communication. The first step occurs when the number of MPI
processes increases from 28 to 29. This step happens because one communication
channel gets the additional burden. Such a behavior slows-down the whole program
because the remaining processes are waiting. Then processes from number 30 to 64
are only adding to communication burden of other neighboring nodes, which hap-
pens in parallel and, therefore, does not additionally degrade the performance. Since
the communication overhead at this number of MPI processes easily overwhelms
calculation, the speedup seems constant from there on.
   Then processes from number 30 to 64 are only adding to communication burden
of other neighboring nodes, which happens in parallel and, therefore, does not addi-
7.2 MPI                                                                                               221
                          14                                                          18
                                                                      RT-R-5e5 [s]
                                                                      RT-R-5e3 [s]    16
                          12
                                                                      SU-R-5e5
                                                                      SU-R-5e3        14
                          10
                                                                                      12
           run-time [s]
                                                                                           speed-up
                          8                                                           10
6 8
                                                                                      6
                          4
                                                                                      4
                          2
                                                                                      2
                          0                                                            0
                               0   10   20      30      40       50      60          70
                                        p - number of MPI processes
Fig.7.6 Parallel run-time (RT-R) and speedup (SU-R) of heat equation solution for N = [5e5, 5e3]
and 1 to 65 MPI processes on a ring interconnection topology and implemented by MPI_Sendrecv
tionally degrade the performance. Since the communication overhead at this number
of MPI processes easily overwhelms calculation, the speedup seems constant from
there on.
   Second step happens, when number of processes passes 64. Process 65 is assigned
to node 1. This makes it the ninth MPI process allocated on this node, which only
supports 8 threads. Ninth and first process on node 1, therefore, have to share to
the same core, which seems to be the recipe for abysmal performance. The speedup
drop is so overwhelming at this point because MPI uses busy waiting as a part of its
synchronous send and receive operations. Busy waiting means that processes are not
immediately switched when waiting for MPI communication, since the operating
systems do not see them as idle but rather as busy. Therefore, the waiting times for
MPI communication dramatically increase and with them the execution times.
   Therefore, we repeat the experiment again. Now the processors are connected in a
true physical ring topology with two communication ports per processor. Addition-
ally, we replace the MPI_Send and MPI_Recv pairs by MPI_Sendrecv function.
We reduce the number of processes in this experiment to 64, because larger numbers
have been proved as useless. The run-time (RT-R) in seconds and corresponding
speedup (SU-R), as a function of the number of MPI processes p = [1, . . . , 64] and
the number of discretization points N = [5e5, 5e3] is shown in Fig. 7.6.
   We can notice several improvements now. With a larger number of discretization
points, the speedup is quite stable but the maximum is not higher than in the previous
experiment from Fig. 7.5. With smaller number of discretization point a speedup is
detected only in up to four processes, because a local memory communication is
used, which confirms that the communication load is prevailing in this case. Further
222                               7 Engineering: Parallel Solution of 1-D Heat Equation
Chapter Summary
In this chapter we present the parallelization of Seam Carving - a content-aware
image resizing algorithm. Effective parallelization of seam carving is a challenging
problem due to its complex computation model. Two main reasons prevent effective
parallelization: computation dependence and irregular memory access. We show,
which parts of the original seam carving algorithm can be accelerated on GPU and
which parts cannot, and how this affects the overall performance.
Seam carving is a content-aware image resizing technique where the image is reduced
in size by one pixel of width (or height) at a time. Seam carving attempts to reduce
the size of a picture while preserving the most interesting content of the image.
Seam Carving was originally published in a 2007 paper by Shai Avidan and Ariel
Shamir. Ideally, one would remove the “lowest energy” pixels (where energy means
the amount of important information contained in a pixel) from the image to preserve
the most important features. However, that would create artifacts and not preserve
the rectangular shape of the image. To balance removing low-energy pixels while
minimizing artifacts, we remove exactly one pixel in each row (or column) where
every pixel in the row must touch the pixel in the next row either via an edge or
corner. Such a connected path of pixels is called seam. If we are going to resize the
image horizontally, we need to remove one pixel from each row of the image. Our
goal is to find a path of connected pixels from the bottom of the image to the top.
By “connected”, we mean that we will never jump more than one pixel left or right
as we move up the image from row to row. A vertical seam in an image is a path
of pixels connected from the top to the bottom with one pixel in each row. Each row
has exactly only one pixel which is the part of the vertical seam. By removing the
vertical seams iteratively, we can compress the image in the horizontal direction.
The seam carving method produces a resized image by searching for the seam which
has the lowest user-specified “image energy”. To shrink the image, the lowest energy
seam is removed and the process is repeated until the desired dimensions are reached.
Seam Carving is a three-step process:
1. Assign an energy value to every pixel. This will define the important parts of the
   image that we want to preserve.
2. Find an 8-connected path of the pixels with the least energy. We use dynamic
   programming to calculate the costs of every potential path through the image.
3. Follow the cheapest path to remove one pixel from each row or column to resize
   the image.
Following these steps will shrink the image by one pixel. We can repeat the process
as many times as we want to resize the image as much as necessary. What we need
to is to implement a function which takes an image as an input and produce a resized
image in one dimension or two dimensions as an output which is expected by the
users.
    Why is seam carving interesting for us? Effective parallelization of seam carving
is a challenging problem due to its complex computation model. There are two main
reasons why effective parallelization is prevented: (1) Computation dependence:
dynamic programming is a key step to compute an optimal seam during image
resizing and takes a large fraction of the program execution time. It is very hard
to parallelize the dynamic programming on GPU devices due to the computation
dependency. (2) Intensive and irregular memory access: in order to compute various
intermediate results a large number of irregular memory access patterns is required.
This worsens the program performance significantly. In this chapter, we are not
going to find or present a better algorithm for seam carving that can be parallelized.
We are just going to show, which part of the original seam carving algorithm can
Fig. 8.1 Original cyclist image to be made narrower with seam carving
Engineering:Parallel Implementation of Seam Carving                                                  225
be accelerated on GPU and which parts cannot and how this affects the overall
performance. For illustration purposes, we will use the cyclist image from Fig. 8.1
as an input image, which is to be made narrower with seam carving.
What are the most important parts of an image? Which parts of a given image should
we eliminate first when resizing and which should we hold onto the longest? The
answer to these questions lies in the energy value of each pixel. The energy value
of a pixel is the amount of important information contained in that pixel. So, the
first step is to compute the energy value for every pixel, which is a measure of its
importance—the higher the energy, the less likely that the pixel will be included as
part of a seam and eventually removed.
    The simplest and frequently most-effective measure of energy is the gradient of the
image. An image gradient highlights the parts of the original image that are changing
the most. It is calculated by looking at how similar each pixel is to its neighbors.
Large uniform areas of the image will have a low gradient (i.e., energy) value and
the more interesting areas (edges of objects or places with a lot of detail) will have
a high energy value. There exist a variety of energy measures (e.g., Sobel operator).
In this book, which is not primarily devoted to image processing, we will use a very
simple energy function, although a number of other energy functions exist and may
work better. Let each pixel (i, j) in the image has its color denoted as I (i, j). The
energy of the pixel (i, j) is given by the following equation:
1: for i = 1 . . . R do
2:    for j = 1 . . . C do
3:       E(i, j) = |I (i, j) − I (i, j + 1)| + |I (i, j) − I (i + 1, j)| + |I (i, j) − I (i + 1, j + 1)|
4:    end for
5: end for
Fig. 8.2 a) Original black and white image. b) Energies of the pixels in the image
   We will illustrate this step on a simple example. Let us suppose a black and white
image as in Fig. 8.2a and let us suppose that the color of black pixels is coded with the
value 1, and the color of white pixels is coded with 0. Figure 8.2b shows energy map
for the image from Fig. 8.2a. Energies of the pixels in the last column and the last
row are computed with the assumption that the image is zero-padded. For example,
the energy of the first pixel in the fourth row (black pixel) is according to Eq. 8.1:
E(3, 0) = |1 − 1| + |1 − 0| + |1 − 1| = 1,
and the energy of the last pixel in the fifth row is:
E(4, 5) = |1 − 0| + |1 − 0| + |1 − 0| = 3.
  The resulting cyclist image of this step is shown in Fig. 8.3. We can see that large
uniform areas of the image have a low gradient value (black) and the more interesting
edges of objects have a high gradient value (white).
Now that we have calculated the value of each pixel, our next objective is to find a
path from the bottom of the image to the top of the image with the least energy. The
line must be 8-connected: this means that every pixel in the row must be touched by
the pixel in the next row either via an edge or corner. That would be a vertical seam
of minimum total energy. One way to do this would be to simply calculate the costs
of each possible path through the image one-by-one. Start with a single pixel in the
bottom row, navigate every path from there to the top of the image, keep track of
the cost of each path as you go. But we will end up with thousands or millions of
Engineering:Parallel Implementation of Seam Carving                               227
possible paths. To overcome this, we can apply the dynamic programming method as
described in the paper by Avidan and Shamir. Dynamic programming lets us touch
each pixel in the image only once, aggregating the total cost as we go, in order to
calculate the final cost of an individual path. Once we have found the energy of every
pixel, we start at the bottom of the image and go up row by row, setting each element
in the row to the energy of the corresponding pixel plus the minimum energy of the
3 possibly path pixels “below” (the pixel directly below and the lower left and right
diagonal). Thus, we have to traverse the image from the bottom row to the first row
and compute the cumulative minimum energy M for all possible connected seams
for each pixel (i, j):
In the bottom most row cumulative energy is equal to pixel energy, i.e., M(i, j) =
E(i, j). A sequential algorithm for cumulative energy calculation is given in Algo-
rithm 6.
228                                  8   Engineering: Parallel Implementation of Seam Carving
1:   for j = 1 . . . C do
2:      M(R, j) = E(R, j)
3:   end for
4:   for i = R − 1 . . . 1 do
5:      for j = 1 . . . C do
6:         M(i, j) = E(i, j) + min(M(i + 1, j − 1), M(i + 1, j), M(i + 1, j + 1))
7:      end for
8:   end for
   We will illustrate this step with Fig. 8.4. We start with the last (bottom most) row.
Cumulative energies in that row are the same as pixel energies. Then, we move up
to the fifth row. Cumulative energy of the fifth pixel in the fifth row is the sum of its
energy and the minimal energy of three pixels below it:
M(4, 4) = 2 + min(1, 2, 0) = 2.
On the other hand, cumulative energy of the last pixel in the fifth row is
M(4, 5) = 3 + min(2, 0, ∞) = 3.
As the element (4, 6) does not exist, we assume it has the maximal energy. In other
words, we ignore it. Once we have computed all the values M, we simply find the
lowest value of M in the top row and return the corresponding path as our minimum
energy vertical seam.
Engineering:Parallel Implementation of Seam Carving                                               229
Fig. 8.5 The calculated cumulative energy function of the cyclist image. Please note that due to
summation, almost all pixel values are greater than 255 and are represented only with least significant
eight bits in the image
   This effect of this step is easy to see in Fig. 8.5. Notice how the spots where the
gradient image was brightest are now the roots of inverted triangles of brightness as
the cost of those pixels propagate into all of the pixels within the range of the widest
possible paths upwards. For example, the brightest inverted triangle at the center of
the image (in the form of the cyclist) is created because the white edge at horizon
propagates upwards. When we arrive at the top row, the lowest valued pixel will
necessarily be the root of the cheapest path through the image. Now, we are ready to
start removing seams.
The final step is to remove all of the pixels along the vertical seam. Due to the power
of dynamic programming, the process of actually removing seams is quite easy. All
we have to do to calculate the cheapest seam is to start with the lowest value M in the
top row and work our way up from there, selecting the cheapest of the three adjacent
pixels in the row below. Dynamic programming guarantees that the pixel with the
lowest value M will be the root of the cheapest connected path from there. Once we
have selected which pixels we want to remove, all that we have to do is go through
and copy the remaining pixels on the right side of the seam from right to left and the
230                                 8   Engineering: Parallel Implementation of Seam Carving
image will be one pixel narrower. A sequential algorithm for seam removal is given
in Algorithm 7.
1: min = M(1, 1)
2: col = 1
3: for j = 2 . . . C do
4:    if M(1, j) < min then
5:        min = M(1, j)
6:        col = j
7:    end if
8: end for
9: for i = 1 . . . R do
10:    for j = col . . . C do
11:        I (i, j) = I (i, j + 1)
12:    end for
13:    if M(i + 1, col − 1) < M(i + 1, col) then
14:        col = col − 1
15:    end if
16:    if M(i + 1, col + 1) < M(i + 1, col − 1) then
17:        col = col + 1
18:    end if
19: end for
   We will illustrate this step with Fig. 8.6. We start with the top most row and
find the pixel with the smallest value M. In our case, this is the third pixel in the
Fig. 8.6 a) Cumulative energies. b) The seam (in grey) with the minimal energy
Engineering:Parallel Implementation of Seam Carving                                           231
Fig. 8.7 a) The first seam in the cyclist image. b) The first 50 seams in the cyclist image
first row with M(0, 2) = 0. Then, we select the pixel below that one with the
minimal M. In our case, this is the third pixel in the second row with M(1, 2) = 0.
We continue downwards and select the pixel below the current with the minimal
cumulative energy. This is the fourth pixel in the third row with M(2, 3) = 0. We
232                               8   Engineering: Parallel Implementation of Seam Carving
Fig. 8.8 a) The image resized by removing 100 seams. b) The image resized by removing 350
seams
continue this process until the last row. The seam with minimal energy is depicted
in gray in Fig. 8.6b.
   Figure 8.7 shows the labeled (with white pixels) seams in the cyclist image. The
very first vertical seam found in the cyclist image is depicted in Fig. 8.7a. It goes
through the darkest parts of the energy map form Fig. 8.3 and thus through the pixels
with minimal amount of information. Figure 8.7b depicts the first 50 seams found in
the cyclist image. It can be observed how seams are “avoiding” the regions with the
highest pixel energy and thus the highest amount of information.
   Once we have labeled a vertical seam we go through the image and move the
pixels that are located at the right of the vertical seam from right to left. The new
image would be one pixel narrower than the original. We repeat the whole process
for as many seams as we like to remove. Figure 8.8 shows two cyclist images reduced
in size by (a) 100 pixels and (b) 350 pixels.
     We will first present the CPU code for seam carving. Emphasis will be only on the
     functions that implement the main operations of the seam carving algorithm. Other
     helper functions and code are available on the book’s companion site.
     Energy Calculation on CPU
     Listing 8.1 shows how to calculate pixel energy following Algorithm 5. The function
     simpleEnergyCPU reads input PGM image input, calculates the energy for
     every pixel and writes the energy to the corresponding pixel in PGM image output.
     The function Listing 8.1 implements image gradient, which highlights the parts of
     the original image that are changing the most. The image gradient is calculated using
     Eq. 8.1, which looks at how similar each pixel is to its neighbors. The third argument
     new_width keeps track of the current image width.
     Cumulative Energies on CPU
     Now that we have calculated the energy value of each pixel, our goal is to find a path
     of connected pixels from the bottom of the image to the top. As we previously said,
     we are looking for a very specific path: the one who’s pixels have the lowest total
     value. In other words, we want to find the path of connected pixels from the bottom to
     the top of the image that touches the darkest pixels in our gradient image. Listing 8.2
     shows how to calculate cumulative pixel energy following Algorithm 6. The function
     cumulativeEnergiesCPU reads input PGM image input, which contains the
     energy of pixels, and writes the cumulative energy to the corresponding pixel in PGM
     image output.
     234                                             8    Engineering: Parallel Implementation of Seam Carving
     Using dynamic programming approach, we start at the bottom and work our way
     up, adding the cost of the cheapest below neighbor to each pixel. This way, we
     accumulate cost as we go—setting the value of each pixel not just to its own cost,
     but to the full cost of the cheapest path from there to the bottom of the image.
     The helper function getPreviousMin() returns the minimal energy value from
     the previous row. It contains a few compare statements to find the minimal value.
     As we can see, each iteration in the outermost loop depends on the results from
     the previous iterations, so it cannot be parallelized and only the iterations in the
     innermost loop are mutually independent and can be run concurrently. Also, to find
     the minimal value from the previous row, we should use conditional statements in
     the helper function getPreviousMin(). We already know that these statements
     will prevent the work-items to follow the same execution path and thus it will prevent
     effective execution of warps.
     Seam Labeling and Removal on CPU
     The process of labeling and removing a seam with minimal energy is quite easy. All
     we have to do to is to start with the darkest pixel with minimal cumulative energy in
     the top row and work our way down from there, selecting the cheapest of the three
     adjacent pixels in the row below and changing the color of the corresponding pixel
     in the original image to white. Listing 8.3 shows how to color the seam with minimal
     energy and Listing 8.4 shows how to remove the seam with minimal energy.
     We can see in Listing 8.4 that seam removal starts with the loop in which we locate
     the pixel with minimal cumulative energy, i.e., the first pixel in the vertical seam with
     the minimal energy. Then we proceed to the loop nest. The outermost loop indexes
     rows in the image. In each row we remove the seam pixel—we move the pixels that
     are located at the right of the vertical seam from right to left. After that, we have to
     find the column index of the seam pixel in the next row. We do this using the helper
     function getNextMinColumn(). As in the previous step, we use conditional
     statements in the helper function getPreviousMin(), which will prevent the
     work-items to follow the same execution path.
     1. Load image from a file into a buffer. For example, we can use grayscale PGM
        images, which are easy to handle.
     2. Transfer the image in the buffer to the device.
     3. Execute four kernels: the energy calculation kernel, the cumulative energy kernel,
        the seam labeling kernel, and the seam removal kernel.
     4. Read the resized image from GPU.
     As discussed before, seam carving consists of three steps. We will implement each
     step as one or more kernel functions. The complete OpenCL code for seam carving
     can be found on the book’s companion site.
     236                                         8    Engineering: Parallel Implementation of Seam Carving
Listing 8.5 shows the code for the energy calculation kernel.
 1    _ _ k e r n e l void s i m p l e E n e r g y G P U (
 2                                             _ _ g l o b a l int * imageIn ,
 3                                             _ _ g l o b a l int * imageOut ,
 4                                             int width ,
 5                                             int height ,
 6                                             int n e w _ w i d t h ) {
 7
 8           // g l o b a l t h r e a d i n d e x
 9           int c o l u m n G I D = g e t _ g l o b a l _ i d (0) ;        // c o l u m n in N D R a n g e
10           int r o w G I D = g e t _ g l o b a l _ i d (1) ;              // row in N D R a n g e
11           int t e m p P i x e l ;
12           int diffx , diffy , d i f f x y ;
13
14           diffx = a b s _ d i f f ( i m a g e I n [ r o w G I D * width + c o l u m n G I D ] ,
15                           i m a g e I n [ r o w G I D * width + c o l u m n G I D     + 1]) ;
16           diffy = a b s _ d i f f ( i m a g e I n [ r o w G I D * width + c o l u m n G I D ] ,
17                           i m a g e I n [( r o w G I D +1) * width + c o l u m n G I D ]) ;
18           diffxy = abs_diff ( imageIn [ rowGID * width + columnGID ],
19                           i m a g e I n [( r o w G I D +1) * width + c o l u m n G I D + 1]) ;
20           tempPixel = diffx + diffy + diffxy ;
21
22           if ( tempPixel >255 )
23                i m a g e O u t [ r o w G I D * width + c o l u m n G I D ] = 255;
24           else
25                i m a g e O u t [ r o w G I D * width + c o l u m n G I D ] = t e m p P i x e l ;
26    }
        Each pixel will also affect the energy of other adjacent pixels, so its will also be
     read by other work-items from the global memory. That means that the same word
     from the global memory will be accessed multiple times. Therefore, it makes sense
     to first load a block of pixels and their neighbors into the local memory and only
Engineering:Parallel Implementation of Seam Carving                                237
then start the calculation of energy. The reader should add the code for collaborative
loading of the pixel block into local memory. Do not forget to wait for other work-
items at the barrier before starting to calculate the pixel energy.
OpenCL Kernel Function: Seam Identification
Prior to calculating the cumulative energy of the pixels in one row, we should have
already calculated the cumulative energy of all the pixels in the previous row. Because
of this data dependency, we can only run as many work-items at a time as the number
of pixels in one row. When all work-items finish the calculation of the cumulative
energy in one row, they move on to the next row. One work-item will calculate the
cumulative energy of all pixels in the same column, but it will move to the next
row (pixel above) only when all other work-items have finished the computation in
the current row. Therefore, we need a way to synchronize work-items that calculate
cumulative energies. We can synchronize work-items in two different ways:
1. We can run all work-items in the same (only one) work-group. The advantage
   of this method is that we can synchronize all work-items using barriers. The
   disadvantage of this approach lies in the fact that only one block of work-items
   can be run, so only one compute unit on GPU will be active during this step. In this
   approach, we will enqueue one kernel. The parallel algorithm for the cumulative
   energy calculation is given in Algorithm 9.
Which of two presented approaches is more appropriate depends on the size of the
problem. For smaller images, the first approach may be more appropriate, while the
     238                                              8    Engineering: Parallel Implementation of Seam Carving
     second approach is more appropriate for images with very long rows since we can
     employ more compute units.
        Listing 8.6 shows the code for the cumulative energy calculation kernel, which
     will be used for testing purposes in this book. The kernel does not use local memory
     and does not implement collaborative loading. The reader should implement this
     functionality and compare both kernels in terms of execution times. The reader should
     also implement the kernel for the second approach and measure the execution time.
 1    _ _ k e r n e l void c u m u l a t i v e E n e r g i e s G P U (
 2                                             _ _ g l o b a l int * imageIn ,
 3                                             _ _ g l o b a l int * imageOut ,
 4                                             int width ,
 5                                             int height ,
 6                                             int n e w _ w i d t h ) {
 7
 8          // global thread index
 9          int c o l u m n G I D = g e t _ g l o b a l _ i d (0) ;   // c o l u m n in N D R a n g e
10
11          // Start from the bottom - most row :
12          for ( int i = height -2; i >= 0; i - -) {
13                imageOut [i* width + columnGID ] = imageIn [i* width + columnGID ] +
14                                                  g e t P r e v i o u s M i n G P U ( imageOut , i , columnGID ,
15                                                                                      width , height , ←
             new_width );
16                // S y n c h r o n i s e to make sure the tiles are loaded
17                barrier ( CLK_LOCAL_MEM_FENCE );
18          }
19    }
       Listing 8.7 shows the code for the seam labeling kernel and Listing 8.8 shows the
     code for the seam removal kernel.
 1    _ _ k e r n e l void g e t S e a m G P U (
 2                             _ _ g l o b a l int * imageIn ,
 3                             _ _ g l o b a l int * s e a m C o l u m n s ,
 4                             int width ,
 5                             int height ,
 6                             int n e w _ w i d t h ) {
 7
 8          int c o l u m n = 0;
 9          int m i n v a l u e = i m a g e I n [0];
10
11          // find the m i n i m u m in the t o p m o s t row (0) and
12          //      return column index :
13          for ( int j = 1; j < n e w _ w i d t h ; j ++) {
14                if ( i m a g e I n [ j ] < m i n v a l u e ) {
15                     column = j;
16                     m i n v a l u e = i m a g e I n [ j ];
17                }
18          }
19
20          // Start from the top - most row :
21          for ( int i = 0; i < h e i g h t ; i ++) {
22                c o l u m n = g e t N e x t M i n C o l u m n G P U ( imageIn , i ,
23                             column , width ,
24                             height , n e w _ w i d t h ) ;
25                seamColumns [i] = column ;
26          }
27    }
 1    _ _ k e r n e l void s e a m R e m o v e G P U (
 2                                             _ _ g l o b a l int * imageIn ,
 3                                             _ _ g l o b a l int * imageOut ,
 4                                             _ _ g l o b a l int * s e a m C o l u m n s ,
 5                                             int width ,
 6                                             int height ,
 7                                             int n e w _ w i d t h ) {
 8
 9          int iGID = g e t _ g l o b a l _ i d (0) ;                 // row in N D R a n g e
10
11          // get the column index of the seam pixel in my row :
     240                                      8   Engineering: Parallel Implementation of Seam Carving
12           int c o l u m n = s e a m C o l u m n s [ iGID ];
13
14           // make my row n a r r o w e r :
15           for ( int k = c o l u m n ; k < n e w _ w i d t h ; k ++) {
16                 i m a g e O u t [ iGID * width + k ] = i m a g e O u t [ iGID * width + k +1];
17           }
18    }
        To analyze the performance of the seam carving program, the sequential version
     has been run on a quad-core processor Intel Core i7 6700HQ running at 2,2 GHz,
     while the parallel version has been run on an Intel Iris Pro 5200 GPU running at
     1,1 GHz. This is a small GPU integrated on the same chip as the CPU and has only
     40 processing elements. The results of seam carving for an image of size 512 × 320
     are presented in Table 8.1.
        As can be seen from the measured execution times, noticeable acceleration is
     achieved only for the first step. Although this step is embarrassingly parallel, we do
     not achieve the ideal speedup. The reason is that when calculating the energies of
     individual pixels, the work-items irregularly access the global memory and there is
     no memory coalescing. The execution times could be reduced if the work-items used
     local memory, as we did in matrix multiplication.
        At the second step, the speedup is barely noticeable. The first reason for this is the
     data dependency between the individual rows. The other reason is, as before, irregular
     access to global memory. And the third factor that prevents effective parallelization is
     the usage of conditional statements when searching for minimal elements in previous
     rows. Here too, the times would be improved by using local memory.
        At the third step, we do not even get speed up, but almost a 10X slow down! The
     reason for such a slowdown lies in the fact that only one thread can be used to mark
     the seam.
        And last but not least, the processing elements on the GPU runs at a 2X lower
     frequency than the CPU.
Chapter Summary
After reading the book, the reader will be able to start parallel programming on any
of the three main parallel platforms using the corresponding libraries OpenMP, MPI,
or OpenCL. Until a new production technology for significantly faster computing
devices is invented, such parallel programming will be—besides the parallel algo-
rithm design—the only way to increase parallelism in almost any area of parallel
processing.
   Now that we have come to the end of the book; the reader should be well aware
and informed that parallelism is ubiquitous in computing; it is present in hardware
devices, in computational problems, in algorithms, and in software on all levels.
   Consequently, many opportunities for improving the efficiency of parallel pro-
grams are ever present. For example, theoreticians and scientists can search for and
design new, improved parallel algorithms; programmers can develop better tools for
compiling and debugging parallel programs; and cooperation with engineers can lead
to faster and more efficient parallel programs and hardware devices.
   Being so, it is our hope that our book will serve as the first step of a reader who
wishes to join this ever evolving journey. We will be delighted if the book will also
encourage the reader to delve further in the study and practice of parallel computing.
   As the reader now knows, the book provides many basic insights into parallel
computing. It focuses on three main parallel platforms, the multi-core computers, the
distributed computers, and the massively parallel processors. In addition, it explicates
and demonstrates the use of the three main corresponding software libraries and
tools, the OpenMP, the MPI, and the OpenCL library. Furthermore, the book offers
hands-on practice and miniprojects so that the reader can gain experience.
   After reading the book, the reader may have become aware of the follow-
ing three general facts about the libraries and their use on parallel computers:
• OpenMP is relatively easy to use yet limited with the number of cooperating
  computers.
• MPI is harder to program and debug but—due to the excellent support and long
  tradition—manageable and not limited with number of cooperating computers.
• Accelerators, programmed with OpenCL, are even more complex and usually
  tailored to specific problems. Nevertheless, users may benefit from excellent
  speedups of naturally parallel applications, and from low power consumption
  which results from massive parallelization with moderate system frequency.
    What about near future? How will develop high-performance computers and the
corresponding programming tools in the near future? Currently, the only possibility
to increase computing power is to increase parallelism in algorithms and programs,
and to increase the number of cooperating processors, which are often supported by
massively parallel accelerators. Why is that so? The reason is that state-of-the-art
production technology is already faced with physical limits dictated by space (e.g.,
dimension of transistors) and time (e.g., system frequency) [8].
    Current high-performance computers, containing millions of cores, can execute
more than 1017 floating point operations per second (100 petaFLOPS). According to
the Moore’s law, the next challenge is to reach the exascale barrier in the next decade.
However, due to abovementioned physical and technological limitations, the validity
of Moore’s law is questionable. So it seems that the most effective approach to future
parallel computing is an interplay of controlflow and dataflow paradigms, that is, in
the heterogeneous computing. But programming of heterogeneous computers is
still a challenging interdisciplinary task.
    In this book, we did not describe programming of such extremely high-
performance computers; rather, we described and trained the reader for program-
ming of parallel computers at hand, e.g., our personal computers, computers in
cloud, or in computing clusters. Fortunately, the approaches and methodology of
parallel programming are fairly independent of the complexity of computers.
    In summary, it looks like that we cannot expect any significant shift in computing
performance until a new production technology for computing devices is invented.
Until then, the maximal exploitation of parallelism will be our delightful challenge.
Appendix
Hints for Making Your Computer a
Parallel Machine
Practical advises for the installation of required supporting software for parallel
program execution on different operating systems are given. Note that this guide and
Internet links can change in the future, and therefore always look for the up-to-date
solution proposed by software tools providers.
A.1 Linux
OpenMP
OpenMP 4.5 has been a part of GNU GCC C/C++, the standard C/C++ compiler on
Linux, by default since GCC’s version 6, and thus it comes preinstalled on virtually
any recent mainstream Linux distribution. You can check the version of your GCC
C/C++ compiler by running command
$ gcc --version
contains the information about the version of GCC C/C++ compiler (6.3.0 in this
example).
   Utility time can be used to measure the execution time of a given program.
Furthermore, Gnome’s System Monitor or the command-line utilities top (with
separate-cpu-states displayed—press 1 once top starts) and htop can be used to
monitor the load on individual logical cores.
MPI
The message passing interface (MPI) standard implementation can be already provid-
ed as a part of the operating system, most often as MPICH [1] or Open MPI [2,10].
If it is not, it can usually be installed through the provided package management
systems, for example, the apt in Ubuntu:
>mkdir OMPI
retype the “Hello World” program from Sect. 4.3 in your editor and save your code in
file OMPIHello.c. Compile and link the program with a setup for maximal speed:
which, besides compiling, also links appropriate MPI libraries with your program.
Note that on some system an additional option -lm could be needed for correct
inclusion of all required files and libraries.
>mpiexec -n 3 OMPIHello
   The output of the program should be in three lines, each line with a notice from
a separate process:
Appendix: Hints for Making Your Computer a Parallel Machine                          245
as the program has run on three processes, because the option -n 3 was used.
Note that the line order is arbitrary, because there is no rule about the MPI process
execution order. This issue is addressed in more detail in Chap. 4.
OpenCL
First of all you need to download the newest drivers to your graphics card. This is
important because OpenCL will not work if you do not have drivers that support
OpenCL. To install OpenCL, you need to download an implementation of OpenCL.
The major graphic vendors NVIDIA, AMD, and Intel have both released implemen-
tations of OpenCL for their GPUs. Besides the drivers, you should get the OpenCL
headers and libraries included in the OpenCL SDK from your favorite vendor. The
installation steps differ for each SDK and the OS you are running. Follow the in-
stallation manual of the SDK carefully. For OpenCL headers and libraries, the main
options you can choose from are NVIDIA CUDA Toolkit, AMD APP SDK, or Intel
SDK for OpenCL. After the installation of drivers and SDK, you should the OpenCL
headers:
#include<CL/cl.h>
If the OpenCL header and library files are located in their proper folders, the following
command will compile an OpenCL program:
A.2 macOS
OpenMP
Unfortunately, the LLVM C/C++ compiler on macOS comes without OpenMP sup-
port (and the command gcc is simply a link to the LLVM compiler). To check your
C/C++ compiler, run
$ gcc --version
where some numbers might change from one version to another, the compiler most
likely do not support OpenMP. To use OpenMP, you have to install the original GNU
246                          Appendix: Hints for Making Your Computer a Parallel Machine
GCC C/C++ compiler (use MacPorts or Homebrew, for instance) which prints out
something like
informing that this is indeed the GNU GCC C/C++ compiler (version 7.3.0 in this
example).
   The running time can be measured in the same way as on Linux (see above).
Monitoring the load on individual cores can be performed using macOS’s Activity
Monitor (open its CPU Usage window) or htop (but not with macOS’s top).
MPI
In order to use MPI on macOS systems, XDeveloper and GNU compiler must be
installed. Download XCode from the Mac App Store and install it by double-clicking
the downloaded .dmg file. Use the command: >mpiexec to check for installed
implementation of the MPI library on your computer. Either a note that the program
is currently not installed or a help text will be printed.
   If the latest stable release of Open MPI is not present, download it, for example,
from the Open Source High Performance Computing website: https://www.open-
mpi.org/. To install Open MPI on your computer, first extract the downloaded archive
by typing the following command in your terminal (assuming that the latest stable
release in 3.0.1):
  Then, prepare the config.log file needed for the installation. The
config.log file collects information about your system:
   >cd openmpi-3.0.1
   >./configure --prefix=/usr/local
Finally, make the executables for installation and finalize the installation:
   >make all
   >sudo make install
   After successful installation of Open MPI, we start working by typing our first
program. Make your local directory, e.g., with >mkdir OMPI.
   Copy or retype the “Hello World” program from Sect. 4.3 in your editor and save
your code in file OMPIHello.c.
   Compile and link the program with
>mpiexec -n 3 OMPIHello.
 The output of the program should be similar to the output of the “Hello World”
MPI program from Appendix A.1.
OpenCL
If you are using Apple Mac OS X, the Apple’s OpenCL implementation should
already be installed on your system. MAC OS X 10.6 and later ships with a native
implementation of OpenCL. The implementation consists of the OpenCL application
programming interface, the OpenCL runtime engine, and the OpenCL compiler.
   OpenCL is fully supported by Xcode. If you use Xcode, all you need to do is to
include the OpenCL header file:
#include <OpenCL/opencl.h>.
A.3 MS Windows
OpenMP
There are several options for using OpenMP on Microsoft Windows. To follow the
examples in the book as closely as possible, it is best to use Linux Subsystem for Win-
dows 10. If a Linux distribution brings recent enough version of GNU GCC C/C++
compiler, e.g., Debian, one can compile OpenMP programs with it. Furthermore,
one can use commands time, top, and htop to measure and monitor programs.
   Another option is of course using Microsoft Visual C++ compiler. OpenMP has
been supported by it since 2005. Apart from using it from within Microsoft Visu-
al Studio, one can start x64 Native Tools Command Prompt for VS 2017 where
programs can be compiled and run as follows:
With PowerShell run in x64 Native Tools Command Prompt for VS 2017, programs
can be compiled and run as
   >   powershell
   >   cl /openmp /O2 fibonacci.c
   >   $env:OMP_NUM_THREADS=8
   >   ./fibonacci.exe
  Within PowerShell, the running time of a program can be measured using the
command Measure-Command as follows:
248                           Appendix: Hints for Making Your Computer a Parallel Machine
   Regardless of the compiler used, the execution of the programs can be monitored
using Task Manager (open the CPU tab within the Resource Monitor).
MPI
More detailed instructions for installation of necessary software for compiling and
running the Microsoft MPI can be found, for example, on https://blogs.technet.
microsoft.com/windowshpc/2015/02/02/how-to-compile-and-run-a-simple-ms-mp
i-program/. A short summary is listed below:
  1. To include the proper header files, open Project Property pages and
     insert     in     C/C++/General          under     Additional Include
     Directories:
     $(MSMPI_INC);$(MSMPI_INC)\x64
     if 64-bit solution will be built. Use ..\x86 for 32 bits.
  2. To set up the linker library in Project Property pages insert in Linker/
     General under Additional Library Directories:
     $(MSMPI_LIB64)
     if 64-bit platform will be used. Use $(MSMPI_LIB32) for 32 bits.
  3. In Linker/Input under Additional Dependencies add:
     msmpi.lib;
Appendix: Hints for Making Your Computer a Parallel Machine                    249
  4. Close the Project Property window and check in the main Visual Studio window
     that Release solution configuration is selected and select also a solution
     platform of your computer, e.g., x64.
• Copy or retype “Hello World” program from Sect. 4.3 and build the project.
• Open a Command prompt window, change directory to the folder where the project
  was built, e.g., ..\source\repos\MSMPIHello\x64\Debug and run the
  program from the command window with execute utility:
mpiexec -n 3 MSMPIHello
   that should result in the same output as in Appendix A.1, with three lines, each
   with a notice from a separate process.
OpenCL
First of all you need to download the newest drivers to your graphics card. This is
important because OpenCL will not work if you do not have drivers that support
OpenCL. To install OpenCL, you need to download an implementation of OpenCL.
The major graphic vendors NVIDIA, AMD, and Intel have both released implemen-
tations of OpenCL for their GPUs. Besides the drivers, you should get the OpenCL
headers and libraries included in the OpenCL SDK from your favorite vendor. The
installation steps differ for each SDK and the OS you are running. Follow the in-
stallation manual of the SDK carefully. For OpenCL headers and libraries, the main
options you can choose from are NVIDIA CUDA Toolkit, AMD APP SDK, or Intel
SDK for OpenCL. After the installation of drivers and SDK, you should the OpenCL
headers:
#include<CL/cl.h>
Suppose you are using Visual Studio 2013, you need to tell the compiler where the
OpenCL headers are located and tell the linker where to find the OpenCL .lib files.
References
14. Jason Long: Hands On OpenCL: An open source two-day lecture course for teaching and
    learning OpenCL. https://handsonopencl.github.io/ (2018). Accessed 25 Jun 2018
15. Khronos Group: OpenCL: The open standard for parallel programming of heterogeneous sys-
    tems. https://www.khronos.org/opencl/f (2000–2018). Accessed 25 Jun 2018
16. MPI Forum: MPI: A message-passing interface standard (Version 3.1). Technical report,
    Knoxville (2015)
17. Munshi, A., Gaster, B., Mattson, T.G., Fung, J., Ginsburg, D.: OpenCL Programming Guide,
    1st edn. Addison-Wesley Professional, Reading (2011)
18. OpenMP architecture review board: OpenMP application programming interface, version 4.5,
    November 2015. http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf (1997–2015).
    Accessed 18 Dec 2017
19. OpenMP architecture review board: OpenMP application programming interface examples,
    version 4.5.0, November 2016. http://www.openmp.org/wp-content/uploads/openmp-4.5.pdf
    (1997–2016). Accessed 18 Dec 2017
20. OpenMP architecture review board: OpenMP. http://www.openmp.org (2012). Accessed 18
    Dec 2017
21. van der Pas, R., Stotzer, E., Terboven, C.: Using OpenMP - The Next Step: Affinity, Accelerators,
    Tasking, and SIMD (Scientific and Engineering Computation). The MIT Press, Cambridge
    (2017)
22. Robič, B.: The Foundations of Computability Theory. Springer, Berlin (2015)
23. Scarpino, M.: A gentle introduction to OpenCL. http://www.drdobbs.com/parallel/a-gentle-
    introduction-to-opencl/231002854. Accessed 10 Apr 2018
24. Scarpino, M.: OpenCL in action: how to accelerate graphics and computations. Manning Pub-
    lications (2011). http://amazon.com/o/ASIN/1617290173/
25. Trobec, R.: Two-dimensional regular d-meshes. Parallel Comput. 26(13–14), 1945–1953 (2002)
26. Trobec, R., Vasiljević, R., Tomašević, M., Milutinović, V., Beivide, R., Valero, M.: Interconnec-
    tion networks in petascale computer systems: a survey. ACM Comput. Surv. 49(3), 44:1–44:24
    (2016)
Index
A                                                       E
Amdahl’s Law, 37                                        Efficiency, 10
Application Programming Interface (API),                Engineering
          50                                              heat equation, 212
Atomic access, 49, 64, 67                                    MPI, 216
atomic, OpenMP directive, 67                                 OpenMP, 215
                                                        F
B
                                                        Features of message passing, 122
Bandwidth
                                                        Features of MPI communication - MPI ex-
  bisection, 21
                                                                   ample, 122
  channel, 21                                           final (OpenMP clause), 80
  cut, 21                                               Final remarks, 241
Barrier, 51                                               perspectives, 242
Blocking communication, 115                             firstprivate (OpenMP clause), 56
Brent’s Theorem, 36                                     Flow control, 19
                                                        Flynn’s taxonomy, 47
C                                                       for, OpenMP directive, 55
Canonical form (of a loop), 55
collapse (OpenMP clause), 55                            G
Communication and computation overlap,                  GPU, 133
           114                                            compute unit, 137
Communication modes, 115                                  constant memory, 144
Core, 47                                                  global memory, 144
  logical, 48                                             local memory, 144
critical, OpenMP directive, 65                            memory coalescing, 144
                                                          memory hierarchy, 142
Critical section, 63, 65
                                                          occupancy, 174
                                                          processing element, 137
D                                                         registers, 143
Data sharing, 56                                          seam carving, 232
© Springer Nature Switzerland AG 2018                                                      253
R. Trobec et al., Introduction to Parallel Computing,
Undergraduate Topics in Computer Science,
https://doi.org/10.1007/978-3-319-98833-7
254                                                                   Index
R                                        T
Race condition, 48                       task, OpenMP directive, 78, 80
RAM, 11                                  taskwait, OpenMP directive, 84
read, atomic access, 67                  Team of threads, 50, 51, 54
Reduction, 65, 67                        Thread, 48
reduction (OpenMP clause), 67              master, 51, 54
Reentrant, 70                              slave, 54
Routing, 19                                team of, 54
                                         Thread-safe, 70
S
Seam carving, 223                        U
  CPU implementation, 233                update, atomic access, 67
  GPU, 232
  OpenCL implementation, 235             W
Shared memory multiprocessor, 47         Warp, 141
shared (OpenMP clause), 56               Work (of a program), 36, 43
single, OpenMP directive, 78             write, atomic access, 67