Applied High-Performance Computing and Parallel
Programming
Presenter: Liangqiong Qu
Assistant Professor
The University of Hong Kong
Outline
▪ Introduction to MPI
▪ Parallel Execution in MPI
▪ Communicator and Rank
▪ MPI Blocking Point-to-Point Communication
▪ Beginner’s of MPI Toolbox
▪ Examples
Review of Previous Lecture: Dominant Architectures of HPC Systems
• Shared-memory computers: A shared-memory parallel computer is a system in
which a number of CPUs work on a common, shared physical address space.
Shared-memory programming enables immediate access to all data from all
processors without explicit communication
• Distributed memory computers: A distributed-memory architecture is a system
where each processor or node has its own local memory, and they communicate
with each other through message passing.
• Hybrid (shared/distributed memory) computers: A hybrid architecture is a
system that combines the features of both shared-memory and distributed-memory
architectures.
Figure. Architecture of Distributed memory
Figure. Architecture of shared-memory computers
Distributed Memory and MPI
• Definition: A distributed-memory architecture is a system where
each processor or node has its own local memory, and they
communicate with each other through message passing.
• Features
• No global shared address space
• Data exchange and communication between processors is done
by explicitly passing message through NI (network interfaces)
• Progamming
• No remote memory access on distributed-memory systems
• Require to ‘send message’ back and forth between processors
• Many free Massage Passing Interface (MPI) libraries available
The Message Passing Paradigm
• A brief history of MPI: Before 1990’s, many libraries could facilitate building parallel
applications, but there was not a standard accepted way of doing it. In Supercomputing
1992 conference, research gather together then define a standard interface for performing
message passing - the Message Passing Interface (MPI). This standard interface allow
programmers to write parallel applications that were portable to all major parallel
architectures.
• MPI is a widely accepted standard in HPC
• Processing-based approach: All variables are local! No concept of shared memory.
• A basic principle of MPI: same program on each processor/machine (SPMD). The
program is written in a sequential language like C and Fortran.
• Data exchange between processes: Send/receive messages via MPI library calls
• No automatic workload distribution
The MPI Standard
• MPI forum – defines MPI standard / library subroutine interface
• Latest standard: MPI 4.1 (Nov., 2023) 1166 pages
• First version MPI 1.0 was released on 1994
• MPI 5.0 under development
• Members of MPI standard forum
• Application developers
• Research institutes & computing centers
• Manufacturers of supercomputers & software designers
• Successful free implementations (MPICH, OpenMPI, mvapich) and vendor
libraries (Intel, Cray, HP, …)
• Documents : https://www.mpi-forum.org/docs/
Serial Programming vs Parallel Programming (MPI) Terminologies
Serial Programming Parallel Programming (MPI)
Parallel Execution in MPI
• Processes run throughout program
execution
Program startup
• MPI start mechanism:
• Launches tasks/processes
• Establishes communication
context (“communicator”)
+
• MPI point-to-point communication
• between pairs of tasks/processes
• MPI collective communication:
• between all processes or a
subgroup Program shutdown Thread # 0 1 2 3 4
• Clean shutdown by MPI
C Interface for MPI
• Required header files:
# include <mpi.h>
• Bindings:
• MPI function calls follow a
specific naming convention
error = MPI_Xxxxx(…);
• MPI constant (global/common):
All upper case in C
Initialization and Finalization
• Details of MPI startup are implementation defined
• First call in MPI program: initialization of parallel machine
int MPI_Init(int *argc,char ***argv) Must be called in every
MPI program, must be
• Last call: clean shutdown of parallel machine called before any other
int MPI_Finalize(); MPI functions
Only “master process is guaranteed to continue after finalize
No other MPI routines may be called after it.
Communicator and Rank
• Key questions arise early in parallel program: How many processors are participating
and which one am I.
• MPI_Init () defines “communicator” MPI_COMM_WORLD comprising all processes
Process rank
• MPI uses objects called communicators and groups to define which collection of
processes may communicate with each other.
• Within a communicator, every process has its own unique, integer identifier rank
assigned by the system when the process initializes. A rank is sometimes also called a
“task ID”. Ranks are contiguous and begin at zero.
Communicator and Rank
• Communicator defines a set of processes (MPI_COMM_WORLD: all)
• The rank identifies each process within a communicator
• Obtain rank:
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
• rank = 0,1,2,..., (number of processes in communicator - 1)
• One process may have different ranks if it belongs to different communicators
• Obtain number of processors in communicator:
int size;
MPI_Comm_size(MPI_Comm_WORLD, &size);
Communicator and Rank: & and * in C-Programming (Background)
• A computer memory location has an address and holds a content. The address is a
numerical number (often expressed in hexadecimal), which is hard for programmers to
use directly.
• To ease the burden of programming using
numerical address, early programming
languages (such as C) introduce the concept
of variables.
• A variable is a named location that can store
a value of a particular type. Instead of
numerical addresses, names (or identifiers)
are attached to certain addresses.
Communicator and Rank: & and * in C-Programming (Background)
• When a variable is created in C, a memory address is assigned to the variable. The
memory address is the location of where the variable is stored on the computer.
• When we assign a value to the variable, it is stored in this memory address. To access
it, use the reference operator (&), and the result represents where the variable is stored:
• A pointer is a variable that stores the memory address of another variable as its value.
• A pointer variable points to a data type (like int) of the same type, and is created with
the * operator.
• You can also get the value of the variable the pointer points to, by using the * operator
(the dereference operator):
Communicator and Rank: & and * in C-Programming (Background)
• In C, function arguments are passed by value by default. This means when you pass a
variable (e.g., rank) to a function, the function receives a copy of its value. That is C
allocates new memory for the parameter inside the function.
• Any modifications to the parameter inside the function do not affect the original
variable outside.
Communicator and Rank: & and * in C-Programming (Background)
• In C, function arguments are passed by value by default. This means when you pass a
variable (e.g., rank) to a function, the function receives a copy of its value. That is C
allocates new memory for the parameter inside the function.
• We need to pass by the pointer! The memory address
Communicator and Rank: & and * in C-Programming (Background)
• When a variable is created in C, a memory address is assigned to the variable. The
memory address is the location of where the variable is stored on the computer.
• When we assign a value to the variable, it is stored in this memory address. To access
it, use the reference operator (&), and the result represents where the variable is stored:
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
"&size" syntax is used to pass the
int size; address of the "size" variable to
MPI_Comm_size(MPI_Comm_WORLD, &size); the function.
• The “&” symbol is used in MPI to pass the address of a variable to a function,
allowing the function to directly modify the value at that memory location. “&” is
more frequently used in MPI functions rather than directly return the value with
return ***
General MPI Program Structure
• Head declaration
• Serial code
• Initialize MPI environment
• Launches tasks/processes
• Establishes communication context (“communicator”)
• Work and message passing calls
• Terminate the MPI environment
• Serial code
Step 1: Write the Code: MPI “Hello World!” in C
Step 2: Compiling the Code
• If you try to compile this code locally with gcc, you might run into problems
• Instead, compiling with MPI wrapper compilers
• Most MPI implementations provide wrapper scripts
• Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
• They behave like normal compilers
Step 2: Compiling the Code
• If you try to compile this code locally with gcc, you might run into problems
• Instead, compiling with MPI wrapper compilers
• Most MPI implementations provide wrapper scripts
• Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
• They behave like normal compilers
• Directly run mpicc –o hello_world hello_world.c output error
Review of Lecture 2: Batch System for Running Jobs with HPC
HPC system module environment: knowledge of installed compilers essential
▪ module list
List currently loaded modules under your account
▪ module avail
List all the available modules in the HPC system
▪ module avail X
List all installed version of module matching any of the “X”
▪ module load X (e.g., module load python)
Load a specific module X into your current account (e.g, load module python/2.7.13)
▪ module unload X
Unload specific module X from your current account
Step 2: Compiling Code Basic: Load the Right Module for Compilers
• If you try to compile this code locally with gcc, you might run into problems
• Instead, compiling with MPI wrapper compilers
• Most MPI implementations provide wrapper scripts
• Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
• They behave like normal compilers
• Directly run mpicc –o hello_world hello_world.c output errors
• Should load the module MPI into our current account
• Our HKU HPC use Intel MPI, then use
• module load impi
• Note: there are many compilers available, we here pick one for our particular HPC course that
works with the MPI
Step 3: Running the Code
• If you try to compile this code locally with gcc, you might run into problems
• Instead, compiling with MPI wrapper compilers
• Most MPI implementations provide wrapper scripts
• Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
• They behave like normal compilers
• Running
• Starup wrappers: mpirun or mpiexec
• mpirun –np 4 ./hello_world
• Details are implementation specific
Review of Lecture 2---Batch System for Running Jobs with HPC
Submitting job scripts
A job script must contain directives to inform the batch system about the
characteristics of the job. This directives appear as comments (#SBATCH) in the job
script and have to conform with the sbatch syntax
Step 3: Running the Code
• Preparing job scripts and running the code with Scheduler
• Example: Slurm as our HKU HPC system
• MPI Run and scheduler distribute the executable on right nodes
• After preparation of job scripts, then submit with sbatch command: sbatch submit-hello.sh
Step 3: Running the Code
• Running the code
• Example to understand distribution of program
• E.g., executing the MPI program on 4 processors
• Normally batch system allocations
• Understanding role of mpirun is important (below
command, running hello_world with 4 processors)
mpirun –np 4 ./hello_world
Summarization of “Hello World!” in MPI
• All MPI programs begin with MPI_init and end with
MPI_Finalize
• When a program is ran with MPI all the processes are
grouped in a communicator, MPI_COMM_WORLD
• Each statement executes independently in each process
Administration
• Assignment 1 has released
- Due March 14, 2025, Friday, 11:59 PM
- Accounts information to access HPC system in HKU has already be sent to your
email late this week.
- Important: The usage of accounts for the first accounts is from Feb. 27 to Mar. 12
11:59 PM.
- You cannot access to the HPC system after Mar.12!
• We need a class representative volunteer to attend Staff-Student Consultative
Committee (SSCC) for 2nd semester 2024-25 meeting! Thank you.
- Session 1: 2:30 - 3:40 p.m. on Wednesday, 26 March 2025 in Room 301, Run Run
Shaw Building
Take a break
MPI Point-to-Point Communication
▪ Sender
• Which processor is sending the message?
• Where is the data on the sending processor?
• What kind of data is being sent?
• How much data is there?
▪ Receiver
• Which processor is receiving the message?
• Where should the data be left on the receiving processor?
• How much data is the receiving processor prepared to accept?
▪ Sender and receiver must mass their information to MPI separately
MPI Point-to-Point Communication
▪ Processors communicate by sending and receiving messages
▪ MPI message: array of elements of a particular type
rank i rank j
Sender Receiver
▪ Data types
▪ Basic
▪ MPI derived types
Predefined Data Types in MPI (Selection)
• Data type matching: Same type
in send and receive call required
MPI Blocking Point-to-Point Communication
▪ Point-to-point: one sender, one receiver
• identified by rank
▪ Blocking: After the MPI call returns,
• the source process can safely modify the send buffer
• the receive buffer (on the destination process) contains the
entire message
• This is not the “standard” definition of “blocking”
Standard Blocking Send
int MPI_Send(void *buf, int count , MPI_Datatype datatype, int dest, int tag, MPI_Comm
comm)
buf address of send buffer
count # of elements
datatype MPI data type
dest destination rank
tag message tag
comm communicator
• void* (void pointer) is a void pointer that can hold the address of any data
type in C
Standard Blocking Send
int MPI_Send(void *buf, int count , MPI_Datatype datatype, int dest, int tag, MPI_Comm
comm)
buf address of send buffer
count # of elements
datatype MPI data type
dest destination rank
tag message tag
comm communicator
▪ At completion
• Send buffer can be reused as you see fit
• Status of destination is unknown - the message could be anywhere
Standard Blocking Send
int MPI_Send(void *buf, int count , MPI_Datatype datatype, int dest, int tag, MPI_Comm
comm)
Standard Blocking Receive
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Status *status);
buf address of receive buffer
count maximum # of elements that excepted to receive
datatype MPI data type
source sending processor rank
tag message tag
comm communicator
status address of status object. It is a struct that you can access if
necessary to have more information on the message you just
received.
▪ At completion
• Message has been received successfully
• Message length, and probably the tag and the sender, are still unknown
Source and Tag WildCards
▪ In certain cases, we might want to allow receiving messages from any sender or
with any tag.
▪ MPI_Recv accepts wildcards for the source and tag arguments:
MPI_ANY_SOURCE,MPI_ANY_TAG
▪ MPI_ANY_SOURCE, and MPI_ANY_TAG indicates receive a message from
any source/tag
▪ Actual source and tag values are available in the status object:
Received Message Length
▪ int MPI_Get_count(const MPI_Status *status, MPI_Datatype
datatype, int *count)
status address of status object
datatype MPI data type
count the address of the variable that will store the element count
after the function is executed
▪ Determines number of elements received
int count;
MPI_Get_count(&s,MPI_DOUBLE, &count);
Standard Blocking Receive
Requirements for Point-to-Point Communication
▪ For a communication to succeed:
• The sender must specify a valid destination.
• The receiver must specify a valid source rank (or MPI_ANY_SOURCE).
• The communicator used by the sender and receiver must be the same (e.g.,
MPI_COMM_WORLD).
• The tags specified by the sender and receiver must match (or MPI_ANY_TAG for
receiver).
• The data types of the messages being sent and received must match.
• The receiver's buffer must be large enough to hold the received message.
Beginner’s MPI Toolbox
• MPI_Init( ): Let's get going. Initializes the MPI execution environment.
• MPI_Comm_size( ): How many are we?
• MPI_Comm_rank( ): Who am I?
• MPI_Send( ): Send data to someone else.
• MPI_Recv( ): Receive data from someone/anyone.
• MPI_Get_count( ): How many items have I received?
• MPI_Finalize( ): Finish off. Terminates the MPI execution environment.
• Send/receive buffer may safely be reused after the call has completed
• MPI_Send() must have a specific received rank/tag, MPI_Recv () does not
Example 1. Exchanging Data with MPI Send/Receive (Pingpong.c)
Example 1. Exchanging Data with MPI Send/Receive
• MPI_Send( ) function is used to send
a certain number of elements of
some datatype to another MPI rank;
this routine blocks until the message
is received by the destination process
• MPI_Recv() function is used to
receive a certain number of elements
of some datatype from another MPI
rank; this routine blocks until the
message is received and thus send by
the source process
• This form of MPI communication is
called ‘blocking'
• int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status *status);
• int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status *status);
Example 1. Exchanging Data with MPI Send/Receive
▪ Spend a bit time to really
understand why source and
dest are equal here per rank
Example 1. Exchanging Data with MPI Send/Receive
• MPI_Status is a variable that
includes a lot of information about
the corresponding MPI function call
• We use the MPI_Status in our
example to check how much chars
we really transferred by using the
MPI_Get_count() function
• As a simple debug possibility we can
check whether the MPI_Status
information about source and tag of
the messages are corresponding to
our idea of programming
Example 1. Exchanging Data with MPI Send/Receive
• Load mpi into our account: module load impi
• Compiling with MPI wrapper compilers
• mpiicc pingpong.c –o pingpong
• Preparing batch script submit-pingpong.sh
• Submit the jobs: sbatch submit-pingpong.sh
• View the results
Example 1. Extension of Scontrol Command
• scontrol show job
<job_id> used to
see the status of a
specific job ID
Summary of Beginner’s MPI Toolbox
▪ Starting up and shutting down the “parallel program” with MPI_Init() and
MPI_Finalize()
▪ MPI task (“process”) identified by rank (MPI_Comm_rank() )
▪ Number of MPI tasks: MPI_Comm_size()
▪ Startup process is very implementation dependent
▪ Simple, blocking point-to-point communication with MPI_send() and
MPI_Recv()
• “Blocking” == buffer can be reused as soon as call returns
▪ Message matching
▪ Timing functions
Thank you very much for choosing this course!
Give us your feedback!
https://forms.gle/zDdrPGCkN7ef3UG5A
52