CSC4005
Parallel Programming
Tutorial 3
Introduction to MPI Programming
Yangzhixin Luo, 119010224@link.cuhk.edu.cn
Outline of Tutorial 3
• Cluster Topology
• MPI: Blocking vs Non-blocking
• Point-wise Communication
▪ Blocking Communication
▪ MPI Supported Datatypes
▪ Probing
▪ Immediate (Non-blocking) Communication
• Multipoint Communication
▪ Broadcast
▪ Scatter
▪ Gather
▪ Allgather
▪ Barrier
▪ More Information
• Debugging
Cluster Topology
MPI: Blocking vs Non-blocking
・Blocking communication
Blocking communication is done using MPI_Send() and MPI_Recv().
These functions do not return (i.e., they block) until the communication is
finished.
The buffer passed to MPI_Send() can be reused, either because MPI saved it
somewhere, or because it has been received by the destination. Similarly,
MPI_Recv() returns when the receive buffer has been filled with valid data.
・Non-blocking communication
Non-blocking communication is done using MPI_Isend() and MPI_Irecv().
These function return immediately (i.e., they do not block) even if the
communication is not finished yet. You must call MPI_Wait() or MPI_Test() to
see whether the communication has finished.
Blocking Point-wise Communication
MPI_Send(
void* data,
int count,
MPI_Datatype datatype,
int destination,
int tag,
MPI_Comm communicator);
Blocking Point-wise Communication
MPI_Recv(
void* data,
int count,
MPI_Datatype datatype,
int source,
int tag,
MPI_Comm communicator,
MPI_Status* status);
MPI Supported Datatypes
Blocking Point-wise Communication
Example 1
// Find out rank, size
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int number;
if (world_rank == 0) {
number = -1;
MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (world_rank == 1) {
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n", number);
}
Blocking Point-wise Communication
Example 2
if (0 == rank) {
data = 1999'12'08;
for (int i = 1; i < size; ++i) {
data = data + i;
std::cout << "sending " << data << " to " << i << std::endl;
MPI_Send(&data, 1, MPI_INT, i, 0, MPI_COMM_WORLD);
}
} else {
MPI_Recv(&data, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
std::cout << "received " << data << " at " << rank << std::endl;
}
Blocking Point-wise Communication
Example 2 Output
Blocking Point-wise Communication
int MPI_Sendrecv(
const void *sendbuf,
int sendcount,
MPI_Datatype sendtype,
int dest,
int sendtag,
void *recvbuf,
int recvcount,
MPI_Datatype recvtype,
int source,
int recvtag,
MPI_Comm comm,
MPI_Status *status);
Blocking Point-wise Communication
Example 3
Blocking Point-wise Communication
Example 3 Output
Probing
MPI_Get_count(
MPI_Status* status,
MPI_Datatype datatype,
int* count);
MPI_Probe(
int source,
int tag,
MPI_Comm comm,
MPI_Status* status);
MPI_Status
If we pass an MPI_Status structure to the MPI_Recv function, it
will be populated with additional information about the receive
operation after it completes. The three primary pieces of information
include:
・The rank of the sender
The rank of the sender is stored in the MPI_SOURCE element of the
structure. That is, if we declare an MPI_Status stat variable, the rank can
be accessed with stat.MPI_SOURCE.
・The tag of the message
The tag of the message can be accessed by the MPI_TAG element of the
structure (similar to MPI_SOURCE).
・The length of the message
The length of the message does not have a predefined element in the
status structure. Instead, we have to find out the length of the message
with MPI_Get_count.
MPI_Get_count
Why would this information be necessary?
The MPI_Get_count function is used to determine the actual
receive amount.
It turns out that MPI_Recv can take MPI_ANY_SOURCE for the
rank of the sender and MPI_ANY_TAG for the tag of the message. For
this case, the MPI_Status structure is the only way to find out the actual
sender and tag of the message. Furthermore, MPI_Recv is not
guaranteed to receive the entire amount of elements passed as the
argument to the function call. Instead, it receives the amount of
elements that were sent to it (and returns an error if more elements were
sent than the desired receive amount).
MPI_Get_count Example
MPI_Probe
Instead of posting a receive and simply providing a really
large buffer to handle all possible sizes of messages, you can use
MPI_Probe to query the message size before actually receiving it.
MPI_Probe looks quite similar to MPI_Recv. In fact, you can
think of MPI_Probe as an MPI_Recv that does everything but
receive the message. Similar to MPI_Recv, MPI_Probe will block for
a message with a matching tag and sender. When the message is
available, it will fill the status structure with information. The user
can then use MPI_Recv to receive the actual message.
MPI_Probe Example
Immediate (Non-blocking) Point-wise
Communication
int MPI_Isend(
const void *buf,
int count,
MPI_Datatype datatype,
int dest,
int tag,
MPI_Comm comm,
MPI_Request *request);
Immediate (Non-blocking) Point-wise
Communication
int MPI_Irecv(
void *buf,
int count,
MPI_Datatype datatype,
int source,
int tag,
MPI_Comm comm,
MPI_Request *request);
Immediate (Non-blocking) Point-wise
Communication
MPI_Wait
MPI_Wait is a blocking call that returns only when a specified
operation has been completed (e.g., the send buffer is safe to
access). This call should be inserted at the point where the next
section of code depends on the buffer, because it forces the
process to block until the buffer is ready.
int MPI_Wait(
MPI_Request *request,
MPI_Status *status);
Immediate (Non-blocking) Point-wise
Communication
MPI_Test
MPI_Test is the nonblocking counterpart to MPI_Wait. Instead of
blocking until the specified message is complete, this function returns
immediately with a flag that says whether the requested message is complete
(true) or not (false). MPI_Test is basically a safe polling mechanism, and this
means we can again emulate blocking behavior by executing MPI_Test inside
of a while-loop.
int MPI_Test(
MPI_Request *request,
int *flag,
MPI_Status *status);
Immediate (Non-blocking) Point-wise
Communication Example
MPI_Request send_req, recv_req;
MPI_Isend(send.data(), size, MPI_INT, target, 0, MPI_COMM_WORLD, &send_req);
MPI_Irecv(recv.data(), size, MPI_INT, source, 0, MPI_COMM_WORLD, &recv_req);
int sflag = 0, rflag = 0;
do {
MPI_Test(&send_req, &sflag, MPI_STATUS_IGNORE);
MPI_Test(&recv_req, &rflag, MPI_STATUS_IGNORE);
} while (!sflag || !rflag);
Broadcast
A broadcast is one of the standard
collective communication techniques.
During a broadcast, one process sends the
same data to all processes in a
communicator. One of the main uses of
broadcasting is to send out user input to a
parallel program, or send out configuration
parameters to all processes.
In this example, process zero is the
root process, and it has the initial copy of
data. All of the other processes receive the
copy of data.
Broadcast
MPI_Bcast(
void* data,
int count,
MPI_Datatype datatype,
int root,
MPI_Comm communicator);
Broadcast Example
int main(int argc, char **argv) {
int data;
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
std::cin >> data;
}
MPI_Bcast(&data, 1, MPI_INT, 0, MPI_COMM_WORLD);
std::cout << data << std::endl;
MPI_Finalize();
}
MPI_Scatter
MPI_Scatter is a collective routine
that is very similar to MPI_Bcast.
MPI_Scatter involves a designated root
process sending data to all processes in a
communicator.
The primary difference between
MPI_Bcast and MPI_Scatter is small but
important. MPI_Bcast sends the same
piece of data to all processes while
MPI_Scatter sends chunks of an array to
different processes.
MPI_Scatter
MPI_Scatter(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
int root,
MPI_Comm communicator);
MPI_Gather
MPI_Gather is the inverse of MPI_Scatter.
Instead of spreading elements from one process to
many processes, MPI_Gather takes elements from
many processes and gathers them to one single
process. This routine is highly useful to many parallel
algorithms, such as parallel sorting and searching.
Similar to MPI_Scatter, MPI_Gather takes
elements from each process and gathers them to the
root process. The elements are ordered by the rank
of the process from which they were received.
MPI_Gather
MPI_Gather(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
int root,
MPI_Comm communicator);
MPI_Allgather
Given a set of elements distributed across
all processes, MPI_Allgather will gather all of the
elements to all the processes. In the most basic
sense, MPI_Allgather is an MPI_Gather followed
by an MPI_Bcast.
Just like MPI_Gather, the elements from
each process are gathered in order of their rank,
except this time the elements are gathered to all
processes. The function declaration for
MPI_Allgather is almost identical to MPI_Gather
with the difference that there is no root process
in MPI_Allgather.
MPI_Allgather
MPI_Allgather(
void* send_data,
int send_count,
MPI_Datatype send_datatype,
void* recv_data,
int recv_count,
MPI_Datatype recv_datatype,
MPI_Comm communicator);
MPI_Barrier
MPI_Barrier(MPI_COMM_WORLD);
MPI_Barrier
{
std::this_thread::sleep_for(std::chrono::milliseconds{100} * rank);
std::cout << "hi (no bar)" << std::endl;
}
MPI_Barrier(MPI_COMM_WORLD);
{
std::this_thread::sleep_for(std::chrono::milliseconds{100} * rank);
MPI_Barrier(MPI_COMM_WORLD);
std::cout << "hi (bar)" << std::endl;
}
More Information
・Reduce:
https://mpitutorial.com/tutorials/mpi-reduce-and-allreduce/
・Group Division:
https://mpitutorial.com/tutorials/introduction-to-groups-and-
communicators/
Debugging: Stacktrace
Debugging: Stacktrace
mpirun –timeout 5 --get-stack-traces ./main
Pass in Arguments
#include <iostream> ・Input:
using namespace std; ./main CSC4005 Parallel Programming
int main(int argc, char** argv) ・Output:
{ You have entered 4 arguments:
cout << "You have entered " << argc ./test
<< " arguments:" << "\n"; CSC4005
Parallel
for (int i = 0; i < argc; ++i)
Programming
cout << argv[i] << "\n";
return 0;
}
Pass in Arguments
・argc (ARGument Count) is int and stores number of command-
line arguments passed by the user including the name of the
program. The value of argc should be nonnegative.
・argv (ARGument Vector) is array of character pointers listing all
the arguments.