Applied High-Performance Computing and Parallel
Programming
                  Presenter: Liangqiong Qu
                     Assistant Professor
                 The University of Hong Kong
Outline
  ▪ Introduction to MPI
    ▪ Parallel Execution in MPI
    ▪ Communicator and Rank
    ▪ MPI Blocking Point-to-Point Communication
    ▪ Beginner’s of MPI Toolbox
    ▪ Examples
Review of Previous Lecture: Dominant Architectures of HPC Systems
•   Shared-memory computers: A shared-memory parallel computer is a system in
    which a number of CPUs work on a common, shared physical address space.
    Shared-memory programming enables immediate access to all data from all
    processors without explicit communication
•   Distributed memory computers: A distributed-memory architecture is a system
    where each processor or node has its own local memory, and they communicate
    with each other through message passing.
•   Hybrid (shared/distributed memory) computers: A hybrid architecture is a
    system that combines the features of both shared-memory and distributed-memory
    architectures.
                                                                Figure. Architecture of Distributed memory
Figure. Architecture of shared-memory computers
Distributed Memory and MPI
• Definition: A distributed-memory architecture is a system where
  each processor or node has its own local memory, and they
  communicate with each other through message passing.
• Features
    • No global shared address space
    • Data exchange and communication between processors is done
      by explicitly passing message through NI (network interfaces)
• Progamming
    • No remote memory access on distributed-memory systems
    • Require to ‘send message’ back and forth between processors
    • Many free Massage Passing Interface (MPI) libraries available
The Message Passing Paradigm
 • A brief history of MPI: Before 1990’s, many libraries could facilitate building parallel
   applications, but there was not a standard accepted way of doing it. In Supercomputing
   1992 conference, research gather together then define a standard interface for performing
   message passing - the Message Passing Interface (MPI). This standard interface allow
   programmers to write parallel applications that were portable to all major parallel
   architectures.
 • MPI is a widely accepted standard in HPC
 • Processing-based approach: All variables are local! No concept of shared memory.
 • A basic principle of MPI: same program on each processor/machine (SPMD). The
   program is written in a sequential language like C and Fortran.
 • Data exchange between processes: Send/receive messages via MPI library calls
    • No automatic workload distribution
The MPI Standard
 • MPI forum – defines MPI standard / library subroutine interface
 • Latest standard: MPI 4.1 (Nov., 2023) 1166 pages
    • First version MPI 1.0 was released on 1994
    • MPI 5.0 under development
 • Members of MPI standard forum
    • Application developers
    • Research institutes & computing centers
    • Manufacturers of supercomputers & software designers
 • Successful free implementations (MPICH, OpenMPI, mvapich) and vendor
   libraries (Intel, Cray, HP, …)
 • Documents : https://www.mpi-forum.org/docs/
Serial Programming vs Parallel Programming (MPI) Terminologies
       Serial Programming        Parallel Programming (MPI)
Parallel Execution in MPI
• Processes run throughout program
  execution
                                     Program startup
• MPI start mechanism:
  • Launches tasks/processes
  • Establishes communication
    context (“communicator”)
                                             +
• MPI point-to-point communication
  • between pairs of tasks/processes
• MPI collective communication:
  • between all processes or a
    subgroup                         Program shutdown   Thread # 0   1   2   3   4
• Clean shutdown by MPI
C Interface for MPI
 • Required header files:
      # include <mpi.h>
 • Bindings:
    • MPI function calls follow a
      specific naming convention
           error = MPI_Xxxxx(…);
    • MPI constant (global/common):
      All upper case in C
Initialization and Finalization
 • Details of MPI startup are implementation defined
 • First call in MPI program: initialization of parallel machine
       int MPI_Init(int *argc,char ***argv)                      Must be called in every
                                                                 MPI program, must be
 • Last call: clean shutdown of parallel machine                 called before any other
       int MPI_Finalize();                                       MPI functions
        Only “master process is guaranteed to continue after finalize
        No other MPI routines may be called after it.
Communicator and Rank
• Key questions arise early in parallel program: How many processors are participating
  and which one am I.
• MPI_Init () defines “communicator” MPI_COMM_WORLD comprising all processes
                                                                            Process rank
• MPI uses objects called communicators and groups to define which collection of
  processes may communicate with each other.
• Within a communicator, every process has its own unique, integer identifier rank
  assigned by the system when the process initializes. A rank is sometimes also called a
  “task ID”. Ranks are contiguous and begin at zero.
Communicator and Rank
• Communicator defines a set of processes (MPI_COMM_WORLD: all)
• The rank identifies each process within a communicator
   • Obtain rank:
         int rank;
         MPI_Comm_rank(MPI_COMM_WORLD, &rank);
      • rank = 0,1,2,..., (number of processes in communicator - 1)
      • One process may have different ranks if it belongs to different communicators
• Obtain number of processors in communicator:
      int size;
      MPI_Comm_size(MPI_Comm_WORLD, &size);
Communicator and Rank: & and * in C-Programming (Background)
• A computer memory location has an address and holds a content. The address is a
  numerical number (often expressed in hexadecimal), which is hard for programmers to
  use directly.
                                         • To ease the burden of programming using
                                           numerical address, early programming
                                           languages (such as C) introduce the concept
                                           of variables.
                                         • A variable is a named location that can store
                                           a value of a particular type. Instead of
                                           numerical addresses, names (or identifiers)
                                           are attached to certain addresses.
Communicator and Rank: & and * in C-Programming (Background)
• When a variable is created in C, a memory address is assigned to the variable. The
  memory address is the location of where the variable is stored on the computer.
• When we assign a value to the variable, it is stored in this memory address. To access
  it, use the reference operator (&), and the result represents where the variable is stored:
• A pointer is a variable that stores the memory address of another variable as its value.
• A pointer variable points to a data type (like int) of the same type, and is created with
  the * operator.
• You can also get the value of the variable the pointer points to, by using the * operator
  (the dereference operator):
Communicator and Rank: & and * in C-Programming (Background)
• In C, function arguments are passed by value by default. This means when you pass a
  variable (e.g., rank) to a function, the function receives a copy of its value. That is C
  allocates new memory for the parameter inside the function.
• Any modifications to the parameter inside the function do not affect the original
  variable outside.
Communicator and Rank: & and * in C-Programming (Background)
• In C, function arguments are passed by value by default. This means when you pass a
  variable (e.g., rank) to a function, the function receives a copy of its value. That is C
  allocates new memory for the parameter inside the function.
• We need to pass by the pointer! The memory address
Communicator and Rank: & and * in C-Programming (Background)
• When a variable is created in C, a memory address is assigned to the variable. The
  memory address is the location of where the variable is stored on the computer.
• When we assign a value to the variable, it is stored in this memory address. To access
  it, use the reference operator (&), and the result represents where the variable is stored:
     int rank;
     MPI_Comm_rank(MPI_COMM_WORLD, &rank);
                                                           "&size" syntax is used to pass the
     int size;                                             address of the "size" variable to
     MPI_Comm_size(MPI_Comm_WORLD, &size);                 the function.
 • The “&” symbol is used in MPI to pass the address of a variable to a function,
   allowing the function to directly modify the value at that memory location. “&” is
   more frequently used in MPI functions rather than directly return the value with
   return ***
General MPI Program Structure
 • Head declaration
 • Serial code
 • Initialize MPI environment
    • Launches tasks/processes
    • Establishes communication context (“communicator”)
 • Work and message passing calls
 • Terminate the MPI environment
 • Serial code
Step 1: Write the Code: MPI “Hello World!” in C
Step 2: Compiling the Code
 • If you try to compile this code locally with gcc, you might run into problems
 • Instead, compiling with MPI wrapper compilers
    •   Most MPI implementations provide wrapper scripts
    •   Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
    •   They behave like normal compilers
Step 2: Compiling the Code
 • If you try to compile this code locally with gcc, you might run into problems
 • Instead, compiling with MPI wrapper compilers
    •   Most MPI implementations provide wrapper scripts
    •   Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
    •   They behave like normal compilers
    •   Directly run mpicc –o hello_world hello_world.c output error
Review of Lecture 2: Batch System for Running Jobs with HPC
 HPC system module environment: knowledge of installed compilers essential
   ▪ module list
   List currently loaded modules under your account
   ▪ module avail
   List all the available modules in the HPC system
   ▪ module avail X
   List all installed version of module matching any of the “X”
   ▪ module load X (e.g., module load python)
   Load a specific module X into your current account (e.g, load module python/2.7.13)
   ▪ module unload X
   Unload specific module X from your current account
Step 2: Compiling Code Basic: Load the Right Module for Compilers
 • If you try to compile this code locally with gcc, you might run into problems
 • Instead, compiling with MPI wrapper compilers
    •   Most MPI implementations provide wrapper scripts
    •   Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
    •   They behave like normal compilers
    •   Directly run mpicc –o hello_world hello_world.c output errors
    •   Should load the module MPI into our current account
    •   Our HKU HPC use Intel MPI, then use
         • module load impi
         •   Note: there are many compilers available, we here pick one for our particular HPC course that
             works with the MPI
Step 3: Running the Code
 • If you try to compile this code locally with gcc, you might run into problems
 • Instead, compiling with MPI wrapper compilers
      •   Most MPI implementations provide wrapper scripts
      •   Such as mpicc (MPICH) or mpiicc (Intel MPI, our HKU HPC system use this)
      •   They behave like normal compilers
 •   Running
      • Starup wrappers: mpirun or mpiexec
         • mpirun –np 4 ./hello_world
      • Details are implementation specific
Review of Lecture 2---Batch System for Running Jobs with HPC
 Submitting job scripts
 A job script must contain directives to inform the batch system about the
 characteristics of the job. This directives appear as comments (#SBATCH) in the job
 script and have to conform with the sbatch syntax
Step 3: Running the Code
•   Preparing job scripts and running the code with Scheduler
     •   Example: Slurm as our HKU HPC system
     •   MPI Run and scheduler distribute the executable on right nodes
     •   After preparation of job scripts, then submit with sbatch command: sbatch submit-hello.sh
Step 3: Running the Code
•   Running the code
    •   Example to understand distribution of program
    •   E.g., executing the MPI program on 4 processors
    •   Normally batch system allocations
    •   Understanding role of mpirun is important (below
        command, running hello_world with 4 processors)
         mpirun –np 4 ./hello_world
Summarization of “Hello World!” in MPI
  • All MPI programs begin with MPI_init and end with
    MPI_Finalize
  • When a program is ran with MPI all the processes are
    grouped in a communicator, MPI_COMM_WORLD
  • Each statement executes independently in each process
Administration
• Assignment 1 has released
    -   Due March 14, 2025, Friday, 11:59 PM
    -   Accounts information to access HPC system in HKU has already be sent to your
        email late this week.
    -   Important: The usage of accounts for the first accounts is from Feb. 27 to Mar. 12
        11:59 PM.
    -   You cannot access to the HPC system after Mar.12!
• We need a class representative volunteer to attend Staff-Student Consultative
   Committee (SSCC) for 2nd semester 2024-25 meeting! Thank you.
    - Session 1: 2:30 - 3:40 p.m. on Wednesday, 26 March 2025 in Room 301, Run Run
      Shaw Building
Take a break
MPI Point-to-Point Communication
  ▪ Sender
     • Which processor is sending the message?
     • Where is the data on the sending processor?
     • What kind of data is being sent?
     • How much data is there?
  ▪ Receiver
     • Which processor is receiving the message?
     • Where should the data be left on the receiving processor?
     • How much data is the receiving processor prepared to accept?
  ▪ Sender and receiver must mass their information to MPI separately
MPI Point-to-Point Communication
▪ Processors communicate by sending and receiving messages
▪ MPI message: array of elements of a particular type
          rank i                                rank j
          Sender                                Receiver
▪ Data types
   ▪ Basic
   ▪ MPI derived types
Predefined Data Types in MPI (Selection)
                                     • Data type matching: Same type
                                       in send and receive call required
MPI Blocking Point-to-Point Communication
    ▪ Point-to-point: one sender, one receiver
       • identified by rank
    ▪ Blocking: After the MPI call returns,
       • the source process can safely modify the send buffer
       • the receive buffer (on the destination process) contains the
         entire message
       • This is not the “standard” definition of “blocking”
Standard Blocking Send
 int MPI_Send(void *buf, int count , MPI_Datatype datatype, int dest, int tag, MPI_Comm
 comm)
 buf             address of send buffer
 count          # of elements
 datatype       MPI data type
 dest           destination rank
 tag            message tag
 comm           communicator
 • void* (void pointer) is a void pointer that can hold the address of any data
   type in C
Standard Blocking Send
 int MPI_Send(void *buf, int count , MPI_Datatype datatype, int dest, int tag, MPI_Comm
 comm)
 buf            address of send buffer
 count         # of elements
 datatype      MPI data type
 dest          destination rank
 tag           message tag
 comm          communicator
 ▪ At completion
    • Send buffer can be reused as you see fit
    • Status of destination is unknown - the message could be anywhere
Standard Blocking Send
 int MPI_Send(void *buf, int count , MPI_Datatype datatype, int dest, int tag, MPI_Comm
 comm)
Standard Blocking Receive
 int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
 MPI_Comm comm, MPI_Status *status);
 buf            address of receive buffer
 count          maximum # of elements that excepted to receive
 datatype       MPI data type
 source         sending processor rank
 tag            message tag
 comm           communicator
 status         address of status object. It is a struct that you can access if
                necessary to have more information on the message you just
                received.
 ▪ At completion
    • Message has been received successfully
    • Message length, and probably the tag and the sender, are still unknown
Source and Tag WildCards
  ▪ In certain cases, we might want to allow receiving messages from any sender or
    with any tag.
  ▪ MPI_Recv accepts wildcards for the source and tag arguments:
    MPI_ANY_SOURCE,MPI_ANY_TAG
  ▪ MPI_ANY_SOURCE, and MPI_ANY_TAG indicates receive a message from
    any source/tag
  ▪ Actual source and tag values are available in the status object:
Received Message Length
  ▪ int MPI_Get_count(const MPI_Status *status, MPI_Datatype
    datatype, int *count)
  status       address of status object
  datatype     MPI data type
  count        the address of the variable that will store the element count
               after the function is executed
  ▪ Determines number of elements received
  int count;
  MPI_Get_count(&s,MPI_DOUBLE, &count);
Standard Blocking Receive
Requirements for Point-to-Point Communication
 ▪ For a communication to succeed:
    • The sender must specify a valid destination.
    • The receiver must specify a valid source rank (or MPI_ANY_SOURCE).
    • The communicator used by the sender and receiver must be the same (e.g.,
      MPI_COMM_WORLD).
    • The tags specified by the sender and receiver must match (or MPI_ANY_TAG for
      receiver).
    • The data types of the messages being sent and received must match.
    • The receiver's buffer must be large enough to hold the received message.
Beginner’s MPI Toolbox
   • MPI_Init( ): Let's get going. Initializes the MPI execution environment.
   • MPI_Comm_size( ): How many are we?
   • MPI_Comm_rank( ): Who am I?
   • MPI_Send( ): Send data to someone else.
   • MPI_Recv( ): Receive data from someone/anyone.
   • MPI_Get_count( ): How many items have I received?
   • MPI_Finalize( ): Finish off. Terminates the MPI execution environment.
   • Send/receive buffer may safely be reused after the call has completed
   • MPI_Send() must have a specific received rank/tag, MPI_Recv () does not
Example 1. Exchanging Data with MPI Send/Receive (Pingpong.c)
Example 1. Exchanging Data with MPI Send/Receive
                                        •     MPI_Send( ) function is used to send
                                              a certain number of elements of
                                              some datatype to another MPI rank;
                                              this routine blocks until the message
                                              is received by the destination process
                                        •     MPI_Recv() function is used to
                                              receive a certain number of elements
                                              of some datatype from another MPI
                                              rank; this routine blocks until the
                                              message is received and thus send by
                                              the source process
                                        •     This form of MPI communication is
                                              called ‘blocking'
                                    •       int MPI_Recv(void *buf, int count, MPI_Datatype
                                            datatype, int source, int tag, MPI_Comm comm,
                                            MPI_Status *status);
                                    •       int MPI_Recv(void *buf, int count, MPI_Datatype
                                            datatype, int source, int tag, MPI_Comm comm,
                                            MPI_Status *status);
Example 1. Exchanging Data with MPI Send/Receive
                                      ▪ Spend a bit time to really
                                        understand why source and
                                        dest are equal here per rank
Example 1. Exchanging Data with MPI Send/Receive
                                     •   MPI_Status is a variable that
                                         includes a lot of information about
                                         the corresponding MPI function call
                                     •   We use the MPI_Status in our
                                         example to check how much chars
                                         we really transferred by using the
                                         MPI_Get_count() function
                                     •   As a simple debug possibility we can
                                         check whether the MPI_Status
                                         information about source and tag of
                                         the messages are corresponding to
                                         our idea of programming
Example 1. Exchanging Data with MPI Send/Receive
 • Load mpi into our account: module load impi
 • Compiling with MPI wrapper compilers
      • mpiicc pingpong.c –o pingpong
 •   Preparing batch script submit-pingpong.sh
 •   Submit the jobs: sbatch submit-pingpong.sh
 •   View the results
Example 1. Extension of Scontrol Command
                                           • scontrol show job
                                             <job_id> used to
                                             see the status of a
                                             specific job ID
Summary of Beginner’s MPI Toolbox
   ▪ Starting up and shutting down the “parallel program” with MPI_Init() and
     MPI_Finalize()
   ▪ MPI task (“process”) identified by rank (MPI_Comm_rank() )
   ▪ Number of MPI tasks: MPI_Comm_size()
   ▪ Startup process is very implementation dependent
   ▪ Simple, blocking point-to-point communication with MPI_send() and
     MPI_Recv()
      • “Blocking” == buffer can be reused as soon as call returns
   ▪ Message matching
   ▪ Timing functions
Thank you very much for choosing this course!
           Give us your feedback!
  https://forms.gle/zDdrPGCkN7ef3UG5A
                                      52