Parallel Architecture and
Execution
Lecture 2
January 8, 2025
Parallel Application Performance
Depends on
• Parallel architecture
• Algorithm
2
Saturation – Example 1
#Processes Time (sec) Speedup Efficiency
1 0.025 1 1.00
2 0.013 1.9 0.95
4 0.010 2.5 0.63
8 0.009 2.8 0.35
12 0.007 3.6 0.30
3
Saturation – Example 2
4
Saturation – Example 3
Source: GGKK Chapter 5 5
Efficiency (Adding numbers)
Problem size
Source: GGKK Chapter 5 6
Limitations of Parallelization
• Overhead
• E.g. communication
• Over-decomposition
• Work per process/core
• Idling
• Load imbalance
• Synchronization
• Serialization
7
Execution Profile
Execution Profile of a Hypothetical Parallel Program
Source: GGKK Chapter 5 8
Performance
• Sequential
• Input size
• Parallel
• Input size
• Number of processing elements (PE)
• Communication speed
• …
9
Scaling Deep Learning Models
Problem size #PEs
SOURCE: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES
“The fastest known sequential algorithm for a problem
may be difficult or impossible to parallelize.”- GGKK 10
Parallelization
• Speedup
• Efficiency
11
Sum of Numbers Speedup
Naïve parallelization method:
𝑁 - Compute in parallel
Speedup = - Send partial result to one
𝑁
+𝑃+𝑃 (All-to-one)
𝑃
- Compute final result
12
Parallel Sum (Optimized)
Source: GGKK Chapter 5 13
Sum of Numbers Speedup (Optimized)
S1 = 𝑁 E1 = ?
𝑁
+ 2𝑃
𝑃
𝑁 1
S2 = E2 =
𝑁 2𝑃𝑙𝑜𝑔𝑃
+ 2 𝑙𝑜𝑔𝑃 +1
𝑃 𝑁
14
Efficiency (Adding numbers)
Homework: Analyze the measured efficiency based on your derivation
Source: GGKK Chapter 5 15
A Limitation of Parallel Computing
Fraction of code that
is parallelizable
1
Speedup S =
1 − 𝑓 + 𝑓/𝑃
Amdahl’s Law
16
Parallel Architecture
17
System Components
• Processor
• Memory
• Network
• Storage
NUMA
Source: https://www.sciencedirect.com/topics/computer-science/non-uniform-memory-access
18
Memory Hierarchy
A multicore SMP architecture
Image Source: The Art of Multiprocessor Programming – Herlihy, Shavit
19
Memory Access Times
Source: MIT CSAIL
20
Processor vs. Memory
“While clock rates of high-end processors have
increased at roughly 40% per year over the last decade,
DRAM access times have only improved at the rate of
roughly 10% per year over this interval.”
- Introduction to Parallel Computing by Ananth Grama
et al. (GGKK)
21
NUMA Nodes
Utility: lstopo (hwloc package)
AMD Bulldozer Memory Topology (Source: Wikipedia)
22
NUMA Node (Zoomed)
Utility: lstopo (hwloc package)
AMD Bulldozer Memory Topology (Source: Wikipedia) 23
Effective Memory Access Times
24
Memory Placement
Lepers et al., “Thread and Memory Placement on NUMA Systems: Asymmetry Matters”, USENIX ATC 2015.
25
Connect Multiple Compute Nodes
Intraconnect
Source: hector.ac.uk
26
Parallel Programming Models
• Shared memory
• Distributed memory
27
Shared Memory
• Shared address space
• Time taken to access certain memory words is
longer (NUMA)
• Need to worry about concurrent access
• Programming paradigms – Pthreads, OpenMP
Thread 0
Thread 1
28
Intel Processors (Latest)
https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-
xeon-scalable-processors.html
29
Cluster of Compute Nodes
30
Message Passing
• Distinct address space per process
• Multiple processing nodes
• Basic operations are send and receive
31
Interprocess Communication
32
Our Parallel World
Core
Process
Memory
Distributed memory programming
• Distinct address space
• Explicit communication
33
Distinct Process Address Space
Process 0 Process 1
x = 1, y = 2 x = 10, y = 20
... ...
x++ x++
... ...
print x, y print x, y
2, 2 11, 20
34
Distinct Process Address Space
Process 0 Process 1
x = 1, y = 2 x = 1, y = 2
... ...
x++; y++ y++
... ...
print x, y print x, y
2, 3 1, 3
35
Adapted from Neha Karanjkar’s slides
36
Our Parallel World
Core
Process
NO centralized server/master
37
Message Passing
Interface
Message Passing Interface (MPI)
• Efforts began in 1991 by Jack Dongarra, Tony Hey, and
David W. Walker.
• Standard for message passing in a distributed
memory environment
• MPI Forum in 1993
• Version 1.0: 1994
• Version 2.0: 1997
• Version 3.0: 2012
• Version 4.0: 2021
• Version 5.0 (under discussion)
39
MPI Implementations
“The MPI standard includes point-to-point message-passing,
collective communications, group and communicator concepts,
process topologies, environmental management, process
creation and management, one-sided communications,
extended collective operations, external interfaces, I/O, some
miscellaneous topics, and a profiling interface.” – MPI report
• MPICH (ANL)
• MVAPICH (OSU)
• OpenMPI
• Intel MPI
• Cray MPI
40
Programming Environment
• Shell scripts (e.g. bash)
• ssh basics
• E.g. ssh –X
•…
• Mostly in C/C++
• Compilation, Makefiles, ...
• Linux environment variables
• PATH
• LD_LIBRARY_PATH
•…
41
MPI Installation – Laptop
• Linux or Linux VM on Windows
• apt/snap/yum/brew
• Windows
• No support
• https://www.mpich.org/documentation/guides/
42
MPI
• Standard for message passing
• Explicit communications
• Medium programming complexity
• Requires communication scope
43
Simple MPI Code
44
MPI Code Execution Steps
• Compile
• mpicc -o program.x program.c
• Execute
• mpirun -np 1 ./program.x (mpiexec -np 1 ./program.x)
• Runs 1 process on the launch/login node
• mpirun -np 6 ./program.x
• Runs 6 processes on the launch/login node
45
Output – Hello World
mpirun –np 20 ./program.x
46