Lecture 3
Introduction to Parallelization
August 7, 2023
Logistics
• Office hours: 5 – 6 PM
– Chetna (RM) chetna@cse
– Muzafar (KD-213) muzafarwan@cse
– Vishal Singh (RM-505) vshlsng@cse
• Group formation
– Email by August 9
– Include names, roll numbers, email-ids
– vdeka@cse, chetna@cse
2
Parallelism Everywhere
Source: https://gilmour.com/
Why Parallel?
Task: Find the average age of Indians
India 2020 population is estimated at 1,380,004,385
people at mid year according to UN data.
Time (1 human): > 40 years
Time (1 CPU): 10 s
Time (2 CPUs): 5 s
Time (4 CPUs): 3 s
Parallelism
A parallel computer is a collection of processing
elements that communicate and cooperate to solve
large problems fast.
– Almasi and Gottlieb (1989)
5
Why Fast?
6
Basic Computing Unit
CPU/Core
Memory
Intel Core i7
Processing unit
(Courtesy: www.intel.com)
System – Simplified View
Memory
Fast Memory CPU cores
DISK
8
Multicore Era
CPU
Intel 4004
(1971)
Single core
single chip
Single core Hydra Multiple cores
(1996)
multiple chips single chip
Cray X-MP IBM POWER4
(1982) (2001)
Multiple cores
multiple chips
9
Moore’s Law
Number of CPU cores per
node increased
Gordon Moore
[Source: Wikipedia]
Parallel Computing
DISK DISK
DISK DISK
Supercomputer/Cluster/Data Center
Network is the backbone for data communication
Parallel Computer
Compute nodes
Domain Decomposition
Compute nodes
Discretization
Gridded mesh for a global model [Credit: Tompkins, ICTP]
15
Data Bottleneck
Congestion
Read/write
Storage
Example: ‘Age’ is data
Compute nodes
Average – Serial vs. Parallel
Serial Parallel
for i = 1 to N for i = 1 to N/P
sum += a[i] sum += a[i]
avg = sum/N collect sums and compute
17
Parallel Computer
A parallel computer is a collection of processing
elements that communicate and cooperate to solve
large problems fast.
– Almasi and Gottlieb (1989)
18
Parallel Average
Core
Process
Memory
for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]
0 1 2 3
19
Parallel Code Example
// local computation at every process/thread
for i = N/P * id ; i < N/P * (id+1) ; i++
localsum += a[i]
// collect localsum, add up in one of the ranks
and compute average
20
Performance Measure
• Speedup
Time ( 1 processor)
SP =
Time ( P processors)
• Efficiency
SP
EP =
P
21
Parallel Performance (Parallel Sum)
Parallel efficiency of summing 10^7 doubles
#Processes Time (sec) Speedup Efficiency
1 0.025 1 1.00
2 0.013 1.9 0.95
4 0.010 2.5 0.63
8 0.009 2.8 0.35
12 0.007 3.6 0.30
22
Ideal Speedup
Speedup Linear
Superlinear
Sublinear
Processors
23
Scalability Bottleneck
Performance of weather simulation application
24
Programming
25
Parallel Programming Models
Libraries MPI, TBB, Pthread, OpenMP, …
New languages Haskell, X10, Chapel, …
Extensions Coarray Fortran, UPC, Cilk, OpenCL, …
• Shared memory
– OpenMP, Pthreads, CUDA, …
• Distributed memory
– MPI, UPC, …
• Hybrid
– MPI + OpenMP, MPI + CUDA
26
Sharing Data
27
Parallel Programming Models
Shared memory programming – OpenMP, Pthreads
Distributed memory programming – MPI
Cache
Core
Process/
thread
28
Shared Memory Programming
• Shared address space
• Time taken to access certain memory words is longer (NUMA)
• Programming paradigms – Pthreads, OpenMP
• Need to worry about concurrent access
Memory
CPU cores
29
Threads
From Tim Mattson’s slides 30
OpenMP (Open Multiprocessing)
• Standard for shared memory programming
– Compiler directives
– Runtime routines
– Environment variables
• OpenMP Architecture Review Board
• First released in Nov’97
• Current version 5.1 (Nov’20)
31
OpenMP Example
• Thread-based
• Fork-join model
#pragma omp parallel //fork
Spawn a
{ default number
of threads
} //join
32
OpenMP
$ gcc –fopenmp –o foo foo.c
33
OpenMP
34
OpenMP
35
Output
36
OpenMP – Parallel Sum
Work on distinct data concurrently
37
OpenMP Timing
double stime = omp_get_wtime();
#pragma omp parallel
{
…
}
double etime = omp_get_wtime();
38
Multiple Systems
Cache
Core
Process
39
Distributed Memory Systems
64 – 192 GB RAM/node
• Networked systems
Node • Distributed memory
• Local memory
• Remote memory
• Parallel
Codefile system
Cluster
40
MPI (Message Passing Interface)
• Standard for message passing in a distributed
memory environment (most widely used
programming model in supercomputers)
• Efforts began in 1991 by Jack Dongarra, Tony
Hey, and David W. Walker
• MPI Forum formed in 1993
– Version 1.0: 1994
– Version 4.0: 2021
41
Process - Distinct Address Space
Core
Process
Memory
Local data Local data Local data Local data
42
Multiple Processes on a Single Node
From N. Karanjkar’s slides
43
Multiple Processes on Multiple Nodes
Node 1
Node 2
44
Communication using Messages
Core
Process
Memory
Local data Local data Local data Local data
for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]
45
Communication using Messages
Core
Process
Memory
Local data Local data Local data Local data
Instruction 1 Instruction 1 Instruction 1 Instruction 1
Instruction 2 Instruction 2 Instruction 2 Instruction 2 SIMD
… … … …
46
Message Passing
Time
Process 0 Process 1 47
MPI Programming
48
MPI Programming
mpicc -o program.x program.c
49
Communication using Messages
Core
Process
Memory
Local data Local data Local data Local data
for i = 0 to N/P for i = N/P to 2N/P for i = 2N/P to 3N/P for i = 3N/P to N
sum += a[i] sum += a[i] sum += a[i] sum += a[i]
for i = N/P * rank ; i < N/P * (rank+1) ; i++
localsum += a[i]
Collect localsum, add up at one of the ranks
50
Communication using Messages
Core
Process
Memory
51
Simplest Communication Primitives
• MPI_Send
• MPI_Recv
52
MPI Programming
int MPI_Send (const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
SENDER RECEIVER
int MPI_Recv (void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Status *status)
53
MPI Programming
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
// Sender process
if (myrank == 0) /* code for process 0 */
{
strcpy (message,"Hello, there");
MPI_Send (message, strlen(message)+1, MPI_CHAR, 1, 99,
MPI_COMM_WORLD);
}
// Receiver process
else if (myrank == 1) /* code for process 1 */
{
MPI_Recv (message, 20, MPI_CHAR, 0, 99, MPI_COMM_WORLD,
&status);
printf ("received :%s\n", message);
} 54
MPI – Parallel Sum
Assume the data array resides in the memory of process 0 initially
MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
// Sender process
if (myrank == 0) /* code for process 0 */
{
for (int rank=1; rank<SIZE ; rank++) {
start = rank*N/size*sizeof(int);
MPI_Send (data+start, N/size, MPI_INT, rank, 99, MPI_COMM_WORLD);
}
}
else /* code for processes 1 … SIZE */
{
MPI_Recv (data, N/size, MPI_CHAR, 0, 99, MPI_COMM_WORLD, &status);
}
55
MPI Timing
double stime = MPI_Wtime();
…
…
…
double etime = MPI_Wtime();
56
Parallelization
57
Interpolation
58
Interpolation
59
Range/Value Query
Input: File
Output: File
60
Query on a Million Processes
Compute nodes
Unstructured Mesh
Source: COMSOL
62
Unstructured Mesh
Obayashi et al., Multi-objective Design Exploration Using Efficient Global Optimization
63
Thank You
64