What is parallelism?
Victor Eijkhout
Fall 2022
Justification
Parallel computing has been a necessity for decades in computational
science. Here we discuss some of the basic concepts. Actual parallel
programming will be discussed in other lectures.
2
Basic concepts
3
1 The basic idea
Parallelism is about doing multiple things at once.
• Hardware: vector instructions, multiple cores, nodes in a cluster.
• Algorithm: can you think of examples?
4
2 Simple example
Summing two arrays together:
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
Parallel: every processing element does
for ( i in my_subset_of_indices )
a[i] = b[i] + c[i];
Time goes down linearly with processors
5
3 Differences between operations
for (i=0; i<n; i++) s = 0;
a[i] = b[i] + c[i]; for (i=0; i<n; i++)
s += x[i]
• Compare operation counts
• Compare behavior on single processor. What about multi-core?
• Other thoughts about parallel execution?
6
4 Summing
Naive algorithm Recoding
s = 0; for (s=2; s<n; s*=2)
for (i=0; i<n; i++) for (i=0; i<n; i+=s)
s += x[i] x[i] += x[i+s/2]
7
5 And then there is hardware
Topology of the processors:
increasing distance: limit on parallel speedup
8
Theoretical concepts
9
Efficiency and scaling
10
6 Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1
But:
• Is T1 based on the same algorithm? The parallel code?
• Sometimes superlinear speedup.
• Is T1 measurable? Can the problem be run on a single
processor?
11
7 Amdahl’s law
Let’s assume that part of the application can be parallelized, part not.
(Examples?)
• Fs sequential fraction, Fp parallelizable fraction
• Fs + Fp = 1
12
8 Amdahl’s law, analysis
• Fs sequential fraction, Fp parallelizable fraction
• Fs + Fp = 1
• T1 = (Fs + Fp )T1 = Fs T1 + Fp T1
• Amdahl’s law: Tp = Fs T1 + Fp T1 /p
• P → ∞: TP ↓ T1 Fs
• Speedup is limited by SP ≤ 1/Fs , efficiency is a decreasing
function E ∼ 1/P.
Do you see problems with this?
13
9 Amdahl’s law with communication overhead
• Communication independent of p: Tp = T1 (Fs + Fp /P ) + Tc
• assume fully parallelizable: Fp = 1
• then Sp = T1
T1 /p+Tc
• For reasonable speedup: Tc ≪ T1 /p or p ≪ T1 /Tc :
number of processors limited by ratio of scalar execution time and
communication overhead
14
10 Gustafson’s law
Reconstruct the sequential execution from the parallel, then analyze
efficiency.
15
11 Gustafson’s law
• Let Tp = Fs + Fp ≡ 1
• then T1 = Fs + p · Fp
• Speedup:
T1 Fs + p · Fp
Sp = = = Fs + p · Fp = p − ( p − 1) · Fs .
Tp Fs + Fp
slowly decreasing function of p
16
12 Scaling
• Amdahl’s law: strong scaling
same problem over increasing processors
• Often more realistic: weak scaling
increase problem size with number of processors,
for instance keeping memory constant
• Weak scaling: Ep > c
• example (below): dense linear algebra
17
13 Strong scaling
• Let M be the total memory needed for your problem.
• Let P be the number of processors
⇒ memory per processor is M /P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)
18
14 Weak scaling
• Let M be the memory per processor.
• Let P be the number of processors
⇒ total memory is M · P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)
19
15 Simulation scaling
• Assumption: simulated time S, running time T constant, now
increase precision
• m memory per processor, and P the number of processors
M = Pm total memory.
d the number of space dimensions of the problem, typically
2 or 3,
∆x = 1/M 1/d grid spacing.
• stability:
(
∆x = 1 M 1/d
hyperbolic case
∆t =
∆x 2 = 1 M 2/d
parabolic case
With a simulated time S:
k = S /∆t time steps.
20
16 Simulation scaling con’td
• Assume time steps parallelizable
S
T = kM /P = m.
∆t
Setting T /S = C, we find
m = C ∆t ,
memory per processor goes down.
(
1 M 1/d hyperbolic case
m = C ∆t = c 2/d
1 M parabolic case
• Substituting M = Pm, we find ultimately
(
1 P 1/(d +1) hyperbolic
m=C 2/(d +2)
1 P parabolic
21
Critical path analysis
22
17 Critical path
• The sequential fraction contains a critical path: a sequence of
operations that depend on each other.
• Example?
• T∞ = time with unlimited processors: length of critical path.
23
18 Brent’s theorem
Let m be the total number of tasks, p the number of processors, and t
the length of a critical path. Then the computation can be done in
m−t
Tp ≤ t + .
p
• Time equals the length of the critical path . . .
• . . . plus the remaining work as parallel as possible.
24
Granularity
25
19 Definition
Definition: granularity is the measure for how many operations can be
performed between synchronizations
26
20 Instruction level parallelism
a ← b+c
d ← e∗f
For the compiler / processor to worry about
27
21 Data parallelism
for (i=0; i<1000000; i++)
a[i] = 2*b[i];
• Array processors, vector instructions, pipelining, GPUs
• Sometimes harder to discover
• Often used mixed with other forms of parallelism
28
22 Task-level parallelism
if optimal (root) then
exit
else
parallel: SearchInTree (leftchild),SearchInTree (rightchild)
Procedure SearchInTree(root)
Unsynchronized tasks: fork-join
general scheduler
while there are tasks left do
wait until a processor becomes inactive;
spawn a new task on it
29
23 Conveniently parallel
Example: Mandelbrot set
Parameter sweep,
often best handled by external tools
30
24 Medium-grain parallelism
Mix of data parallel and task parallel
my_lower_bound = // some processor-dependent number
my_upper_bound = // some processor-dependent number
for (i=my_lower_bound; i<my_upper_bound; i++)
// the loop body goes here
31
LU factorization analysis
32
25 Algorithm
for k = 1, n − 1:
for i = k + 1 to n:
aik ← aik /akk
for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj
Can the k loop be done in parallel? The i , j loops?
33
26 Dependent operations
−1
a22 ← a22 − a21 ∗ a11 a12
···
−1
a33 ← a33 − a32 ∗ a22 a23
34
Exercise 1: Critical path
Follow this argument through. Argue that there is a non-trivial critical
path in the sense of section ??. What is its length?
In the analysis of the critical path section, what does this critical path
imply for the minimum parallel execution time and bounds on
speedup?
35
27 Subblock update
for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj
How many processors can you use maximally in step k ?
36
Exercise 2: Parallel execution
Continue this reasoning. With p = n2 processing elements each of the
(i , j ) updates in the subblock can be done simultaneously. To be
precise, how long does an arbitrary k iteration take? Summing over
all k , what is the resulting Tp , Sp , Ep ? How does this relate to the
bounds you derived above?
Also, with p = n processing elements you could let each row or
column of the subblock update be done in parallel. What is now the
time for the kth outer iteration? What is the resulting Tp , Sp , Ep ?,
37
28 Application scaling
Single processor.
Relating time and memory to problem size
1
T= N 3 /f , M = N 2.
3
where f is processor frequency.
38
Exercise 3: Memory scaling, case 1: Faster
processor
Suppose you buy a processor twice as fast, and you want to do a
benchmark run that again takes time T . How much memory do you
need?
39
29 More processors
Keep frequency constant, but vary number of processors p:
1
T= N 3 /p , M = N 2.
3
Each processor now stores Mp = N 2 /p elements.
40
Exercise 4: Memory scaling, case 2: More
processors
Suppose you have a cluster with p processors, each with Mp memory,
can run a Gaussian elimination of an N × N matrix in time T :
1
T= N 3 /p , Mp = N 2 /p .
3
Now you extend the cluster to 2P processors, of the same clock
speed, and you want to do a benchmark run, again taking time T . How
much memory does each node need?
Hint: for the extended cluster:
1
T ′ = N ′3 / p ′ , Mp′ = N ′2 /p′ .
3
The question becomes to compute Mp′ under the given conditions.
41
The SIMD/MIMD/SPMD/SIMT model for parallelism
42
30 Flynn Taxonomy
Consider instruction stream and data stream:
• SISD: single instruction single data
used to be single processor, now single core
• MISD: multiple instruction single data
redundant computing for fault tolerance?
• SIMD: single instruction multiple data
data parallelism, pipelining, array processing, vector instructions
• MIMD: multiple instruction multiple data
independent processors, clusters, MPPs
43
31 SIMD
• Relies on streams of identical operations
• See pipelining
• Recurrences hard to accomodate
44
32 SIMD: array processors
Technology going back to the
1980s: FPS, MasPar, CM,
GoodYear
Major advantage: simplification of
processor
45
33 SIMD as vector instructions
• Register width multiple of 8 bytes:
• simultaneous processing of more than one operand pair
• SSE: 2 operands,
• AVX: 4 or 8 operands
46
34 Controlling vector instructions
void func(float *restrict c, float *restrict a,
float *restrict b, int n)
{
#pragma vector always
for (int i=0; i<n; i++)
c[i] = a[i] * b[i];
}
This needs aligned data (posix_memalign)
47
35 New branches in the taxonomy
• SPMD: single program multiple data
the way clusters are actually used
• SIMT: single instruction multiple threads
the GPU model
48
36 MIMD becomes SPMD
• MIMD: independent processors, independent instruction streams,
independent data
• In practice very little true independence: usally the same
executable
Single Program Multiple Data
• Exceptional example: climate codes
• Old-style SPMD: cluster of single-processor nodes
• New-style: cluster of multicore nodes, ignore shared caches /
memory
• (We’ll get to hybrid computing in a minute)
49
37 GPUs and data paralleism
Lockstep in thread block,
single instruction model between streaming processors
(more about GPU threads later)
50
Characterization of parallelism by memory model
51
38 Major types of memory organization, classic
52
39 Major types of memory organization,
contemporary
53
40 Symmetric multi-processing
• The ideal case of shared memory:
every address equally accessible
• This hasn’t existed in a while
(Tim Mattson claims Cray-2)
• Danger signs: shared memory programming pretends that
memory access is symmetric
in fact: hides reality from you
54
41 SMP, bus design
• Bus: all processors on the same wires to memory
• Not very scalable: requires slow processors or cache memory
• Cache coherence easy by ‘snooping’
55
42 Non-uniform Memory Access
Memory is equally programmable, but not equally accessible
• Different caches, different affinity
• Distributed shared memory: network latency
ScaleMP and other products watch me not believe it
56
43 Picture of NUMA
57
Interconnects and topologies, theoretical
concepts
58
44 Topology concepts
• Hardware characteristics
• Software requirement
• Design: how ‘close’ are processors?
59
45 Graph theory
• Degree: number of connections from one processor to others
• Diameter: maximum minimum distance (measured in hops)
60
46 Bandwidth
• Bandwidth per wire is nice, adding over all wires is nice, but. . .
• Bisection width: minimum number of wires through a cut
• Bisection bandwidth: bandwidth through a bisection
61
47 Design 1: bus
Already discussed; simple design, does not scale very far
62
48 Design 2: linear arrays
• Degree 2, diameter P, bisection width 1
• Scales nicely!
• but low bisection width
63
Exercise 5: Broadcast algorithm
Flip last bit, flip one before, . . .
64
49 Design 3: 2/3-D arrays
• Degree 2d, diameter P 1/d
• Natural design: nature is three-dimensional
• More dimensions: less contention.
K-machine is 6-dimensional
65
50 Design 3: Hypercubes
66
51 Hypercube numbering
Naive numbering:
67
52 Gray codes
Embedding linear numbering in hypercube:
68
53 Binary reflected Gray code
1D Gray code : 0 1
..
1D code and reflection: 0 1 . 1 0
2D Gray code : ..
append 0 and 1 bit: 0 0 . 1 1
.
2D code and reflection: 0 1 1 0 .. 0 1 1 0
.
3D Gray code : 0 0 1 1 .. 1 1 0 0
.
append 0 and 1 bit: 0 0 0 0 .. 1 1 1 1
69
54 Switching networks
• Solution to all-to-all connection
• (Real all-to-all too expensive)
• Typically layered
• Switching elements: easy to extend
70
55 Cross bar
Advantage: non-blocking
Disadvantage: cost
71
56 Butterfly exchange
Process to segmented pool of memory, or between processors with
private memory:
72
57 Building up butterflies
73
58 Uniform memory access
Contention possible
74
59 Route calculation
75
60 Fat Tree
76
61 Fat trees from switching elements
(Clos network)
77
62 Fat tree clusters
78
Exercise 6: Switch contention
Suppose the number of processor p is larger than the number of
wires w.
Write a simulation that investigates the probability of contention if you
send m ≤ w message to distinct processors.
Can you do a statistical analysis, starting with a simple case?
79
63 Mesh clusters
80
64 Levels of locality
• Core level: private cache, shared cache
• Node level: numa
• Network: levels in the switch
81
Programming models
82
65 Shared vs distributed memory
programming
Different memory models:
Different questions:
• Shared memory: synchronization problems such as critical
sections
• Distributed memory: data motion
83
Thread parallelism
84
66 What is a thread
• Process: code, heap, stack
• Thread: same code but private program counter, stack, local
variables
• dynamically (even recursively) created: fork-join
Incremental parallelization!
85
67 Thread context
• Private data (stack, local variables) is called ‘thread context’
• Context switch: switch from one thread execution to another
• context switches are expensive; alternative hyperthreading
• Intel Xeon Phi: hardware support for 4 threads per core
• GPUs: fast context switching between many threads
86
68 Thread programming 1
Pthreads
pthread_t threads[NTHREADS];
printf("forking\n");
for (i=0; i<NTHREADS; i++)
if (pthread_create(threads+i,NULL,&adder,NULL)!=0)
return i+1;
printf("joining\n");
for (i=0; i<NTHREADS; i++)
if (pthread_join(threads[i],NULL)!=0)
return NTHREADS+i+1;
87
69 Race conditions
Init: I=0
process 1: I=I+2
process 2: I=I+3
scenario 1. scenario 2. scenario 3.
I=0
read I = 0 read I = 0 read I = 0 read I = 0 read I = 0
set I = 2 set I = 3 set I = 2 set I = 3 set I = 2
write I = 2 write I = 3 write I = 2
write I = 3 write I = 2 read I = 2
set I = 5
write I = 5
I=3 I=2 I=5
88
70 Dealing with atomic operations
Semaphores, locks, mutexes, critical sections, transactional memory
Software / hardware
89
71 Cilk
Cilk code:
Sequential code:
cilk int fib(int n){
int fib(int n){
if (n<2) return 1;
if (n<2) return 1;
else {
else {
int rst=0;
int rst=0;
rst += spawn fib(n-1);
rst += fib(n-1);
rst += spawn fib(n-2);
rst += fib(n-2);
sync;
return rst;
return rst;
}
}
Sequential consistency: program output identical to sequential
90
72 OpenMP
• Directive based
• Parallel sections, parallel loops, tasks
91
Distributed memory parallelism
92
73 Global vs local view
(
yi ← yi + xi −1 i >0
yi unchanged i =0
• If I am processor 0 do nothing, otherwise receive a y element
from the left, add it to my x element.
• If I am the last processor do nothing, otherwise send my y
element to the right.
(Let’s think this through. . . )
93
74 Global picture
94
75 Careful coding
95
76 Better approaches
• Non-blocking send/receive
• One-sided
96
Hybrid/heterogeneous parallelism
97
77 Hybrid computing
• Use MPI between nodes, OpenMP inside nodes
• alternative: ignore shared memory and MPI throughout
• you save: buffers and copying
• bundling communication, load spread
98
78 Using threads for load balancing
Dynamic scheduling gives load balancing
Hybrid is possible improvement over strict-MPI
99
79 Amdahl’s law for hybrid programming
• p nodes with c cores each
• Fp core-parallel fraction, assume full MPI parallel
• ideal speedup pc, running time T1 /(pc ), actually:
Fs Fp T1 T1
Tp,c = T1 + = (Fs c + Fp ) = (1 + Fs (c − 1)) .
p pc pc pc
• T1 /Tp,c ≈ p/Fs
• Original Amdahl: Sp < 1/Fs , hybrid programming Sp < p/Fs
100
Design patterns
101
80 Array of Structures
struct { int number; double xcoord,ycoord; } _Node;
struct { double xtrans,ytrans} _Vector;
typedef struct _Node* Node;
typedef struct _Vector* Vector;
Node *nodes = (node) malloc( n_nodes*sizeof(struct _Node)
);
102
81 Operations
Operate
void shift(node the_point,vector by) {
the_point->xcoord += by->xtrans;
the_point->ycoord += by->ytrans;
}
in a loop
for (i=0; i<n_nodes; i++) {
shift(nodes[i],shift_vector);
}
103
82 Along come the 80s
Vector operations
node_numbers = (int*) malloc( n_nodes*sizeof(int) );
node_xcoords = // et cetera
node_ycoords = // et cetera
and you would iterate
for (i=0; i<n_nodes; i++) {
node_xoords[i] += shift_vector->xtrans;
node_yoords[i] += shift_vector->ytrans;
}
104
83 and the wheel of reinvention turns further
The original design was better for MPI in the 1990s
except when vector instructions (and GPUs) came along in the 2000s
105
84 Latency hiding
• Memory and network are slow, prevent having to wait for it
• Hardware magic: out-of-order execution, caches, prefetching
106
85 Explicit latency hiding
Matrix vector product
∀i ∈Ip : yi = ∑ aij xj .
j
x needs to be gathered:
!
∀i ∈Ip : yi = ∑ + ∑ aij xj .
j local j not local
Overlap loads and local operations
Possible in MPI and Xeon Phi offloading,
very hard to do with caches
107
What’s left
108
86 Parallel languages
• Co-array Fortran: extensions to the Fortran standard
• X10
• Chapel
• UPC
• BSP
• MapReduce
• Pregel, . . .
109
87 UPC example
#define N 100*THREADS
shared int v1[N], v2[N], v1plusv2[N];
void main()
{
int i;
upc_forall(i=0; i<N; i++; i)
v1plusv2[i]=v1[i]+v2[i];
}
110
88 Co-array Fortran example
Explicit dimension for ‘images’:
Real,dimension(100),codimension[*] :: X
Real :: X(100)[*]
Real :: X(100,200)[10,0:9,*]
determined by runtime environment
111
89 Grab bag of other approaches
• OS-based: data movement induced by cache misses
• Active messages: application level Remote Procedure Call
(see: Charm++)
112
Load balancing, locality, space-filling curves
113
90 The load balancing problem
• Application load can change dynamically
e.g., mesh refinement, time-dependent problems
• Splitting off and merging loads
• No real software support: write application anticipating load
management
• Initial balancing: graph partitioners
114
91 Load balancing and performance
• Assignment to arbitrary processor violates locality
• Need a dynamic load assignment scheme that preserves locality
under load migration
• Fairly easy for regular problems, for irregular?
115
Space-filling curves
116
92 Adaptive refinement and load assignment
117
93 Assignment through Space-Filling Curve
118
Domain partitioning by Fiedler vectors
119
94 Inspiration from physics
120
95 Graph laplacian
• Set Gij = −1 if edge (i , j )
• Set Gii positive to give zero rowsums
• First eigenvector is zero, positive eigenvector
• Second eigenvector has pos/neg, divides in two
• n-th eigenvector divides in n parts
121
96 Fiedler in a picture
122