Lecture 4:
Parallel Programming Basics
CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel
export*uniform*float*sumall1( export*uniform*float*sumall2(
***uniform*int*N, ***uniform*int*N,
***uniform*float**x) ***uniform*float**x)
{ {
***uniform*float*sum*=*0.0f; ***uniform*float*sum;
***foreach*(i*=*0*...*N) ***float*partial*=*0.0f;
***{ ***foreach*(i*=*0*...*N)
******sum*+=*x[i]; ***{
***} ******partial*+=*x[i];
*** ***}
***return*sum;
} ***//*from*ISPC*math*library
***sum*=*reduceAdd(partial);
***
***return*sum;
}
sum is of type uniform float (one copy of variable for all program instances)
Undefined behavior: All program instances accumulate into sum in parallel
(read-modify-write operation must be atomic for correctness: it is not)
(CMU 15-418, Spring 2012)
ISPC discussion: sum “reduction”
Compute the sum of all array elements in parallel export*uniform*float*sumall2(
***uniform*int*N,
Each instance accumulates a private partial sum ***uniform*float**x)
{
(no communication) ***uniform*float*sum;
***float*partial*=*0.0f;
Partial sums are added together using the reduceAdd() ***foreach*(i*=*0*...*N)
cross-instance communication primitive. The result is the ***{
******partial*+=*x[i];
same for all instances (uniform) ***}
ISPC code at right will execute in a manner similar to ***//*from*ISPC*math*library
***sum*=*reduceAdd(partial);
handwritten C implementation below. ***
***return*sum;
}
const*int*N*=*1024;
float**x*=*new*float[N];
__mm256*partial*=*_mm256_broadcast_ss(0.0f);
//*populate*x
for*(int*i=0;*i<N;*i+=8)
***partial*=*_mm256_add_ps(partial,*_mm256_load_ps(&x[i]));
float*sum*=*0.f;
for*(int*i=0;*i<8;*i++)
***sum*+=*partial[i];
}
(CMU 15-418, Spring 2012)
Parallel programming basics
(CMU 15-418, Spring 2012)
Creating a parallel program
▪ Thought process:
- Identify work that can be performed in parallel
- Partition work (and associated data)
- Manage data access, communication, and synchronization
▪ Recall one** of our main goals is speedup:
For a fixed problem size:
Time (1 processor)
Speedup( P processors ) =
Time (P processors)
** Other goals include efficiency (cost, area, power, etc.), working on bigger problems than on a uniprocessor
(CMU 15-418, Spring 2012)
Steps in creating a parallel program
These steps are
Problem to solve performed by the
programmer and/or
Decomposition
system (compiler,
runtime, hardware)
Subproblems
(“tasks”)
Assignment
Threads **
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
** textbook uses the term
parallel “processes”. We’re referring to
machine the same concept
(CMU 15-418, Spring 2012)
Decomposition
▪ Break up problem into tasks that can be carried out in parallel
- Need not happen statically
- Tasks can be identified as program executes
▪ Want to create enough tasks to keep all execution units on a
machine busy.
Key aspect of decomposition: identifying dependencies
(a lack of dependencies)
(CMU 15-418, Spring 2012)
Limited concurrency: Amdahl’s law
▪ Say you have a sequential program
▪ Let S = the fraction of sequential execution that is inherently
sequential
- Dependencies prevent parallel execution
▪ Then speedup ≤ /S
1
(CMU 15-418, Spring 2012)
Amdahl’s law example
▪ 2-phase computation on an N-by-N grid
- Phase 1: independent computation on each grid element
- Phase 2: compute sum of all cell values
- Real-life example: image processing
▪ Sequential implementation N
- Both phases take N2 time: total is 2N2
N
Parallelism
N2 N2
1
Execution time
(CMU 15-418, Spring 2012)
First attempt at parallelism (P processors)
▪ Strategy:
- Phase 1: execute in parallel
- time for phase 1: N2/P
- Phase 2: execute serially
- time for phase 2: N2 P Sequential program
Parallelism
▪ Overall performance:
N2 N2
- Speedup 1
Execution time
N2/P
P
Parallel program
-
Parallelism
Speedup ≤ 2
N2
1
Execution time
(CMU 15-418, Spring 2012)
Parallelizing phase 2
▪ Strategy:
- Phase 1: execute in parallel
- time for phase 1: N2/P
- Phase 2: execute partial summations in parallel, combine results serially
- time for phase 2: N2/P + P
▪ Overall performance:
- Speedup overhead:
combining the partial sums
N2/P N2/P + P
P
Parallel program
Parallelism
Note: speedup → P when N >> P
1
Execution time
(CMU 15-418, Spring 2012)
Amdahl’s law
▪ Let S = the fraction of sequential execution that is inherently sequential
▪ Max speedup on P processors given by:
speedup
S=0.01
Max Speedup
S=0.05
S=0.1
Processors
(CMU 15-418, Spring 2012)
Decomposition
▪ Who is responsible for performing decomposition?
- In many cases: the programmer
- Lots and lots of research on automatic decomposition of
sequential programs (very hard in general case)
- Compiler analyzes program, determines dependencies
- What if dependencies are data-dependent?
- Success with simple loops, loop nests
- The “magic parallelizing compiler” has never materialized
(CMU 15-418, Spring 2012)
Assignment
Problem to solve
Decomposition
Subproblems
(“tasks”)
Assignment
Threads
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel
machine
(CMU 15-418, Spring 2012)
Assignment
▪ Assigning tasks to threads
- Think of the threads as “workers”
▪ Goals: balance workload, reduce communication costs
▪ Can be performed statically, or dynamically during execution
▪ While programmer often responsible for decomposition many
languages/runtimes take responsibility for assignment.
(CMU 15-418, Spring 2012)
Assignment examples in ISPC
export*void*sinx( export*void*sinx(
***uniform*int*N, ***uniform*int*N,
***uniform*int*terms, ***uniform*int*terms,
***uniform*float**x, ***uniform*float**x,
***uniform*float**result) ***uniform*float**result)
{ {
***//*assumes*N*%*programCount*=*0 ***foreach*(i*=*0*...*N)
***for*(uniform*int*i=0;*i<N;*i+=programCount) ***{
***{ ****float*value*=*x[i];
****int*idx*=*i*+*programIndex; ****float*numer*=*x[i]***x[i]***x[i];
****float*value*=*x[idx]; ****uniform*int*denom*=*6;**//*3!
****float*numer*=*x[idx]***x[idx]***x[idx]; ****uniform*int*sign*=*X1;
****uniform*int*denom*=*6;**//*3!
****uniform*int*sign*=*X1; ****for*(uniform*int*j=1;*j<=terms;*j++)
****{*
****for*(uniform*int*j=1;*j<=terms;*j++) *******value*+=*sign***numer*/*denom
****{* *******numer**=*x[i]***x[i];
*******value*+=*sign***numer*/*denom *******denom**=*(j+3)***(j+4);
*******numer**=*x[idx]***x[idx]; *******sign**=*X1;
*******denom**=*(j+3)***(j+4); ******}
*******sign**=*X1; ******result[i]*=*value;
******} ***}
******result[i]*=*value; }
***}
}
Decomposition by loop iteration Decomposition by loop iteration
Programmer managed assignment: Foreach construct exposes independent tasks to system
Static assignment System-manages assignment of iterations to instances
Assign iterations to instances in interleaved fashion
(CMU 15-418, Spring 2012)
Orchestration
Problem to solve
Decomposition
Subproblems
(“tasks”)
Assignment
Threads
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel
machine
(CMU 15-418, Spring 2012)
Orchestration
▪ Involves:
- Structuring communication
- Adding synchronization to preserve dependencies
- Organizing data structures in memory, scheduling tasks
▪ Goals: reduce costs of communication/sync, preserve locality
of data reference, reduce overhead, etc.
▪ Machine details impact many of these decisions
- If synchronization is expensive, might use it more sparsely
(CMU 15-418, Spring 2012)
Mapping
Problem to solve
Decomposition
Subproblems
(“tasks”)
Assignment
Threads
Orchestration
Parallel program
(communicating
threads)
Mapping
Execution on
parallel
machine
(CMU 15-418, Spring 2012)
Mapping
▪ Mapping “threads” to execution units
▪ Usually a job for the OS
▪ Many mapping decisions are trivial in parallel programs
- Parallel application uses the entire machine
- So oversubscribing machine with multiple parallel apps is not common
▪ More interesting mapping decisions:
- Place related threads (cooperating threads) on the same processor
(maximize locality, data sharing, minimize costs of comm/sync)
- Mapping of ISPC instances to vector ALUs
(CMU 15-418, Spring 2012)
Decomposing/assigning computation or data?
Often, the reason a problem requires lots of computation (and needs to be parallelized) is
that it involves a lot of data.
I’ve described the process of parallelizing programs as an act of partitioning computation
Often equally valid to think of partitioning data. (computations go with the data)
But there are many computations where the correspondence between “tasks” and data is
less clear. In these cases it’s natural to think of partitioning computation.
(CMU 15-418, Spring 2012)
A parallel programming example
(CMU 15-418, Spring 2012)
Grid-based solver
▪ Solve partial differential equation on N+2 x N+2 grid
▪ Iterative solution
- Perform Gauss-Seidel sweeps over grid until convergence
N
A[i,j]*=*0.2***A[i,j]*+*A[i,jX1]*+*A[iX1,j]
**********************+*A[i,j+1]*+*A[i+1,j];*
(CMU 15-418, Spring 2012)
Grid solver algorithm
(generic syntax: to match textbook)
(CMU 15-418, Spring 2012)
Step 1: identify dependencies
(problem decomposition phase)
N
Each row element depends on element to left.
Each column depends on previous column.
N
...
...
(CMU 15-418, Spring 2012)
Step 1: identify dependencies
(problem decomposition phase)
N
Parallelism along the diagonals.
Good: parallelism exists!
Possible strategy:
1. Partition grid cells on a diagonal into tasks
2. Update values in parallel
N
3. When complete, move to next diagonal
...
...
Bad: hard to exploit
Early in computation: not much parallelism
Frequent synchronization (each diagonal)
(CMU 15-418, Spring 2012)
Key idea: change algorithm
▪ Change order grid cell cells are updated
▪ Iterates to (approximately) same solution, but converges to
solution differently
- Note: floating point values computed are different, but solution still converges
to within error threshold
▪ Domain knowledge: needed knowledge of Gauss-Seidel
iteration to realize this change is okay for application’s needs
(CMU 15-418, Spring 2012)
Exploit application knowledge
Reorder grid traversal: red-black coloring
Update all red cells in parallel
When done, update all black cells in
parallel
(dependency on red cells)
N
Repeat until convergence
(CMU 15-418, Spring 2012)
Assignment
Which is better? Does it matter?
(CMU 15-418, Spring 2012)
Consider dependencies (data flow)
1. Perform red update in parallel Compute red
2. Wait until all processors done Wait
3. Communicate updated red cells to other processors
4. Perform black update in parallel Compute black
5. Wait until all processors done Wait
6. Communicate updated black cells to other processors
7. Repeat
(CMU 15-418, Spring 2012)
Assignment
= data that must be sent to P2 each iteration
Blocked assignment requires less data to be communicated between processors
(CMU 15-418, Spring 2012)
Grid solver: data-parallel expression
To simplify code: we’ve dropped red-black separation, now ignoring dependencies (follows textbook section 2.3)
assignment:
specified explicitly
(block assignment)
decomposition:
tasks are individual
elements
Orchestration:
handled by system
(End of for_all block is implicit wait for all
workers before returning to sequential control)
(CMU 15-418, Spring 2012)
Shared address space solver
SPMD execution model
▪ Programmer responsible for synchronization Compute
▪ Common synchronization primitives: Wait
- Locks (mutual exclusion): only one thread in Compute
the critical region at a time
Wait
- Barriers: wait for threads to reach this point
(CMU 15-418, Spring 2012)
Barrier
▪ Barrier(nthreads)* Compute
▪ Barriers are a conservative way to express Barrier
dependencies
▪ Barriers divide computation into phases Compute
▪ All computations by all threads before the barrier Barrier
complete before any computation in any thread
after the barrier begins
(CMU 15-418, Spring 2012)
Shared address space solver (SPMD execution model)
Value of pid is different for each
SPMD instance: use value to
compute region of grid to work on
partial sum
Why are there so many barriers?
(CMU 15-418, Spring 2012)
Need for mutual exclusion
▪ Each thread executes
- load the value of diff into register r1
- add the register r2 to register r1
- store the value of register r1 into diff
▪ One possible interleaving: (let starting value of diff=0, r2=1)
T0 T1
r1*←*diff T0*reads*value*0
r1*←*diff T1*reads*value*0
r1*←*r1*+*r2 T0*sets*value*of*its*r1*to*1
r1*←*r1*+*r2 T1*sets*value*of*its*r1*to*1
diff*←r1 T0*stores*1*to*diff
diff*←r1 T0*stores*1*to*diff
▪ Need set of three instructions to be atomic
(CMU 15-418, Spring 2012)
Mechanisms for atomicity
▪ lock/unlock mutex variable around critical section
LOCK(mylock);
//*critical*section
UNLOCK(mylock);
▪ Some languages have first-class support
atomic*{
//*critical*section
}
▪ Intrinsics for hardware-supported atomic rd-modify-write operations
atomicAdd(x,*10);
▪ Access to critical section will be serialized across all threads
- High contention will cause performance problems (recall Amdahl’s Law)
- Note partial accumulation into private mydiff
(CMU 15-418, Spring 2012)
More on specifying dependencies
▪ Barriers: simple, but conservative (coarse granularity)
- Everything done up until now must finish, then before next phase
▪ Specifying specific dependencies can increase performance
(by revealing more parallelism)
- Example: two threads. One produces a result, the other consumes it.
T0 T1
//*produce*x,*then*let*T1*know
X*=*1;
flag*=*1;
while*(flag*==*0);
print*X;
▪ We just implemented a message queue
T0 T1
(CMU 15-418, Spring 2012)
Next time: message passing version
(CMU 15-418, Spring 2012)
Example application 1:
Modeling ocean currents
▪ Discretize ocean into slices represented as 2D grids
- Toy example today (grid solver) was taken from this case study
▪ Discretize time evolution: ∆t
▪ High accuracy simulation = small ∆t and high resolution grids
(CMU 15-418, Spring 2012)
Example application 2:
Galaxy evolution
▪ Represent galaxy as a bunch of particles (think: particle = star)
▪ Compute forces due to gravity
- Gravity has infinite extent: O(N2)
- But falls off with distance, so algorithm groups far away stars into aggregates
▪ N-body simulation: commonly used way to simulate fluids, molecular dynamics
(CMU 15-418, Spring 2012)
Example application 3:
Ray tracing
Screen
Camera
Image credit: NVIDIA Image Credit: Sony
(this image can be rendered at “interactive rates” but not real-time yet) (Cloudy With a Chance of Meatballs)
▪ Simulate propagation of light through scene to synthesize realistic images
▪ Compute amount of light traveling along rays
(CMU 15-418, Spring 2012)
Summary
▪ Amdahl’s Law
- Overall speedup limited by amount of serial execution in code
▪ Steps in creating a parallel program
- Decomposition, assignment, orchestrating, mapping
- We’ll talk a lot about making good decisions in each of these phases in coming
lectures (in practice, very inter-related)
▪ Focus today: identifying dependencies
▪ Focus soon: identifying locality
(CMU 15-418, Spring 2012)