Parallelization Principles
R. Govindarajan
SERC, IISc
govind@iisc.ac.in
1
Overview
§ Introduction
§ Parallelization Steps
§ Example
Ø Shared Address Space
Ø Distributed Address Space
Acknowledgments:
Slides for this tutorial are taken from presentation materials
available with the book “Parallel Computing Architecture: A
Hardware/Software Approach” (Culler, Singh and Gupta,
Morgan Kaufmann Pub.) and the associated course material.
They have been suitably adapted.
2
Parallel Programming
§ Shared, global, address space, hence
called Shared Address Space
Ø Any processor can directly reference any
memory location
Ø Communication occurs implicitly as result of
loads and stores
§ Message Passing Architecture
Ø Memory is private to each node
Ø Processes communicate by messages
3
Definitions
§ Speedup = 𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛
𝑈𝑛𝑖𝑃𝑟𝑜𝑐𝑒𝑠𝑜𝑟 /𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛 𝒏 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠
§ Efficiency = 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 /𝒏
§ Amdahl’s Law:
ØFor a program with s part sequential
execution, speedup is limited by 1/s .
4
Understanding Amdahl’s Law
Example: 2-phase calculation
Ø sweep over n x n grid and do some independent
computation
Ø sweep again and add each value to global sum
concurrency
1
n2 n2 Time
(a) Serial
Ø Serial Execution Time = n2 + n2 = 2n2
5
Understanding Amdahl’s Law
Parallel Execution time:
Ø Time for first phase = n2/p
Ø Second phase serialized at global variable = n2;
Ø Speedup = (2n2/(n2 + n2/p)) or at most 2
Improved Parallel Execution
Ø Localize the sum in p procs and then do serial sum.
Ø Speedup = (2n2/(2n2 /p + p)) ≈ p
p p
concurrency
concurrency
1 1
p
n2/p
n2/p
n2/p n2 Time Time
(b) Naïve Parallel (c) Parallel 6
Definitions
§ Task
ØArbitrary piece of work in parallel computation
ØExecuted sequentially; concurrency is only across
tasks
ØFine-grained vs. coarse-grained tasks
§ Process (thread)
ØAbstract entity that performs the tasks
ØCommunicate and synchronize to perform the tasks
§ Processor
Ø Physical engine on which process executes
7
Tasks involved in Parallelizaton
§ Identify work that can be done in parallel
Ø work includes computation, data access and I/O
§ Partition work and perhaps data among
processes
§ Manage data access, communication and
synchronization
8
Parallelizing Computation vs. Data
§ Computation is decomposed and assigned
(partitioned) – task decomposition
Ø Task graphs, synchronization among tasks
§ Partitioning Data is often a natural view
too – data or domain decomposition
ØComputation follows data: owner computes
ØGrid example; data mining;
9
Domain Decomposition: Example
§ Some computation
performed on all elts. of
the array
for i=1 to m
for j= 1 to n
a[i,j] = a[i,j] + v[i]
10
Steps in Creating a Parallel Program
§ Decomposition of computation into tasks
§ Assignment of tasks to processes
§ Orchestration of data access, communication,
and synchronization.
§ Mapping processes to processors
11
Steps in Creating a Parallel Program
Partitioning
D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n
Sequential Tasks Processes Parallel Processors
computation program
12
Decomposition
§ Identify concurrency
§ Break up computation into tasks to be divided among
processes
ØTasks may become available dynamically
ØNo. of available tasks may vary with time
§ Goal:Expose available parallelism à enough tasks to
keep all processors busy
13
Assignment
§ Specifies how to group tasks together for a process
ØBalance workload, reduce communication and
management cost
§ Structured approaches usually work well
ØCode inspection (parallel loops) or understanding of
application
ØStatic versus dynamic assignment
§ Both decomposition and assignment are usually
independent of architecture or programming model
ØBut cost and complexity of using primitives may
affect decisions
14
Orchestration
§ Goals
ØReduce cost of communication and synch.
ØPreserve locality of data reference
ØSchedule tasks to satisfy dependences early
ØReduce overhead of parallelism management
§ Choices depend on Programming Model,
Communication abstraction, and efficiency of
primitives
§ Architecture should provide appropriate
primitives efficiently
15
Mapping
§ Two aspects:
ØWhich process runs on which particular processor?
ØWill multiple processes run on same processor?
§ Space-sharing
ØMachine divided into subsets, only one app at a time in a
subset
ØProcesses can be pinned to processors, or left to OS
§ System allocation
§ Real world
ØUser specifies some aspects, system handles some
16
High-level Goals
Table 2.1 Steps in the Parallelization Process and Their Goals
Architecture-
Step Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no Balance workload
Reduce communication volume
Orchestration Yes Reduce noninherent communication via data
locality
Reduce communication and synchronization cost
as seen by the processor
Reduce serialization at shared resources
Schedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if
necessary
Exploit locality in network topology
17
Example: Grid Solver
§ Gauss-Seidel (near-neighbor) sweeps to
convergence
Øinterior n x n points of (n+2) x (n+2) updated in each
sweep
Ødifference from previous value computed
Øaccumulate partial diffs into global diff at end of
every sweep
Øcheck if it has converged
§ to within a tolerance parameter
Ø updates array and iterate
18
Grid solver (Simple Version)
for i = 1 to n
for j = 1 to n
{
B[i,j] = 0.2 * (A[i,j] +
A[i-1,j] + A[i+1,j]+
A[i,j-1] + A[i,j+1]);
diff += abs(B[i,j] – A[i,j]);
}
for i = 1 to n
for j = 1 to n
A[i,j] = B[i,j] ;
19
Sequential Version
1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/
2. float **A, diff = 0;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
6. A ← malloc (a 2-d array of (n+2) x (n+2) doubles);
7. B ← malloc (a 2-d array of (n+2) x (n+2) doubles);
8. initialize(A); /*initialize the matrix A somehow*/
9. Solve (A); /*call the routine to solve equation*/
10. end main
20
Sequential Version (contd.)
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, done = 0;
14. float diff = 0, temp;
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize maximum difference to 0*/
17. for i ← 1 to n do/*sweep over non-border points of grid*/
18. for j ← 1 to n do
19. B[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
20. A[i,j+1] + A[i+1,j]); /*compute average*/
21. diff += abs(B[i,j] – A[i,j]);
22. end for
23. end for
24. if (diff/(n*n) < TOL) then done = 1;
25. else Copy_Array (A ß B)
26. end while
27. end procedure
21
Decomposition & Assignment
for i = 1 to n § Decomposition
for j = 1 to n Ø Both i and j loops can be
parallelized – no data
{ dependences
B[i,j] = 0.2 * (A[i,j] + Ø Each grid point can be a
A[i-1,j] + A[i+1,j]+ task
A[i,j-1] + A[i,j+1]); Ø To compute diff, some
diff += abs(B[i,j] – A[i,j]); coordination would be
required!
}
for i = 1 to n § Assignment
for j = 1 to n Ø Each grid point
A[i,j] = B[i,j] ; Ø Each row or column
Ø A group of rows or columns
22
Grid solver (Update-in-place Version)
for i = 1 to n
for j = 1 to n
{
temp = A[i,j];
A[i,j] = 0.2 * (A[i,j] +
A[i-1,j] + A[i+1,j]+
A[i,j-1] + A[i,j+1]);
diff += abs(temp – A[i,j]);
}
23
Decomposition & Assignment
§ Decomposition
Ø Dependence on both
i and j loops
Ø Each grid point can be
a task
Ø Need point-to-point
synchronization --
Very expensive
§ Assignment
Ø Grid points along
diagonal form a task
Ø Restructure loop and
global synchronization
Ø Load imbalance
24
Exploiting Application Knowledge
§ Reorder grid traversal: red-
black ordering
§ Red sweep and black sweep
are each fully parallel:
§ Global synch between them
(conservative but convenient)
§ Different ordering of
updates: may converge
slower
25
Red-Black Parallel Version
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, done = 0;
14. float diff = 0, temp;
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize maximum difference to 0*/
17. forall i ← 1 to n step 2 do/*sweep black points of grid*/
18. forall j ← 2 to n+1 step 2 do
19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. diff += abs(A[i,j] - temp);
23. end forall
24. end forall
24a /* similarly forall loop for red points of grid */
Ensure
25. if (diff/(n*n) < TOL) then done = 1; computation for
26. end while all black points
27. end procedure
are complete!
26
Red-Black Parallel Version (contd.)
§ Decomposition into elements: degree of concurrency
n2/2; 2 global synchronizations per n2 computation
§ forall loop to express the parallelism.
§ Too fine-grain parallelism ⇒ group tasks to form a
process.
§ Decompose into rows? Computation vs.
communication overhead?
27
Assignment
§ Static assignment: decomposition into rows
– Block assignment of rows: Rows i*(n/p), … , (i+1)*(n/p) - 1
are assigned to process i
– Cyclic assignment of rows: process i is assigned rows
i, i+p, i+2p ...
§ Dynamic assignment
§ get a row index, work on the row, get a new row, …
§ Concurrency? Volume of Communication?
28
Assignment (contd.)
P0
P0
P1
P2
P0
P3
29
Orchestration
§ Different for different programming
models/architectures
ØShared address space
§ Naming: global addr. Space
§ Synch. through barriers and locks
ØDistributed Memory /Message passing
§ Non-shared address space
§ Send-receive messages + barrier for synch.
30
Shared Memory Version
1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/
2. float **A, diff = 0;
2a. LockDec (diff_lock);
2b. BarrierDec (barrier1);
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. Read (nprocs);
6. A ← g_malloc (a 2-d array of (n+2) x (n+2) doubles);
6a. Create (nprocs -1, Solve, A);
7. initialize(A); /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main
31
Shared Memory Version
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin • No red-black, simply ignore
13. int i, j, pid, done = 0; dependences within sweep
14. float mydiff, temp; • Simpler asynchronous version,
14a. mybegin = 1 + (n/nprocs)*pid; may take longer to converge!
14b. myend = mybegin + (n/nprocs);
15. while (!done) do /*outermost loop over sweeps*/
16. mydiff = diff = 0; /*initialize local difference to 0*/
16a. Barrier (barrier1, nprocs);
17. for i ← mybeg to myend do/*sweep for all points of grid*/
18. for j ← 1 to n do Why do we need
19. temp = A[i,j]; /*save old value of element*/ this barrier?
20. A[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. mydiff += abs(A[i,j] - temp);
23. end for
24. end for
24a lock (diff_lock);
24b. diff(mydif,
Reduce += mydiff;
diff);
24c unlock (diff_lock);
Why do we need
24d. barrier (barrier1, nprocs);
25. if (diff/(n*n) < TOL) then done = 1; this barrier?
26. end while
27. end procedure 32
Shared Memory Program : Remarks
§ done condition evaluated redundantly by all
§ Each process has private mydiff variable
§ Most interesting special operations are for
synchronization provided by LOCK-UNLOCK around
criticalsection
ØSet of operations we want to execute atomically
Øaccumulations into shared diff have to be mutually
exclusive
§ Good global reduction?
33
Message Passing Version
§ Cannot declare A to be global shared array
Øcompose it from per-process private arrays
Øusually allocated in accordance with the assignment of
work -- owner-compute rule
§ process assigned a set of rows allocates them locally
§ Structurally similar to SPMD Shared Memory
Version
§ Orchestration different
Ødata structures and data access/naming
Øcommunication
Øsynchronization
§ Ghost rows
34
Data Layout and Orchestration
Data partition allocated per processor
Add ghost rows to hold boundary data
Send edges to neighbors
Receive into ghost rows
Compute as in sequential program
35
Message Passing Version
1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/
2. float **myA;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. read (nprocs);
/* 6. A ← g_malloc (a 2-d array of (n+2) x (n+2) doubles); */
6a. Create (nprocs -1, Solve, A);
/* 7. initialize(A); */ /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main
36
Message Passing Version
10. procedure Solve (A) /*solve the equation system*/
11. float A[n+2][n+2]; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float mydiff, temp;
14a. myend = (n/nprocs) ;
14b. myA = malloc (array of ((n/nprocs)+2) x (n+2) floats );
14c. If (pid == 0)
Initialize (A)
14d. GetMyArray (A, myA); /* get n x (n+2) elts. from proess 0 */
15. while (!done) { /*outermost loop over sweeps*/
16. mydiff = 0; /*initialize local difference to 0*/
16a. if (pid != 0) then
SEND (&myA[1,0] , n*sizeof(float), (pid-1), row);
16b. if (pid != nprocs-1) then
SEND (&myA[myend,0], n*sizeof(float), (pid+1), row);
16c. if (pid != 0) then
RECEIVE (&myA[0,0], n*sizeof(float), (pid -1), row);
16d. if (pid != nprocs-1) then
RECEIVE (&myA[myend+1,0], n*sizeof(float), (pid -1), row);
16e. ... ... ...
37
Message Passing Version – Solver
12. begin
… … …
15. while (!done) do /*outermost loop over sweeps*/
… … …
17. for i ← 1 to myend do/*sweep for all points of grid*/
18. for j ← 1 to n do
19. temp = myA[i,j]; /*save old value of element*/
20. myA[i,j] ← 0.2 * (myA[i,j] + myA[i,j-1] +myA[i-1,j] +
21. myA[i,j+1] + myA[i+1,j]); /*compute average*/
22. mydiff += abs(myA[i,j] - temp);
23. end for
24. end for
24a if (pid != 0) then
24b. SEND (mydif, sizeof (float), 0, DIFF);
24c. RECEIVE (done, sizeof(int), 0, DONE);
24d. else
24e. for k ß 1 to nprocs-1 do
24f. RECEIVE (tempdiff, sizeof(float), k , DIFF);
24g. mydiff += tempdiff;
24h. Endfor
24i. if (diff/(n*n) < TOL) then done = 1;
24j. for k ß 1 to nprocs-1 do
24k. SEND (done, sizeof(float), k , DONE);
26. end while
27. end procedure
38
Message Passing Version : Remarks
§ Communication in whole rows, not element at a time
§ Code similar, but indices/bounds in local rather than global
space
§ Synchronization through sends and receives
Ø Update of global diff and event synch for done condition
Ø Could implement locks and barriers with messages
§ Can use REDUCE and BROADCAST library calls to simplify
code
§ Communication done at beginning of iteration,
synchronization only between neighboring processes
39
Orchestration: Summary
§ Shared address space
ØShared and private data explicitly separate
ØCommunication implicit in access patterns
ØSynchronization via atomic operations on shared data
ØSynchronization explicit and distinct from data
communication
§ Message passing
ØData distribution among local address spaces needed
ØNo explicit shared structures (implicit in comm. patterns)
ØCommunication is explicit
ØSynchronization implicit in communication (at least in
synch. case)
40