KEMBAR78
RG2 ParallelizationPrinciples HPCAI Jan2020 | PDF | Parallel Computing | Concurrent Computing
0% found this document useful (0 votes)
22 views40 pages

RG2 ParallelizationPrinciples HPCAI Jan2020

Uploaded by

sridevi10mas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views40 pages

RG2 ParallelizationPrinciples HPCAI Jan2020

Uploaded by

sridevi10mas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Parallelization Principles

R. Govindarajan
SERC, IISc
govind@iisc.ac.in

1
Overview
§ Introduction
§ Parallelization Steps
§ Example
Ø Shared Address Space
Ø Distributed Address Space

Acknowledgments:

Slides for this tutorial are taken from presentation materials


available with the book “Parallel Computing Architecture: A
Hardware/Software Approach” (Culler, Singh and Gupta,
Morgan Kaufmann Pub.) and the associated course material.
They have been suitably adapted.
2
Parallel Programming

§ Shared, global, address space, hence


called Shared Address Space
Ø Any processor can directly reference any
memory location
Ø Communication occurs implicitly as result of
loads and stores
§ Message Passing Architecture
Ø Memory is private to each node
Ø Processes communicate by messages

3
Definitions

§ Speedup = ​𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛


𝑈𝑛𝑖𝑃𝑟𝑜𝑐𝑒𝑠𝑜𝑟 /𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛 𝒏 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠

§ Efficiency = ​𝑆𝑝𝑒𝑒𝑑𝑢𝑝 /𝒏

§ Amdahl’s Law:
ØFor a program with s part sequential
execution, speedup is limited by 1/s .

4
Understanding Amdahl’s Law

Example: 2-phase calculation


Ø sweep over n x n grid and do some independent
computation
Ø sweep again and add each value to global sum
concurrency

1
n2 n2 Time

(a) Serial

Ø Serial Execution Time = n2 + n2 = 2n2


5
Understanding Amdahl’s Law

Parallel Execution time:


Ø Time for first phase = n2/p
Ø Second phase serialized at global variable = n2;
Ø Speedup = (2n2/(n2 + n2/p)) or at most 2
Improved Parallel Execution
Ø Localize the sum in p procs and then do serial sum.
Ø Speedup = (2n2/(2n2 /p + p)) ≈ p
p p
concurrency

concurrency
1 1

p
n2/p
n2/p
n2/p n2 Time Time

(b) Naïve Parallel (c) Parallel 6


Definitions
§ Task
ØArbitrary piece of work in parallel computation
ØExecuted sequentially; concurrency is only across
tasks
ØFine-grained vs. coarse-grained tasks

§ Process (thread)
ØAbstract entity that performs the tasks
ØCommunicate and synchronize to perform the tasks

§ Processor
Ø Physical engine on which process executes

7
Tasks involved in Parallelizaton

§ Identify work that can be done in parallel


Ø work includes computation, data access and I/O
§ Partition work and perhaps data among
processes
§ Manage data access, communication and
synchronization

8
Parallelizing Computation vs. Data

§ Computation is decomposed and assigned


(partitioned) – task decomposition
Ø Task graphs, synchronization among tasks
§ Partitioning Data is often a natural view
too – data or domain decomposition
ØComputation follows data: owner computes
ØGrid example; data mining;

9
Domain Decomposition: Example

§ Some computation
performed on all elts. of
the array

for i=1 to m
for j= 1 to n
a[i,j] = a[i,j] + v[i]

10
Steps in Creating a Parallel Program

§ Decomposition of computation into tasks


§ Assignment of tasks to processes
§ Orchestration of data access, communication,
and synchronization.
§ Mapping processes to processors

11
Steps in Creating a Parallel Program

Partitioning

D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n

Sequential Tasks Processes Parallel Processors


computation program

12
Decomposition

§ Identify concurrency
§ Break up computation into tasks to be divided among
processes
ØTasks may become available dynamically
ØNo. of available tasks may vary with time
§ Goal:Expose available parallelism à enough tasks to
keep all processors busy

13
Assignment

§ Specifies how to group tasks together for a process


ØBalance workload, reduce communication and
management cost

§ Structured approaches usually work well


ØCode inspection (parallel loops) or understanding of
application
ØStatic versus dynamic assignment
§ Both decomposition and assignment are usually
independent of architecture or programming model
ØBut cost and complexity of using primitives may
affect decisions

14
Orchestration

§ Goals
ØReduce cost of communication and synch.
ØPreserve locality of data reference
ØSchedule tasks to satisfy dependences early
ØReduce overhead of parallelism management

§ Choices depend on Programming Model,


Communication abstraction, and efficiency of
primitives
§ Architecture should provide appropriate
primitives efficiently
15
Mapping

§ Two aspects:
ØWhich process runs on which particular processor?
ØWill multiple processes run on same processor?

§ Space-sharing
ØMachine divided into subsets, only one app at a time in a
subset
ØProcesses can be pinned to processors, or left to OS
§ System allocation
§ Real world
ØUser specifies some aspects, system handles some

16
High-level Goals

Table 2.1 Steps in the Parallelization Process and Their Goals

Architecture-
Step Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no Balance workload
Reduce communication volume
Orchestration Yes Reduce noninherent communication via data
locality
Reduce communication and synchronization cost
as seen by the processor
Reduce serialization at shared resources
Schedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if
necessary
Exploit locality in network topology

17
Example: Grid Solver

§ Gauss-Seidel (near-neighbor) sweeps to


convergence
Øinterior n x n points of (n+2) x (n+2) updated in each
sweep
Ødifference from previous value computed
Øaccumulate partial diffs into global diff at end of
every sweep
Øcheck if it has converged
§ to within a tolerance parameter
Ø updates array and iterate

18
Grid solver (Simple Version)

for i = 1 to n
for j = 1 to n
{
B[i,j] = 0.2 * (A[i,j] +
A[i-1,j] + A[i+1,j]+
A[i,j-1] + A[i,j+1]);
diff += abs(B[i,j] – A[i,j]);
}
for i = 1 to n
for j = 1 to n
A[i,j] = B[i,j] ;

19
Sequential Version
1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/
2. float **A, diff = 0;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
6. A ← malloc (a 2-d array of (n+2) x (n+2) doubles);
7. B ← malloc (a 2-d array of (n+2) x (n+2) doubles);
8. initialize(A); /*initialize the matrix A somehow*/
9. Solve (A); /*call the routine to solve equation*/
10. end main

20
Sequential Version (contd.)

10. procedure Solve (A) /*solve the equation system*/


11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, done = 0;
14. float diff = 0, temp;
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize maximum difference to 0*/
17. for i ← 1 to n do/*sweep over non-border points of grid*/
18. for j ← 1 to n do
19. B[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
20. A[i,j+1] + A[i+1,j]); /*compute average*/
21. diff += abs(B[i,j] – A[i,j]);
22. end for
23. end for
24. if (diff/(n*n) < TOL) then done = 1;
25. else Copy_Array (A ß B)
26. end while
27. end procedure

21
Decomposition & Assignment

for i = 1 to n § Decomposition
for j = 1 to n Ø Both i and j loops can be
parallelized – no data
{ dependences
B[i,j] = 0.2 * (A[i,j] + Ø Each grid point can be a
A[i-1,j] + A[i+1,j]+ task
A[i,j-1] + A[i,j+1]); Ø To compute diff, some
diff += abs(B[i,j] – A[i,j]); coordination would be
required!
}
for i = 1 to n § Assignment
for j = 1 to n Ø Each grid point
A[i,j] = B[i,j] ; Ø Each row or column
Ø A group of rows or columns

22
Grid solver (Update-in-place Version)

for i = 1 to n
for j = 1 to n
{
temp = A[i,j];
A[i,j] = 0.2 * (A[i,j] +
A[i-1,j] + A[i+1,j]+
A[i,j-1] + A[i,j+1]);
diff += abs(temp – A[i,j]);
}

23
Decomposition & Assignment

§ Decomposition
Ø Dependence on both
i and j loops
Ø Each grid point can be
a task
Ø Need point-to-point
synchronization --
Very expensive
§ Assignment
Ø Grid points along
diagonal form a task
Ø Restructure loop and
global synchronization
Ø Load imbalance
24
Exploiting Application Knowledge

§ Reorder grid traversal: red-


black ordering
§ Red sweep and black sweep
are each fully parallel:
§ Global synch between them
(conservative but convenient)
§ Different ordering of
updates: may converge
slower

25
Red-Black Parallel Version
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, done = 0;
14. float diff = 0, temp;
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize maximum difference to 0*/
17. forall i ← 1 to n step 2 do/*sweep black points of grid*/
18. forall j ← 2 to n+1 step 2 do
19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. diff += abs(A[i,j] - temp);
23. end forall
24. end forall
24a /* similarly forall loop for red points of grid */
Ensure
25. if (diff/(n*n) < TOL) then done = 1; computation for
26. end while all black points
27. end procedure
are complete!
26
Red-Black Parallel Version (contd.)
§ Decomposition into elements: degree of concurrency
n2/2; 2 global synchronizations per n2 computation
§ forall loop to express the parallelism.
§ Too fine-grain parallelism ⇒ group tasks to form a
process.
§ Decompose into rows? Computation vs.
communication overhead?

27
Assignment

§ Static assignment: decomposition into rows


– Block assignment of rows: Rows i*(n/p), … , (i+1)*(n/p) - 1
are assigned to process i
– Cyclic assignment of rows: process i is assigned rows
i, i+p, i+2p ...
§ Dynamic assignment
§ get a row index, work on the row, get a new row, …
§ Concurrency? Volume of Communication?

28
Assignment (contd.)

P0
P0

P1

P2

P0
P3

29
Orchestration

§ Different for different programming


models/architectures
ØShared address space
§ Naming: global addr. Space
§ Synch. through barriers and locks
ØDistributed Memory /Message passing
§ Non-shared address space
§ Send-receive messages + barrier for synch.

30
Shared Memory Version

1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/


2. float **A, diff = 0;
2a. LockDec (diff_lock);
2b. BarrierDec (barrier1);
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. Read (nprocs);
6. A ← g_malloc (a 2-d array of (n+2) x (n+2) doubles);
6a. Create (nprocs -1, Solve, A);
7. initialize(A); /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main

31
Shared Memory Version
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin • No red-black, simply ignore
13. int i, j, pid, done = 0; dependences within sweep
14. float mydiff, temp; • Simpler asynchronous version,
14a. mybegin = 1 + (n/nprocs)*pid; may take longer to converge!
14b. myend = mybegin + (n/nprocs);
15. while (!done) do /*outermost loop over sweeps*/
16. mydiff = diff = 0; /*initialize local difference to 0*/
16a. Barrier (barrier1, nprocs);
17. for i ← mybeg to myend do/*sweep for all points of grid*/
18. for j ← 1 to n do Why do we need
19. temp = A[i,j]; /*save old value of element*/ this barrier?
20. A[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. mydiff += abs(A[i,j] - temp);
23. end for
24. end for
24a lock (diff_lock);
24b. diff(mydif,
Reduce += mydiff;
diff);
24c unlock (diff_lock);
Why do we need
24d. barrier (barrier1, nprocs);
25. if (diff/(n*n) < TOL) then done = 1; this barrier?
26. end while
27. end procedure 32
Shared Memory Program : Remarks

§ done condition evaluated redundantly by all


§ Each process has private mydiff variable
§ Most interesting special operations are for
synchronization provided by LOCK-UNLOCK around
criticalsection
ØSet of operations we want to execute atomically
Øaccumulations into shared diff have to be mutually
exclusive
§ Good global reduction?

33
Message Passing Version

§ Cannot declare A to be global shared array


Øcompose it from per-process private arrays
Øusually allocated in accordance with the assignment of
work -- owner-compute rule
§ process assigned a set of rows allocates them locally
§ Structurally similar to SPMD Shared Memory
Version
§ Orchestration different
Ødata structures and data access/naming
Øcommunication
Øsynchronization
§ Ghost rows
34
Data Layout and Orchestration

Data partition allocated per processor


Add ghost rows to hold boundary data
Send edges to neighbors
Receive into ghost rows
Compute as in sequential program
35
Message Passing Version

1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/


2. float **myA;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. read (nprocs);
/* 6. A ← g_malloc (a 2-d array of (n+2) x (n+2) doubles); */
6a. Create (nprocs -1, Solve, A);
/* 7. initialize(A); */ /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main

36
Message Passing Version
10. procedure Solve (A) /*solve the equation system*/
11. float A[n+2][n+2]; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float mydiff, temp;
14a. myend = (n/nprocs) ;
14b. myA = malloc (array of ((n/nprocs)+2) x (n+2) floats );
14c. If (pid == 0)
Initialize (A)
14d. GetMyArray (A, myA); /* get n x (n+2) elts. from proess 0 */
15. while (!done) { /*outermost loop over sweeps*/
16. mydiff = 0; /*initialize local difference to 0*/
16a. if (pid != 0) then
SEND (&myA[1,0] , n*sizeof(float), (pid-1), row);
16b. if (pid != nprocs-1) then
SEND (&myA[myend,0], n*sizeof(float), (pid+1), row);
16c. if (pid != 0) then
RECEIVE (&myA[0,0], n*sizeof(float), (pid -1), row);
16d. if (pid != nprocs-1) then
RECEIVE (&myA[myend+1,0], n*sizeof(float), (pid -1), row);
16e. ... ... ...
37
Message Passing Version – Solver
12. begin
… … …
15. while (!done) do /*outermost loop over sweeps*/
… … …
17. for i ← 1 to myend do/*sweep for all points of grid*/
18. for j ← 1 to n do
19. temp = myA[i,j]; /*save old value of element*/
20. myA[i,j] ← 0.2 * (myA[i,j] + myA[i,j-1] +myA[i-1,j] +
21. myA[i,j+1] + myA[i+1,j]); /*compute average*/
22. mydiff += abs(myA[i,j] - temp);
23. end for
24. end for
24a if (pid != 0) then
24b. SEND (mydif, sizeof (float), 0, DIFF);
24c. RECEIVE (done, sizeof(int), 0, DONE);
24d. else
24e. for k ß 1 to nprocs-1 do
24f. RECEIVE (tempdiff, sizeof(float), k , DIFF);
24g. mydiff += tempdiff;
24h. Endfor
24i. if (diff/(n*n) < TOL) then done = 1;
24j. for k ß 1 to nprocs-1 do
24k. SEND (done, sizeof(float), k , DONE);
26. end while
27. end procedure
38
Message Passing Version : Remarks

§ Communication in whole rows, not element at a time


§ Code similar, but indices/bounds in local rather than global
space
§ Synchronization through sends and receives
Ø Update of global diff and event synch for done condition
Ø Could implement locks and barriers with messages
§ Can use REDUCE and BROADCAST library calls to simplify
code
§ Communication done at beginning of iteration,
synchronization only between neighboring processes

39
Orchestration: Summary
§ Shared address space
ØShared and private data explicitly separate
ØCommunication implicit in access patterns
ØSynchronization via atomic operations on shared data
ØSynchronization explicit and distinct from data
communication

§ Message passing
ØData distribution among local address spaces needed
ØNo explicit shared structures (implicit in comm. patterns)
ØCommunication is explicit
ØSynchronization implicit in communication (at least in
synch. case)

40

You might also like