0% found this document useful (0 votes)

22 views40 pages

RG2 ParallelizationPrinciples HPCAI Jan2020

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views40 pages

RG2 ParallelizationPrinciples HPCAI Jan2020

Uploaded by

sridevi10mas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Parallelization Principles

R. Govindarajan
SERC, IISc
govind@iisc.ac.in

1
Overview
§ Introduction
§ Parallelization Steps
§ Example
Ø Shared Address Space
Ø Distributed Address Space

Acknowledgments:

Slides for this tutorial are taken from presentation materials

available with the book “Parallel Computing Architecture: A
Hardware/Software Approach” (Culler, Singh and Gupta,
Morgan Kaufmann Pub.) and the associated course material.
They have been suitably adapted.
2
Parallel Programming

§ Shared, global, address space, hence

called Shared Address Space
Ø Any processor can directly reference any
memory location
Ø Communication occurs implicitly as result of
loads and stores
§ Message Passing Architecture
Ø Memory is private to each node
Ø Processes communicate by messages

3
Definitions

§ Speedup = 𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛

𝑈𝑛𝑖𝑃𝑟𝑜𝑐𝑒𝑠𝑜𝑟 /𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛 𝒏 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑜𝑟𝑠

§ Efficiency = 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 /𝒏

§ Amdahl’s Law:
ØFor a program with s part sequential
execution, speedup is limited by 1/s .

4
Understanding Amdahl’s Law

Example: 2-phase calculation

Ø sweep over n x n grid and do some independent
computation
Ø sweep again and add each value to global sum
concurrency

1
n2 n2 Time

(a) Serial

Ø Serial Execution Time = n2 + n2 = 2n2

5
Understanding Amdahl’s Law

Parallel Execution time:

Ø Time for first phase = n2/p
Ø Second phase serialized at global variable = n2;
Ø Speedup = (2n2/(n2 + n2/p)) or at most 2
Improved Parallel Execution
Ø Localize the sum in p procs and then do serial sum.
Ø Speedup = (2n2/(2n2 /p + p)) ≈ p
p p
concurrency

concurrency
1 1

p
n2/p
n2/p
n2/p n2 Time Time

(b) Naïve Parallel (c) Parallel 6

Definitions
§ Task
ØArbitrary piece of work in parallel computation
ØExecuted sequentially; concurrency is only across
tasks
ØFine-grained vs. coarse-grained tasks

§ Process (thread)
ØAbstract entity that performs the tasks
ØCommunicate and synchronize to perform the tasks

§ Processor
Ø Physical engine on which process executes

7
Tasks involved in Parallelizaton

§ Identify work that can be done in parallel

Ø work includes computation, data access and I/O
§ Partition work and perhaps data among
processes
§ Manage data access, communication and
synchronization

8
Parallelizing Computation vs. Data

§ Computation is decomposed and assigned

(partitioned) – task decomposition
Ø Task graphs, synchronization among tasks
§ Partitioning Data is often a natural view
too – data or domain decomposition
ØComputation follows data: owner computes
ØGrid example; data mining;

9
Domain Decomposition: Example

§ Some computation
performed on all elts. of
the array

for i=1 to m
for j= 1 to n
a[i,j] = a[i,j] + v[i]

10
Steps in Creating a Parallel Program

§ Decomposition of computation into tasks

§ Assignment of tasks to processes
§ Orchestration of data access, communication,
and synchronization.
§ Mapping processes to processors

11
Steps in Creating a Parallel Program

Partitioning

D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n

Sequential Tasks Processes Parallel Processors

computation program

12
Decomposition

§ Identify concurrency
§ Break up computation into tasks to be divided among
processes
ØTasks may become available dynamically
ØNo. of available tasks may vary with time
§ Goal:Expose available parallelism à enough tasks to
keep all processors busy

13
Assignment

§ Specifies how to group tasks together for a process

ØBalance workload, reduce communication and
management cost

§ Structured approaches usually work well

ØCode inspection (parallel loops) or understanding of
application
ØStatic versus dynamic assignment
§ Both decomposition and assignment are usually
independent of architecture or programming model
ØBut cost and complexity of using primitives may
affect decisions

14
Orchestration

§ Goals
ØReduce cost of communication and synch.
ØPreserve locality of data reference
ØSchedule tasks to satisfy dependences early
ØReduce overhead of parallelism management

§ Choices depend on Programming Model,

Communication abstraction, and efficiency of
primitives
§ Architecture should provide appropriate
primitives efficiently
15
Mapping

§ Two aspects:
ØWhich process runs on which particular processor?
ØWill multiple processes run on same processor?

§ Space-sharing
ØMachine divided into subsets, only one app at a time in a
subset
ØProcesses can be pinned to processors, or left to OS
§ System allocation
§ Real world
ØUser specifies some aspects, system handles some

16
High-level Goals

Table 2.1 Steps in the Parallelization Process and Their Goals

Architecture-
Step Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no Balance workload
Reduce communication volume
Orchestration Yes Reduce noninherent communication via data
locality
Reduce communication and synchronization cost
as seen by the processor
Reduce serialization at shared resources
Schedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if
necessary
Exploit locality in network topology

17
Example: Grid Solver

§ Gauss-Seidel (near-neighbor) sweeps to

convergence
Øinterior n x n points of (n+2) x (n+2) updated in each
sweep
Ødifference from previous value computed
Øaccumulate partial diffs into global diff at end of
every sweep
Øcheck if it has converged
§ to within a tolerance parameter
Ø updates array and iterate

18
Grid solver (Simple Version)

for i = 1 to n
for j = 1 to n
{
B[i,j] = 0.2 * (A[i,j] +
A[i-1,j] + A[i+1,j]+
A[i,j-1] + A[i,j+1]);
diff += abs(B[i,j] – A[i,j]);
}
for i = 1 to n
for j = 1 to n
A[i,j] = B[i,j] ;

19
Sequential Version
1. int n; /*size of matrix: (n + 2-by-n + 2) elements*/
2. float **A, diff = 0;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
6. A ← malloc (a 2-d array of (n+2) x (n+2) doubles);
7. B ← malloc (a 2-d array of (n+2) x (n+2) doubles);
8. initialize(A); /*initialize the matrix A somehow*/
9. Solve (A); /*call the routine to solve equation*/
10. end main

20
Sequential Version (contd.)

10. procedure Solve (A) /solve the equation system/

11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, done = 0;
14. float diff = 0, temp;
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize maximum difference to 0*/
17. for i ← 1 to n do/*sweep over non-border points of grid*/
18. for j ← 1 to n do
19. B[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
20. A[i,j+1] + A[i+1,j]); /*compute average*/
21. diff += abs(B[i,j] – A[i,j]);
22. end for
23. end for
24. if (diff/(n*n) < TOL) then done = 1;
25. else Copy_Array (A ß B)
26. end while
27. end procedure

21
Decomposition & Assignment

for i = 1 to n § Decomposition
for j = 1 to n Ø Both i and j loops can be
parallelized – no data
{ dependences
B[i,j] = 0.2 * (A[i,j] + Ø Each grid point can be a
A[i-1,j] + A[i+1,j]+ task
A[i,j-1] + A[i,j+1]); Ø To compute diff, some
diff += abs(B[i,j] – A[i,j]); coordination would be
required!
}
for i = 1 to n § Assignment
for j = 1 to n Ø Each grid point
A[i,j] = B[i,j] ; Ø Each row or column
Ø A group of rows or columns

22
Grid solver (Update-in-place Version)

for i = 1 to n
for j = 1 to n
{
temp = A[i,j];
A[i,j] = 0.2 * (A[i,j] +
A[i-1,j] + A[i+1,j]+
A[i,j-1] + A[i,j+1]);
diff += abs(temp – A[i,j]);
}

23
Decomposition & Assignment

§ Decomposition
Ø Dependence on both
i and j loops
Ø Each grid point can be
a task
Ø Need point-to-point
synchronization --
Very expensive
§ Assignment
Ø Grid points along
diagonal form a task
Ø Restructure loop and
global synchronization
Ø Load imbalance
24
Exploiting Application Knowledge

§ Reorder grid traversal: red-

black ordering
§ Red sweep and black sweep
are each fully parallel:
§ Global synch between them
(conservative but convenient)
§ Different ordering of
updates: may converge
slower

25
Red-Black Parallel Version
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, done = 0;
14. float diff = 0, temp;
15. while (!done) do /*outermost loop over sweeps*/
16. diff = 0; /*initialize maximum difference to 0*/
17. forall i ← 1 to n step 2 do/*sweep black points of grid*/
18. forall j ← 2 to n+1 step 2 do
19. temp = A[i,j]; /*save old value of element*/
20. A[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. diff += abs(A[i,j] - temp);
23. end forall
24. end forall
24a /* similarly forall loop for red points of grid */
Ensure
25. if (diff/(n*n) < TOL) then done = 1; computation for
26. end while all black points
27. end procedure
are complete!
26
Red-Black Parallel Version (contd.)
§ Decomposition into elements: degree of concurrency
n2/2; 2 global synchronizations per n2 computation
§ forall loop to express the parallelism.
§ Too fine-grain parallelism ⇒ group tasks to form a
process.
§ Decompose into rows? Computation vs.
communication overhead?

27
Assignment

§ Static assignment: decomposition into rows

– Block assignment of rows: Rows i*(n/p), … , (i+1)*(n/p) - 1
are assigned to process i
– Cyclic assignment of rows: process i is assigned rows
i, i+p, i+2p ...
§ Dynamic assignment
§ get a row index, work on the row, get a new row, …
§ Concurrency? Volume of Communication?

28
Assignment (contd.)

P0
P0

P0
P3

29
Orchestration

§ Different for different programming

models/architectures
ØShared address space
§ Naming: global addr. Space
§ Synch. through barriers and locks
ØDistributed Memory /Message passing
§ Non-shared address space
§ Send-receive messages + barrier for synch.

30
Shared Memory Version

1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/

2. float **A, diff = 0;
2a. LockDec (diff_lock);
2b. BarrierDec (barrier1);
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. Read (nprocs);
6. A ← g_malloc (a 2-d array of (n+2) x (n+2) doubles);
6a. Create (nprocs -1, Solve, A);
7. initialize(A); /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main

31
Shared Memory Version
10. procedure Solve (A) /*solve the equation system*/
11. float **A; /*A is an (n + 2)-by-(n + 2) array*/
12. begin • No red-black, simply ignore
13. int i, j, pid, done = 0; dependences within sweep
14. float mydiff, temp; • Simpler asynchronous version,
14a. mybegin = 1 + (n/nprocs)*pid; may take longer to converge!
14b. myend = mybegin + (n/nprocs);
15. while (!done) do /*outermost loop over sweeps*/
16. mydiff = diff = 0; /*initialize local difference to 0*/
16a. Barrier (barrier1, nprocs);
17. for i ← mybeg to myend do/*sweep for all points of grid*/
18. for j ← 1 to n do Why do we need
19. temp = A[i,j]; /*save old value of element*/ this barrier?
20. A[i,j] ← 0.2 * (A[i,j] + A[i,j-1] + A[i-1,j] +
21. A[i,j+1] + A[i+1,j]); /*compute average*/
22. mydiff += abs(A[i,j] - temp);
23. end for
24. end for
24a lock (diff_lock);
24b. diff(mydif,
Reduce += mydiff;
diff);
24c unlock (diff_lock);
Why do we need
24d. barrier (barrier1, nprocs);
25. if (diff/(n*n) < TOL) then done = 1; this barrier?
26. end while
27. end procedure 32
Shared Memory Program : Remarks

§ done condition evaluated redundantly by all

§ Each process has private mydiff variable
§ Most interesting special operations are for
synchronization provided by LOCK-UNLOCK around
criticalsection
ØSet of operations we want to execute atomically
Øaccumulations into shared diff have to be mutually
exclusive
§ Good global reduction?

33
Message Passing Version

§ Cannot declare A to be global shared array

Øcompose it from per-process private arrays
Øusually allocated in accordance with the assignment of
work -- owner-compute rule
§ process assigned a set of rows allocates them locally
§ Structurally similar to SPMD Shared Memory
Version
§ Orchestration different
Ødata structures and data access/naming
Øcommunication
Øsynchronization
§ Ghost rows
34
Data Layout and Orchestration

Data partition allocated per processor

Add ghost rows to hold boundary data
Send edges to neighbors
Receive into ghost rows
Compute as in sequential program
35
Message Passing Version

1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/

2. float **myA;
3. main()
4. begin
5. read(n) ; /*read input parameter: matrix size*/
5a. read (nprocs);
/* 6. A ← g_malloc (a 2-d array of (n+2) x (n+2) doubles); */
6a. Create (nprocs -1, Solve, A);
/* 7. initialize(A); */ /*initialize the matrix A somehow*/
8. Solve (A); /*call the routine to solve equation*/
8a. Wait_for_End (nprocs-1);
9. end main

36
Message Passing Version
10. procedure Solve (A) /*solve the equation system*/
11. float A[n+2][n+2]; /*A is an (n + 2)-by-(n + 2) array*/
12. begin
13. int i, j, pid, done = 0;
14. float mydiff, temp;
14a. myend = (n/nprocs) ;
14b. myA = malloc (array of ((n/nprocs)+2) x (n+2) floats );
14c. If (pid == 0)
Initialize (A)
14d. GetMyArray (A, myA); /* get n x (n+2) elts. from proess 0 */
15. while (!done) { /*outermost loop over sweeps*/
16. mydiff = 0; /*initialize local difference to 0*/
16a. if (pid != 0) then
SEND (&myA[1,0] , n*sizeof(float), (pid-1), row);
16b. if (pid != nprocs-1) then
SEND (&myA[myend,0], n*sizeof(float), (pid+1), row);
16c. if (pid != 0) then
RECEIVE (&myA[0,0], n*sizeof(float), (pid -1), row);
16d. if (pid != nprocs-1) then
RECEIVE (&myA[myend+1,0], n*sizeof(float), (pid -1), row);
16e. ... ... ...
37
Message Passing Version – Solver
12. begin
… … …
15. while (!done) do /*outermost loop over sweeps*/
… … …
17. for i ← 1 to myend do/*sweep for all points of grid*/
18. for j ← 1 to n do
19. temp = myA[i,j]; /*save old value of element*/
20. myA[i,j] ← 0.2 * (myA[i,j] + myA[i,j-1] +myA[i-1,j] +
21. myA[i,j+1] + myA[i+1,j]); /*compute average*/
22. mydiff += abs(myA[i,j] - temp);
23. end for
24. end for
24a if (pid != 0) then
24b. SEND (mydif, sizeof (float), 0, DIFF);
24c. RECEIVE (done, sizeof(int), 0, DONE);
24d. else
24e. for k ß 1 to nprocs-1 do
24f. RECEIVE (tempdiff, sizeof(float), k , DIFF);
24g. mydiff += tempdiff;
24h. Endfor
24i. if (diff/(n*n) < TOL) then done = 1;
24j. for k ß 1 to nprocs-1 do
24k. SEND (done, sizeof(float), k , DONE);
26. end while
27. end procedure
38
Message Passing Version : Remarks

§ Communication in whole rows, not element at a time

§ Code similar, but indices/bounds in local rather than global
space
§ Synchronization through sends and receives
Ø Update of global diff and event synch for done condition
Ø Could implement locks and barriers with messages
§ Can use REDUCE and BROADCAST library calls to simplify
code
§ Communication done at beginning of iteration,
synchronization only between neighboring processes

39
Orchestration: Summary
§ Shared address space
ØShared and private data explicitly separate
ØCommunication implicit in access patterns
ØSynchronization via atomic operations on shared data
ØSynchronization explicit and distinct from data
communication

§ Message passing
ØData distribution among local address spaces needed
ØNo explicit shared structures (implicit in comm. patterns)
ØCommunication is explicit
ØSynchronization implicit in communication (at least in
synch. case)

Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
Parallel Programming Essentials
No ratings yet
Parallel Programming Essentials
40 pages
Parallel Programming Models Guide
No ratings yet
Parallel Programming Models Guide
32 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Parallel Programming Insights
No ratings yet
Parallel Programming Insights
42 pages
Parallel Algorithms: Theory and Practice
No ratings yet
Parallel Algorithms: Theory and Practice
44 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Bert 2a Parallel Algorithms Parfor Quicksort Reduction Listranking Rootfinding Postordernumbering
No ratings yet
Bert 2a Parallel Algorithms Parfor Quicksort Reduction Listranking Rootfinding Postordernumbering
73 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
3.1.3 Processes and Mapping (1/5)
No ratings yet
3.1.3 Processes and Mapping (1/5)
74 pages
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
No ratings yet
UNIT-8 Forms of Parallelism: 8.1 Simple Parallel Computation: Example 1: Numerical Integration Over Two Variables
12 pages
Co 2
No ratings yet
Co 2
22 pages
High Performance Computing Labs & Concepts
No ratings yet
High Performance Computing Labs & Concepts
5 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
1.1 Parallelism Is Ubiquitous
No ratings yet
1.1 Parallelism Is Ubiquitous
3 pages
Parallel Algorithm Design Basics
No ratings yet
Parallel Algorithm Design Basics
63 pages
Excelente
No ratings yet
Excelente
64 pages
Unit 2
No ratings yet
Unit 2
151 pages
Unit 2
No ratings yet
Unit 2
64 pages
Algorithms For Parallel Machines
No ratings yet
Algorithms For Parallel Machines
7 pages
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
No ratings yet
Parallel Algorithms: Theory and Practice: Deterministi C Parallelism
51 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
2.decomposition Done
No ratings yet
2.decomposition Done
4 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
VSS NumericalLibraries
No ratings yet
VSS NumericalLibraries
21 pages
410A Week 5
No ratings yet
410A Week 5
23 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
29 pages
Processes and Mapping, Decomposition Techniques
No ratings yet
Processes and Mapping, Decomposition Techniques
28 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Web GPU
0% (1)
Web GPU
40 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Why Parallel Computing?: Peter Pacheco
No ratings yet
Why Parallel Computing?: Peter Pacheco
84 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Lecture 9 - Parallel Algorithms
No ratings yet
Lecture 9 - Parallel Algorithms
28 pages
Chapter 1
No ratings yet
Chapter 1
39 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
Assignment No. 2 PDC 21L-1786
No ratings yet
Assignment No. 2 PDC 21L-1786
6 pages
Module 3
No ratings yet
Module 3
104 pages
Problem Statement
No ratings yet
Problem Statement
2 pages
Unit 2 HPC - Nap
No ratings yet
Unit 2 HPC - Nap
72 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
Parallel Algorithm Design Guide
No ratings yet
Parallel Algorithm Design Guide
35 pages
Par - 1 In-Term Exam - Course 2018/19-Q2
No ratings yet
Par - 1 In-Term Exam - Course 2018/19-Q2
9 pages
Parallel Computing Architectures
No ratings yet
Parallel Computing Architectures
57 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
APT06 2024S2 New
No ratings yet
APT06 2024S2 New
21 pages
PDC ch#5
No ratings yet
PDC ch#5
12 pages
Pda 4
No ratings yet
Pda 4
82 pages
Chapter 1
No ratings yet
Chapter 1
47 pages
AA Part1
No ratings yet
AA Part1
43 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Newton 1
No ratings yet
Newton 1
114 pages
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
No ratings yet
Data-Driven Solutions and Parameter Estimations of A Family of Higher-Order KDV Equations Based On Physics Informed Neural Networks
27 pages
Kinematic Wave Theory
No ratings yet
Kinematic Wave Theory
2 pages
Biorthogonal System
No ratings yet
Biorthogonal System
54 pages
Klausen 1999
No ratings yet
Klausen 1999
20 pages
1D Conservation Laws
No ratings yet
1D Conservation Laws
38 pages
Mathematicians & Computational Scientists
No ratings yet
Mathematicians & Computational Scientists
35 pages
Lie NOTES
No ratings yet
Lie NOTES
114 pages
Nonlinear Differential Equations
No ratings yet
Nonlinear Differential Equations
25 pages
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
No ratings yet
(Applied Mathematical Sciences 35) Jack Carr (Auth.) - Applications of Centre Manifold Theory-Springer-Verlag New York (1981)
156 pages
Integration Guide For The X-CUBE-SBSFU
No ratings yet
Integration Guide For The X-CUBE-SBSFU
49 pages
Cad Database Server Purchase: Hardware Proposal
No ratings yet
Cad Database Server Purchase: Hardware Proposal
20 pages
E-Business Platform/Infrastructure For Daraz: ERP CRM SCM Data Mining
No ratings yet
E-Business Platform/Infrastructure For Daraz: ERP CRM SCM Data Mining
3 pages
(DSWD) Orientation On The Implementation of Basic Online Authentication Application
No ratings yet
(DSWD) Orientation On The Implementation of Basic Online Authentication Application
60 pages
Dynamic Resource Allocation Using Virtual Machines For Cloud Computing Environment
No ratings yet
Dynamic Resource Allocation Using Virtual Machines For Cloud Computing Environment
24 pages
Bank HTTPS-XML Specification Document
No ratings yet
Bank HTTPS-XML Specification Document
8 pages
MigrationObjects OP en
No ratings yet
MigrationObjects OP en
1,882 pages
TusharSawant CV
No ratings yet
TusharSawant CV
3 pages
Cloud Engineer Exam Prep Guide
100% (3)
Cloud Engineer Exam Prep Guide
32 pages
Discord Bot With Python
No ratings yet
Discord Bot With Python
12 pages
Darkweb Monitoring Report
No ratings yet
Darkweb Monitoring Report
6 pages
Traditional Commerce Vs e Comm
No ratings yet
Traditional Commerce Vs e Comm
9 pages
3D Modeling for Decentraland Users
No ratings yet
3D Modeling for Decentraland Users
10 pages
Vsan 673 Administration Guide - 2
No ratings yet
Vsan 673 Administration Guide - 2
95 pages
N-Tier Architecture in University Portals
No ratings yet
N-Tier Architecture in University Portals
2 pages
Categorical Data in Python Guide
No ratings yet
Categorical Data in Python Guide
33 pages
Zain Subhani
No ratings yet
Zain Subhani
7 pages
As DLK Ja Attempts 1
No ratings yet
As DLK Ja Attempts 1
13 pages
WALLIX PEDM Presentation Guide
No ratings yet
WALLIX PEDM Presentation Guide
14 pages
Total Economic Impact Cisco ThousandEyes End User Monitoring
No ratings yet
Total Economic Impact Cisco ThousandEyes End User Monitoring
28 pages
Marketing Analytics Student Resume
No ratings yet
Marketing Analytics Student Resume
1 page
Unit IV Cloud Computing
No ratings yet
Unit IV Cloud Computing
70 pages
Dell Equallogic SAN HQ Guide v2.2
No ratings yet
Dell Equallogic SAN HQ Guide v2.2
144 pages
Yogesh Kumar
100% (2)
Yogesh Kumar
2 pages
How To Access Hikvision DVR From Mobile (Step-By-Step) - Learn CCTV
No ratings yet
How To Access Hikvision DVR From Mobile (Step-By-Step) - Learn CCTV
16 pages
2 - Introduction To Python Programming
No ratings yet
2 - Introduction To Python Programming
19 pages
DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition. ISBN 1634622340, 978-1634622349
86% (35)
DAMA-DMBOK: Data Management Body of Knowledge: 2nd Edition. ISBN 1634622340, 978-1634622349
23 pages
SAP HANA EIM Administration Guide en
No ratings yet
SAP HANA EIM Administration Guide en
338 pages
BA
100% (1)
BA
20 pages
Aws Certified Cloud Practitioner PDF
No ratings yet
Aws Certified Cloud Practitioner PDF
5 pages

RG2 ParallelizationPrinciples HPCAI Jan2020

Uploaded by

RG2 ParallelizationPrinciples HPCAI Jan2020

Uploaded by

Parallelization Principles

Slides for this tutorial are taken from presentation materials

§ Shared, global, address space, hence

§ Speedup = ​𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛

Example: 2-phase calculation

Ø Serial Execution Time = n2 + n2 = 2n2

Parallel Execution time:

(b) Naïve Parallel (c) Parallel 6

§ Identify work that can be done in parallel

§ Computation is decomposed and assigned

§ Decomposition of computation into tasks

Sequential Tasks Processes Parallel Processors

§ Specifies how to group tasks together for a process

§ Structured approaches usually work well

§ Choices depend on Programming Model,

Table 2.1 Steps in the Parallelization Process and Their Goals

§ Gauss-Seidel (near-neighbor) sweeps to

10. procedure Solve (A) /*solve the equation system*/

§ Reorder grid traversal: red-

§ Static assignment: decomposition into rows

§ Different for different programming

1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/

§ done condition evaluated redundantly by all

§ Cannot declare A to be global shared array

Data partition allocated per processor

1. int n, nprocs; /* matrix: (n + 2-by-n + 2) elts.*/

§ Communication in whole rows, not element at a time

You might also like

§ Speedup = 𝐸𝑥𝑒𝑐. 𝑇𝑖𝑚𝑒 𝑖𝑛

10. procedure Solve (A) /solve the equation system/