0% found this document useful (0 votes)

40 views122 pages

HPC Parallel

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views122 pages

HPC Parallel

Uploaded by

Rajul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 122

What is parallelism?

Victor Eijkhout

Fall 2022
Justification

Parallel computing has been a necessity for decades in computational

science. Here we discuss some of the basic concepts. Actual parallel
programming will be discussed in other lectures.

2
Basic concepts

3
1 The basic idea

Parallelism is about doing multiple things at once.

• Hardware: vector instructions, multiple cores, nodes in a cluster.

• Algorithm: can you think of examples?

4
2 Simple example
Summing two arrays together:
for (i=0; i<n; i++)
a[i] = b[i] + c[i];

Parallel: every processing element does

for ( i in my_subset_of_indices )
a[i] = b[i] + c[i];

Time goes down linearly with processors

5
3 Differences between operations

for (i=0; i<n; i++) s = 0;

a[i] = b[i] + c[i]; for (i=0; i<n; i++)
s += x[i]

• Compare operation counts

• Compare behavior on single processor. What about multi-core?
• Other thoughts about parallel execution?

6
4 Summing
Naive algorithm Recoding
s = 0; for (s=2; s<n; s*=2)
for (i=0; i<n; i++) for (i=0; i<n; i+=s)
s += x[i] x[i] += x[i+s/2]

7
5 And then there is hardware
Topology of the processors:

increasing distance: limit on parallel speedup

8
Theoretical concepts

9
Efficiency and scaling

10
6 Speedup
• Single processor time T1 , on p processors Tp
• speedup is Sp = T1 /Tp , SP ≤ p
• efficiency is Ep = Sp /p, 0 < Ep ≤ 1

But:

• Is T1 based on the same algorithm? The parallel code?

• Sometimes superlinear speedup.
• Is T1 measurable? Can the problem be run on a single
processor?

11
7 Amdahl’s law
Let’s assume that part of the application can be parallelized, part not.
(Examples?)

• Fs sequential fraction, Fp parallelizable fraction

• Fs + Fp = 1

12
8 Amdahl’s law, analysis
• Fs sequential fraction, Fp parallelizable fraction
• Fs + Fp = 1
• T1 = (Fs + Fp )T1 = Fs T1 + Fp T1
• Amdahl’s law: Tp = Fs T1 + Fp T1 /p
• P → ∞: TP ↓ T1 Fs
• Speedup is limited by SP ≤ 1/Fs , efficiency is a decreasing
function E ∼ 1/P.

Do you see problems with this?

13
9 Amdahl’s law with communication overhead

• Communication independent of p: Tp = T1 (Fs + Fp /P ) + Tc

• assume fully parallelizable: Fp = 1
• then Sp = T1
T1 /p+Tc
• For reasonable speedup: Tc ≪ T1 /p or p ≪ T1 /Tc :
number of processors limited by ratio of scalar execution time and
communication overhead

14
10 Gustafson’s law
Reconstruct the sequential execution from the parallel, then analyze
efficiency.

15
11 Gustafson’s law

• Let Tp = Fs + Fp ≡ 1
• then T1 = Fs + p · Fp
• Speedup:

T1 Fs + p · Fp
Sp = = = Fs + p · Fp = p − ( p − 1) · Fs .
Tp Fs + Fp

slowly decreasing function of p

16
12 Scaling

• Amdahl’s law: strong scaling

same problem over increasing processors
• Often more realistic: weak scaling
increase problem size with number of processors,
for instance keeping memory constant
• Weak scaling: Ep > c
• example (below): dense linear algebra

17
13 Strong scaling

• Let M be the total memory needed for your problem.

• Let P be the number of processors
⇒ memory per processor is M /P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)

18
14 Weak scaling

• Let M be the memory per processor.

• Let P be the number of processors
⇒ total memory is M · P
• What is limP →∞ EP ?
(Note that implicitly Ep = E (P , M ).)

19
15 Simulation scaling
• Assumption: simulated time S, running time T constant, now
increase precision
• m memory per processor, and P the number of processors
M = Pm total memory.
d the number of space dimensions of the problem, typically
2 or 3,
∆x = 1/M 1/d grid spacing.
• stability:
(
∆x = 1 M 1/d

hyperbolic case
∆t =
∆x 2 = 1 M 2/d

parabolic case
With a simulated time S:
k = S /∆t time steps.

20
16 Simulation scaling con’td
• Assume time steps parallelizable
S
T = kM /P = m.
∆t
Setting T /S = C, we find

m = C ∆t ,

memory per processor goes down.

(
1 M 1/d hyperbolic case
m = C ∆t = c 2/d
1 M parabolic case
• Substituting M = Pm, we find ultimately
(
1 P 1/(d +1) hyperbolic
m=C 2/(d +2)
1 P parabolic

21
Critical path analysis

22
17 Critical path
• The sequential fraction contains a critical path: a sequence of
operations that depend on each other.
• Example?
• T∞ = time with unlimited processors: length of critical path.

23
18 Brent’s theorem

Let m be the total number of tasks, p the number of processors, and t

the length of a critical path. Then the computation can be done in

m−t
Tp ≤ t + .
p

• Time equals the length of the critical path . . .

• . . . plus the remaining work as parallel as possible.

24
Granularity

25
19 Definition

Definition: granularity is the measure for how many operations can be

performed between synchronizations

26
20 Instruction level parallelism

a ← b+c
d ← e∗f
For the compiler / processor to worry about

27
21 Data parallelism

for (i=0; i<1000000; i++)

a[i] = 2*b[i];

• Array processors, vector instructions, pipelining, GPUs

• Sometimes harder to discover
• Often used mixed with other forms of parallelism

28
22 Task-level parallelism
if optimal (root) then
exit
else
parallel: SearchInTree (leftchild),SearchInTree (rightchild)
Procedure SearchInTree(root)

Unsynchronized tasks: fork-join

general scheduler

while there are tasks left do

wait until a processor becomes inactive;
spawn a new task on it

29
23 Conveniently parallel

Example: Mandelbrot set

Parameter sweep,
often best handled by external tools

30
24 Medium-grain parallelism

Mix of data parallel and task parallel

my_lower_bound = // some processor-dependent number
my_upper_bound = // some processor-dependent number
for (i=my_lower_bound; i<my_upper_bound; i++)
// the loop body goes here

31
LU factorization analysis

32
25 Algorithm

for k = 1, n − 1:
for i = k + 1 to n:
aik ← aik /akk
for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj

Can the k loop be done in parallel? The i , j loops?

33
26 Dependent operations

−1
a22 ← a22 − a21 ∗ a11 a12
···
−1
a33 ← a33 − a32 ∗ a22 a23

34
Exercise 1: Critical path

Follow this argument through. Argue that there is a non-trivial critical

path in the sense of section ??. What is its length?

In the analysis of the critical path section, what does this critical path
imply for the minimum parallel execution time and bounds on
speedup?

35
27 Subblock update

for i = k + 1 to n:
for j = k + 1 to n:
aij ← aij − aik ∗ akj

How many processors can you use maximally in step k ?

36
Exercise 2: Parallel execution

Continue this reasoning. With p = n2 processing elements each of the

(i , j ) updates in the subblock can be done simultaneously. To be
precise, how long does an arbitrary k iteration take? Summing over
all k , what is the resulting Tp , Sp , Ep ? How does this relate to the
bounds you derived above?

Also, with p = n processing elements you could let each row or

column of the subblock update be done in parallel. What is now the
time for the kth outer iteration? What is the resulting Tp , Sp , Ep ?,

37
28 Application scaling

Single processor.

Relating time and memory to problem size

1
T= N 3 /f , M = N 2.
3
where f is processor frequency.

38
Exercise 3: Memory scaling, case 1: Faster
processor

Suppose you buy a processor twice as fast, and you want to do a

benchmark run that again takes time T . How much memory do you
need?

39
29 More processors

Keep frequency constant, but vary number of processors p:

1
T= N 3 /p , M = N 2.
3

Each processor now stores Mp = N 2 /p elements.

40
Exercise 4: Memory scaling, case 2: More
processors
Suppose you have a cluster with p processors, each with Mp memory,
can run a Gaussian elimination of an N × N matrix in time T :
1
T= N 3 /p , Mp = N 2 /p .
3

Now you extend the cluster to 2P processors, of the same clock

speed, and you want to do a benchmark run, again taking time T . How
much memory does each node need?

Hint: for the extended cluster:

1
T ′ = N ′3 / p ′ , Mp′ = N ′2 /p′ .
3
The question becomes to compute Mp′ under the given conditions.

41
The SIMD/MIMD/SPMD/SIMT model for parallelism

42
30 Flynn Taxonomy
Consider instruction stream and data stream:

• SISD: single instruction single data

used to be single processor, now single core
• MISD: multiple instruction single data
redundant computing for fault tolerance?
• SIMD: single instruction multiple data
data parallelism, pipelining, array processing, vector instructions
• MIMD: multiple instruction multiple data
independent processors, clusters, MPPs

43
31 SIMD

• Relies on streams of identical operations

• See pipelining
• Recurrences hard to accomodate

44
32 SIMD: array processors

Technology going back to the

1980s: FPS, MasPar, CM,
GoodYear
Major advantage: simplification of
processor

45
33 SIMD as vector instructions
• Register width multiple of 8 bytes:
• simultaneous processing of more than one operand pair
• SSE: 2 operands,
• AVX: 4 or 8 operands

46
34 Controlling vector instructions

void func(float restrict c, float restrict a,

float *restrict b, int n)
{
#pragma vector always
for (int i=0; i<n; i++)
c[i] = a[i] * b[i];
}

This needs aligned data (posix_memalign)

47
35 New branches in the taxonomy

• SPMD: single program multiple data

the way clusters are actually used
• SIMT: single instruction multiple threads
the GPU model

48
36 MIMD becomes SPMD
• MIMD: independent processors, independent instruction streams,
independent data
• In practice very little true independence: usally the same
executable
Single Program Multiple Data
• Exceptional example: climate codes
• Old-style SPMD: cluster of single-processor nodes
• New-style: cluster of multicore nodes, ignore shared caches /
memory
• (We’ll get to hybrid computing in a minute)

49
37 GPUs and data paralleism

Lockstep in thread block,

single instruction model between streaming processors

(more about GPU threads later)

50
Characterization of parallelism by memory model

51
38 Major types of memory organization, classic

52
39 Major types of memory organization,
contemporary

53
40 Symmetric multi-processing

• The ideal case of shared memory:

every address equally accessible
• This hasn’t existed in a while
(Tim Mattson claims Cray-2)
• Danger signs: shared memory programming pretends that
memory access is symmetric
in fact: hides reality from you

54
41 SMP, bus design

• Bus: all processors on the same wires to memory

• Not very scalable: requires slow processors or cache memory
• Cache coherence easy by ‘snooping’

55
42 Non-uniform Memory Access
Memory is equally programmable, but not equally accessible

• Different caches, different affinity

• Distributed shared memory: network latency

ScaleMP and other products watch me not believe it

56
43 Picture of NUMA

57
Interconnects and topologies, theoretical
concepts

58
44 Topology concepts

• Hardware characteristics
• Software requirement
• Design: how ‘close’ are processors?

59
45 Graph theory

• Degree: number of connections from one processor to others

• Diameter: maximum minimum distance (measured in hops)

60
46 Bandwidth

• Bandwidth per wire is nice, adding over all wires is nice, but. . .

• Bisection width: minimum number of wires through a cut

• Bisection bandwidth: bandwidth through a bisection

61
47 Design 1: bus

Already discussed; simple design, does not scale very far

62
48 Design 2: linear arrays

• Degree 2, diameter P, bisection width 1

• Scales nicely!
• but low bisection width

63
Exercise 5: Broadcast algorithm

Flip last bit, flip one before, . . .

64
49 Design 3: 2/3-D arrays

• Degree 2d, diameter P 1/d

• Natural design: nature is three-dimensional
• More dimensions: less contention.
K-machine is 6-dimensional

65
50 Design 3: Hypercubes

66
51 Hypercube numbering
Naive numbering:

67
52 Gray codes
Embedding linear numbering in hypercube:

68
53 Binary reflected Gray code

1D Gray code : 0 1
..
1D code and reflection: 0 1 . 1 0
2D Gray code : ..
append 0 and 1 bit: 0 0 . 1 1
.
2D code and reflection: 0 1 1 0 .. 0 1 1 0
.
3D Gray code : 0 0 1 1 .. 1 1 0 0
.
append 0 and 1 bit: 0 0 0 0 .. 1 1 1 1

69
54 Switching networks

• Solution to all-to-all connection

• (Real all-to-all too expensive)
• Typically layered
• Switching elements: easy to extend

70
55 Cross bar

Advantage: non-blocking
Disadvantage: cost

71
56 Butterfly exchange
Process to segmented pool of memory, or between processors with
private memory:

72
57 Building up butterflies

73
58 Uniform memory access

Contention possible

74
59 Route calculation

75
60 Fat Tree

76
61 Fat trees from switching elements

(Clos network)

77
62 Fat tree clusters

78
Exercise 6: Switch contention
Suppose the number of processor p is larger than the number of
wires w.
Write a simulation that investigates the probability of contention if you
send m ≤ w message to distinct processors.
Can you do a statistical analysis, starting with a simple case?

79
63 Mesh clusters

80
64 Levels of locality

• Core level: private cache, shared cache

• Node level: numa
• Network: levels in the switch

81
Programming models

82
65 Shared vs distributed memory
programming
Different memory models:

Different questions:

• Shared memory: synchronization problems such as critical

sections
• Distributed memory: data motion

83
Thread parallelism

84
66 What is a thread
• Process: code, heap, stack
• Thread: same code but private program counter, stack, local
variables
• dynamically (even recursively) created: fork-join

Incremental parallelization!

85
67 Thread context

• Private data (stack, local variables) is called ‘thread context’

• Context switch: switch from one thread execution to another
• context switches are expensive; alternative hyperthreading
• Intel Xeon Phi: hardware support for 4 threads per core
• GPUs: fast context switching between many threads

86
68 Thread programming 1

Pthreads
pthread_t threads[NTHREADS];
printf("forking\n");
for (i=0; i<NTHREADS; i++)
if (pthread_create(threads+i,NULL,&adder,NULL)!=0)
return i+1;
printf("joining\n");
for (i=0; i<NTHREADS; i++)
if (pthread_join(threads[i],NULL)!=0)
return NTHREADS+i+1;

87
69 Race conditions
Init: I=0
process 1: I=I+2
process 2: I=I+3

scenario 1. scenario 2. scenario 3.

I=0
read I = 0 read I = 0 read I = 0 read I = 0 read I = 0
set I = 2 set I = 3 set I = 2 set I = 3 set I = 2
write I = 2 write I = 3 write I = 2
write I = 3 write I = 2 read I = 2
set I = 5
write I = 5

I=3 I=2 I=5

88
70 Dealing with atomic operations

Semaphores, locks, mutexes, critical sections, transactional memory

Software / hardware

89
71 Cilk
Cilk code:
Sequential code:
cilk int fib(int n){
int fib(int n){
if (n<2) return 1;
if (n<2) return 1;
else {
else {
int rst=0;
int rst=0;
rst += spawn fib(n-1);
rst += fib(n-1);
rst += spawn fib(n-2);
rst += fib(n-2);
sync;
return rst;
return rst;
}
}

Sequential consistency: program output identical to sequential

90
72 OpenMP

• Directive based
• Parallel sections, parallel loops, tasks

91
Distributed memory parallelism

92
73 Global vs local view
(
yi ← yi + xi −1 i >0
yi unchanged i =0

• If I am processor 0 do nothing, otherwise receive a y element

from the left, add it to my x element.
• If I am the last processor do nothing, otherwise send my y
element to the right.

(Let’s think this through. . . )

93
74 Global picture

94
75 Careful coding

95
76 Better approaches

• Non-blocking send/receive
• One-sided

96
Hybrid/heterogeneous parallelism

97
77 Hybrid computing

• Use MPI between nodes, OpenMP inside nodes

• alternative: ignore shared memory and MPI throughout
• you save: buffers and copying
• bundling communication, load spread

98
78 Using threads for load balancing

Dynamic scheduling gives load balancing

Hybrid is possible improvement over strict-MPI

99
79 Amdahl’s law for hybrid programming

• p nodes with c cores each

• Fp core-parallel fraction, assume full MPI parallel
• ideal speedup pc, running time T1 /(pc ), actually:

Fs Fp T1 T1
Tp,c = T1 + = (Fs c + Fp ) = (1 + Fs (c − 1)) .
p pc pc pc

• T1 /Tp,c ≈ p/Fs
• Original Amdahl: Sp < 1/Fs , hybrid programming Sp < p/Fs

100
Design patterns

101
80 Array of Structures

struct { int number; double xcoord,ycoord; } _Node;

struct { double xtrans,ytrans} _Vector;
typedef struct _Node* Node;
typedef struct _Vector* Vector;

Node nodes = (node) malloc( n_nodessizeof(struct _Node)

);

102
81 Operations
Operate
void shift(node the_point,vector by) {
the_point->xcoord += by->xtrans;
the_point->ycoord += by->ytrans;
}

in a loop
for (i=0; i<n_nodes; i++) {
shift(nodes[i],shift_vector);
}

103
82 Along come the 80s
Vector operations
node_numbers = (int*) malloc( n_nodes*sizeof(int) );
node_xcoords = // et cetera
node_ycoords = // et cetera

and you would iterate

for (i=0; i<n_nodes; i++) {
node_xoords[i] += shift_vector->xtrans;
node_yoords[i] += shift_vector->ytrans;
}

104
83 and the wheel of reinvention turns further

The original design was better for MPI in the 1990s

except when vector instructions (and GPUs) came along in the 2000s

105
84 Latency hiding

• Memory and network are slow, prevent having to wait for it

• Hardware magic: out-of-order execution, caches, prefetching

106
85 Explicit latency hiding
Matrix vector product
∀i ∈Ip : yi = ∑ aij xj .
j

x needs to be gathered:
!
∀i ∈Ip : yi = ∑ + ∑ aij xj .
j local j not local

Overlap loads and local operations

Possible in MPI and Xeon Phi offloading,

very hard to do with caches

107
What’s left

108
86 Parallel languages

• Co-array Fortran: extensions to the Fortran standard

• X10
• Chapel
• UPC
• BSP
• MapReduce
• Pregel, . . .

109
87 UPC example

#define N 100*THREADS

shared int v1[N], v2[N], v1plusv2[N];

void main()
{
int i;
upc_forall(i=0; i<N; i++; i)
v1plusv2[i]=v1[i]+v2[i];
}

110
88 Co-array Fortran example

Explicit dimension for ‘images’:

Real,dimension(100),codimension[*] :: X
Real :: X(100)[*]
Real :: X(100,200)[10,0:9,*]

determined by runtime environment

111
89 Grab bag of other approaches

• OS-based: data movement induced by cache misses

• Active messages: application level Remote Procedure Call
(see: Charm++)

112
Load balancing, locality, space-filling curves

113
90 The load balancing problem

• Application load can change dynamically

e.g., mesh refinement, time-dependent problems
• Splitting off and merging loads
• No real software support: write application anticipating load
management
• Initial balancing: graph partitioners

114
91 Load balancing and performance

• Assignment to arbitrary processor violates locality

• Need a dynamic load assignment scheme that preserves locality
under load migration
• Fairly easy for regular problems, for irregular?

115
Space-filling curves

116
92 Adaptive refinement and load assignment

117
93 Assignment through Space-Filling Curve

118
Domain partitioning by Fiedler vectors

119
94 Inspiration from physics

120
95 Graph laplacian

• Set Gij = −1 if edge (i , j )

• Set Gii positive to give zero rowsums
• First eigenvector is zero, positive eigenvector
• Second eigenvector has pos/neg, divides in two
• n-th eigenvector divides in n parts

121
96 Fiedler in a picture

122

HPC Overview
No ratings yet
HPC Overview
45 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Parallel Computing: Types of Parallelism
No ratings yet
Parallel Computing: Types of Parallelism
27 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Module 3
No ratings yet
Module 3
104 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Aca
No ratings yet
Aca
13 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
Parallel & Distributed Algorithms Course
No ratings yet
Parallel & Distributed Algorithms Course
65 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
40 pages
Parallelism
No ratings yet
Parallelism
67 pages
Lecture Notes On Parallel Computation
No ratings yet
Lecture Notes On Parallel Computation
30 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
L04 Concurrency Consistency
No ratings yet
L04 Concurrency Consistency
39 pages
BDS Session 2
No ratings yet
BDS Session 2
59 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
Partitioning
No ratings yet
Partitioning
37 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
2 ND
No ratings yet
2 ND
19 pages
CS439 CC 2 Parallel Distributed Systems
No ratings yet
CS439 CC 2 Parallel Distributed Systems
37 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
BDS Session 6
No ratings yet
BDS Session 6
53 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
32 pages
PDC ch#5
No ratings yet
PDC ch#5
12 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
L04 Concurrency Consistency Updated
No ratings yet
L04 Concurrency Consistency Updated
40 pages
Parallel Computing
100% (1)
Parallel Computing
12 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Lecture 05
No ratings yet
Lecture 05
73 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
No ratings yet
Introduction To Parallel Processing: Shantanu Dutt University of Illinois at Chicago
51 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Pda 1
No ratings yet
Pda 1
72 pages
Parallel Algorithm Design Guide
No ratings yet
Parallel Algorithm Design Guide
107 pages
Week1 Parallel and Distributed Computing
No ratings yet
Week1 Parallel and Distributed Computing
55 pages
HPC Iterative
No ratings yet
HPC Iterative
106 pages
HPC Architecture
No ratings yet
HPC Architecture
86 pages
HPC Unix
No ratings yet
HPC Unix
46 pages
HPC Graph
No ratings yet
HPC Graph
22 pages
HPC Performance
No ratings yet
HPC Performance
13 pages
HPC Pkgconfig
No ratings yet
HPC Pkgconfig
12 pages
HPC Cmakeshort
No ratings yet
HPC Cmakeshort
11 pages
HPC Linear
No ratings yet
HPC Linear
52 pages
HPC Intro
No ratings yet
HPC Intro
16 pages
HPC Programming
No ratings yet
HPC Programming
33 pages
Lec2 17
No ratings yet
Lec2 17
27 pages
HPC Git
No ratings yet
HPC Git
12 pages
HPC Arithmetic
No ratings yet
HPC Arithmetic
62 pages
HPC Scaling
No ratings yet
HPC Scaling
56 pages
HPC Debug
No ratings yet
HPC Debug
38 pages
Lec1 17
No ratings yet
Lec1 17
39 pages
Lec4 17
No ratings yet
Lec4 17
22 pages
HPC Nbody
No ratings yet
HPC Nbody
23 pages
CMake Tutorial for C++ Projects
No ratings yet
CMake Tutorial for C++ Projects
76 pages
0.1 Installation of R Packages
No ratings yet
0.1 Installation of R Packages
10 pages
Equity Structured Products Accumulator/ Decumulator
No ratings yet
Equity Structured Products Accumulator/ Decumulator
5 pages
000 Getstartedrpi Digital
100% (2)
000 Getstartedrpi Digital
116 pages
NTP: Synchronizing Internet Clocks
100% (1)
NTP: Synchronizing Internet Clocks
22 pages
Elective I (Math)
No ratings yet
Elective I (Math)
2 pages
Probability Theorems in Finance
No ratings yet
Probability Theorems in Finance
59 pages
Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
No ratings yet
Linear Models: Stability and Redundancy: 2.1 Singular Value Decomposition
24 pages
Long-Range Dependency Effects in Network Timekeeping: David L. Mills University of Delaware
100% (1)
Long-Range Dependency Effects in Network Timekeeping: David L. Mills University of Delaware
33 pages
Numerical Methods in Finance. Part A. (2010-2011)
No ratings yet
Numerical Methods in Finance. Part A. (2010-2011)
23 pages
Asset-V1 HKUx+HKU 08x+1T2030+type@asset+block@Introduction To FinTech Course Syllabus 05142018
No ratings yet
Asset-V1 HKUx+HKU 08x+1T2030+type@asset+block@Introduction To FinTech Course Syllabus 05142018
2 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
Life Sciences and Medicine - Round 1 - Editing Test
0% (2)
Life Sciences and Medicine - Round 1 - Editing Test
5 pages
Parallel and Distributed Computing Module I
No ratings yet
Parallel and Distributed Computing Module I
28 pages
Program and Network Properties
No ratings yet
Program and Network Properties
27 pages
Versatile Java - Deepak Mali 351
No ratings yet
Versatile Java - Deepak Mali 351
351 pages
Parallel Computing - Unit III
No ratings yet
Parallel Computing - Unit III
74 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Week 7
No ratings yet
Week 7
27 pages
Exploiting ILP, TLP, and DLP With The PDF
No ratings yet
Exploiting ILP, TLP, and DLP With The PDF
21 pages
Module 1 ACA
86% (7)
Module 1 ACA
42 pages
Computer Architecture Generations
No ratings yet
Computer Architecture Generations
12 pages
PDC (Steps in Parallel Algorithm Design)
No ratings yet
PDC (Steps in Parallel Algorithm Design)
82 pages
Multicore Software Development Techniques Applications Tips and Tricks 1st Edition Oshana All Chapter Instant Download
100% (6)
Multicore Software Development Techniques Applications Tips and Tricks 1st Edition Oshana All Chapter Instant Download
77 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
Hardware Multithreading
No ratings yet
Hardware Multithreading
10 pages
Distributed UNIT 3
No ratings yet
Distributed UNIT 3
17 pages
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
No ratings yet
Internal Assignment: Name Sneha Sankhla Roll Number 2214505216 Program Master of Computer Applications (Mca) Semester 1
13 pages
OOAD
No ratings yet
OOAD
67 pages
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
No ratings yet
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
39 pages
Coupling
No ratings yet
Coupling
2 pages
Parallel and Distributed Computing Complete Notes
No ratings yet
Parallel and Distributed Computing Complete Notes
41 pages
Padp Unit 4up
No ratings yet
Padp Unit 4up
147 pages
Behavioral Observation
No ratings yet
Behavioral Observation
19 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
LECTURE 4 - Parallel Computing Design (PART 1)
No ratings yet
LECTURE 4 - Parallel Computing Design (PART 1)
47 pages
MCA Computer Organization and Architecture 14
No ratings yet
MCA Computer Organization and Architecture 14
9 pages
PDC Last Min Notes For MCQS - Theory
No ratings yet
PDC Last Min Notes For MCQS - Theory
39 pages
HPC Unit 2
No ratings yet
HPC Unit 2
2 pages
PDF
100% (2)
PDF
39 pages
Rtos Module 2 Notes PDF
No ratings yet
Rtos Module 2 Notes PDF
26 pages
P&D Computing MCQS
No ratings yet
P&D Computing MCQS
21 pages

HPC Parallel

Uploaded by

HPC Parallel

Uploaded by

What is parallelism?

Parallel computing has been a necessity for decades in computational

Parallelism is about doing multiple things at once.

• Hardware: vector instructions, multiple cores, nodes in a cluster.

Parallel: every processing element does

Time goes down linearly with processors

for (i=0; i<n; i++) s = 0;

• Compare operation counts

increasing distance: limit on parallel speedup

• Is T1 based on the same algorithm? The parallel code?

• Fs sequential fraction, Fp parallelizable fraction

Do you see problems with this?

• Communication independent of p: Tp = T1 (Fs + Fp /P ) + Tc

slowly decreasing function of p

• Amdahl’s law: strong scaling

• Let M be the total memory needed for your problem.

• Let M be the memory per processor.

memory per processor goes down.

Let m be the total number of tasks, p the number of processors, and t

• Time equals the length of the critical path . . .

Definition: granularity is the measure for how many operations can be

for (i=0; i<1000000; i++)

• Array processors, vector instructions, pipelining, GPUs

Unsynchronized tasks: fork-join

while there are tasks left do

Example: Mandelbrot set

Mix of data parallel and task parallel

Can the k loop be done in parallel? The i , j loops?

Follow this argument through. Argue that there is a non-trivial critical

How many processors can you use maximally in step k ?

Continue this reasoning. With p = n2 processing elements each of the

Also, with p = n processing elements you could let each row or

Relating time and memory to problem size

Suppose you buy a processor twice as fast, and you want to do a

Keep frequency constant, but vary number of processors p:

Each processor now stores Mp = N 2 /p elements.

Now you extend the cluster to 2P processors, of the same clock

Hint: for the extended cluster:

• SISD: single instruction single data

• Relies on streams of identical operations

Technology going back to the

void func(float *restrict c, float *restrict a,

This needs aligned data (posix_memalign)

• SPMD: single program multiple data

Lockstep in thread block,

(more about GPU threads later)

• The ideal case of shared memory:

• Bus: all processors on the same wires to memory

• Different caches, different affinity

• Distributed shared memory: network latency

• Degree: number of connections from one processor to others

• Bisection width: minimum number of wires through a cut

Already discussed; simple design, does not scale very far

• Degree 2, diameter P, bisection width 1

Flip last bit, flip one before, . . .

• Degree 2d, diameter P 1/d

• Solution to all-to-all connection

• Core level: private cache, shared cache

• Shared memory: synchronization problems such as critical

• Private data (stack, local variables) is called ‘thread context’

scenario 1. scenario 2. scenario 3.

I=3 I=2 I=5

Semaphores, locks, mutexes, critical sections, transactional memory

Sequential consistency: program output identical to sequential

• If I am processor 0 do nothing, otherwise receive a y element

(Let’s think this through. . . )

• Use MPI between nodes, OpenMP inside nodes

Dynamic scheduling gives load balancing

Hybrid is possible improvement over strict-MPI

• p nodes with c cores each

struct { int number; double xcoord,ycoord; } _Node;

Node *nodes = (node) malloc( n_nodes*sizeof(struct _Node)

and you would iterate

The original design was better for MPI in the 1990s

• Memory and network are slow, prevent having to wait for it

Overlap loads and local operations

void func(float restrict c, float restrict a,

Node nodes = (node) malloc( n_nodessizeof(struct _Node)