Parallel Computing
Parallel performance
Thorsten Grahs, 01.06.2015
Overview
Performance
Speedup & efficiency
Amdahls Law
Gustafsons Law
Efficiency and scalability metrics
Timing models
Examples: Inner product
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 2
Parallel Performance
Primary purpose of parallelization
The primary purpose or goal to parallelize a system or a
program is performance.
So what is performance?
Usually it is about one of the following
Reducing the total time it takes to compute a single result
Increasing the rate at which a series of results can be
computed
Reducing the power consumption of a computation
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 3
Speed up by parallelization
Question
Could one build a TeraFlop computer by using 1000
GigaFlop machines?
Yes, . . . but
Technical restrictions
Influenced by CPU speed, memory, network
Not all program parts could be parallelized?
Speed-Up?
Efficiency?
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 4
Parallel run time
Parallel execution time
Tp (n)
Time between start of the parallel program and the end of all
involved parallel processes.
p # processors
n model size
Run time for a parallel program (distributed sys.) consists of
local computation
Computation of one processor with local data
Data exchange
Necessary communication for data exchange
Waiting time
(e.g. due to unbalanced loads on processors)
synchronisation Calibration between involved processes
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 5
Cost of a parallel program
Costs Cp (n)
The costs (work, processor-time-product) of a parallel
product are defined as
Cp (n) = Tp (n) p
Costs Cp (n)
takes into account the time that all processors involved
needs for fulfilling their tasks
a measure of the work performed by all processors.
are optimal, when the parallel program performs as many
operations as the fastest sequential algorithm with run
time T (n), i.e.
Cp (n) = T (n).
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 6
Speed-up
Serial vs. parallel fraction Speed-up Sp (n)
The speed-up of an parallel program with run time Tp (n) is
defined as
Sp (n) =
T (n)
Tp (n)
T (n) Run time of the optimal serial implementation.
Measure for
comparison between serial and parallel implementation.
relative gain in speed up
Ideal case: linear speed-up T = Tp p Sp = p
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 7
Efficiency
Serial vs. parallel fraction Efficiency Ep (n)
The efficiency of a parallel program is defined as
Ep (n) =
Sp (n)
T (n)
T (n)
=
=
Cp (n)
p
p Tp (n)
Measure for the proportion of the run time, a processor
needs for calculation, which is also present in the serial
program
less efficiency = large parallel overhead
Linear Speed-up Sp = p corresponds to Ep = 1
E100 (n) = 0.4 means that each out of the 100 processors
needs 60% time for communication
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 8
Amdahls objection
Gene Amdahl, 1967
. . . the effort expended on achieving high parallel processing
rates is wasted unless it is accompanied by achievements in
sequential processing rates of nearly the same magnitude
What does this mean?
Inclusion of the sequential proportions
In general, a parallel algorithm has always inherent
sequential components. This has implications for the
achievable speed-up.
We assume
f
relative proportion of the serial task
1f
relative proportion of ideal parallelizable tasks
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 9
Amdahls Law
Gene Amdahl ( 1922)
Computer pioneer
& businessman
Famous for his work on
large-capacity computers at
IBM
Founder of the Amdahl Corp.
Assume
T (n) = Tser + Tpar
sequential execution
p
T (n) = Tser + Tpar /p parallel execution
Amdahls Law
Sp (n) =
Tser + Tpar
Tser + Tpar /p
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 10
Amdahls Law | Example
Assume
a program needs 10 hours on a single processor
one hour is sequential, the rest 9 hours can be parallelized
Consequences
The program cannot run in less than an hour
The speed-up is at most 10
Speed-up is limited by the sequential part
Note: the problem size is fixed!
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 11
Amdahls Law relative formulation
If an implementation has a
relative proportion f , (0 6 f 6 1) of inherent serial tasks,
relative proportion (1 f ) of ideal parallelizable tasks
the over-all run time of the program consists of the
Run time for serial part
f T (n)
Run time for parallel part (1 f )/p T (n)
Amdahls Law (relative)
Sp (n) =
T (n)
1
1
=
6
1f
1f
f
f T (n) + p T (n)
f+ p
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 12
Amdahls law Example
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 13
Amdahls law Speed-up
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 14
Amdahls law Consequence
Serial part is excluded from massive parallel performance
For a sequential proportion of 1 %
the maximal accessible speed-up is 100
regardless how many processors are used
For a sequential proportion of 20 %
the maximal accessible speed-up is 5
regardless how many processors are used
Conclusion
Massive parallel systems where for a long time considered as
inefficient and only interesting from a theoretical point of view
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 15
Further sources of limitations
Further delays due to
Load balancing
i.e. unequal work packages (loads)
Communication
e.g. waiting for data/processes
Parallelization induced communication
Consider the communication time Tpc (n) due to the need for
exchange between participating processors
Tp (n) = T (n) [(1 f ) + f /p] + Tpc (n)
Tpc (n) is monotone increasing with # p.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 16
Communication Master-Worker
Assumptions
Load balancing excellent
Communication between Master and Worker
Tpc (n) linear function of p.
A processor does not monologise
T2c (n) minimal comm. time (Master & Worker)
Tpc (n) = T2c (n)(p 1)
i.e.
with
r=
Tc (2)
T (1)
S(p) =
1
f r+(1f )/p+rp
Min. comm. time
Serial comp. time
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 17
Comm. Master-Worker Max. speed-up
New situation
For a great # p speed-up can decline
Maximal Speed-up
S(p? ) =
f r+2
(1f )r
with
p? =
1f
r
Pay-of area
Only for
p p?
S(p) = p holds. This means that only for processor number
much less than p? parallel performance can be assured.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 18
Scaling
An example
One decorator wallpapers a room in 60 minutes.
Two decorators wallpaper a room in 30 minutes.
How long does it take, when 60 decorators work in the
room?
Remedy
Use the 60 decorators for a hotel with 60 rooms. . .
Problem scaling
With increasing # processors the problem size should be
increased, in order to guarantee parallel
performance/efficiency.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 19
Scalability
Saturation of speed-up
According to the Amdahls Law for a fixed problem size n with
increasing number of processors p there is a saturation of
the speed-up
60 decorators in one room . . .
Quite often one is interested in scientific computing to
solve a bigger problem in the same time.
Growing problem size n combined with increasing
# processors p
Instead of decorating one room, 60 painters can renovate
the whole hotel (60 rooms).
This behaviour (speed-up for increasing n) is not detected
by Amdahls Law!
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 20
Scaled speed-up
Assumption
The sequential proportion of a parallel program decreases
with increasing model size. I.e. it is not a constant fraction of
the total computation as adopted in Amdahls Law.
For each number of processors p maximum speed up
Sp (n) 6 p can be achieved
namely by correspondingly large model size
The behaviour of run time T with larger problem size and
correspondingly increased number of processors is
described by the Gustafsons Law.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 21
Gustafsons Law
John Gustafson (1955)
. . . speed-up should be measured by scaling the
problem to the number of processors, not by fixing the problem size.
J. L. Gustafson Reevaluating Amdahls Law,
Comm. of the ACM, 31:532-533, 1988
This implies:
inclusion of the model size n
sequential proportion f constant
Serial run time
Ts (n) = f + n(1 f )
)
Parallel run time
Tp (n) = f + n(1f
p
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 22
Gustafsons Law scaled speed-up
Gustafsons Law(Scaled speed-up)
If the problem size increases while the serial portion grows
slowly or remains fixed, speed-up grows as workers are
added
Sp (n) =
=
Ts (n)
Tp (n)
f + n(1 f )
f+
n(1f )
p
f + n(1 f )
p
fp + n(1 f )
lim Sp (n) = p
=
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 23
Gustafsons Law constant run time
Constant run time
Keeping the run time constant for increasing problem size,
one has to choose n = p. This means: For double model size
(e.g. # grid points) one has to double the # processors.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 24
Karp-Flatt metric
Both, Amdahl and Gustafson ignore the parallel overhead
of the parallelization.
This may lead to an overestimated speedup.
Karp & Flatt introduces an
experimentally determined serial fraction
Karp-Flatt metric
f =
1/S 1/p
1 1/p
S measured speedup
# processors
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 25
Karp-Flatt metric
Advantage
Takes into account parallel overhead
Detects other sources of overhead or inefficiency
ignored in speedup model:
Process start-up time
Process synchronization time
Imbalanced workload
Architectural overhead
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 26
Karp-Flatt metric Example 1
Measured speedup
p
2
3
S 1.8 2.5
4
3.1
5
3.6
6
4.0
7
4.4
8
4.7
What is the primary reason for speedup of only 4.7 on 8
CPUs?
Karp-Flatt metric
f 0.1 0.1 0.1 0.1 0.1 0.1 0.1
Since f is constant, (i.e. not increasing with growing # p)
the limiting constraint is limited opportunity for parallelism
(large fraction that is inherently sequential)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 27
Karp-Flatt metric Example 2
Measured speedup
p
2
3
S 1.9 2.6
4
3.2
5
3.7
6
4.1
7
4.5
8
4.7
What is the primary reason for speedup of only 4.7 on 8
processors?
Karp-Flatt metric
f
0.070
0.075
0.080
0.085
0.090
0.095
0.100
Since f is steadily increasing, parallel overhead is the
primary reason (f.i. time spent in process start-up,
communication synchronization or an architectural
constraint).
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 28
Scalability again
Scalability
The scalability of a parallel system
(i.e. parallel program executing on a parallel computer)
is a measure of its ability to increase performance as the
number of processors increases.
Speedup (and hence efficiency) is typically an increasing
function of the problem size.
This is called the Amdahl effect.
in order to maintain efficiency when processors are added,
we can increase the problem size.
This idea is formulated by the
Isoefficiency relation
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 29
Efficiency relation
TO = is the total amount of time spent by all processes
doing work not done by the sequential algorithm. (Total
overhead)
Ts was the sequential execution time. We have
TS + TO
pTP = TS + TO TP = =
p
Speed-up
TS
pTS
S =
TP TS + TO
Efficiency
S
1
TS
E 6
=
=
p
TS + TO
1 + TTO
S
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 30
Isoefficiency relation
E(n, p) 6
1
1+
TO (n,p)
TS (n)
TO (n, p)
1 E(n, p)
6
TS (n)
E(n, p)
1 E(n, p)
TS (n) 6
TO (n, p)
E(n, p)
If we wish to maintain a constant level of efficiency
1 E(n, p)
has to be a constant.
E(n, p)
Consequently, the isoefficiency relation simplifies too
TS (n) > CTO (n, p)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 31
Scalability function
Suppose a parallel system has the isoefficiency relation
n > f (p)
M(n) is the memory required to store a problem of size n.
We can derive the estimate for the total amount of
memory, i.e.
n > M[f (p)].
Thus the function
M(f (p))
p
shows how memory usage per processor must increase to
maintain same efficiency
This function is called the Scalability function
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 32
Meaning of scalability function
To maintain the same efficiency when increasing p,
we must increase n.
The maximum problem size is limited by the maximum
available size of memory M(f (p)),
which is linear in p.
The scalability function shows how memory usage per
processor must grow to maintain efficiency
Is the complexity of the
scalability function is constant
this means that the
parallel system is perfectly scalable.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 33
Meaning of scalability function
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 34
Performance models
Analysing performance
Amdahls &. Gustafsons Laws makes simple statements
about the behaviour of a model with changing
# processors p
model/problem size n
Need for analysing the behaviour of parallel Algorithms
Performance models
Abstract machine models for the design and performance
analysis of parallel algorithms.
Allow detailed statements about run time behaviour (time,
memory, efficiency,. . . )
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 35
Timing model
model (Van de Velde)
Simple performance model for run time analysis
Assumptions
All tasks run independently from each other and start
simultaneously
All processors are identical
All arithmetic operations take the same time ta
Message exchange in data words with unit length (16 Bit)
Communication and calculation are non-overlapping
Communication tasks are non-interfering each others
No global only point-to-point communication
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 36
Timing model Modelling data exchange
Data exchange
Linear model for message exchange (send & receive) of a
message M of length l (i.e. consisting of l data words)
tk (l) = + l
is the start time (latency = Time to complete a task)
Here: Time, to send a message, i.e.
pack the message
send the message
& unpack it on recipient side.
reciprocal bandwidth of the bus/network
M Message of length l to be send
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 37
Timing model Modelling data exchange
In general
ta
ta arithmetic time
i.e. time for processing one standard arithmetic operation
Properties
Communication time depends linear on message length l.
Communication time depends on the system/hardware
(Network)
Slope of the line is the reciprocal bandwidth.
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 38
Timing model Example
model time response depending on message length
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 39
Example: Inner product
1
for(s=0.,i=0;i<N;i++) s += x[i] * y[i];
N vector length (problem size)
p # processors
n = Np uniform distribution of the vectors to processors
Constraint
s should be known on all processors.
Serial work
T1 = 2 N ta
To consider for parallelization
Distribution of data
Distribution of work
Algorithm
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 40
Example: Inner product Version I
Version I local inner products
1
2
3
4
s=0.;
for(i=0;i<n;i++) s += x[i] * y[i];
for(i=0;i<p; i++) if(i != myself) send(s to proc i);
for(i=0;i<p; i++) if(i != myself) s+=recieve(from proc i);
2 n ta
(p 1) tk (1)
(p 1) ta
accumulation of local summation
send and receive of the (p-1) sub totals
computation of total sum
Tp = 2 n ta + (p 1)tk (1) + (p 1)ta
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 41
Example: Inner product Version I
Speed-up
2N ta
(2n + p 1)ta + (p 1)tk (1)
2N
=
(2n + p 1) + (p 1)tr
S(p) =
T1
Tp
with
tr =
tk (1)
ta
(lesser is better)
Speed-up Version I
S(p) =
2N
(2n + p 1) + (p 1)tr
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 42
Example: Inner product Version II
Version 2 recursive doubling
Pairs of local sums are formed recursively.
2 n ta
Formation of the local sums
ta + 2tk (1)
Receiving and adding two
sub sums.
log2 p
# inner knots (propagation
levels)
Tp = 2 n ta + log2 p [ta + 2tk (1)]
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 43
Example: Inner product Version II
Speed-up
2N ta
(2n + log2 p)ta + 2 log2 p tk (1)
2N
=
2n + log2 p (1 + 2tr )
tk (1)
with tr =
(lesser is better)
ta
S(p) =
T1
Tp
Speed-up Version II
2N
2n + log2 p (1 + 2tr )
Only logarithmic dependence on number of processors!
S(p)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 44
Example: Inner product Version III
Version 3 Butterfly
1
2
3
4
5
6
7
8
s=0.;
for(i=0; i<np; i++) s += x[i] * y[i];
for(i=0; i<(int)log2(p); i++) {
k = 1<<i;/*Bit shift to left, i.e. sequence 1,2,4,...*/
j = myself^k; /* exclusive OR */
send (s to proc j);
s += recv(from proc j)
}
log2 (p) steps
Each p different pairs exchange data s.
Each processor forms subtotal
Algorithm known from Fast Fourier Transf. (FFT)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 45
Example: Inner product Version III
Version 3 Butterfly
Pairs of local sums are formed recursively.
2 n ta
Formation local sums
ta + tk (1)
Sum. of two subtotals
log2 p
# inner knots
P
s(0 : 3)means: 3i=0 s(i)
Tp = 2 n ta + log2 p (ta + tk (1))
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 46
Example: Inner product Version III
Speed-up
2N ta
(2n + log2 p)ta + log2 p tk (1)
2N
=
2n + log2 p (1 + tr )
tk (1)
with tr =
(lesser is better)
ta
S(p) =
T1
Tp
Speed-up Version III
2N
2n + log2 p (1 + tr )
Only logarithmic dependence on number of processors!
Half communication time compared to recursive doubling
S(p)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 47
Example: Inner product Comparison
Comparison for Versions I,II & III (N=100,000)
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 48
Example: Inner product Conclusions
Simple performance model is helpful
It can analyse different realisations of parallel algorithms
Take into account network speed
Local inner products
linear dependence of the run time of processor number p
prevents massive parallelism
Recursive doubling
logarithmic dependence of the run time leads to better
speed-up
Butterfly
Lower communication overhead In addition
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 49
Further readings
Literature
G.M. Amdahl Validity of the single-processor approach
to achieving large scale computing capabilities, In Proc.
Amer. Fed. Information Proceeding Societies Spring Joint
Computer Conference, AFIPS Press, 1967, pp. 483-485.
J. L. Gustafson Reevaluating Amdahls Law, Comm. of
the ACM, 31:532-533, 1988
E. F. Van de Velde Concurrent Scientific Computing,
Springer Texts in Applied Mathematics 16, 1994
M. McCool, A. D. Robinson, J. Reinders Structured
Parallel Programming, Morgan Kaufmann Pub., 2012
01.06.2015 Thorsten Grahs Parallel Computing I SS 2015 Seite 50