Introduction to HPC - 2
Parallel Architectures and Reference Models
Francesco Moscato
HIGH PERFORMANCE COMPUTING
LM Ingegneria Informatica
Universitá degli Studi di Salerno
e-mail:fmoscato@unisa.it
Parallelization and Communications Perfomance Models Overview Questions
Outline
1 Parallelization and Communications
2 Perfomance Models
3 Overview
4 Questions
Introduction to HPC - 2 – Francesco Moscato 2/46
Parallelization and Communications Perfomance Models Overview Questions
Parallel Algorithms, Architectures and Nodes Topology
Algorithms Design
Algorithms have different performances on different architectures
and node topology.
Algorithm −→ Architecture
You can design your algorithm and then you can choose best
suiting Architecture and Node Topology
Of course: you have to configure and tune
Architecture −→ Algorithm
You can design your Algorithm Depending on architecture and
node topology you have
Of course: This is a constraint that can limit algorithm speedup
Of course: You have to configure and tune.
Introduction to HPC - 2 – Francesco Moscato 3/46
Parallelization and Communications Perfomance Models Overview Questions
Sum Algorithm
• What if we have a
Shared-Memory with
Multi-Thread Architecture?
• What if we have a Distributed
Memory with Message Passing
Approach?
• And What if with Message
passing with different Node
Topology ?
Introduction to HPC - 2 – Francesco Moscato 4/46
Parallelization and Communications Perfomance Models Overview Questions
Sum on Multi-core / Shared Memory
• All ok if we have enough
memory
• All ok if we have enough
“cores”
• Operating systems (scheduler)
virtualize processors.
• What if A is 42 Petabytes ?
Introduction to HPC - 2 – Francesco Moscato 5/46
Parallelization and Communications Perfomance Models Overview Questions
Performances issues
Performances
Scheduling (context switches), Hardware Architecture, caching
mechanisms and cache sizes, interface to memory (as well as other
hardware properties) influence performances
Massive Data and Computation
Massive Computation and/or Massive Data require different
architectures: they do not fit into a a “single” CPU core
Introduction to HPC - 2 – Francesco Moscato 6/46
Parallelization and Communications Perfomance Models Overview Questions
Sum on MIMD architectures
• If we have a Bus-based
network with all-to-all
connections
• Input Data exist somewhere
... it has to be routed to
nodes
• Intermediate data have to
be routed and managed to
build result
• We can share input if it is
huge.
• What about work
partitioning?
Introduction to HPC - 2 – Francesco Moscato 7/46
Parallelization and Communications Perfomance Models Overview Questions
Some Issues on MIMD Architectures
• Bus: the examples shows an all-to-all connection based on
TCP/IP network. This is obviously a cheap solution, but some
performance problems exist. We can rely on more reliable and
fast networks (Infiniband, Myrinet, etc.)
• Input Data: We still have the need to distribute input data
among nodes. Anyway collection of input data can be shared
among the same nodes.
• Partial Results: Algorithms may require distribution or routing
of Partial (and final) results.
• Communication: it requires time
Introduction to HPC - 2 – Francesco Moscato 8/46
Parallelization and Communications Perfomance Models Overview Questions
Sum With Ring Bus
• On a Ring Bus, we need
more effort to route input
data (distribution of data is
O(N) where N is the
number of chunks)
• More effort too for routing
of partial results.
• Communication here
requires more time.
Introduction to HPC - 2 – Francesco Moscato 9/46
Parallelization and Communications Perfomance Models Overview Questions
The Need for a Model
A Model
We need a model to analyze parallel/distributed algorithms and
their execution on different architectures
Analysis
We need a model general enough to analyze parallel algorithms
and their application to parallel architectures
Evaluation
We need methods and techiniques to evaluate real performances
and improvements at run-time on HPC infrastructures.
Introduction to HPC - 2 – Francesco Moscato 10/46
Parallelization and Communications Perfomance Models Overview Questions
An Abstract Model for Message Passing
Lets start with an abstract
model for
• Nodes for tasks, Edges for
communication link
• send-receive of messages
• spawn and termination of
tasks.
Introduction to HPC - 2 – Francesco Moscato 11/46
Parallelization and Communications Perfomance Models Overview Questions
Design Methodology (By Foster)
• Partitioning: Define a large number of small
tasks (divide). Both Data and Computation are
decomposed. We isolate tasks that can be
executed in parallel, or data that can be
processed in parallel. Practical issues are
ignored (number of processors, network etc.)
• Communication: This includes messages for
coordination (control flow) and data flow as
well.
• Agglomeration: tasks and communication
structures are evaluated with respect to
performance requirements and implementation
costs. If necessary, tasks can be agglomerated
in lager ones in order to improve performances
and reduce costs.
• Mapping: Each Task is assigned to a processor
in a network. Some goals to reach are CPU-time
maximization, minimization of communication
latency, delays and costs. Complex scheduling
and load balancing algorithms can be applied
here (i.e. mapping can be dynamic).
Introduction to HPC - 2 – Francesco Moscato 12/46
Parallelization and Communications Perfomance Models Overview Questions
Partitioning
• Depends on independent tasks
• Decomposition on data (good is having many independent
tasks on small chunks of data)
• Domain-based decomposition depend on nature of the
problem (e.g. in a 3-D array, you can decompose on 1-D
slices)
• Functional decomposition depends on feature-based
decomposition
Introduction to HPC - 2 – Francesco Moscato 13/46
Parallelization and Communications Perfomance Models Overview Questions
Communication
• local vs global: With local communication, tasks communicates
only with a small set of other tasks; global communication requires
each task to communicate with many (eventually all other) tasks
• structured vs unstructured: structured communication has regular
comm graphs (e.g. a tree, a grid etc.); unstructured
communications have arbitrary graphs.
• static vs dynamic: in static communication, interaction of tasks
are known at design time; in dynamic communication,
communication graphs change depending on data and on tasks
behaviors and results.
• synchronous vs asynchronous: in the first a consumer gets data
when a producer generates it; in the second kind of communication,
a consumer can get data without the cooperation of a producer.
Introduction to HPC - 2 – Francesco Moscato 14/46
Parallelization and Communications Perfomance Models Overview Questions
Agglomeration
• Agglomerating tasks, reduces communication overhead
• sending / receiving data requires time (creation of messages,
use of proper data structure in the Operating Systems, sends
may be blocking or receives may require active waits etc.)
• the less data we transmit, the more fast is the algorithm)
• the more tasks we have, the higher is the cost for their
creation (resources, time etc.)
Introduction to HPC - 2 – Francesco Moscato 15/46
Parallelization and Communications Perfomance Models Overview Questions
Mapping
• place tasks that can execute concurrently on different
processor;
• place tasks that communicate a lot on the same processor in
order to improve locality.
• Mapping Problem is NP-Complete
• Dynamic Load Balancing
Introduction to HPC - 2 – Francesco Moscato 16/46
Parallelization and Communications Perfomance Models Overview Questions
Speedup
Speedup
is a measure of relative performance of two Algorithms/systems
when processing the same problem.
Example
Saying that an Algorithm/System on 12 Processors(nodes) has a
Speedup of 10, means that the algorithm/program executed on
one processor(node), is ten times slower than the same
algorithm/program executed on 12 processors(nodes).
Amdahl’s Law
The concept of speedup is connected to the Amdahl’s Law
Introduction to HPC - 2 – Francesco Moscato 17/46
Parallelization and Communications Perfomance Models Overview Questions
Amdahl’s Law
Theoretical SpeedUp
The Amdahl’s Law gives a theoretical speedup in the latency of
execution of a task with fixed workload. It expresses how much
performances improves when the resources on the executing system
improve (i.e. increases in number).
Sequential part
It assumes that a program has a part that cannot be partitioned to
execute in parallel (i.e. a sequential part). Since the sequential
part cannot be executed in parallel, the maximum theoretic
speedup depends on the fraction of sequential part on the whole
program.
Introduction to HPC - 2 – Francesco Moscato 18/46
Parallelization and Communications Perfomance Models Overview Questions
Amdahl’s Law
Speedup
If the sequential part of an algorithm accounts for 1/s pf the whole
program execution time, then the maximum speedup that can be
achieved on a parallel computer is s.
Example
if the sequential part is 5 percent, then the maximum speedup is 20
Introduction to HPC - 2 – Francesco Moscato 19/46
Parallelization and Communications Perfomance Models Overview Questions
Theoretical and Real Speedups
Measuring and Evaluating Speedup
When performing experiments to evaluate real speedup,
performances and speedup are characterized by sentences like:
The Algorithm implemented on parallel system X achieves a
speedup of 10.8 on 12 processors with a problem size N = 100.
Narrow region characterization
This information is not “good”. What if N changes ? What if
communication costs changes ? What if the number of processors
changes?
Introduction to HPC - 2 – Francesco Moscato 20/46
Parallelization and Communications Perfomance Models Overview Questions
Algorithm Complexity and Speedup
Speed up changes with complexity
Let us assume that we have an asymptotic time complexity (for
the optimal sequential algorithm) of O(N + N 2 ), with an
execution time function T on a number of processors P with a
problem size of N
Three Examples
Let us assume that we have 3 algorithms that somewhere have a
speedup of 10.8 on P = 12
Introduction to HPC - 2 – Francesco Moscato 21/46
Parallelization and Communications Perfomance Models Overview Questions
Algorithm Complexity and Speedup
Example
1 T = N + N 2 /P : This partitions the O(N 2 ) component, but
replicate the O(N ) part.
2 T = (N + N 2 )/P + 100 : This partitions the whole algorithm
but introduces an additional cost of 100;
3 T = (N + N 2 )/P + 0.6P 2 : This introduces an additional cost
of 0.6P 2
Speedup
Speedup with N=100, P=12 is almost the same (10.82) for the
three algorithms but ...
Introduction to HPC - 2 – Francesco Moscato 22/46
Parallelization and Communications Perfomance Models Overview Questions
Example: Speedup (GNU)Plotted with N=100
N=100
1000
N+N2/P
(N+N2)/P + 100
(N+N2)/P +0.6P2
Ideal
100
Speedup
(10.82, 12)
10
1
1 10 100 1000
Processors
Introduction to HPC - 2 – Francesco Moscato 23/46
Parallelization and Communications Perfomance Models Overview Questions
Example: Speedup (GNU)Plotted with N=1000
N=1000
10000
N+N2/P
(N+N2)/P + 100
(N+N2)/P +0.6P2
1000 Ideal
Speedup
100
(10.82, 12)
10
1
1 10 100 1000 10000
Processors
Introduction to HPC - 2 – Francesco Moscato 24/46
Parallelization and Communications Perfomance Models Overview Questions
Asymptotic Analysis
Asymptotic Analysis
It is Asymptotic ... And this can be a Problem because it ignores
lower-order terms in formulas
Example
The (asymptotic) cost of an algorithm 10N + N logN ignores the
term 10N , that is significative for N < 1024 (i.e. the longer
time(or space) is taken by the ignored term. This is a problem if
we can divide data in little chunks, or if the input is “short”
Communication time
In this example we are not considering Communication time.
Introduction to HPC - 2 – Francesco Moscato 25/46
Parallelization and Communications Perfomance Models Overview Questions
The Need for a Model
Other Model
The model based on observations and the asymptotic analysis are
not enough accurate in order to explain observed values or to
predict future circumstances.
Performance Model
Performance model should consider: Dimension of the Problem;
Number of Tasks; Number of Processors; Communication;
Operating System, Libraries and other software issues; Hardware ...
T = f (N, P, U, · · · )
Introduction to HPC - 2 – Francesco Moscato 26/46
Parallelization and Communications Perfomance Models Overview Questions
Execution Time
Definition
The time elapsed between start of first processor and completion of
last processor
Component
It has three components: Computation Time (Tcomp ),
Communication Time (Tcomm ) and Idle Time (Tidle ). This is for
each Processor executing the parallel program.
T = Tcomp + Tcomm + Tidle
Introduction to HPC - 2 – Francesco Moscato 27/46
Parallelization and Communications Perfomance Models Overview Questions
Example of Execution Time
Introduction to HPC - 2 – Francesco Moscato 28/46
Parallelization and Communications Perfomance Models Overview Questions
Execution Time
For balanced distribution of processes:
1
T = P (Tcomp + Tcomm + Tidle )
PP
1 i i i
= P i=1 Tcomp + Tcomm + Tidle
Introduction to HPC - 2 – Francesco Moscato 29/46
Parallelization and Communications Perfomance Models Overview Questions
Computation Time
Depends on :
• Problem Size (N)
• Number of tasks or processors
• Memory System, Cache Configuration and Performances
• Processor Architecture, Instructions Set, Pipeline etc.
Performances Changes
Be Aware: Computation Time changes changing Computing
Architecture (You cannot simply scale up by comparing GFLOPS)
Introduction to HPC - 2 – Francesco Moscato 30/46
Parallelization and Communications Perfomance Models Overview Questions
Communication
Sending a Message needs:
• Setup Communication (ts )
• Time to send a chunk of data (let’s say a word) (tw )
• Sending as many words as we need to transmit the message
(e.g. L)
Sendind a Message
Tmsg = ts + Ltw
Inter-Intra Processor
We have to consider time both in the case of Inter-processors as
well as of Intra-processors communication. We assume here they
are comparable
Introduction to HPC - 2 – Francesco Moscato 31/46
Parallelization and Communications Perfomance Models Overview Questions
Sending a Message
• Many Network protocols
have irregularity in first
stages of communication
• tw changes depending on
word size
• tw also depends on
hardware, Operating
systems, implementation of
communication protocols
etc.
Introduction to HPC - 2 – Francesco Moscato 32/46
Parallelization and Communications Perfomance Models Overview Questions
Idle time
• lack of computation and/or data
• waiting synchronization
• can be avoided overlapping communication and data (making
computation during communication or while waiting a
message), by creating multiple tasks and proper scheduling
Introduction to HPC - 2 – Francesco Moscato 33/46
Parallelization and Communications Perfomance Models Overview Questions
Efficiency and Speedup
Efficiency
Can provide a useful measure of parallel algorithm quality
Relative Efficiency
T1
Er = P TP
T1 is the time to execute the program on one processor; TP is the
time to execute a parallel task on one processor
Relative Speedup
Sr = P Er
Introduction to HPC - 2 – Francesco Moscato 34/46
Parallelization and Communications Perfomance Models Overview Questions
Relative Speedup
Example
T1
Er = P TP ; Sr = P Er
Example 1: P = 1000; T1 = 10000s; TP = 20s: for P = 1000
Sr = 500
Example 2: P = 1000; T1 = 1000s; TP = 5s: for P = 1000
Sr = 200
Problem in relative Speedup Metric
The second algorithm is better in the range 1-1000
We should have an algorithm-independent metric other than
execution time
Introduction to HPC - 2 – Francesco Moscato 35/46
Parallelization and Communications Perfomance Models Overview Questions
Not-Ideal Efficiency and Speedup
Overhead
Parallelization introduces an overhead (communication, system call
and resources to create and manage parallel tasks, parallel library
overhead etc.)
Modification to TP
TP = T1 /P + Toverhead
Introduction to HPC - 2 – Francesco Moscato 36/46
Parallelization and Communications Perfomance Models Overview Questions
Amdahl’s Law again
Evaluation
Suppose we have a sequential program, that we are able to
parallelize at 90%; that the parallelization is perfect (it scales
linearly with P ) and that T1 = 20 seconds.
Evaluation
TP = 0.9T1 /P + 0.1T1 = 18/P + 2
T1 20
S= 0.9T1 /P +0.1T1 = 18/P +2
with the Amdahl’s law stating that the speedup is less the serial
execution time divided the time of the program that cannot be
parallelized:
T1
S‘ 0.1T 1
= 20/2 = 10
Introduction to HPC - 2 – Francesco Moscato 37/46
Parallelization and Communications Perfomance Models Overview Questions
Scalability
Idea
Scalability is the property of dealing with ever-increasing problem
size
Efficiency and Scalability
If we find when increasing the size of the problem, E is constant,
then the program is scalable
Introduction to HPC - 2 – Francesco Moscato 38/46
Parallelization and Communications Perfomance Models Overview Questions
Scalability
Example
Let be T1 = k seconds; N = k (yes ... the same) is the problem
size; TP = n/P + 1
k k
E= P (k/P +1) = k+P
Let us increase k and P by two factors (α and β respectively)
αk
αk+βP
if α = β = a
ak k
a(k+P ) = k+P
In this case ... the program is scalable
Introduction to HPC - 2 – Francesco Moscato 39/46
Parallelization and Communications Perfomance Models Overview Questions
Homeworks
Parallelize (Shared Memory and Message Passing) and evaluate
Performances, Efficiency, Speedup and Scalability of the following
algorithm:
Parallel Generation and Sum
1 Let A[N] be a vector of N random integer values
2 //Shuffle A[N] with Fisher-Yates algorithm:
3 for i: N downto 1 do
4 p = random number in [1..i]
5 swap A[i],A[p]
6 done
7 //Sum Elements of A
8 sum=0
9 for i:1 to N do
10 sum = sum +A[i];
11 done
Introduction to HPC - 2 – Francesco Moscato 40/46
Parallelization and Communications Perfomance Models Overview Questions
Main Frameworks/API/Tools for Parallel Programming
In this course...
• OpenMP
• MPI
• CUDA
• OpenCl
Many others in literature ...
Introduction to HPC - 2 – Francesco Moscato 41/46
Parallelization and Communications Perfomance Models Overview Questions
OpenMP
• Shared Memory Parallel System
• OPEN API specification1 (it is not a framework or a
tool/compiler)
• Implemented and supported by many tools2
• defines a set of compiler directives, run-time routines and
environmental variables
• based on a multi-thread model
• cooperation is through shared memory (and thread-based
synchronization directives)
• it allows for transparent scalability
• programming model based on pre-compiling directives (e.g.
#pragma)
1
https://www.openmp.org
2
https://www.openmp.org//resources/openmp-compilers-tools/
Introduction to HPC - 2 – Francesco Moscato 42/46
Parallelization and Communications Perfomance Models Overview Questions
Message Passing Interface (MPI)
• Library Specification for Message-Passing3
• designed for High Performances on massively parallel clusters
• many implementors (e.g. openMPI library)
• Based on Process definition and message passing among
processes.
• it contains all directives to manage inter-process
communications (send and receive, topology definition,
packets managements etc.)
3
https://www.mcs.anl.gov/research/projects/mpi/
Introduction to HPC - 2 – Francesco Moscato 43/46
Parallelization and Communications Perfomance Models Overview Questions
CUDA
• NVIDIA Platform and Library4
• Application of General Purpose parallel programming to GPU
• C - C++ Libraries
• based on concept of kernels of computation, that are spawned
on groups of threads on shared memory
• manages communication and synchronization
4
https://developer.nvidia.com/cuda-zone
Introduction to HPC - 2 – Francesco Moscato 44/46
Parallelization and Communications Perfomance Models Overview Questions
OpenCl
• Open Standard for parallel programming across
heterogeneous devices (CPU, GPU, FPGA ,...)5
• Many Partners (Apple, Nvidia, Intel, Arm, AMD, Samsung,
Sony ...)
• task-based and data-based parallelism
• an abstract programming model to support more platforms
• private/local/global/shared memory model
• macro - based programming model (C language)
5
https://www.khronos.org/opencl/
Introduction to HPC - 2 – Francesco Moscato 45/46
Parallelization and Communications Perfomance Models Overview Questions
Any Question ?
1
image from: https://pigswithcrayons.com/illustration/
dd-players-strategy-guide-illustrations/
Introduction to HPC - 2 – Francesco Moscato 46/46