KEMBAR78
02 Introduction | PDF | Parallel Computing | Supercomputer
0% found this document useful (0 votes)
7 views46 pages

02 Introduction

The document discusses parallel algorithms, architectures, and performance modeling. It covers parallelization techniques like partitioning and mapping tasks, and communication methods. A design methodology is presented for developing parallel algorithms based on partitioning, communication, agglomeration, and mapping. Different architectures and their impacts on algorithms are also examined.

Uploaded by

FrancescoMoscato
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views46 pages

02 Introduction

The document discusses parallel algorithms, architectures, and performance modeling. It covers parallelization techniques like partitioning and mapping tasks, and communication methods. A design methodology is presented for developing parallel algorithms based on partitioning, communication, agglomeration, and mapping. Different architectures and their impacts on algorithms are also examined.

Uploaded by

FrancescoMoscato
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to HPC - 2

Parallel Architectures and Reference Models

Francesco Moscato

HIGH PERFORMANCE COMPUTING


LM Ingegneria Informatica
Universitá degli Studi di Salerno
e-mail:fmoscato@unisa.it
Parallelization and Communications Perfomance Models Overview Questions

Outline

1 Parallelization and Communications

2 Perfomance Models

3 Overview

4 Questions

Introduction to HPC - 2 – Francesco Moscato 2/46


Parallelization and Communications Perfomance Models Overview Questions

Parallel Algorithms, Architectures and Nodes Topology


Algorithms Design
Algorithms have different performances on different architectures
and node topology.

Algorithm −→ Architecture
You can design your algorithm and then you can choose best
suiting Architecture and Node Topology
Of course: you have to configure and tune

Architecture −→ Algorithm
You can design your Algorithm Depending on architecture and
node topology you have
Of course: This is a constraint that can limit algorithm speedup
Of course: You have to configure and tune.
Introduction to HPC - 2 – Francesco Moscato 3/46
Parallelization and Communications Perfomance Models Overview Questions

Sum Algorithm

• What if we have a
Shared-Memory with
Multi-Thread Architecture?
• What if we have a Distributed
Memory with Message Passing
Approach?
• And What if with Message
passing with different Node
Topology ?

Introduction to HPC - 2 – Francesco Moscato 4/46


Parallelization and Communications Perfomance Models Overview Questions

Sum on Multi-core / Shared Memory

• All ok if we have enough


memory
• All ok if we have enough
“cores”
• Operating systems (scheduler)
virtualize processors.
• What if A is 42 Petabytes ?

Introduction to HPC - 2 – Francesco Moscato 5/46


Parallelization and Communications Perfomance Models Overview Questions

Performances issues

Performances
Scheduling (context switches), Hardware Architecture, caching
mechanisms and cache sizes, interface to memory (as well as other
hardware properties) influence performances

Massive Data and Computation


Massive Computation and/or Massive Data require different
architectures: they do not fit into a a “single” CPU core

Introduction to HPC - 2 – Francesco Moscato 6/46


Parallelization and Communications Perfomance Models Overview Questions

Sum on MIMD architectures


• If we have a Bus-based
network with all-to-all
connections
• Input Data exist somewhere
... it has to be routed to
nodes
• Intermediate data have to
be routed and managed to
build result
• We can share input if it is
huge.
• What about work
partitioning?

Introduction to HPC - 2 – Francesco Moscato 7/46


Parallelization and Communications Perfomance Models Overview Questions

Some Issues on MIMD Architectures

• Bus: the examples shows an all-to-all connection based on


TCP/IP network. This is obviously a cheap solution, but some
performance problems exist. We can rely on more reliable and
fast networks (Infiniband, Myrinet, etc.)
• Input Data: We still have the need to distribute input data
among nodes. Anyway collection of input data can be shared
among the same nodes.
• Partial Results: Algorithms may require distribution or routing
of Partial (and final) results.
• Communication: it requires time

Introduction to HPC - 2 – Francesco Moscato 8/46


Parallelization and Communications Perfomance Models Overview Questions

Sum With Ring Bus

• On a Ring Bus, we need


more effort to route input
data (distribution of data is
O(N) where N is the
number of chunks)
• More effort too for routing
of partial results.
• Communication here
requires more time.

Introduction to HPC - 2 – Francesco Moscato 9/46


Parallelization and Communications Perfomance Models Overview Questions

The Need for a Model

A Model
We need a model to analyze parallel/distributed algorithms and
their execution on different architectures

Analysis
We need a model general enough to analyze parallel algorithms
and their application to parallel architectures

Evaluation
We need methods and techiniques to evaluate real performances
and improvements at run-time on HPC infrastructures.

Introduction to HPC - 2 – Francesco Moscato 10/46


Parallelization and Communications Perfomance Models Overview Questions

An Abstract Model for Message Passing

Lets start with an abstract


model for
• Nodes for tasks, Edges for
communication link
• send-receive of messages
• spawn and termination of
tasks.

Introduction to HPC - 2 – Francesco Moscato 11/46


Parallelization and Communications Perfomance Models Overview Questions

Design Methodology (By Foster)

• Partitioning: Define a large number of small


tasks (divide). Both Data and Computation are
decomposed. We isolate tasks that can be
executed in parallel, or data that can be
processed in parallel. Practical issues are
ignored (number of processors, network etc.)
• Communication: This includes messages for
coordination (control flow) and data flow as
well.
• Agglomeration: tasks and communication
structures are evaluated with respect to
performance requirements and implementation
costs. If necessary, tasks can be agglomerated
in lager ones in order to improve performances
and reduce costs.
• Mapping: Each Task is assigned to a processor
in a network. Some goals to reach are CPU-time
maximization, minimization of communication
latency, delays and costs. Complex scheduling
and load balancing algorithms can be applied
here (i.e. mapping can be dynamic).

Introduction to HPC - 2 – Francesco Moscato 12/46


Parallelization and Communications Perfomance Models Overview Questions

Partitioning

• Depends on independent tasks


• Decomposition on data (good is having many independent
tasks on small chunks of data)
• Domain-based decomposition depend on nature of the
problem (e.g. in a 3-D array, you can decompose on 1-D
slices)
• Functional decomposition depends on feature-based
decomposition

Introduction to HPC - 2 – Francesco Moscato 13/46


Parallelization and Communications Perfomance Models Overview Questions

Communication

• local vs global: With local communication, tasks communicates


only with a small set of other tasks; global communication requires
each task to communicate with many (eventually all other) tasks
• structured vs unstructured: structured communication has regular
comm graphs (e.g. a tree, a grid etc.); unstructured
communications have arbitrary graphs.
• static vs dynamic: in static communication, interaction of tasks
are known at design time; in dynamic communication,
communication graphs change depending on data and on tasks
behaviors and results.
• synchronous vs asynchronous: in the first a consumer gets data
when a producer generates it; in the second kind of communication,
a consumer can get data without the cooperation of a producer.

Introduction to HPC - 2 – Francesco Moscato 14/46


Parallelization and Communications Perfomance Models Overview Questions

Agglomeration

• Agglomerating tasks, reduces communication overhead


• sending / receiving data requires time (creation of messages,
use of proper data structure in the Operating Systems, sends
may be blocking or receives may require active waits etc.)
• the less data we transmit, the more fast is the algorithm)
• the more tasks we have, the higher is the cost for their
creation (resources, time etc.)

Introduction to HPC - 2 – Francesco Moscato 15/46


Parallelization and Communications Perfomance Models Overview Questions

Mapping

• place tasks that can execute concurrently on different


processor;
• place tasks that communicate a lot on the same processor in
order to improve locality.
• Mapping Problem is NP-Complete
• Dynamic Load Balancing

Introduction to HPC - 2 – Francesco Moscato 16/46


Parallelization and Communications Perfomance Models Overview Questions

Speedup

Speedup
is a measure of relative performance of two Algorithms/systems
when processing the same problem.

Example
Saying that an Algorithm/System on 12 Processors(nodes) has a
Speedup of 10, means that the algorithm/program executed on
one processor(node), is ten times slower than the same
algorithm/program executed on 12 processors(nodes).

Amdahl’s Law
The concept of speedup is connected to the Amdahl’s Law

Introduction to HPC - 2 – Francesco Moscato 17/46


Parallelization and Communications Perfomance Models Overview Questions

Amdahl’s Law

Theoretical SpeedUp
The Amdahl’s Law gives a theoretical speedup in the latency of
execution of a task with fixed workload. It expresses how much
performances improves when the resources on the executing system
improve (i.e. increases in number).

Sequential part
It assumes that a program has a part that cannot be partitioned to
execute in parallel (i.e. a sequential part). Since the sequential
part cannot be executed in parallel, the maximum theoretic
speedup depends on the fraction of sequential part on the whole
program.

Introduction to HPC - 2 – Francesco Moscato 18/46


Parallelization and Communications Perfomance Models Overview Questions

Amdahl’s Law

Speedup
If the sequential part of an algorithm accounts for 1/s pf the whole
program execution time, then the maximum speedup that can be
achieved on a parallel computer is s.

Example
if the sequential part is 5 percent, then the maximum speedup is 20

Introduction to HPC - 2 – Francesco Moscato 19/46


Parallelization and Communications Perfomance Models Overview Questions

Theoretical and Real Speedups

Measuring and Evaluating Speedup


When performing experiments to evaluate real speedup,
performances and speedup are characterized by sentences like:
The Algorithm implemented on parallel system X achieves a
speedup of 10.8 on 12 processors with a problem size N = 100.

Narrow region characterization


This information is not “good”. What if N changes ? What if
communication costs changes ? What if the number of processors
changes?

Introduction to HPC - 2 – Francesco Moscato 20/46


Parallelization and Communications Perfomance Models Overview Questions

Algorithm Complexity and Speedup

Speed up changes with complexity


Let us assume that we have an asymptotic time complexity (for
the optimal sequential algorithm) of O(N + N 2 ), with an
execution time function T on a number of processors P with a
problem size of N

Three Examples
Let us assume that we have 3 algorithms that somewhere have a
speedup of 10.8 on P = 12

Introduction to HPC - 2 – Francesco Moscato 21/46


Parallelization and Communications Perfomance Models Overview Questions

Algorithm Complexity and Speedup

Example
1 T = N + N 2 /P : This partitions the O(N 2 ) component, but
replicate the O(N ) part.
2 T = (N + N 2 )/P + 100 : This partitions the whole algorithm
but introduces an additional cost of 100;
3 T = (N + N 2 )/P + 0.6P 2 : This introduces an additional cost
of 0.6P 2

Speedup
Speedup with N=100, P=12 is almost the same (10.82) for the
three algorithms but ...

Introduction to HPC - 2 – Francesco Moscato 22/46


Parallelization and Communications Perfomance Models Overview Questions

Example: Speedup (GNU)Plotted with N=100

N=100
1000
N+N2/P
(N+N2)/P + 100
(N+N2)/P +0.6P2
Ideal

100
Speedup

(10.82, 12)
10

1
1 10 100 1000
Processors
Introduction to HPC - 2 – Francesco Moscato 23/46
Parallelization and Communications Perfomance Models Overview Questions

Example: Speedup (GNU)Plotted with N=1000

N=1000
10000
N+N2/P
(N+N2)/P + 100
(N+N2)/P +0.6P2
1000 Ideal
Speedup

100

(10.82, 12)
10

1
1 10 100 1000 10000
Processors
Introduction to HPC - 2 – Francesco Moscato 24/46
Parallelization and Communications Perfomance Models Overview Questions

Asymptotic Analysis

Asymptotic Analysis
It is Asymptotic ... And this can be a Problem because it ignores
lower-order terms in formulas

Example
The (asymptotic) cost of an algorithm 10N + N logN ignores the
term 10N , that is significative for N < 1024 (i.e. the longer
time(or space) is taken by the ignored term. This is a problem if
we can divide data in little chunks, or if the input is “short”

Communication time
In this example we are not considering Communication time.

Introduction to HPC - 2 – Francesco Moscato 25/46


Parallelization and Communications Perfomance Models Overview Questions

The Need for a Model

Other Model
The model based on observations and the asymptotic analysis are
not enough accurate in order to explain observed values or to
predict future circumstances.

Performance Model
Performance model should consider: Dimension of the Problem;
Number of Tasks; Number of Processors; Communication;
Operating System, Libraries and other software issues; Hardware ...

T = f (N, P, U, · · · )

Introduction to HPC - 2 – Francesco Moscato 26/46


Parallelization and Communications Perfomance Models Overview Questions

Execution Time

Definition
The time elapsed between start of first processor and completion of
last processor

Component
It has three components: Computation Time (Tcomp ),
Communication Time (Tcomm ) and Idle Time (Tidle ). This is for
each Processor executing the parallel program.

T = Tcomp + Tcomm + Tidle

Introduction to HPC - 2 – Francesco Moscato 27/46


Parallelization and Communications Perfomance Models Overview Questions

Example of Execution Time

Introduction to HPC - 2 – Francesco Moscato 28/46


Parallelization and Communications Perfomance Models Overview Questions

Execution Time

For balanced distribution of processes:


1
T = P (Tcomp + Tcomm + Tidle )
PP  
1 i i i
= P i=1 Tcomp + Tcomm + Tidle

Introduction to HPC - 2 – Francesco Moscato 29/46


Parallelization and Communications Perfomance Models Overview Questions

Computation Time

Depends on :
• Problem Size (N)
• Number of tasks or processors
• Memory System, Cache Configuration and Performances
• Processor Architecture, Instructions Set, Pipeline etc.

Performances Changes
Be Aware: Computation Time changes changing Computing
Architecture (You cannot simply scale up by comparing GFLOPS)

Introduction to HPC - 2 – Francesco Moscato 30/46


Parallelization and Communications Perfomance Models Overview Questions

Communication
Sending a Message needs:
• Setup Communication (ts )
• Time to send a chunk of data (let’s say a word) (tw )
• Sending as many words as we need to transmit the message
(e.g. L)

Sendind a Message

Tmsg = ts + Ltw

Inter-Intra Processor
We have to consider time both in the case of Inter-processors as
well as of Intra-processors communication. We assume here they
are comparable
Introduction to HPC - 2 – Francesco Moscato 31/46
Parallelization and Communications Perfomance Models Overview Questions

Sending a Message

• Many Network protocols


have irregularity in first
stages of communication
• tw changes depending on
word size
• tw also depends on
hardware, Operating
systems, implementation of
communication protocols
etc.

Introduction to HPC - 2 – Francesco Moscato 32/46


Parallelization and Communications Perfomance Models Overview Questions

Idle time

• lack of computation and/or data


• waiting synchronization
• can be avoided overlapping communication and data (making
computation during communication or while waiting a
message), by creating multiple tasks and proper scheduling

Introduction to HPC - 2 – Francesco Moscato 33/46


Parallelization and Communications Perfomance Models Overview Questions

Efficiency and Speedup


Efficiency
Can provide a useful measure of parallel algorithm quality

Relative Efficiency

T1
Er = P TP

T1 is the time to execute the program on one processor; TP is the


time to execute a parallel task on one processor

Relative Speedup

Sr = P Er

Introduction to HPC - 2 – Francesco Moscato 34/46


Parallelization and Communications Perfomance Models Overview Questions

Relative Speedup

Example
T1
Er = P TP ; Sr = P Er

Example 1: P = 1000; T1 = 10000s; TP = 20s: for P = 1000


Sr = 500
Example 2: P = 1000; T1 = 1000s; TP = 5s: for P = 1000
Sr = 200

Problem in relative Speedup Metric


The second algorithm is better in the range 1-1000
We should have an algorithm-independent metric other than
execution time

Introduction to HPC - 2 – Francesco Moscato 35/46


Parallelization and Communications Perfomance Models Overview Questions

Not-Ideal Efficiency and Speedup

Overhead
Parallelization introduces an overhead (communication, system call
and resources to create and manage parallel tasks, parallel library
overhead etc.)

Modification to TP

TP = T1 /P + Toverhead

Introduction to HPC - 2 – Francesco Moscato 36/46


Parallelization and Communications Perfomance Models Overview Questions

Amdahl’s Law again


Evaluation
Suppose we have a sequential program, that we are able to
parallelize at 90%; that the parallelization is perfect (it scales
linearly with P ) and that T1 = 20 seconds.
Evaluation

TP = 0.9T1 /P + 0.1T1 = 18/P + 2


T1 20
S= 0.9T1 /P +0.1T1 = 18/P +2

with the Amdahl’s law stating that the speedup is less the serial
execution time divided the time of the program that cannot be
parallelized:
T1
S‘ 0.1T 1
= 20/2 = 10
Introduction to HPC - 2 – Francesco Moscato 37/46
Parallelization and Communications Perfomance Models Overview Questions

Scalability

Idea
Scalability is the property of dealing with ever-increasing problem
size

Efficiency and Scalability


If we find when increasing the size of the problem, E is constant,
then the program is scalable

Introduction to HPC - 2 – Francesco Moscato 38/46


Parallelization and Communications Perfomance Models Overview Questions

Scalability

Example
Let be T1 = k seconds; N = k (yes ... the same) is the problem
size; TP = n/P + 1
k k
E= P (k/P +1) = k+P

Let us increase k and P by two factors (α and β respectively)


αk
αk+βP

if α = β = a
ak k
a(k+P ) = k+P

In this case ... the program is scalable

Introduction to HPC - 2 – Francesco Moscato 39/46


Parallelization and Communications Perfomance Models Overview Questions

Homeworks
Parallelize (Shared Memory and Message Passing) and evaluate
Performances, Efficiency, Speedup and Scalability of the following
algorithm:
Parallel Generation and Sum
1 Let A[N] be a vector of N random integer values
2 //Shuffle A[N] with Fisher-Yates algorithm:
3 for i: N downto 1 do
4 p = random number in [1..i]
5 swap A[i],A[p]
6 done
7 //Sum Elements of A
8 sum=0
9 for i:1 to N do
10 sum = sum +A[i];
11 done

Introduction to HPC - 2 – Francesco Moscato 40/46


Parallelization and Communications Perfomance Models Overview Questions

Main Frameworks/API/Tools for Parallel Programming

In this course...
• OpenMP
• MPI
• CUDA
• OpenCl
Many others in literature ...

Introduction to HPC - 2 – Francesco Moscato 41/46


Parallelization and Communications Perfomance Models Overview Questions

OpenMP

• Shared Memory Parallel System


• OPEN API specification1 (it is not a framework or a
tool/compiler)
• Implemented and supported by many tools2
• defines a set of compiler directives, run-time routines and
environmental variables
• based on a multi-thread model
• cooperation is through shared memory (and thread-based
synchronization directives)
• it allows for transparent scalability
• programming model based on pre-compiling directives (e.g.
#pragma)
1
https://www.openmp.org
2
https://www.openmp.org//resources/openmp-compilers-tools/
Introduction to HPC - 2 – Francesco Moscato 42/46
Parallelization and Communications Perfomance Models Overview Questions

Message Passing Interface (MPI)

• Library Specification for Message-Passing3


• designed for High Performances on massively parallel clusters
• many implementors (e.g. openMPI library)
• Based on Process definition and message passing among
processes.
• it contains all directives to manage inter-process
communications (send and receive, topology definition,
packets managements etc.)

3
https://www.mcs.anl.gov/research/projects/mpi/
Introduction to HPC - 2 – Francesco Moscato 43/46
Parallelization and Communications Perfomance Models Overview Questions

CUDA

• NVIDIA Platform and Library4


• Application of General Purpose parallel programming to GPU
• C - C++ Libraries
• based on concept of kernels of computation, that are spawned
on groups of threads on shared memory
• manages communication and synchronization

4
https://developer.nvidia.com/cuda-zone
Introduction to HPC - 2 – Francesco Moscato 44/46
Parallelization and Communications Perfomance Models Overview Questions

OpenCl

• Open Standard for parallel programming across


heterogeneous devices (CPU, GPU, FPGA ,...)5
• Many Partners (Apple, Nvidia, Intel, Arm, AMD, Samsung,
Sony ...)
• task-based and data-based parallelism
• an abstract programming model to support more platforms
• private/local/global/shared memory model
• macro - based programming model (C language)
5
https://www.khronos.org/opencl/
Introduction to HPC - 2 – Francesco Moscato 45/46
Parallelization and Communications Perfomance Models Overview Questions

Any Question ?

1
image from: https://pigswithcrayons.com/illustration/
dd-players-strategy-guide-illustrations/
Introduction to HPC - 2 – Francesco Moscato 46/46

You might also like