0% found this document useful (0 votes)

7 views46 pages

02 Introduction

The document discusses parallel algorithms, architectures, and performance modeling. It covers parallelization techniques like partitioning and mapping tasks, and communication methods. A design methodology is presented for developing parallel algorithms based on partitioning, communication, agglomeration, and mapping. Different architectures and their impacts on algorithms are also examined.

Uploaded by

FrancescoMoscato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views46 pages

02 Introduction

Uploaded by

FrancescoMoscato

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to HPC - 2

Parallel Architectures and Reference Models

Francesco Moscato

HIGH PERFORMANCE COMPUTING

LM Ingegneria Informatica
Universitá degli Studi di Salerno
e-mail:fmoscato@unisa.it
Parallelization and Communications Perfomance Models Overview Questions

Outline

1 Parallelization and Communications

2 Perfomance Models

3 Overview

4 Questions

Introduction to HPC - 2 – Francesco Moscato 2/46

Parallelization and Communications Perfomance Models Overview Questions

Parallel Algorithms, Architectures and Nodes Topology

Algorithms Design
Algorithms have different performances on different architectures
and node topology.

Algorithm −→ Architecture
You can design your algorithm and then you can choose best
suiting Architecture and Node Topology
Of course: you have to configure and tune

Architecture −→ Algorithm
You can design your Algorithm Depending on architecture and
node topology you have
Of course: This is a constraint that can limit algorithm speedup
Of course: You have to configure and tune.
Introduction to HPC - 2 – Francesco Moscato 3/46
Parallelization and Communications Perfomance Models Overview Questions

Sum Algorithm

• What if we have a
Shared-Memory with
Multi-Thread Architecture?
• What if we have a Distributed
Memory with Message Passing
Approach?
• And What if with Message
passing with different Node
Topology ?

Introduction to HPC - 2 – Francesco Moscato 4/46

Parallelization and Communications Perfomance Models Overview Questions

Sum on Multi-core / Shared Memory

• All ok if we have enough

memory
• All ok if we have enough
“cores”
• Operating systems (scheduler)
virtualize processors.
• What if A is 42 Petabytes ?

Introduction to HPC - 2 – Francesco Moscato 5/46

Parallelization and Communications Perfomance Models Overview Questions

Performances issues

Performances
Scheduling (context switches), Hardware Architecture, caching
mechanisms and cache sizes, interface to memory (as well as other
hardware properties) influence performances

Massive Data and Computation

Massive Computation and/or Massive Data require different
architectures: they do not fit into a a “single” CPU core

Introduction to HPC - 2 – Francesco Moscato 6/46

Parallelization and Communications Perfomance Models Overview Questions

Sum on MIMD architectures

• If we have a Bus-based
network with all-to-all
connections
• Input Data exist somewhere
... it has to be routed to
nodes
• Intermediate data have to
be routed and managed to
build result
• We can share input if it is
huge.
• What about work
partitioning?

Introduction to HPC - 2 – Francesco Moscato 7/46

Parallelization and Communications Perfomance Models Overview Questions

Some Issues on MIMD Architectures

• Bus: the examples shows an all-to-all connection based on

TCP/IP network. This is obviously a cheap solution, but some
performance problems exist. We can rely on more reliable and
fast networks (Infiniband, Myrinet, etc.)
• Input Data: We still have the need to distribute input data
among nodes. Anyway collection of input data can be shared
among the same nodes.
• Partial Results: Algorithms may require distribution or routing
of Partial (and final) results.
• Communication: it requires time

Introduction to HPC - 2 – Francesco Moscato 8/46

Parallelization and Communications Perfomance Models Overview Questions

Sum With Ring Bus

• On a Ring Bus, we need

more effort to route input
data (distribution of data is
O(N) where N is the
number of chunks)
• More effort too for routing
of partial results.
• Communication here
requires more time.

Introduction to HPC - 2 – Francesco Moscato 9/46

Parallelization and Communications Perfomance Models Overview Questions

The Need for a Model

A Model
We need a model to analyze parallel/distributed algorithms and
their execution on different architectures

Analysis
We need a model general enough to analyze parallel algorithms
and their application to parallel architectures

Evaluation
We need methods and techiniques to evaluate real performances
and improvements at run-time on HPC infrastructures.

Introduction to HPC - 2 – Francesco Moscato 10/46

Parallelization and Communications Perfomance Models Overview Questions

An Abstract Model for Message Passing

Lets start with an abstract

model for
• Nodes for tasks, Edges for
communication link
• send-receive of messages
• spawn and termination of
tasks.

Introduction to HPC - 2 – Francesco Moscato 11/46

Parallelization and Communications Perfomance Models Overview Questions

Design Methodology (By Foster)

• Partitioning: Define a large number of small

tasks (divide). Both Data and Computation are
decomposed. We isolate tasks that can be
executed in parallel, or data that can be
processed in parallel. Practical issues are
ignored (number of processors, network etc.)
• Communication: This includes messages for
coordination (control flow) and data flow as
well.
• Agglomeration: tasks and communication
structures are evaluated with respect to
performance requirements and implementation
costs. If necessary, tasks can be agglomerated
in lager ones in order to improve performances
and reduce costs.
• Mapping: Each Task is assigned to a processor
in a network. Some goals to reach are CPU-time
maximization, minimization of communication
latency, delays and costs. Complex scheduling
and load balancing algorithms can be applied
here (i.e. mapping can be dynamic).

Introduction to HPC - 2 – Francesco Moscato 12/46

Parallelization and Communications Perfomance Models Overview Questions

Partitioning

• Depends on independent tasks

• Decomposition on data (good is having many independent
tasks on small chunks of data)
• Domain-based decomposition depend on nature of the
problem (e.g. in a 3-D array, you can decompose on 1-D
slices)
• Functional decomposition depends on feature-based
decomposition

Introduction to HPC - 2 – Francesco Moscato 13/46

Parallelization and Communications Perfomance Models Overview Questions

Communication

• local vs global: With local communication, tasks communicates

only with a small set of other tasks; global communication requires
each task to communicate with many (eventually all other) tasks
• structured vs unstructured: structured communication has regular
comm graphs (e.g. a tree, a grid etc.); unstructured
communications have arbitrary graphs.
• static vs dynamic: in static communication, interaction of tasks
are known at design time; in dynamic communication,
communication graphs change depending on data and on tasks
behaviors and results.
• synchronous vs asynchronous: in the first a consumer gets data
when a producer generates it; in the second kind of communication,
a consumer can get data without the cooperation of a producer.

Introduction to HPC - 2 – Francesco Moscato 14/46

Parallelization and Communications Perfomance Models Overview Questions

Agglomeration

• Agglomerating tasks, reduces communication overhead

• sending / receiving data requires time (creation of messages,
use of proper data structure in the Operating Systems, sends
may be blocking or receives may require active waits etc.)
• the less data we transmit, the more fast is the algorithm)
• the more tasks we have, the higher is the cost for their
creation (resources, time etc.)

Introduction to HPC - 2 – Francesco Moscato 15/46

Parallelization and Communications Perfomance Models Overview Questions

Mapping

• place tasks that can execute concurrently on different

processor;
• place tasks that communicate a lot on the same processor in
order to improve locality.
• Mapping Problem is NP-Complete
• Dynamic Load Balancing

Introduction to HPC - 2 – Francesco Moscato 16/46

Parallelization and Communications Perfomance Models Overview Questions

Speedup

Speedup
is a measure of relative performance of two Algorithms/systems
when processing the same problem.

Example
Saying that an Algorithm/System on 12 Processors(nodes) has a
Speedup of 10, means that the algorithm/program executed on
one processor(node), is ten times slower than the same
algorithm/program executed on 12 processors(nodes).

Amdahl’s Law
The concept of speedup is connected to the Amdahl’s Law

Introduction to HPC - 2 – Francesco Moscato 17/46

Parallelization and Communications Perfomance Models Overview Questions

Amdahl’s Law

Theoretical SpeedUp
The Amdahl’s Law gives a theoretical speedup in the latency of
execution of a task with fixed workload. It expresses how much
performances improves when the resources on the executing system
improve (i.e. increases in number).

Sequential part
It assumes that a program has a part that cannot be partitioned to
execute in parallel (i.e. a sequential part). Since the sequential
part cannot be executed in parallel, the maximum theoretic
speedup depends on the fraction of sequential part on the whole
program.

Introduction to HPC - 2 – Francesco Moscato 18/46

Parallelization and Communications Perfomance Models Overview Questions

Amdahl’s Law

Speedup
If the sequential part of an algorithm accounts for 1/s pf the whole
program execution time, then the maximum speedup that can be
achieved on a parallel computer is s.

Example
if the sequential part is 5 percent, then the maximum speedup is 20

Introduction to HPC - 2 – Francesco Moscato 19/46

Parallelization and Communications Perfomance Models Overview Questions

Theoretical and Real Speedups

Measuring and Evaluating Speedup

When performing experiments to evaluate real speedup,
performances and speedup are characterized by sentences like:
The Algorithm implemented on parallel system X achieves a
speedup of 10.8 on 12 processors with a problem size N = 100.

Narrow region characterization

This information is not “good”. What if N changes ? What if
communication costs changes ? What if the number of processors
changes?

Introduction to HPC - 2 – Francesco Moscato 20/46

Parallelization and Communications Perfomance Models Overview Questions

Algorithm Complexity and Speedup

Speed up changes with complexity

Let us assume that we have an asymptotic time complexity (for
the optimal sequential algorithm) of O(N + N 2 ), with an
execution time function T on a number of processors P with a
problem size of N

Three Examples
Let us assume that we have 3 algorithms that somewhere have a
speedup of 10.8 on P = 12

Introduction to HPC - 2 – Francesco Moscato 21/46

Parallelization and Communications Perfomance Models Overview Questions

Algorithm Complexity and Speedup

Example
1 T = N + N 2 /P : This partitions the O(N 2 ) component, but
replicate the O(N ) part.
2 T = (N + N 2 )/P + 100 : This partitions the whole algorithm
but introduces an additional cost of 100;
3 T = (N + N 2 )/P + 0.6P 2 : This introduces an additional cost
of 0.6P 2

Speedup
Speedup with N=100, P=12 is almost the same (10.82) for the
three algorithms but ...

Introduction to HPC - 2 – Francesco Moscato 22/46

Parallelization and Communications Perfomance Models Overview Questions

Example: Speedup (GNU)Plotted with N=100

N=100
1000
N+N2/P
(N+N2)/P + 100
(N+N2)/P +0.6P2
Ideal

100
Speedup

(10.82, 12)
10

1
1 10 100 1000
Processors
Introduction to HPC - 2 – Francesco Moscato 23/46
Parallelization and Communications Perfomance Models Overview Questions

Example: Speedup (GNU)Plotted with N=1000

N=1000
10000
N+N2/P
(N+N2)/P + 100
(N+N2)/P +0.6P2
1000 Ideal
Speedup

100

(10.82, 12)
10

1
1 10 100 1000 10000
Processors
Introduction to HPC - 2 – Francesco Moscato 24/46
Parallelization and Communications Perfomance Models Overview Questions

Asymptotic Analysis

Asymptotic Analysis
It is Asymptotic ... And this can be a Problem because it ignores
lower-order terms in formulas

Example
The (asymptotic) cost of an algorithm 10N + N logN ignores the
term 10N , that is significative for N < 1024 (i.e. the longer
time(or space) is taken by the ignored term. This is a problem if
we can divide data in little chunks, or if the input is “short”

Communication time
In this example we are not considering Communication time.

Introduction to HPC - 2 – Francesco Moscato 25/46

Parallelization and Communications Perfomance Models Overview Questions

The Need for a Model

Other Model
The model based on observations and the asymptotic analysis are
not enough accurate in order to explain observed values or to
predict future circumstances.

Performance Model
Performance model should consider: Dimension of the Problem;
Number of Tasks; Number of Processors; Communication;
Operating System, Libraries and other software issues; Hardware ...

T = f (N, P, U, · · · )

Introduction to HPC - 2 – Francesco Moscato 26/46

Parallelization and Communications Perfomance Models Overview Questions

Execution Time

Definition
The time elapsed between start of first processor and completion of
last processor

Component
It has three components: Computation Time (Tcomp ),
Communication Time (Tcomm ) and Idle Time (Tidle ). This is for
each Processor executing the parallel program.

T = Tcomp + Tcomm + Tidle

Introduction to HPC - 2 – Francesco Moscato 27/46

Parallelization and Communications Perfomance Models Overview Questions

Example of Execution Time

Introduction to HPC - 2 – Francesco Moscato 28/46

Parallelization and Communications Perfomance Models Overview Questions

Execution Time

For balanced distribution of processes:

1
T = P (Tcomp + Tcomm + Tidle )
PP
1 i i i
= P i=1 Tcomp + Tcomm + Tidle

Introduction to HPC - 2 – Francesco Moscato 29/46

Parallelization and Communications Perfomance Models Overview Questions

Computation Time

Depends on :
• Problem Size (N)
• Number of tasks or processors
• Memory System, Cache Configuration and Performances
• Processor Architecture, Instructions Set, Pipeline etc.

Performances Changes
Be Aware: Computation Time changes changing Computing
Architecture (You cannot simply scale up by comparing GFLOPS)

Introduction to HPC - 2 – Francesco Moscato 30/46

Parallelization and Communications Perfomance Models Overview Questions

Communication
Sending a Message needs:
• Setup Communication (ts )
• Time to send a chunk of data (let’s say a word) (tw )
• Sending as many words as we need to transmit the message
(e.g. L)

Sendind a Message

Tmsg = ts + Ltw

Inter-Intra Processor
We have to consider time both in the case of Inter-processors as
well as of Intra-processors communication. We assume here they
are comparable
Introduction to HPC - 2 – Francesco Moscato 31/46
Parallelization and Communications Perfomance Models Overview Questions

Sending a Message

Introduction to HPC - 2 – Francesco Moscato 32/46

Parallelization and Communications Perfomance Models Overview Questions

Idle time

• lack of computation and/or data

• waiting synchronization
• can be avoided overlapping communication and data (making
computation during communication or while waiting a
message), by creating multiple tasks and proper scheduling

Introduction to HPC - 2 – Francesco Moscato 33/46

Parallelization and Communications Perfomance Models Overview Questions

Efficiency and Speedup

Efficiency
Can provide a useful measure of parallel algorithm quality

Relative Efficiency

T1
Er = P TP

T1 is the time to execute the program on one processor; TP is the

time to execute a parallel task on one processor

Relative Speedup

Sr = P Er

Introduction to HPC - 2 – Francesco Moscato 34/46

Parallelization and Communications Perfomance Models Overview Questions

Relative Speedup

Example
T1
Er = P TP ; Sr = P Er

Example 1: P = 1000; T1 = 10000s; TP = 20s: for P = 1000

Sr = 500
Example 2: P = 1000; T1 = 1000s; TP = 5s: for P = 1000
Sr = 200

Problem in relative Speedup Metric

The second algorithm is better in the range 1-1000
We should have an algorithm-independent metric other than
execution time

Introduction to HPC - 2 – Francesco Moscato 35/46

Parallelization and Communications Perfomance Models Overview Questions

Not-Ideal Efficiency and Speedup

Overhead
Parallelization introduces an overhead (communication, system call
and resources to create and manage parallel tasks, parallel library
overhead etc.)

Modification to TP

TP = T1 /P + Toverhead

Introduction to HPC - 2 – Francesco Moscato 36/46

Parallelization and Communications Perfomance Models Overview Questions

Amdahl’s Law again

Evaluation
Suppose we have a sequential program, that we are able to
parallelize at 90%; that the parallelization is perfect (it scales
linearly with P ) and that T1 = 20 seconds.
Evaluation

TP = 0.9T1 /P + 0.1T1 = 18/P + 2

T1 20
S= 0.9T1 /P +0.1T1 = 18/P +2

with the Amdahl’s law stating that the speedup is less the serial
execution time divided the time of the program that cannot be
parallelized:
T1
S‘ 0.1T 1
= 20/2 = 10
Introduction to HPC - 2 – Francesco Moscato 37/46
Parallelization and Communications Perfomance Models Overview Questions

Scalability

Idea
Scalability is the property of dealing with ever-increasing problem
size

Efficiency and Scalability

If we find when increasing the size of the problem, E is constant,
then the program is scalable

Introduction to HPC - 2 – Francesco Moscato 38/46

Parallelization and Communications Perfomance Models Overview Questions

Scalability

Example
Let be T1 = k seconds; N = k (yes ... the same) is the problem
size; TP = n/P + 1
k k
E= P (k/P +1) = k+P

Let us increase k and P by two factors (α and β respectively)

αk
αk+βP

if α = β = a
ak k
a(k+P ) = k+P

In this case ... the program is scalable

Introduction to HPC - 2 – Francesco Moscato 39/46

Parallelization and Communications Perfomance Models Overview Questions

Homeworks
Parallelize (Shared Memory and Message Passing) and evaluate
Performances, Efficiency, Speedup and Scalability of the following
algorithm:
Parallel Generation and Sum
1 Let A[N] be a vector of N random integer values
2 //Shuffle A[N] with Fisher-Yates algorithm:
3 for i: N downto 1 do
4 p = random number in [1..i]
5 swap A[i],A[p]
6 done
7 //Sum Elements of A
8 sum=0
9 for i:1 to N do
10 sum = sum +A[i];
11 done

Introduction to HPC - 2 – Francesco Moscato 40/46

Parallelization and Communications Perfomance Models Overview Questions

Main Frameworks/API/Tools for Parallel Programming

In this course...
• OpenMP
• MPI
• CUDA
• OpenCl
Many others in literature ...

Introduction to HPC - 2 – Francesco Moscato 41/46

Parallelization and Communications Perfomance Models Overview Questions

OpenMP

• Shared Memory Parallel System

• OPEN API specification1 (it is not a framework or a
tool/compiler)
• Implemented and supported by many tools2
• defines a set of compiler directives, run-time routines and
environmental variables
• based on a multi-thread model
• cooperation is through shared memory (and thread-based
synchronization directives)
• it allows for transparent scalability
• programming model based on pre-compiling directives (e.g.
#pragma)
1
https://www.openmp.org
2
https://www.openmp.org//resources/openmp-compilers-tools/
Introduction to HPC - 2 – Francesco Moscato 42/46
Parallelization and Communications Perfomance Models Overview Questions

Message Passing Interface (MPI)

• Library Specification for Message-Passing3

• designed for High Performances on massively parallel clusters
• many implementors (e.g. openMPI library)
• Based on Process definition and message passing among
processes.
• it contains all directives to manage inter-process
communications (send and receive, topology definition,
packets managements etc.)

3
https://www.mcs.anl.gov/research/projects/mpi/
Introduction to HPC - 2 – Francesco Moscato 43/46
Parallelization and Communications Perfomance Models Overview Questions

CUDA

• NVIDIA Platform and Library4

• Application of General Purpose parallel programming to GPU
• C - C++ Libraries
• based on concept of kernels of computation, that are spawned
on groups of threads on shared memory
• manages communication and synchronization

4
https://developer.nvidia.com/cuda-zone
Introduction to HPC - 2 – Francesco Moscato 44/46
Parallelization and Communications Perfomance Models Overview Questions

OpenCl

• Open Standard for parallel programming across

heterogeneous devices (CPU, GPU, FPGA ,...)5
• Many Partners (Apple, Nvidia, Intel, Arm, AMD, Samsung,
Sony ...)
• task-based and data-based parallelism
• an abstract programming model to support more platforms
• private/local/global/shared memory model
• macro - based programming model (C language)
5
https://www.khronos.org/opencl/
Introduction to HPC - 2 – Francesco Moscato 45/46
Parallelization and Communications Perfomance Models Overview Questions

Any Question ?

1
image from: https://pigswithcrayons.com/illustration/
dd-players-strategy-guide-illustrations/
Introduction to HPC - 2 – Francesco Moscato 46/46

HPC Overview
No ratings yet
HPC Overview
45 pages
QNS. Parallel Computing
No ratings yet
QNS. Parallel Computing
44 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
32 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Module 1 Chapter3
No ratings yet
Module 1 Chapter3
45 pages
2 ParallelArchExec
No ratings yet
2 ParallelArchExec
46 pages
Unit 4-Mca
No ratings yet
Unit 4-Mca
29 pages
Parallel Programming
No ratings yet
Parallel Programming
10 pages
Course Outcome 1:: 15Cs4180 - Parallel Computing
No ratings yet
Course Outcome 1:: 15Cs4180 - Parallel Computing
23 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
Computing Lecture 9
No ratings yet
Computing Lecture 9
32 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
HPC Lecture 2 Points
No ratings yet
HPC Lecture 2 Points
7 pages
HPC Lectures 1 5
No ratings yet
HPC Lectures 1 5
18 pages
2 ND
No ratings yet
2 ND
19 pages
Partitioning
No ratings yet
Partitioning
37 pages
Presentation 3
No ratings yet
Presentation 3
63 pages
BDS Session 2
No ratings yet
BDS Session 2
59 pages
04 - Computer Clusters
No ratings yet
04 - Computer Clusters
66 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
01 Introduction
No ratings yet
01 Introduction
51 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
Unit 1
No ratings yet
Unit 1
52 pages
L01 Slides
No ratings yet
L01 Slides
24 pages
Massively Parallel Processors
No ratings yet
Massively Parallel Processors
102 pages
High Performance Computing Syllabus
No ratings yet
High Performance Computing Syllabus
35 pages
Week 7
No ratings yet
Week 7
27 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Amdahl's Law & Parallel Computing
No ratings yet
Amdahl's Law & Parallel Computing
19 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
HPC Lecture (1) Summary
No ratings yet
HPC Lecture (1) Summary
8 pages
CS439 CC 2 Parallel Distributed Systems
No ratings yet
CS439 CC 2 Parallel Distributed Systems
37 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
ch2 PC
No ratings yet
ch2 PC
44 pages
NT533
No ratings yet
NT533
16 pages
PDC Lecture 03
No ratings yet
PDC Lecture 03
36 pages
Unit 1 - Part 3
No ratings yet
Unit 1 - Part 3
17 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
PDC Merged
No ratings yet
PDC Merged
30 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
Parallelism
No ratings yet
Parallelism
67 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Brief Overview of Parallel Computing
No ratings yet
Brief Overview of Parallel Computing
14 pages
Week 7
No ratings yet
Week 7
27 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
CS621 SQ
No ratings yet
CS621 SQ
15 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Zindagi Zama Da
No ratings yet
Zindagi Zama Da
21 pages
Getting Started : With Your TMP4210HH/600 Handheld Marking System
No ratings yet
Getting Started : With Your TMP4210HH/600 Handheld Marking System
8 pages
Swastick Das Xii Cs Mt3!20!21 QP
No ratings yet
Swastick Das Xii Cs Mt3!20!21 QP
8 pages
Linux Mastery - Mastering The Linux Command Line - Udemy
No ratings yet
Linux Mastery - Mastering The Linux Command Line - Udemy
10 pages
CombiTest Multi-Standard Wireless Test Plan Application (Information Sheet)
No ratings yet
CombiTest Multi-Standard Wireless Test Plan Application (Information Sheet)
4 pages
SIP Feature Server Deployment Guide: Migrate Data From Cassandra Database Versions
No ratings yet
SIP Feature Server Deployment Guide: Migrate Data From Cassandra Database Versions
9 pages
Embedded System Design-NPTEL-NOTES
100% (1)
Embedded System Design-NPTEL-NOTES
38 pages
Intergraph CAS Key Troubleshooting
No ratings yet
Intergraph CAS Key Troubleshooting
5 pages
21CS71 Solutions
No ratings yet
21CS71 Solutions
24 pages
Samsung Secret Codes
No ratings yet
Samsung Secret Codes
10 pages
Draft Specifications For PR and DR
No ratings yet
Draft Specifications For PR and DR
9 pages
Aml-3 LGR Manual (v1.6)
No ratings yet
Aml-3 LGR Manual (v1.6)
38 pages
OpenNetMon: SDN Network Monitoring
No ratings yet
OpenNetMon: SDN Network Monitoring
8 pages
VMware VSphere With Operations Management-Datasheet
No ratings yet
VMware VSphere With Operations Management-Datasheet
4 pages
Solaris Commands
No ratings yet
Solaris Commands
2 pages
Mincom LinkOne WinView Technical Reference
No ratings yet
Mincom LinkOne WinView Technical Reference
105 pages
CS6303 Computer Architecture ACT Notes
No ratings yet
CS6303 Computer Architecture ACT Notes
76 pages
3 Ways To Download A File From The Internet
No ratings yet
3 Ways To Download A File From The Internet
8 pages
How To Install Photoshop Extensions
No ratings yet
How To Install Photoshop Extensions
2 pages
Splunk Data Onboarding Course Guide
No ratings yet
Splunk Data Onboarding Course Guide
10 pages
Ciscom Manual
No ratings yet
Ciscom Manual
29 pages
CRM Order Creation Report
No ratings yet
CRM Order Creation Report
24 pages
Excel Shorcut Keys
No ratings yet
Excel Shorcut Keys
8 pages
Budget of Work
No ratings yet
Budget of Work
7 pages
WSUS, MU and WU Oh My!: SMB Technology Network Susan Bradley, Patchaholic
No ratings yet
WSUS, MU and WU Oh My!: SMB Technology Network Susan Bradley, Patchaholic
78 pages
User Agent Strings for Android & BlackBerry
No ratings yet
User Agent Strings for Android & BlackBerry
16 pages
Arcserve® Backup Patch Manager For Windows: User Guide
No ratings yet
Arcserve® Backup Patch Manager For Windows: User Guide
77 pages
Odoo Sample Exercises
No ratings yet
Odoo Sample Exercises
6 pages
Cluster Installation Trouble Shooting
No ratings yet
Cluster Installation Trouble Shooting
39 pages
Contact Certifications: Cdac Iacsd
No ratings yet
Contact Certifications: Cdac Iacsd
1 page
Subject: Computer Fundamental Class: Bca & Bscit Sem - 01 Unit - 1: (Introduction To Computer)
No ratings yet
Subject: Computer Fundamental Class: Bca & Bscit Sem - 01 Unit - 1: (Introduction To Computer)
25 pages

02 Introduction

Uploaded by

02 Introduction

Uploaded by

Introduction to HPC - 2

Parallel Architectures and Reference Models

HIGH PERFORMANCE COMPUTING

1 Parallelization and Communications

Introduction to HPC - 2 – Francesco Moscato 2/46

Parallel Algorithms, Architectures and Nodes Topology

Introduction to HPC - 2 – Francesco Moscato 4/46

Sum on Multi-core / Shared Memory

• All ok if we have enough

Introduction to HPC - 2 – Francesco Moscato 5/46

Massive Data and Computation

Introduction to HPC - 2 – Francesco Moscato 6/46

Sum on MIMD architectures

Introduction to HPC - 2 – Francesco Moscato 7/46

Some Issues on MIMD Architectures

• Bus: the examples shows an all-to-all connection based on

Introduction to HPC - 2 – Francesco Moscato 8/46

Sum With Ring Bus

• On a Ring Bus, we need

Introduction to HPC - 2 – Francesco Moscato 9/46

The Need for a Model

Introduction to HPC - 2 – Francesco Moscato 10/46

An Abstract Model for Message Passing

Lets start with an abstract

Introduction to HPC - 2 – Francesco Moscato 11/46

Design Methodology (By Foster)

• Partitioning: Define a large number of small

Introduction to HPC - 2 – Francesco Moscato 12/46

• Depends on independent tasks

Introduction to HPC - 2 – Francesco Moscato 13/46

• local vs global: With local communication, tasks communicates

Introduction to HPC - 2 – Francesco Moscato 14/46

• Agglomerating tasks, reduces communication overhead

Introduction to HPC - 2 – Francesco Moscato 15/46

• place tasks that can execute concurrently on different

Introduction to HPC - 2 – Francesco Moscato 16/46

Introduction to HPC - 2 – Francesco Moscato 17/46

Introduction to HPC - 2 – Francesco Moscato 18/46

Introduction to HPC - 2 – Francesco Moscato 19/46

Theoretical and Real Speedups

Measuring and Evaluating Speedup

Narrow region characterization

Introduction to HPC - 2 – Francesco Moscato 20/46

Algorithm Complexity and Speedup

Speed up changes with complexity

Introduction to HPC - 2 – Francesco Moscato 21/46

Algorithm Complexity and Speedup

Introduction to HPC - 2 – Francesco Moscato 22/46

Example: Speedup (GNU)Plotted with N=100

Example: Speedup (GNU)Plotted with N=1000

Introduction to HPC - 2 – Francesco Moscato 25/46

The Need for a Model

Introduction to HPC - 2 – Francesco Moscato 26/46

T = Tcomp + Tcomm + Tidle

Introduction to HPC - 2 – Francesco Moscato 27/46

Example of Execution Time

Introduction to HPC - 2 – Francesco Moscato 28/46

For balanced distribution of processes:

Introduction to HPC - 2 – Francesco Moscato 29/46

Introduction to HPC - 2 – Francesco Moscato 30/46

• Many Network protocols

Introduction to HPC - 2 – Francesco Moscato 32/46

• lack of computation and/or data

Introduction to HPC - 2 – Francesco Moscato 33/46

Efficiency and Speedup

T1 is the time to execute the program on one processor; TP is the

Introduction to HPC - 2 – Francesco Moscato 34/46

Example 1: P = 1000; T1 = 10000s; TP = 20s: for P = 1000

Problem in relative Speedup Metric

Introduction to HPC - 2 – Francesco Moscato 35/46

Not-Ideal Efficiency and Speedup

Introduction to HPC - 2 – Francesco Moscato 36/46

Amdahl’s Law again

TP = 0.9T1 /P + 0.1T1 = 18/P + 2

Efficiency and Scalability

Introduction to HPC - 2 – Francesco Moscato 38/46