0% found this document useful (0 votes)

13 views46 pages

2 ParallelArchExec

The document discusses parallel architecture and execution, emphasizing the importance of parallel architecture and algorithms on application performance. It covers concepts such as saturation, efficiency, limitations of parallelization, and various programming models including shared and distributed memory. Additionally, it introduces the Message Passing Interface (MPI) as a standard for message passing in distributed memory environments and outlines its implementation and execution steps.

Uploaded by

Sharvani Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views46 pages

2 ParallelArchExec

Uploaded by

Sharvani Jadhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Parallel Architecture and

Execution

Lecture 2
January 8, 2025
Parallel Application Performance
Depends on
• Parallel architecture
• Algorithm

2
Saturation – Example 1

#Processes Time (sec) Speedup Efficiency

1 0.025 1 1.00
2 0.013 1.9 0.95
4 0.010 2.5 0.63
8 0.009 2.8 0.35
12 0.007 3.6 0.30

3
Saturation – Example 2

4
Saturation – Example 3

Source: GGKK Chapter 5 5

Efficiency (Adding numbers)

Problem size

Source: GGKK Chapter 5 6

Limitations of Parallelization
• Overhead
• E.g. communication
• Over-decomposition
• Work per process/core
• Idling
• Load imbalance
• Synchronization
• Serialization

7
Execution Profile

Execution Profile of a Hypothetical Parallel Program

Source: GGKK Chapter 5 8
Performance
• Sequential
• Input size
• Parallel
• Input size
• Number of processing elements (PE)
• Communication speed
• …

9
Scaling Deep Learning Models
Problem size #PEs

SOURCE: LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76 MINUTES

“The fastest known sequential algorithm for a problem

may be difficult or impossible to parallelize.”- GGKK 10
Parallelization

• Speedup
• Efficiency

11
Sum of Numbers Speedup

Naïve parallelization method:

𝑁 - Compute in parallel
Speedup = - Send partial result to one
𝑁
+𝑃+𝑃 (All-to-one)
𝑃
- Compute final result

12
Parallel Sum (Optimized)

Source: GGKK Chapter 5 13

Sum of Numbers Speedup (Optimized)

S1 = 𝑁 E1 = ?
𝑁
+ 2𝑃
𝑃

𝑁 1
S2 = E2 =
𝑁 2𝑃𝑙𝑜𝑔𝑃
+ 2 𝑙𝑜𝑔𝑃 +1
𝑃 𝑁

14
Efficiency (Adding numbers)
Homework: Analyze the measured efficiency based on your derivation

Source: GGKK Chapter 5 15

A Limitation of Parallel Computing
Fraction of code that
is parallelizable

1
Speedup S =
1 − 𝑓 + 𝑓/𝑃

Amdahl’s Law

16
Parallel Architecture

17
System Components
• Processor
• Memory
• Network
• Storage
NUMA

Source: https://www.sciencedirect.com/topics/computer-science/non-uniform-memory-access
18
Memory Hierarchy

A multicore SMP architecture

Image Source: The Art of Multiprocessor Programming – Herlihy, Shavit
19
Memory Access Times

Source: MIT CSAIL

20
Processor vs. Memory

“While clock rates of high-end processors have

increased at roughly 40% per year over the last decade,
DRAM access times have only improved at the rate of
roughly 10% per year over this interval.”

- Introduction to Parallel Computing by Ananth Grama

et al. (GGKK)

21
NUMA Nodes
Utility: lstopo (hwloc package)

AMD Bulldozer Memory Topology (Source: Wikipedia)

22
NUMA Node (Zoomed)
Utility: lstopo (hwloc package)

AMD Bulldozer Memory Topology (Source: Wikipedia) 23

Effective Memory Access Times

24
Memory Placement

Lepers et al., “Thread and Memory Placement on NUMA Systems: Asymmetry Matters”, USENIX ATC 2015.

25
Connect Multiple Compute Nodes
Intraconnect

Source: hector.ac.uk
26
Parallel Programming Models
• Shared memory
• Distributed memory

27
Shared Memory
• Shared address space
• Time taken to access certain memory words is
longer (NUMA)
• Need to worry about concurrent access
• Programming paradigms – Pthreads, OpenMP

Thread 0
Thread 1

28
Intel Processors (Latest)

https://www.intel.com/content/www/us/en/products/docs/processors/xeon/5th-gen-
xeon-scalable-processors.html
29
Cluster of Compute Nodes

30
Message Passing
• Distinct address space per process
• Multiple processing nodes
• Basic operations are send and receive

31
Interprocess Communication

32
Our Parallel World
Core
Process
Memory

Distributed memory programming

• Distinct address space
• Explicit communication
33
Distinct Process Address Space
Process 0 Process 1

x = 1, y = 2 x = 10, y = 20
... ...
x++ x++
... ...
print x, y print x, y

2, 2 11, 20
34
Distinct Process Address Space
Process 0 Process 1

x = 1, y = 2 x = 1, y = 2
... ...
x++; y++ y++
... ...
print x, y print x, y

2, 3 1, 3
35
Adapted from Neha Karanjkar’s slides

36
Our Parallel World
Core
Process

NO centralized server/master
37
Message Passing
Interface
Message Passing Interface (MPI)

• Efforts began in 1991 by Jack Dongarra, Tony Hey, and

David W. Walker.
• Standard for message passing in a distributed
memory environment
• MPI Forum in 1993
• Version 1.0: 1994
• Version 2.0: 1997
• Version 3.0: 2012
• Version 4.0: 2021
• Version 5.0 (under discussion)
39
MPI Implementations
“The MPI standard includes point-to-point message-passing,
collective communications, group and communicator concepts,
process topologies, environmental management, process
creation and management, one-sided communications,
extended collective operations, external interfaces, I/O, some
miscellaneous topics, and a profiling interface.” – MPI report
• MPICH (ANL)
• MVAPICH (OSU)
• OpenMPI
• Intel MPI
• Cray MPI
40
Programming Environment
• Shell scripts (e.g. bash)
• ssh basics
• E.g. ssh –X
•…
• Mostly in C/C++
• Compilation, Makefiles, ...
• Linux environment variables
• PATH
• LD_LIBRARY_PATH
•…

41
MPI Installation – Laptop
• Linux or Linux VM on Windows
• apt/snap/yum/brew
• Windows
• No support

• https://www.mpich.org/documentation/guides/

42
MPI
• Standard for message passing
• Explicit communications
• Medium programming complexity
• Requires communication scope

43
Simple MPI Code

44
MPI Code Execution Steps

• Compile
• mpicc -o program.x program.c

• Execute
• mpirun -np 1 ./program.x (mpiexec -np 1 ./program.x)
• Runs 1 process on the launch/login node
• mpirun -np 6 ./program.x
• Runs 6 processes on the launch/login node

45
Output – Hello World
mpirun –np 20 ./program.x

CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
Bcs702 Parallel Computing Module 1
100% (2)
Bcs702 Parallel Computing Module 1
35 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Overview of Parallel Computing: Shawn T. Brown
No ratings yet
Overview of Parallel Computing: Shawn T. Brown
46 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Lecture 03
No ratings yet
Lecture 03
39 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Multiprocessor Basics & Performance
No ratings yet
Multiprocessor Basics & Performance
52 pages
Lec 4
No ratings yet
Lec 4
36 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
34 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
08 Systems Programming-Concurrent Programming
No ratings yet
08 Systems Programming-Concurrent Programming
61 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Parallel Programming FDP
No ratings yet
Parallel Programming FDP
43 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Intro To Parallel Computing
No ratings yet
Intro To Parallel Computing
127 pages
Introduction to Parallel Programming
No ratings yet
Introduction to Parallel Programming
17 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Chapter 12 Multiprocessor Systems
No ratings yet
Chapter 12 Multiprocessor Systems
110 pages
Parallel Programming
No ratings yet
Parallel Programming
42 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
002 IntroHPC
No ratings yet
002 IntroHPC
33 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Parallel Programming Course Overview
No ratings yet
Parallel Programming Course Overview
36 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
28 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
MPI Python Workshop Day1 Fall2024
No ratings yet
MPI Python Workshop Day1 Fall2024
22 pages
Parallel Computing Essentials
No ratings yet
Parallel Computing Essentials
32 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Parallel Computing
No ratings yet
Parallel Computing
24 pages
Apt05 2024S2
No ratings yet
Apt05 2024S2
23 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
Lecture1 Notes
No ratings yet
Lecture1 Notes
19 pages
NeurIPS 2023 Dream The Impossible Outlier Imagination With Diffusion Models Paper Conference
No ratings yet
NeurIPS 2023 Dream The Impossible Outlier Imagination With Diffusion Models Paper Conference
24 pages
CS772 Lec21
No ratings yet
CS772 Lec21
26 pages
Should We Learn Most Likely Functions or Parameters?: Shikai Qiu Tim G. J. Rudner Sanyam Kapoor Andrew Gordon Wilson
No ratings yet
Should We Learn Most Likely Functions or Parameters?: Shikai Qiu Tim G. J. Rudner Sanyam Kapoor Andrew Gordon Wilson
22 pages
5 P2p-Ii
No ratings yet
5 P2p-Ii
26 pages
hw3 Cgs
No ratings yet
hw3 Cgs
7 pages
Orange Tsai - Lets Dance in The Cache - Destabilizing Hash Table On Microsoft IIS
No ratings yet
Orange Tsai - Lets Dance in The Cache - Destabilizing Hash Table On Microsoft IIS
111 pages
Solidworks 2020 Installation Guide
No ratings yet
Solidworks 2020 Installation Guide
2 pages
API PRO 9 Installation Guide
No ratings yet
API PRO 9 Installation Guide
134 pages
Mpuria Njeru Bernard Parapet Cleaning Service Management System (Project 2018
No ratings yet
Mpuria Njeru Bernard Parapet Cleaning Service Management System (Project 2018
17 pages
Software Testing - Final - 23032020
No ratings yet
Software Testing - Final - 23032020
152 pages
3 Hrs Day Java + DSA For 90 Days
No ratings yet
3 Hrs Day Java + DSA For 90 Days
8 pages
Software Quality Assurance Microproject
No ratings yet
Software Quality Assurance Microproject
9 pages
PD1 Set3
No ratings yet
PD1 Set3
76 pages
JF 7 2 Practice
No ratings yet
JF 7 2 Practice
2 pages
DevOps Basics for Tech Enthusiasts
50% (2)
DevOps Basics for Tech Enthusiasts
16 pages
ZKBio Time 8.0.6 Six Fold Leaflet PDF
No ratings yet
ZKBio Time 8.0.6 Six Fold Leaflet PDF
2 pages
Malbolge
No ratings yet
Malbolge
5 pages
Experiment 3: Interfacing LED's With PIC Microcontroller: Step1: Open Proteus and Get The Devices
No ratings yet
Experiment 3: Interfacing LED's With PIC Microcontroller: Step1: Open Proteus and Get The Devices
6 pages
Testing in Software Engineering:: Question-19
No ratings yet
Testing in Software Engineering:: Question-19
3 pages
PPL (Unit I)
0% (1)
PPL (Unit I)
39 pages
Sameed CS304 Test Midterm
No ratings yet
Sameed CS304 Test Midterm
2 pages
Pydon'ts: Write Beautiful Python Code
No ratings yet
Pydon'ts: Write Beautiful Python Code
110 pages
Java Developers Guide - Accenture Java Interview Questions & Answers
No ratings yet
Java Developers Guide - Accenture Java Interview Questions & Answers
2 pages
1756 ct127 - en e
No ratings yet
1756 ct127 - en e
3 pages
Java Programming Basics Guide
No ratings yet
Java Programming Basics Guide
62 pages
Garments Shop Management System
No ratings yet
Garments Shop Management System
70 pages
Queue Using Linked List 8
No ratings yet
Queue Using Linked List 8
4 pages
Eden's Internship Report
No ratings yet
Eden's Internship Report
32 pages
Chapter 05 Basic OOAD Process
No ratings yet
Chapter 05 Basic OOAD Process
22 pages
Python Lab Manual 2024
No ratings yet
Python Lab Manual 2024
3 pages
Class: Write A Java To Represent A Vehicle Type
No ratings yet
Class: Write A Java To Represent A Vehicle Type
2 pages
Flatiron Software Engineering Syllabus
No ratings yet
Flatiron Software Engineering Syllabus
11 pages
Jbasic Users Guide
No ratings yet
Jbasic Users Guide
247 pages
Led On and Off
No ratings yet
Led On and Off
14 pages
Python Framework for LaTeX Processing
No ratings yet
Python Framework for LaTeX Processing
98 pages

2 ParallelArchExec

Uploaded by

2 ParallelArchExec

Uploaded by

Parallel Architecture and

#Processes Time (sec) Speedup Efficiency

Source: GGKK Chapter 5 5

Source: GGKK Chapter 5 6

Execution Profile of a Hypothetical Parallel Program

“The fastest known sequential algorithm for a problem

Naïve parallelization method:

Source: GGKK Chapter 5 13

Source: GGKK Chapter 5 15

A multicore SMP architecture

Source: MIT CSAIL

“While clock rates of high-end processors have

- Introduction to Parallel Computing by Ananth Grama

AMD Bulldozer Memory Topology (Source: Wikipedia)

AMD Bulldozer Memory Topology (Source: Wikipedia) 23

Distributed memory programming

• Efforts began in 1991 by Jack Dongarra, Tony Hey, and

You might also like