0% found this document useful (0 votes)

190 views10 pages

Parallel Computing-Module2 Notes

Uploaded by

Harini Vasantha Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

190 views10 pages

Parallel Computing-Module2 Notes

Uploaded by

Harini Vasantha Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

PARALLEL COMPUTING(BCS702)

NOTES-Module2

MODULE-2: Syllabus
GPU programming, Programming hybrid systems, MIMD systems, GPUs,
Performance –Speedup and efficiency in MIMD systems, Amdahl’s law,
Scalability in MIMD systems, Taking timings of MIMD programs, GPU
performance.

GPU programming
GPU programming is a type of heterogeneous computing, where code is
written for both the CPU (host) and the GPU (device). The GPU cannot run
an operating system or access files directly, so the CPU handles memory
allocation, data transfer, and initiates execution on the GPU.
Since CPU and GPU have separate memory, the host program must
manage data transfers between them. The GPU executes many threads in
parallel, grouped into SIMD (Single Instruction, Multiple Data) groups.

Each GPU processor has fast private memory(like a cache)for its threads.
Threads in a SIMD group may not always execute the same path due to
branching, which causes some threads to idle—this is inefficient and
should be minimized.

For example consider below code treads with rank_in_gp<16 execute the first
line, while others stay idle. Then the reverse happens. This serializes
execution and reduces GPU efficiency.
To maximize performance, programmers should Minimize branching
within SIMD groups, Use many thread groups to keep the GPU busy and
Rely on the GPU’s hardware scheduler which switches between ready
thread groups with minimal overhead.
Thus efficient GPU programming requires understanding parallelism,
memory management, and thread scheduling, making it very different
from traditional CPU programming.

Programming hybrid systems

While it's possible to program clusters of multicore systems using a hybrid
approach—shared-memory APIs within nodes and distributed-memory APIs
between nodes—this method is complex and used mainly for applications
needing maximum performance. To reduce complexity, most programmers
prefer using a single distributed-memory API for both intra-node and inter-
node communication.

MIMD systems
Parallel I/O is a complex topic, often avoided in basic parallel
programming due to its difficulty and scope. Most programs in this
context perform minimal I/O, easily handled by standard C functions
like printf, scanf, etc. However, using these functions in parallel
environments (multiple processes or threads) can lead to
nondeterministic behavior.
• Processes do not share stdin, stdout, or stderr, so their I/O behavior

can be unpredictable.

• Threads share standard I/O streams, but simultaneous access can result
in mixed or unordered output.

• In many systems, only one process or thread (usually

process/thread 0) is allowed to read from stdin.

• Multiple processes/threads writing to stdout can cause interleaved

or scrambled output, especially during debugging.
Best Practices for Parallel I/O:

• In distributed-memory programs, only process 0 should read from

stdin.

• In shared-memory programs, only thread 0 (master) should read

from stdin.

• Output to stdout or stderr should preferably be handled by a single

thread/process to maintain order.

• For debugging, multiple threads/processes may print to stdout,

but should always include their ID or rank.

• Only one thread/process should access a file(other than standard

streams)at a time, each should use separate files for
reading/writing if needed.

This structured approach helps manage I/O in parallel systems more

predictably and avoids common pitfalls.
GPUs
In most GPU programs, all input and output (I/O) is handled by the host
(CPU) code, which runs as a single thread/process. Therefore, standard C
I/O functions like printf and scanf work just like in serial C programs.
An exception occurs during GPU debugging. In this case:

• GPU threads can write to stdout, but the output order is

nondeterministic, similar to MIMD programs.

• GPU threads do not have access to:

• stderr

• stdin

• Secondary storage(e.g.,files)

Hence, for most practical purposes, I/O is centralized in the host, while
GPU-side I/O is limited and mainly used for debugging.
Speedup and efficiency in MIMD systems
The main performance goal of a parallel program is to divide work
equally among all cores without adding extra overhead. If this is
perfectly achieved using p cores, the program can run p times faster than
its serial version, a condition called linear speedup:

Tparallel=Tserial
p
However, perfect linear speedup is rare due to overheads introduced by
parallelism:

• In shared-memory systems, synchronization mechanisms like mutexes

add overhead and can serialize parts of the code.

• In distributed-memory systems, network communication between

processes is slower than local memory access.

These overheads increase with more threads/processes. Thus, while we aim

for linear speedup, actual speedup is often less. The speedup of a parallel
program is defined as:
S=Tserial
Tparallel
This metric helps evaluate how effectively a parallel program utilizes
multiple cores.

The value S/p is called the efficiency of the parallel program which
indicates average speedup achieved per processor. If we substitute the
formula for S,we see that the efficiency is

Efficiency can be thought of as the fraction of the parallel run-time that’s

spent, on average, by each core working on solving the
original problem. The remainder of the parallel run-time is the parallel
overhead.
Many parallel programs are developed by explicitly dividing the work of
the serial program among the processes/threads and adding in the
necessary “parallel over-head,” such as mutual exclusion or
communication. Therefore if Toverhead denotes
this parallel overhead, it’s often the case that

Below figure(2.18) shows the variation in speedup if we halve and

double the problem size of the program

figure(2.19)showsthevariationinEfficiencyifwehalveand double the problem

size of the program
When reporting speedup and efficiency, there are two common
approaches to choosing the serial runtime Tserial:
1. Use the runtime of the fastest serial program on the fastest
Process or available(maybe a different algorithm).

2. Use the runtime of the serial version of the same parallel program
running on a single processor of the parallel system.

Most researchers prefer the second approach because it reflects the

efficiency of the parallel system's core utilization more realistically. For
example, comparing a parallel shell sort program’s speedup using the
serial shell sort runtime on the same system is preferred over comparing
it to a different sorting algorithm on a faster machine.
Amdahl’s law

It states “Roughly, that unless virtually all of a serial program is

parallelized, the possible speedup is going to be very limited regardless of
the number of cores available.”

For example, if 90% of a program is parallelized perfectly, the speedup

with p cores is:
Tparallel=(0.9×Tserial)/p+0.1×Tserial
Even with an infinite number of cores, the runtime cannot be less than
the serial part(10%),which limits the speed up to atmost 10.

In general If:

r=fraction of the program that is serial

p = number of processors (cores)

S = speedup with p processors Then,

For infinitely large values of p, S≤1/r where r is the fraction of the
program that is inherently serial.

Keypoints:

• No matter how many cores you use, speedup is capped by the

serial portion.

• Even a small serial fraction severely limits maximum

speedup.

• However, Amdahl’s Law doesn’t consider increasing problem

size. Gustafson’s Law suggests that for larger problems, the serial
fraction effectively decreases, allowing better speedup.

• Many real-world scientific programs achieve very large

speedups. A modest speedup of 5 to10 is often sufficient and
worthwhile.
Scalability in MIMDsystems
Scalability refers to how well a parallel program improves in
performance when the system’s resources (like number of
cores/processes) are increased.
A parallel program is scalable if when we increase the number of
processes/threads, we can also increase the problem size in such away that
the efficiency (E) remains constant.

EfficiencyFormula:
Given:

• Tserial=n(problem size= n)

• Tparallel=n/p+1

Efficiency is:
If we increase processes to k p and problem size to kn, efficiency
becomes:

Thus, efficiency remains constant if problem size increases

proportionally with number of threads then program is scalable.

Types of Scalability:

• Strong Scalability: Efficiency stays constant without

increasing problem size.

• Weak Scalability: Efficiency stays constant if the problem size

increases proportionally with the number of processes/threads.
Taking timings of MIMD programs
Measuring performance in parallel (MIMD) programs involves
understanding what kind of timing is meaningful and how to take it
accurately.
Development Timing is used to debug or optimize specific parts of the
code (e.g., time spent waiting for messages). Where as Performance
Timing is used to report overall efficiency typically just one value (total
execution time of the computation).Wall Clock Time (real elapsed time)
is preferred over CPU Time because CPU time ignores idle/waiting time
(e.g., waiting for message in MPI).Wall clock time reflects actual
performance as seen by the user.

Measuring of Wall Clock Time is done using API-specific timing functions

like MPI_Wtime() or omp_get_wtime()
PseudoCode Example:
Double start, finish;
start=Get_current_time();
/*Code to time*/
finish=Get_current_time();
printf("Elapsed time=%e seconds\n",finish-start);

In parallel programs, every thread/process may report a different elapsed

time. To get a single, meaningful result, we synchronize all processes
using a barrier, Measure individual elapsed times and Use Global_max()
to find the maximum elapsed time, representing total run-time.
Example code using Barrier():
Barrier();
my_start=Get_current_time();
/*Parallel code*/
my_finish = Get_current_time();

my_elapsed = my_finish - my_start;

global_elapsed = Global_max(my_elapsed);

if (my_rank == 0)
printf("Elapsedtime=%eseconds\n",global_elapsed);

Execution times may vary across runs due to system activity. Instead of
using average or median, we usually report the minimum run-time,
assuming it best reflects ideal performance. For practical considerations we
avoid running multiple threads per core to reduce context-switching and
overhead and we also exclude I/O operations from performance timings
as programs are not optimized for high-performance I/O.
GPU performance
When evaluating MIMD (Multiple Instruction, Multiple Data) programs
we usually compare parallel performance to serial performance on the
same CPU core type. However, this model doesn't work well for GPUs,
because GPU cores are inherently parallel, unlike conventional CPUs
So, measuring efficiency or linear speedup of GPU programs compared
to CPU serial programs is not meaningful. Similarly, formal scalability
(keeping efficiency fixed) doesn't directly apply to GPUs, though
informal scalability (performance improves with more GPU cores) is
still used.
If the serial part of a GPU program runs on the CPU, Amdahl’s Law can
still apply: If fraction r of the code is serial max speedup is
≤1/r. But just like in MIMD, this bound depends on problem size the
serial part may become negligible as size grows.

For timing GPU programs since they start/stop on CPU, we typically use CPU
timers (e.g., omp_get_wtime(), MPI_Wtime()) around the GPU calls. If we
want to time only GPU kernel execution we use GPU-specific timers.

Parellel Computing Module1 Notes
No ratings yet
Parellel Computing Module1 Notes
29 pages
BCM601-Module 1
No ratings yet
BCM601-Module 1
35 pages
Sample Report 22-23 1
No ratings yet
Sample Report 22-23 1
30 pages
AIML Module - 4
No ratings yet
AIML Module - 4
25 pages
20250318120345-02) BEC601 - Mod-1 - PPT - 18.3.25
No ratings yet
20250318120345-02) BEC601 - Mod-1 - PPT - 18.3.25
142 pages
Module 2
No ratings yet
Module 2
16 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
57 pages
MC - BCS402 Lab Manual
No ratings yet
MC - BCS402 Lab Manual
21 pages
JNTUA MCA V Semester R17 Syllabus
No ratings yet
JNTUA MCA V Semester R17 Syllabus
24 pages
BDA Lab Manual - BAD601-Final One - 7-11
No ratings yet
BDA Lab Manual - BAD601-Final One - 7-11
25 pages
Perspectives and Issues in Deep Learning.
No ratings yet
Perspectives and Issues in Deep Learning.
8 pages
Besck104e-204e
No ratings yet
Besck104e-204e
3 pages
3 Unit - Dspu
No ratings yet
3 Unit - Dspu
23 pages
Unit 3 Topic 9 Hadoop Archives
No ratings yet
Unit 3 Topic 9 Hadoop Archives
32 pages
BCS515B
No ratings yet
BCS515B
2 pages
Experiment-7: Implementation of K-Means Clustering Algorithm
No ratings yet
Experiment-7: Implementation of K-Means Clustering Algorithm
3 pages
Co Po Mapping Bda With Justiificaton
No ratings yet
Co Po Mapping Bda With Justiificaton
4 pages
Neural Networks
No ratings yet
Neural Networks
1 page
R22-M.tech Curriculum and Syllabus
No ratings yet
R22-M.tech Curriculum and Syllabus
85 pages
AIML 4th and 5th Module Notes
No ratings yet
AIML 4th and 5th Module Notes
77 pages
PARALLEL PROGRAMMING Module 1
No ratings yet
PARALLEL PROGRAMMING Module 1
20 pages
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Advanced Computer Architecture - WWW - Rgpvnotes.in
15 pages
Textbook ML - Removed
No ratings yet
Textbook ML - Removed
10 pages
High Performance Computing L T P J C Pre-Requisite Nil Syllabus Version Course Objectives
No ratings yet
High Performance Computing L T P J C Pre-Requisite Nil Syllabus Version Course Objectives
2 pages
UNIT IV (Well Posed Leaning Problems)
100% (1)
UNIT IV (Well Posed Leaning Problems)
16 pages
Module 5 PDF
No ratings yet
Module 5 PDF
35 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Machine Learning An Algorithmic Perspective Second Edition Stephen Marsland No Waiting Time
100% (1)
Machine Learning An Algorithmic Perspective Second Edition Stephen Marsland No Waiting Time
81 pages
Python Programming Changing
No ratings yet
Python Programming Changing
3 pages
DV LAB-5th Sem (2022)
No ratings yet
DV LAB-5th Sem (2022)
92 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
ML Notes (BCS602)
No ratings yet
ML Notes (BCS602)
186 pages
Cp4251 Internet of Things
No ratings yet
Cp4251 Internet of Things
61 pages
Experiment 3 Module 1
No ratings yet
Experiment 3 Module 1
6 pages
Unit - 1 MACHINE LEARNING BASICS, LINEAR ALGEBRA
No ratings yet
Unit - 1 MACHINE LEARNING BASICS, LINEAR ALGEBRA
41 pages
UNIT 4 - Perceptron and DL
No ratings yet
UNIT 4 - Perceptron and DL
39 pages
Bca Question Papers Blue Print
60% (5)
Bca Question Papers Blue Print
11 pages
Module - 04 CC (Bcs601) Search Creators - 250426 - 131037
No ratings yet
Module - 04 CC (Bcs601) Search Creators - 250426 - 131037
64 pages
Internet of Things A Hands - On Approach - Arshdeep Bahga, Vijay Madisetti
No ratings yet
Internet of Things A Hands - On Approach - Arshdeep Bahga, Vijay Madisetti
34 pages
Module-02 AIML NOTES
No ratings yet
Module-02 AIML NOTES
29 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
96 pages
OpenMP Shared-Memory Programming Guide
No ratings yet
OpenMP Shared-Memory Programming Guide
37 pages
Various Paradigms of Learning Problems
100% (1)
Various Paradigms of Learning Problems
14 pages
B.E IT 1st Year Syllabus Overview
No ratings yet
B.E IT 1st Year Syllabus Overview
50 pages
Drivers For Big Data
No ratings yet
Drivers For Big Data
7 pages
Module - 05 CC (Bcs601) Search Creators
100% (2)
Module - 05 CC (Bcs601) Search Creators
35 pages
Web Lab PGM 1 To 3
No ratings yet
Web Lab PGM 1 To 3
11 pages
Experiment No. 1: Theory
No ratings yet
Experiment No. 1: Theory
7 pages
Question Bank - OS
No ratings yet
Question Bank - OS
6 pages
Btech Cs 7 Sem Deep Learning
No ratings yet
Btech Cs 7 Sem Deep Learning
3 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
MTech Big Data & Algorithms Exam
No ratings yet
MTech Big Data & Algorithms Exam
98 pages
Jntuk Machine Learning 3-2 Unit-4
No ratings yet
Jntuk Machine Learning 3-2 Unit-4
32 pages
5th Sem BCS515B - AI - Module3
No ratings yet
5th Sem BCS515B - AI - Module3
113 pages
M3-M4-Understanding of Data
No ratings yet
M3-M4-Understanding of Data
16 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
PC Module2
No ratings yet
PC Module2
10 pages
Thara Module2 BCS702
No ratings yet
Thara Module2 BCS702
25 pages
Christ Apostolic Church
No ratings yet
Christ Apostolic Church
2 pages
Estimation of Potassium in Tap Water by Flame Photometer
100% (1)
Estimation of Potassium in Tap Water by Flame Photometer
19 pages
Fiscal & Monetary Policy (BBA)
No ratings yet
Fiscal & Monetary Policy (BBA)
30 pages
Theology For Beginners PDF
No ratings yet
Theology For Beginners PDF
287 pages
Everest Simulation Report PDF
No ratings yet
Everest Simulation Report PDF
17 pages
Đề Kiểm tra cuối HK 2 Môn Tiếng Anh 6 năm học 2023-2024
No ratings yet
Đề Kiểm tra cuối HK 2 Môn Tiếng Anh 6 năm học 2023-2024
2 pages
A Single-Cell and Spatially Resolved Atlas of Human
No ratings yet
A Single-Cell and Spatially Resolved Atlas of Human
40 pages
Public and Private Sectors in Higher Education: A Comparison of International Patterns 1
No ratings yet
Public and Private Sectors in Higher Education: A Comparison of International Patterns 1
13 pages
Health Impact Assessment
100% (1)
Health Impact Assessment
16 pages
Toyota Mirai FCV Posters LR Tcm-11-564265
No ratings yet
Toyota Mirai FCV Posters LR Tcm-11-564265
10 pages
1.1 Material Submittal Cover Page-Stair Case and Platform PDF
No ratings yet
1.1 Material Submittal Cover Page-Stair Case and Platform PDF
197 pages
Topic 8
No ratings yet
Topic 8
58 pages
Examples of Psychoanalytic Theory
No ratings yet
Examples of Psychoanalytic Theory
13 pages
Nuclear Fuel Rod Thermal Analysis
No ratings yet
Nuclear Fuel Rod Thermal Analysis
12 pages
Final Research 13
No ratings yet
Final Research 13
20 pages
DPP 305
No ratings yet
DPP 305
35 pages
PDF
No ratings yet
PDF
4 pages
7 Steps To Create Systems That Will Change Your Life
No ratings yet
7 Steps To Create Systems That Will Change Your Life
6 pages
Ship Maintanance
100% (4)
Ship Maintanance
302 pages
Angelica (Angelica Archangelica)
No ratings yet
Angelica (Angelica Archangelica)
10 pages
RNA & Protein Synthesis Quiz
67% (3)
RNA & Protein Synthesis Quiz
6 pages
ILS L4 Transcripts PDF
75% (4)
ILS L4 Transcripts PDF
38 pages
Long-Distance Nationalism - Worl - Benedict Anderson
No ratings yet
Long-Distance Nationalism - Worl - Benedict Anderson
11 pages
E-Recruitment Insights for HR Pros
No ratings yet
E-Recruitment Insights for HR Pros
15 pages
Business Pressure
No ratings yet
Business Pressure
7 pages
902900-616 Despiece
No ratings yet
902900-616 Despiece
331 pages
Sustainability 2 Marks Answers
No ratings yet
Sustainability 2 Marks Answers
3 pages
What Is Development Studies
No ratings yet
What Is Development Studies
8 pages
Networking Devices & Protocols Guide
No ratings yet
Networking Devices & Protocols Guide
34 pages
365 Magic Items
100% (5)
365 Magic Items
71 pages

Parallel Computing-Module2 Notes

Uploaded by

Parallel Computing-Module2 Notes

Uploaded by

PARALLEL COMPUTING(BCS702)

Programming hybrid systems

• In many systems, only one process or thread (usually

• Multiple processes/threads writing to stdout can cause interleaved

• In distributed-memory programs, only process 0 should read from

• In shared-memory programs, only thread 0 (master) should read

• Output to stdout or stderr should preferably be handled by a single

• For debugging, multiple threads/processes may print to stdout,

• Only one thread/process should access a file(other than standard

This structured approach helps manage I/O in parallel systems more

• GPU threads can write to stdout, but the output order is

• GPU threads do not have access to:

• In shared-memory systems, synchronization mechanisms like mutexes

• In distributed-memory systems, network communication between

These overheads increase with more threads/processes. Thus, while we aim

Efficiency can be thought of as the fraction of the parallel run-time that’s

Below figure(2.18) shows the variation in speedup if we halve and

figure(2.19)showsthevariationinEfficiencyifwehalveand double the problem

Most researchers prefer the second approach because it reflects the

It states “Roughly, that unless virtually all of a serial program is

For example, if 90% of a program is parallelized perfectly, the speedup

r=fraction of the program that is serial

p = number of processors (cores)

S = speedup with p processors Then,

• No matter how many cores you use, speedup is capped by the

• Even a small serial fraction severely limits maximum

• However, Amdahl’s Law doesn’t consider increasing problem

• Many real-world scientific programs achieve very large

Thus, efficiency remains constant if problem size increases

• Strong Scalability: Efficiency stays constant without

• Weak Scalability: Efficiency stays constant if the problem size

Measuring of Wall Clock Time is done using API-specific timing functions

In parallel programs, every thread/process may report a different elapsed

my_elapsed = my_finish - my_start;

You might also like