PARALLEL COMPUTING(BCS702)
NOTES-Module2
MODULE-2: Syllabus
GPU programming, Programming hybrid systems, MIMD systems, GPUs,
Performance –Speedup and efficiency in MIMD systems, Amdahl’s law,
Scalability in MIMD systems, Taking timings of MIMD programs, GPU
performance.
GPU programming
GPU programming is a type of heterogeneous computing, where code is
written for both the CPU (host) and the GPU (device). The GPU cannot run
an operating system or access files directly, so the CPU handles memory
allocation, data transfer, and initiates execution on the GPU.
Since CPU and GPU have separate memory, the host program must
manage data transfers between them. The GPU executes many threads in
parallel, grouped into SIMD (Single Instruction, Multiple Data) groups.
Each GPU processor has fast private memory(like a cache)for its threads.
Threads in a SIMD group may not always execute the same path due to
branching, which causes some threads to idle—this is inefficient and
should be minimized.
For example consider below code treads with rank_in_gp<16 execute the first
line, while others stay idle. Then the reverse happens. This serializes
execution and reduces GPU efficiency.
To maximize performance, programmers should Minimize branching
within SIMD groups, Use many thread groups to keep the GPU busy and
Rely on the GPU’s hardware scheduler which switches between ready
thread groups with minimal overhead.
Thus efficient GPU programming requires understanding parallelism,
memory management, and thread scheduling, making it very different
from traditional CPU programming.
Programming hybrid systems
While it's possible to program clusters of multicore systems using a hybrid
approach—shared-memory APIs within nodes and distributed-memory APIs
between nodes—this method is complex and used mainly for applications
needing maximum performance. To reduce complexity, most programmers
prefer using a single distributed-memory API for both intra-node and inter-
node communication.
MIMD systems
Parallel I/O is a complex topic, often avoided in basic parallel
programming due to its difficulty and scope. Most programs in this
context perform minimal I/O, easily handled by standard C functions
like printf, scanf, etc. However, using these functions in parallel
environments (multiple processes or threads) can lead to
nondeterministic behavior.
• Processes do not share stdin, stdout, or stderr, so their I/O behavior
can be unpredictable.
• Threads share standard I/O streams, but simultaneous access can result
in mixed or unordered output.
• In many systems, only one process or thread (usually
process/thread 0) is allowed to read from stdin.
• Multiple processes/threads writing to stdout can cause interleaved
or scrambled output, especially during debugging.
Best Practices for Parallel I/O:
• In distributed-memory programs, only process 0 should read from
stdin.
• In shared-memory programs, only thread 0 (master) should read
from stdin.
• Output to stdout or stderr should preferably be handled by a single
thread/process to maintain order.
• For debugging, multiple threads/processes may print to stdout,
but should always include their ID or rank.
• Only one thread/process should access a file(other than standard
streams)at a time, each should use separate files for
reading/writing if needed.
This structured approach helps manage I/O in parallel systems more
predictably and avoids common pitfalls.
GPUs
In most GPU programs, all input and output (I/O) is handled by the host
(CPU) code, which runs as a single thread/process. Therefore, standard C
I/O functions like printf and scanf work just like in serial C programs.
An exception occurs during GPU debugging. In this case:
• GPU threads can write to stdout, but the output order is
nondeterministic, similar to MIMD programs.
• GPU threads do not have access to:
• stderr
• stdin
• Secondary storage(e.g.,files)
Hence, for most practical purposes, I/O is centralized in the host, while
GPU-side I/O is limited and mainly used for debugging.
Speedup and efficiency in MIMD systems
The main performance goal of a parallel program is to divide work
equally among all cores without adding extra overhead. If this is
perfectly achieved using p cores, the program can run p times faster than
its serial version, a condition called linear speedup:
Tparallel=Tserial
p
However, perfect linear speedup is rare due to overheads introduced by
parallelism:
• In shared-memory systems, synchronization mechanisms like mutexes
add overhead and can serialize parts of the code.
• In distributed-memory systems, network communication between
processes is slower than local memory access.
These overheads increase with more threads/processes. Thus, while we aim
for linear speedup, actual speedup is often less. The speedup of a parallel
program is defined as:
S=Tserial
Tparallel
This metric helps evaluate how effectively a parallel program utilizes
multiple cores.
The value S/p is called the efficiency of the parallel program which
indicates average speedup achieved per processor. If we substitute the
formula for S,we see that the efficiency is
Efficiency can be thought of as the fraction of the parallel run-time that’s
spent, on average, by each core working on solving the
original problem. The remainder of the parallel run-time is the parallel
overhead.
Many parallel programs are developed by explicitly dividing the work of
the serial program among the processes/threads and adding in the
necessary “parallel over-head,” such as mutual exclusion or
communication. Therefore if Toverhead denotes
this parallel overhead, it’s often the case that
Below figure(2.18) shows the variation in speedup if we halve and
double the problem size of the program
figure(2.19)showsthevariationinEfficiencyifwehalveand double the problem
size of the program
When reporting speedup and efficiency, there are two common
approaches to choosing the serial runtime Tserial:
1. Use the runtime of the fastest serial program on the fastest
Process or available(maybe a different algorithm).
2. Use the runtime of the serial version of the same parallel program
running on a single processor of the parallel system.
Most researchers prefer the second approach because it reflects the
efficiency of the parallel system's core utilization more realistically. For
example, comparing a parallel shell sort program’s speedup using the
serial shell sort runtime on the same system is preferred over comparing
it to a different sorting algorithm on a faster machine.
Amdahl’s law
It states “Roughly, that unless virtually all of a serial program is
parallelized, the possible speedup is going to be very limited regardless of
the number of cores available.”
For example, if 90% of a program is parallelized perfectly, the speedup
with p cores is:
Tparallel=(0.9×Tserial)/p+0.1×Tserial
Even with an infinite number of cores, the runtime cannot be less than
the serial part(10%),which limits the speed up to atmost 10.
In general If:
r=fraction of the program that is serial
p = number of processors (cores)
S = speedup with p processors Then,
For infinitely large values of p, S≤1/r where r is the fraction of the
program that is inherently serial.
Keypoints:
• No matter how many cores you use, speedup is capped by the
serial portion.
• Even a small serial fraction severely limits maximum
speedup.
• However, Amdahl’s Law doesn’t consider increasing problem
size. Gustafson’s Law suggests that for larger problems, the serial
fraction effectively decreases, allowing better speedup.
• Many real-world scientific programs achieve very large
speedups. A modest speedup of 5 to10 is often sufficient and
worthwhile.
Scalability in MIMDsystems
Scalability refers to how well a parallel program improves in
performance when the system’s resources (like number of
cores/processes) are increased.
A parallel program is scalable if when we increase the number of
processes/threads, we can also increase the problem size in such away that
the efficiency (E) remains constant.
EfficiencyFormula:
Given:
• Tserial=n(problem size= n)
• Tparallel=n/p+1
Efficiency is:
If we increase processes to k p and problem size to kn, efficiency
becomes:
Thus, efficiency remains constant if problem size increases
proportionally with number of threads then program is scalable.
Types of Scalability:
• Strong Scalability: Efficiency stays constant without
increasing problem size.
• Weak Scalability: Efficiency stays constant if the problem size
increases proportionally with the number of processes/threads.
Taking timings of MIMD programs
Measuring performance in parallel (MIMD) programs involves
understanding what kind of timing is meaningful and how to take it
accurately.
Development Timing is used to debug or optimize specific parts of the
code (e.g., time spent waiting for messages). Where as Performance
Timing is used to report overall efficiency typically just one value (total
execution time of the computation).Wall Clock Time (real elapsed time)
is preferred over CPU Time because CPU time ignores idle/waiting time
(e.g., waiting for message in MPI).Wall clock time reflects actual
performance as seen by the user.
Measuring of Wall Clock Time is done using API-specific timing functions
like MPI_Wtime() or omp_get_wtime()
PseudoCode Example:
Double start, finish;
start=Get_current_time();
/*Code to time*/
finish=Get_current_time();
printf("Elapsed time=%e seconds\n",finish-start);
In parallel programs, every thread/process may report a different elapsed
time. To get a single, meaningful result, we synchronize all processes
using a barrier, Measure individual elapsed times and Use Global_max()
to find the maximum elapsed time, representing total run-time.
Example code using Barrier():
Barrier();
my_start=Get_current_time();
/*Parallel code*/
my_finish = Get_current_time();
my_elapsed = my_finish - my_start;
global_elapsed = Global_max(my_elapsed);
if (my_rank == 0)
printf("Elapsedtime=%eseconds\n",global_elapsed);
Execution times may vary across runs due to system activity. Instead of
using average or median, we usually report the minimum run-time,
assuming it best reflects ideal performance. For practical considerations we
avoid running multiple threads per core to reduce context-switching and
overhead and we also exclude I/O operations from performance timings
as programs are not optimized for high-performance I/O.
GPU performance
When evaluating MIMD (Multiple Instruction, Multiple Data) programs
we usually compare parallel performance to serial performance on the
same CPU core type. However, this model doesn't work well for GPUs,
because GPU cores are inherently parallel, unlike conventional CPUs
So, measuring efficiency or linear speedup of GPU programs compared
to CPU serial programs is not meaningful. Similarly, formal scalability
(keeping efficiency fixed) doesn't directly apply to GPUs, though
informal scalability (performance improves with more GPU cores) is
still used.
If the serial part of a GPU program runs on the CPU, Amdahl’s Law can
still apply: If fraction r of the code is serial max speedup is
≤1/r. But just like in MIMD, this bound depends on problem size the
serial part may become negligible as size grows.
For timing GPU programs since they start/stop on CPU, we typically use CPU
timers (e.g., omp_get_wtime(), MPI_Wtime()) around the GPU calls. If we
want to time only GPU kernel execution we use GPU-specific timers.