Parallel Computing IA1
Parallel Computing IA1
MODULE-1
Introduction to parallel programming, Parallel hardware and parallel software –Classifications of parallel
computers, SIMD systems, MIMD systems, Interconnection networks, Cache coherence, Shared-memory vs.
distributed-memory, Coordinating the processes/threads, Shared-memory, Distributed-memory.
1. Explain the classification of memory with a schematic diagram and relate it to the suitable system
programming APIs used for shared memory and distributed-memory systems.
Parallel computers are primarily classified based on how their processing cores access memory. This leads
to two main types of memory systems: shared memory systems and distributed memory systems.
• Concept: In a shared memory system, all processing cores (or processors) have access to a common
block of memory locations. Cores coordinate their work by modifying these shared memory
locations. This means communication between cores happens implicitly by reading from or writing
to shared data structures.
• Suitable APIs: For programming shared memory systems, threads are commonly used.
o OpenMP (Open Multi-Processing): A popular API for shared-memory multiprocessing
programming in C, C++, and Fortran. It uses compiler directives to parallelise regions of
code, allowing threads to share memory.
o Pthreads (POSIX Threads): A standard API for creating and managing threads. It provides
explicit control over thread creation, synchronisation (e.g., mutexes, semaphores), and
memory sharing.
• Why these APIs? These APIs are suitable because they allow programmers to define shared
variables that all threads can access, making implicit communication straightforward.
Synchronisation mechanisms (like mutexes) are provided to manage concurrent access to shared data
and prevent race conditions.
2. Distributed Memory Systems:
• Concept: In a distributed memory system, each processing core has its own private memory. Cores
cannot directly access each other's memory. Instead, they coordinate their work by explicitly
communicating across a network. Communication typically involves sending messages between
processors.
• Suitable APIs: For programming distributed memory systems, processes are commonly used.
o MPI (Message-Passing Interface): This is the most widely used API for message-passing. It
provides functions for sending and receiving messages between processes, allowing explicit
communication and synchronisation.
o PGAS (Partitioned Global Address Space) Languages: These languages offer a
programming model that provides some shared-memory concepts over distributed-memory
hardware. Programmers can control the distribution of data in shared structures, ensuring that
local data access is prioritised to improve performance.
• Why these APIs? These APIs are necessary because each process operates on its private memory.
Programmers must explicitly specify when data needs to be exchanged between processes (e.g.,
using Send and Receive functions in MPI).
Schematic Diagrams:
2. Explain in detail about Instruction-Level Parallelism to demonstrate how hazards can be reduced
in execution.
Instruction-Level Parallelism (ILP) refers to the execution of multiple instructions from a single
instruction stream concurrently. The sources mention multiple issue and pipelining as forms of parallel
hardware that achieve this by allowing different functional units to execute simultaneously. This form of
parallelism is typically not visible to the programmer and is treated as an extension to the basic von
Neumann model.
While the provided sources acknowledge pipelining and multiple issue as ways to execute functional units
simultaneously, they do not explicitly define or detail "hazards" in execution or specific techniques for their
reduction (such as data hazards, control hazards, or structural hazards, and solutions like forwarding,
stalling, or branch prediction).
Flynn’s taxonomy classifies parallel computers based on the number of instruction streams and data
streams they can manage simultaneously. While Flynn's taxonomy primarily classifies systems rather than
memory directly, the type of system strongly dictates the memory architecture.
The four classifications in Flynn's taxonomy are:
4. Explain the concept of SIMD to explain the role of vector processors in improving performance.
Concept of SIMD (Single Instruction, Multiple Data): SIMD systems are a class of parallel computers
designed to execute a single instruction stream on multiple data streams simultaneously.
• An abstract SIMD system consists of a single control unit that broadcasts instructions to multiple
datapaths. Each datapath then applies the same instruction to its local data item, or it remains idle.
• This architecture is highly efficient for data-parallel problems, such as performing the same
operation on many elements of large arrays.
• However, its efficiency can degrade if conditional operations (e.g., if statements) cause some
datapaths to be idle, as all datapaths must execute the same instruction or wait. SIMD datapaths
operate synchronously and do not have individual instruction storage.
Role of Vector Processors in Improving Performance: Vector processors are a significant type of SIMD
system that excel at operating on arrays or "vectors" of data, unlike conventional CPUs that operate on
individual "scalar" data elements. They are designed with specific features to leverage SIMD parallelism for
performance improvement:
1. Vector Registers: These registers are designed to store an entire vector of operands and can operate
on their contents simultaneously. This means a single operation can affect multiple data elements
stored within the register at once.
2. Vectorized and Pipelined Functional Units: Vector processors have functional units optimised to
apply the same operation to each element within a vector. These units are also pipelined, allowing
continuous flow of data through the processing stages.
3. Vector Instructions: These are special instructions that operate on entire vectors rather than scalars.
For example, a simple loop like for (i=0; i<n; i++) x[i] += y[i]; can be executed with a single load, add,
and store instruction for a block of vector_length elements, significantly reducing the total number of
instructions compared to a conventional system that requires these three operations for each element.
This dramatically cuts down instruction fetch and decode overhead.
4. Interleaved Memory: To keep the vector functional units supplied with data, vector processors use
interleaved memory. This system consists of multiple independent "banks" that can be accessed in
parallel or rapid succession, allowing continuous loading and storing of vector elements with little to
no delay.
5. Strided Memory Access and Hardware Scatter/Gather: Vector systems include special hardware
to accelerate accessing vector elements that are at fixed intervals (strided access) or irregular
intervals (scatter/gather). This ensures efficient data movement for various data patterns.
• They process multiple data items with a single instruction (SIMD nature), leading to high
computational throughput.
• They effectively exploit memory bandwidth, ensuring that every data item loaded from memory is
actually used, which is more efficient than cache-based systems where an entire cache line might be
loaded but not all items used.
• Vectorizing compilers are very good at identifying and transforming serial code (especially loops)
into vector instructions, making them easy to use for many applications.
For many applications involving large-scale data manipulation (e.g., scientific simulations), vector
processors offer substantial speedups by reducing instruction overhead and maximizing data processing per
clock cycle.
5. With a neat diagram explain the structure of UMA and NUMA architectures.
UMA and NUMA are classifications of shared-memory systems that describe how multiple processors or
cores access the main memory.
• Structure: In a UMA system, all processing cores in the system have uniform access time to any
location in the main memory. This is typically achieved by connecting all processors directly to a
common main memory module through an interconnection network.
• Characteristics:
o Uniform Access Time: The time taken for any core to access any memory location is
roughly the same.
o Easier Programming: Programmers do not need to consider different memory access times,
simplifying software development.
o Scalability Limitations: As the number of processors increases, the shared interconnect
(e.g., a bus) can become a bottleneck, leading to contention and performance degradation.
• Diagram:
This diagram typically shows multiple processors or cores connected to a single block of main memory via a
single interconnect.
• Structure: In a NUMA system, the memory is physically distributed among different processing
nodes, but it presents a single, shared address space to the processors. Each processor (or group of
processors) has its local memory that it can access quickly. Accessing memory attached to other
processors (remote memory) is slower. This is often implemented where each processor has a direct
connection to a block of main memory, and processors access each other's memory blocks through
special hardware.
• Characteristics:
o Nonuniform Access Time: Access to local memory is significantly faster than access to
remote memory.
o Programming Complexity: Programmers must be aware of data locality to place data in
local memory for better performance, increasing programming complexity.
o Better Scalability: NUMA systems can scale to use larger amounts of memory and support
more processors than UMA systems because the memory bus is not a single point of
contention for all memory accesses.
• Diagram:
This diagram typically illustrates multiple processor-memory nodes interconnected, showing that each
processor has its own local memory segment, and it can access other segments but with a higher latency.
6. Explain with neat diagram the distributed memory interconnects such as ring and toroidal mesh to
explain how communication efficiency is improved.
Distributed Memory Interconnects are networks that connect processor-memory pairs in distributed-
memory systems. The design and performance of these interconnects are crucial, as a slow interconnect can
severely degrade the overall performance of parallel programs. Ring and toroidal mesh are examples of
direct interconnects, where each switch is directly connected to a processor-memory pair, and the switches
are also connected to each other.
1. Ring Interconnect:
• Structure: A ring interconnect connects processors (or nodes, each consisting of a processor and its
local memory) in a circular fashion. Each node is connected to exactly two neighbours, forming a
closed loop [36, Fig 2.8a]. Data flows in one or both directions around the ring.
• Diagram:
This diagram shows switches (circles) and processors (squares) arranged in a ring, with bidirectional
links.
• Communication Efficiency:
o Multiple Simultaneous Communications: A ring is an improvement over a simple bus
system because it allows multiple communications to occur simultaneously, as long as they
do not conflict over the same link. For example, in an 8-node ring, it's possible for four
communications to happen concurrently if arranged carefully.
o Limitations: Despite allowing simultaneous communications, it's easy to create scenarios
where some processors must wait for others to complete their transmissions, which can still
limit overall efficiency.
o Bisection Width: The bisection width of a ring is typically 2. This measure represents the
minimum number of links that must be cut to divide the network into two equal halves. A
small bisection width indicates potential bottlenecks for communications that must cross the
divide.
• Structure: A toroidal mesh is a two-dimensional grid of processors (or nodes) where the connections
"wrap around" from one edge to the opposite edge, forming a torus (like a donut shape). Each
internal node is connected to four neighbours (North, South, East, West), and the wrap-around
connections connect nodes at the edges [36, Fig 2.8b].
• Communication Efficiency Improvement:
o Higher Connectivity: The toroidal mesh offers significantly greater connectivity than a
simple ring. Each switch typically supports five links (one to the processor, four to other
switches), compared to three links in a ring.
o Increased Simultaneous Communications: It's easier to achieve more simultaneous
communication patterns with a mesh than with a ring. This means that more messages can be
in transit concurrently without conflicting, leading to better overall throughput.
• Diagram:
This diagram shows processors and switches arranged in a grid, with connections that wrap around
the edges.
Improved Bisection Width: For a square two-dimensional toroidal mesh with p = q^2 nodes
(where q is even), the bisection width is 2√p. For example, a 16-node (4x4) mesh would have
a bisection width of 2√16 = 8. This is substantially higher than a ring's bisection width of 2,
indicating that the toroidal mesh can support much more "cross-sectional" communication,
thus improving communication efficiency for large-scale data exchanges.
o Cost: While more expensive than a ring due to more complex switches and more links (2p
links for p processors compared to p links in a ring), the increased communication efficiency
often justifies the cost for larger parallel systems.
In summary, the toroidal mesh improves communication efficiency over a ring by providing higher
connectivity and a greater bisection width, enabling more simultaneous data transfers and reducing the
likelihood of communication bottlenecks.
7. With neat diagrams explain the working of a 3-dimensional hypercube network and generic
indirect networks for processor interconnection.
• Concept: Indirect interconnects offer an alternative to direct interconnects. The key difference is
that in an indirect interconnect, the switches are not necessarily directly connected to a processor.
Instead, they form a separate switching network that processes use to communicate.
• Working/Structure:
o They are often shown with unidirectional links (data flows in one direction only) [47, Fig
2.13].
o Each processor typically has an outgoing link to the switching network and an incoming link
from the switching network [47, Fig 2.13].
o The switching network routes data between processors. This provides flexibility in
communication but can introduce more hops and potential for contention depending on the
network's design.
• Diagram: Refer to Figure 2.13, for a generic indirect network. This diagram illustrates a collection
of processors connected to a central switching network.
8. Explain with a neat diagram to explain the working of a Crossbar and an Omega network.
1. Crossbar Network:
2. Omega Network:
• Concept: An Omega network is another type of indirect switched interconnect, often considered a
more cost-effective alternative to a full crossbar for large systems.
• Working/Structure:
o It is constructed using multiple stages of smaller switches, typically 2x2 crossbar switches
[Fig 2.15, Fig 2.16].
o These stages are interconnected in a specific pattern to route messages from any input to any
output.
• Communication Efficiency:
o Unlike a crossbar, an Omega network does not allow all possible simultaneous
communications. There can be conflicts. For example, if processor 0 sends a message to
processor 6, processor 1 might not be able to simultaneously send a message to processor 7
because their paths would conflict at an internal switch.
o Cost: The main advantage of an Omega network is its lower cost compared to a crossbar. For
p inputs/outputs, an Omega network uses (p/2) * log₂(p) of the 2x2 crossbar switches, totaling p
* log₂(p) 2x2 switches, which is significantly less than the p^2 switches required by a crossbar.
o Bisection Width: The bisection width of an Omega network is p/2.
• Diagrams:
o For the Omega network structure: Figure 2.15
o For the internal 2x2 switch: Figure 2.16
9. Explain the concept of cache coherence and its approaches to explain how inconsistent data can
arise in a multiprocessor system with diagram.
Diagram: This diagram shows a shared-memory system with two cores, each having its own private data
cache, connected to main memory.
Concept of Cache Coherence: In a shared-memory multiprocessor system, each CPU core often has its
own private data cache. While caches significantly speed up memory access for individual cores, they
introduce a problem called cache coherence.
• Problem: The cache coherence problem arises when multiple caches store copies of the same shared
variable. If one core updates its copy of the variable in its cache, other cores that also have a copy of
that variable in their caches will not "see" this update. This leads to inconsistent data across the
system, where different cores hold different values for the same memory location.
• Why it arises: Programmers do not have direct control over CPU caches. The system hardware
manages them. This means a programmer cannot ensure that when one core updates a cached
variable, the cached values held by other processors are also updated.
How Inconsistent Data Can Arise (Example): Consider a shared-memory system with two cores, each
with its private cache [55, Fig 2.17]. Let x be a shared variable initialized to 2.
• Scenario:
o Time 0: Core 0 executes y0 = x; and Core 1 executes y1 = 3*x;. Both cores might load x=2 into
their respective caches. So, y0 becomes 2 and y1 becomes 6.
o Time 1: Core 0 updates x = 7;. This update happens in Core 0's cache.
o Time 2: Core 1 executes z1 = 4*x;.
• Inconsistency: If Core 1 still has the old value x=2 in its cache and doesn't "see" Core 0's update to
x=7, then z1 will incorrectly become 4 * 2 = 8. The expected value, reflecting the update by Core 0,
would be 4 * 7 = 28.
• This problem occurs regardless of whether a write-through (main memory updated) or write-back
(main memory not immediately updated) cache policy is used. In write-through, main memory is
updated, but Core 1's cache copy is still stale. In write-back, the update might not even be in main
memory for Core 1 to potentially see.
False Sharing: A related issue that can degrade performance (though not cause incorrect results) is false
sharing. Since caches operate on cache lines (blocks of memory) rather than individual variables, if two
cores access different, unrelated variables that happen to reside in the same cache line, that cache line will
be repeatedly invalidated and fetched by both cores. This happens even if the cores are not truly sharing the
data items but only the cache line they are part of, leading to many unnecessary main memory accesses and
poor performance. It can be reduced by using temporary local storage.
10. Explain how non determinism can lead to race condition in MIMD systems and how mutex locks
prevent them.
• Concept: In Multiple Instruction, Multiple Data (MIMD) systems, particularly those where
processors execute asynchronously (at their own pace), nondeterminism is likely to occur. A
computation is nondeterministic if, for the same input, it can produce different outputs on different
runs.
• Cause: This happens because independent threads or processes execute statements at varying
relative rates from run to run. The exact order in which operations complete cannot be predicted.
• Example: If two threads print a private variable, the output order can vary between runs, or even
interleave, because the operating system schedules their output operations unpredictably.
How Nondeterminism Leads to Race Conditions:
• Problem: While nondeterminism isn't always harmful (like varying print order), it can be disastrous
when threads or processes attempt to access and modify shared resources simultaneously. This can
lead to incorrect program results.
• Race Condition Definition: A race condition occurs when multiple threads or processes try to
access a shared resource at the same time, and the outcome of the computation depends on the
unpredictable order (which thread "wins the race") in which these accesses occur.
• Example (x += my_val):
o Suppose two threads (Thread 0 and Thread 1) want to add their private values ( my_val) to a
shared variable x, initially 0.
o The operation x += my_val is not atomic; it involves multiple steps:
1. Load x into a register.
2. Load my_val into a register.
3. Add my_val to x in the register.
4. Store the result back into x in memory.
o Race Scenario:
▪ Core 0 loads x = 0.
▪ Core 1 loads x = 0.
▪ Core 0 adds my_val = 7 to its register copy of x (result 7).
▪ Core 1 adds my_val = 19 to its register copy of x (result 19).
▪ Core 0 stores x = 7 to memory.
▪ Core 1 stores x = 19 to memory.
o Incorrect Result: The final value of x becomes 19, losing Thread 0's update, instead of the
correct sum 7 + 19 = 26. This unpredictable outcome is a race condition.
• Atomic Operations and Critical Sections: To prevent race conditions, the operations on shared
resources must be atomic (appearing to complete without interruption from other threads). This is
achieved by defining a critical section: a block of code that only one thread can execute at a time.
• Mutex (Mutual Exclusion Lock): The most common mechanism for ensuring mutual exclusion in
critical sections is a mutex lock (or simply "lock").
o A mutex is a special object, supported by hardware, that acts as a guard for a critical section.
o Working:
1. Before entering a critical section, a thread must obtain (lock) the mutex by calling a
Lock function.
2. If the mutex is already owned by another thread, the calling thread will wait until it
becomes available.
3. Once the thread successfully obtains the lock, it can execute the code in the critical
section (e.g., x += my_val).
4. After completing the critical section, the thread must relinquish (unlock) the mutex
by calling an Unlock function.
o Prevention: This ensures that only one thread can execute the code within the critical section
at any given time, thereby preventing race conditions. The order in which threads acquire the
lock is not predetermined, but the access to the shared resource is serialized.
• Example with Mutex:
• my_val = Compute_val(my_rank);
• Lock(&add_my_val_lock); // Acquire lock
• x += my_val; // Critical section
• Unlock(&add_my_val_lock); // Release lock
This guarantees that the x += my_val operation from one thread completes fully before another thread
can start its x += my_val operation, thus ensuring the correct sum.
• Consideration: Using mutexes enforces serialization of the critical section. To maintain good
parallel performance, critical sections should be as few and as short as possible.
11. Explain message-passing and partitioned global address space languages in distributed-memory
systems to show how two processes can exchange data without deadlock.
In distributed-memory systems, each processing core has its own private memory and cannot directly
access another core's memory. Therefore, processes must explicitly communicate to exchange data. Two
common programming models for this are message-passing and Partitioned Global Address Space (PGAS)
languages.
1. Message-Passing:
• Concept: Message-passing APIs provide functions for processes to explicitly send and receive data
to and from each other. Processes identify each other by ranks (e.g., 0, 1, ..., p-1). MPI (Message-
Passing Interface) is the most widely used API for this.
• How two processes exchange data (without deadlock in a simple scenario):
o A message-passing program typically involves a Send function and a Receive function.
o Example Pseudocode:
if (my_rank == 1) {
sprintf(message, "greetings from process 1");
Send(message, MSG_CHAR, 100, 0); // Send to rank 0
} else if (my_rank == 0) {
Receive(message, MSG_CHAR, 100, 1); // Receive from rank 1
printf("Process 0 > Received: %s\n", message);
}
o Working: In this example, Process 1 creates a message and calls Send to transmit it to Process
0. Process 0 simultaneously calls Receive to wait for a message from Process 1.
o Blocking Behavior and Deadlock Avoidance (Simple Case):
▪ The most common behavior for Receive is to block (wait) until the message is fully
received.
▪ For Send, it can either block until the Receive operation has started receiving the data,
or it can copy the message into internal storage and return immediately.
▪ In the simple example above, a matching Send and Receive pair, where one process
sends and the other receives, allows for data exchange. If both Send and Receive are
blocking calls, they will wait for each other. This explicit pairing ensures that data is
exchanged correctly without a deadlock in this straightforward one-to-one
communication. For more complex patterns, careful ordering of Send and Receive
operations is crucial to avoid deadlocks (e.g., if Process A sends to B and B sends to
A, both might block indefinitely if they both try to send before anyone receives).
• Concept: PGAS languages aim to combine the programming convenience of shared-memory models
with the underlying performance characteristics of distributed-memory hardware. They provide a
single global address space, but the programmer is given tools to control the distribution of data
within this shared space.
• How they work for data exchange:
o Addressing the "Remote Access" Problem: A naive approach to a global address space on
distributed memory would lead to unpredictable and poor performance because accessing
"remote memory" (memory attached to another core) is vastly slower than accessing "local
memory" (memory attached to the executing core).
o Programmer Control: PGAS languages solve this by making the programmer aware of data
locality.
▪ Private variables are allocated in the local memory of the core where the process
runs.
▪ The distribution of shared data structures (like arrays) is explicitly controlled by
the programmer. This means the programmer knows which parts of a shared array are
in which process's local memory.
o Data Exchange without Explicit Deadlock: Instead of explicit Send/Receive calls, a process
can access a shared variable. If that variable is in another process's "local" memory (but within
the global address space), the system handles the communication (often referred to as one-
sided communication or remote memory access). The programmer, by carefully arranging
data, ensures that operations primarily access local memory for performance, implicitly
exchanging data when remote access is necessary.
o Example (Vector Addition):
If the assigned elements of x and y are allocated so they reside in the local memory of the
core running the process, this code will be very fast. The "exchange" of data is conceptually
handled by the global address space, with the programmer ensuring efficient local access.
o While PGAS languages simplify the communication model by providing a global view, they
still require careful programming to ensure data locality and avoid slow remote memory
accesses for good performance.
12. What is parallel computing where parallel programs are needed to write and explain briefly why
they are better than sequential programs.
What is Parallel Computing? Parallel computing involves performing multiple computations or tasks
simultaneously using multiple processing units (like CPU cores or GPUs). The goal is to solve a single
larger problem faster or handle larger datasets by dividing the work among these units, allowing them to
execute parts of the problem in parallel. This is a departure from the traditional von Neumann model of
sequential computation, which executes one instruction and computes one data value at a time.
Where Parallel Programs Are Needed to Write: Parallel programs are essential in modern computing for
several key reasons:
1. Exploiting Modern Hardware: Almost all modern computing devices, from mobile phones and
tablets to desktops and servers, now use multicore processors. To get the most performance out of
these systems, programs must be written to use multiple cores simultaneously.
2. Increased Application Performance: We can no longer rely solely on hardware advancements and
compilers to automatically provide steady increases in application performance for single-core
execution. To achieve routine increases in application speed and power, software developers must
learn to write parallel applications.
3. Solving Large-Scale Problems: Parallel computing is crucial for problems that demand vast
amounts of data or computation. Examples include:
o Scientific simulations (e.g., climate modeling, molecular dynamics).
o Weather forecasting.
o Large-scale engineering models.
o Image and video processing (heavily uses GPUs which employ SIMD parallelism).
o Many common applications like Excel, Photoshop, and web browsers (Chrome) already use
multiple cores to improve responsiveness and performance.
4. Handling Data-Parallel Workloads: For problems that involve applying the same operation to
large sets of data (e.g., adding two large arrays), parallel computing, particularly SIMD systems like
vector processors and GPUs, is highly efficient.
1. Speedup and Faster Execution: The most significant advantage is increased speed. By dividing a
task among multiple processors, a parallel program can theoretically complete the work much faster
than a sequential program. If a serial program takes Tserial time, a perfectly parallelized program on p
cores could run in Tserial/p time, achieving linear speedup.
2. Ability to Solve Larger and More Complex Problems: Parallel systems, especially distributed-
memory ones, can aggregate memory and computational power far beyond what a single machine
can offer. This allows them to tackle problems that would be impossible for a sequential program due
to time or memory constraints. For many problems, as the problem size increases, the portion that
can be parallelized grows faster than the inherently serial portion, leading to significant real-world
speedups (Gustafson's Law).
3. Efficient Resource Utilization: Parallel programs aim to keep multiple processing units busy,
leading to better overall utilization of the available hardware resources (reflected in high efficiency).
While there are overheads associated with parallelism (like communication and synchronization), for
suitable problems, the benefits of concurrent execution far outweigh these costs.
4. Responsiveness in Interactive Applications: In user-facing applications, parallelism can make
software more responsive by performing background tasks concurrently with user interaction.
13. Describe the classifications of parallel computers and explain hardware multithreading with an
example.
Classifications of Parallel Computers: Parallel computers are classified in several ways, primarily based
on their instruction and data streams, and how their cores access memory.
1. Flynn’s Taxonomy: This classification categorizes computers based on the number of instruction
streams (IS) and data streams (DS) they can handle simultaneously:
2. Memory Access Classification: This classification distinguishes parallel computers based on how their
processing cores access memory:
• Shared-Memory Systems:
o All processing cores can access a common block of memory locations. Cores coordinate
their work by implicitly modifying shared data structures. Examples include multicore
processors and UMA/NUMA systems.
• Distributed-Memory Systems:
o Each processing core has its own, private memory. Cores communicate and coordinate their
work by explicitly sending messages across an interconnection network. Clusters of
computers are a common example.
Hardware Multithreading is a technique where the hardware is designed to allow multiple threads to share
the execution resources of a single core. This helps keep the processing units busy and improve overall
throughput. The source primarily discusses Simultaneous Multithreading (SMT).
• CPU (Host): This is the general-purpose processor that runs the operating system and manages
system services. In GPU programming, the CPU is responsible for:
o Allocating and initializing memory on both the CPU and the GPU.
o Launching GPU programs (often called kernels).
o Collecting and outputting the results after the GPU has finished its computation.
• GPU (Device): This is the highly parallel, specialized processor designed for compute-intensive
work. It does not usually run an operating system or directly manage system services like disk I/O.
Why Minimizing Communication is Essential: The critical reason to minimize communication (data
transfer) between the CPU and GPU is due to their separate memory spaces and the inherent latency
involved in moving data between these distinct components.
1. Separate Memory: The memory for the CPU host and the GPU memory are physically separate.
This means that any data needed by the GPU that originates from the CPU (or vice-versa) must be
explicitly copied across the system's interconnect (e.g., PCIe bus).
2. High Latency and Limited Bandwidth: While not explicitly detailed as "slow" in every instance,
the architectural separation implies that data transfer between the CPU's main memory and the
GPU's global memory is significantly slower than operations performed locally on either device.
Moving large amounts of data back and forth incurs substantial latency (time for data to start
arriving) and can consume significant bandwidth (rate of data transfer) that could otherwise be used
for computation.
3. Overhead of Data Transfer: Every time data is moved, there's an overhead. For example, the CPU
must initiate the transfer, and the GPU must receive it. This overhead reduces the actual time
available for parallel computation on the GPU.
Impact on Performance: If a GPU program frequently transfers data between the CPU and GPU, the
performance gains from the GPU's massive parallel processing power can be severely negated. The GPU
may spend a large portion of its time waiting for data to arrive from the CPU or waiting for its results to be
transferred back. This is especially problematic for applications that have a low "compute-to-data" ratio,
where the amount of computation per data item is small compared to the data transfer cost.
• Transfer necessary data to the GPU once at the beginning of the computation.
• Perform as much computation as possible entirely on the GPU.
• Transfer results back to the CPU only when necessary at the end. This strategy ensures that the
GPU's powerful parallel capabilities are primarily used for computation rather than waiting for data.
2. Write the given rules to decide which process or thread should handle standard input in a parallel
program running on a shared-memory MIMD system.
In parallel programs, especially in shared-memory MIMD systems, managing standard input (stdin) can be
complex due to the potential for nondeterminism when multiple threads try to access it simultaneously. To
address these issues and ensure predictable behaviour, specific rules are followed for handling I/O:
For a parallel program running on a shared-memory MIMD system, the rule for handling standard input
(stdin) is:
This rule ensures that a single, designated thread is responsible for reading input, preventing
nondeterministic behaviour or conflicts that could arise if multiple threads attempted to read from the same
input stream simultaneously.
Additionally, while not directly about standard input, other relevant I/O rules for parallel programs
(including shared-memory systems) are:
3. Explain the concepts of speedup and efficiency to determine processor utilization in a parallel
program, and show how speedup and efficiency vary with different problem sizes.
In parallel programming, speedup and efficiency are key metrics used to evaluate the performance and
1. Speedup (S):
• Concept: Speedup measures how much faster a parallel program runs compared to its corresponding
serial (sequential) program. It is the ratio of the time taken by the best serial program ($T_{serial}$)
to the time taken by the parallel program ($T_{parallel}$).
• Interpretation:
o If a program achieves linear speedup, it means $S = p$, where $p$ is the number of
processors/cores used. This is the ideal scenario where a program runs $p$ times faster on
$p$ cores.
o In practice, perfect linear speedup is rarely achieved because parallel programs introduce
overhead, such as communication between processes, synchronization (e.g., using mutexes),
and load imbalance. These overheads are not present in serial programs and tend to increase
as the number of processors ($p$) increases.
2. Efficiency (E):
• Concept: Efficiency measures how well the available processors/cores are being utilized by the
parallel program. It indicates the average fraction of time each core spends doing useful work on the
original problem, rather than dealing with parallel overhead.
• Interpretation:
o Efficiency ranges from 0 to 1 (or 0% to 100%). An efficiency of 1 (or 100%) indicates perfect
utilization, where there is no parallel overhead.
o If $E < 1$, it means some time is being wasted due to parallel overhead.
o Efficiency can be seen as the average utilization of the parallel cores on solving the problem,
with the remainder of the parallel run-time being the parallel overhead.
•
Speedup and efficiency are not constant; they also depend on the problem size.
• Increasing Problem Size: When the problem size is increased (while keeping the number of
processors fixed), both the speedup and efficiency generally increase.
o Reason: This often happens because the time spent on the original, useful computation
($T_{serial}$) grows much faster than the time spent on parallel overhead
($T_{overhead}$). The overheads (like communication setup or synchronization) might be
relatively fixed or grow slowly, becoming a smaller fraction of the total execution time as the
problem size increases. Therefore, the cores spend a larger proportion of their time doing
useful work, leading to better utilization.
o Example (Table 2.5): The sources provide an example where doubling the problem size
significantly improves efficiency, from 0.68 for the original size to 0.89 with 16 cores.
• Decreasing Problem Size: Conversely, when the problem size is decreased, the speedup and
efficiency generally decrease.
o Reason: The overheads become a more significant portion of the total execution time,
leading to lower processor utilization and less impressive speedups.
o Example (Table 2.5): Halving the problem size in the example shows a drop in efficiency
from 0.68 to 0.39 with 16 cores.
Diagrams:
4. Explain the concept of scalability to explain how increasing the number of processors in a MIMD
Concept of Scalability: Informally, a parallel program is scalable if, by increasing the power of the system
it runs on (e.g., increasing the number of cores), we can obtain speedups over its performance on a less
powerful system.
More formally, in discussions of MIMD parallel program performance, scalability has a specific
definition:
How Increasing the Number of Processors Affects Program Performance (and Scalability):
Increasing the number of processors ($p$) in an MIMD system typically affects program performance in the
following ways:
1. Potential for Higher Speedup: With more processors, more work can be done concurrently, leading
to faster execution and higher speedup. Ideally, doubling the processors might double the speedup.
2. Increased Parallel Overhead: However, simply increasing processors does not guarantee linear
speedup. Parallel overheads (like communication, synchronization, and load imbalance) usually
increase as the number of processors grows.
o More processors often mean more communication to exchange data.
o More processors often mean more contention for shared resources or more complex
synchronization mechanisms.
o It becomes harder to perfectly balance the workload across a very large number of processors.
3. Impact on Efficiency: Because parallel overheads tend to increase with $p$, the efficiency ($E$) of
a parallel program usually decreases as the number of processors increases for a fixed problem
size. This means each processor spends a smaller fraction of its time doing useful work on the
original problem and more time on overhead.
Scalability and Performance Relationship: The concept of scalability explains how this trade-off between
increased processors and increased overhead can be managed to maintain performance.
• Weak Scalability: This is when the efficiency of a program can be kept constant by increasing the
problem size at the same rate as the number of processes/threads. This is a more common and
realistic form of scalability in large-scale parallel computing.
o This implies that for a parallel program to maintain its performance characteristics
(efficiency) as we add more processors, the workload must also grow proportionally. This
allows the useful computation time to increase, making the overheads a relatively smaller
portion of the total time.
• Strong Scalability: This is a much rarer and more difficult type of scalability to achieve. A program
is strongly scalable if its efficiency remains fixed even when the number of processes/threads
increases without changing the problem size. This is rare because overheads almost always grow to
some extent as more processors are introduced, making it difficult to maintain constant efficiency for
a fixed amount of work.
In essence, scalability helps us understand that merely adding more processors to an MIMD system doesn't
automatically improve performance linearly. For sustained performance gains, especially in terms of
efficiency, the problem size often needs to be considered and sometimes scaled along with the number of
processors to offset the increasing parallel overheads.
5. Interpret the role of barriers and elapsed times in timing parallel programs and justify with reasons
why the minimum run-time is preferred.
When timing parallel programs, especially in MIMD (Multiple Instruction, Multiple Data) systems, it is
crucial to accurately measure the performance of the parallelised sections of code. This involves
understanding the roles of barriers and elapsed times, and why minimum run-time is typically reported.
1. Role of Barriers:
• Purpose: In parallel programs, different processes or threads execute asynchronously (at their own
pace). To get an accurate measurement of how long a parallel section of code takes, all
processes/threads need to start timing that section at roughly the same moment.
• Working: A barrier is a synchronisation primitive. When a parallel program calls a barrier function
(e.g., Barrier() in pseudocode), all participating processes/threads must reach this point in the code
before any of them are allowed to proceed past it. While an ideal barrier would have all processes
return simultaneously, it usually guarantees that all have at least started the call when the first one
returns.
• Benefit: By placing a barrier just before the start timer call, we ensure that all processes/threads begin
the timed computation together, leading to a more consistent and meaningful measurement of the
parallel work.
2. Role of Elapsed Times:
• Type of Time: When evaluating parallel programs, we are primarily interested in "wall clock" time
(also known as real time or elapsed time), not "CPU time".
o Wall clock time measures the actual time that passes from the start of a code section to its
finish, as if measured by a stopwatch. This includes both the time spent actively computing
and any time spent waiting (e.g., for messages, for synchronisation, or for other threads to
finish).
o CPU time only accounts for the time when the processor is actively executing instructions,
ignoring idle or waiting periods. For parallel programs, waiting is a significant and real cost,
so excluding it would give a misleading picture of performance.
• Measurement: Each individual process/thread records its own start time (my_start) and end time
(my_finish) for the code section it executes, calculating its my_elapsed time.
• Global Measurement: Since processes/threads might finish at different times, the actual parallel
run-time is the time from when the first process/thread began execution to when the last
process/thread finished. This is captured by taking the maximum of all the individual my_elapsed
times using a global_max function (or similar collective operation). This maximum value is then
reported as the overall elapsed time for the parallel section.
• Variability: When a parallel program is run multiple times, even with the same input and on the
same system, the measured elapsed times are likely to vary.
• External Factors: This variability is often due to external factors such as operating system
activities, background processes, network fluctuations, or other users on a shared system. These
external events can only slow down a program's execution; they cannot make it run faster than its
inherent best possible speed.
• Best-Case Performance: Therefore, reporting the minimum run-time among several executions is
preferred. This minimum time provides the most optimistic and realistic measure of the program's
best performance, representing its speed when external interferences are minimised. Reporting a
mean or median time would include the negative impact of these external slowdowns, which do not
reflect the program's optimal capability.
6. Explain GPU performance to show whether the evaluation measures used in MIMD systems, such
as speedup, efficiency, and scalability, are applicable to GPU or not.
Evaluating GPU performance requires a different perspective than traditional MIMD (Multiple Instruction,
Multiple Data) systems, primarily because GPUs are fundamentally different in their architecture and
operating principles.
1. Speedup:
• MIMD Systems: For MIMD systems, speedup is typically calculated by comparing the parallel
execution time on p cores to the execution time of the best serial program run on a single core of the
same design. Linear speedup ($S=p$) is the ideal target.
• GPU Performance: While it is common to report speedups of GPU programs over serial CPU
programs or even parallel MIMD programs, the concept of "linear speedup" in the MIMD sense is
not directly applicable to GPUs.
o Reason: GPU cores are inherently parallel and fundamentally different from conventional
CPU cores. They are designed for massive data-parallel tasks using a SIMD-like model.
Directly comparing them as if they were simple extensions of a single CPU core for linear
speedup calculation is not meaningful.
• Applicability: No, the MIMD concept of linear speedup relative to a same-type serial core is not
applicable. General speedup numbers are still reported but should be interpreted carefully.
2. Efficiency:
• MIMD Systems: Efficiency measures how well the available processors are utilised, calculated as
$E = S/p$. It represents the average fraction of time each core spends on useful work, assuming
comparison against a single core of the same type.
• GPU Performance: Similar to speedup, efficiency is ordinarily not used in discussions of GPU
performance in the same way it is for MIMD systems.
o Reason: Since GPU cores are inherently parallel and distinct from serial CPU cores, defining
what "100% utilisation" means for a single GPU core in the context of a serial benchmark
doesn't make sense. The concept of comparing "parallel vs. serial speed per core" as a
measure of efficiency becomes problematic for GPUs.
• Applicability: No, efficiency is generally not a meaningful evaluation measure for GPUs.
3. Scalability:
• MIMD Systems (Formal Definition): A parallel program is formally scalable if, when the number
of processes/threads is increased, a corresponding rate of increase in the problem size can be found
such that the program always maintains the same efficiency.
• GPU Performance: Because the formal definition of efficiency (as used in MIMD systems) does
not apply to GPUs, the formal definition of scalability for MIMD programs cannot be applied to
GPU programs either.
o Informal Usage: However, the informal usage of scalability is routinely applied to GPUs.
A GPU program is considered scalable if, by increasing the size or power of the GPU (e.g.,
adding more cores, increasing memory bandwidth), one can obtain speedups over the
program's performance on a smaller or less powerful GPU.
• Applicability: The formal MIMD definition is not applicable. The informal understanding of
scalability (better performance with more GPU resources) is widely used.
4. Amdahl's Law:
• MIMD Systems: Amdahl's Law states that if a fraction 'r' of a serial program is inherently
unparallelised, the maximum possible speedup is limited to $1/r$, regardless of the number of
processors.
• GPU Performance: Amdahl's Law can be applied to GPU programs if the inherently serial part
of the program is executed on a conventional, serial CPU processor. In this scenario, the resulting
upper bound on possible speedup will be the same as for an MIMD program.
o Caveats: The same caveats that apply to Amdahl's Law in MIMD systems also apply to
GPUs: the serial fraction often decreases with increasing problem size, and many GPU
programs still achieve huge speedups. Even small speedups can be valuable.
• Applicability: Yes, if the serial portion runs on a CPU.
In summary, while absolute speedup figures are often reported for GPUs, the nuanced MIMD measures of
linear speedup, efficiency, and formal scalability are generally not directly applicable due to the inherent
architectural differences between CPUs and GPUs. Amdahl's Law, however, can be relevant when
considering the serial CPU component of a heterogeneous CPU-GPU application.
7. Explain the concepts of speedup and efficiency to decide which system is better: System A: 8
processors, execution time = 40s (sequential time = 160s) System B: 4 processors, execution time = 60s
(sequential time = 160s)
To determine which system is "better," we need to calculate the speedup and efficiency for both systems
using the given concepts.
Comparison & Decision
• System A: Higher speedup (4) → finishes the job faster (40s vs 60s)
• System B: Higher efficiency (66.75%) → processors are better utilised
• System A (8 processors, 40s execution) achieves a higher speedup of 4, meaning it completes the
task in less absolute time (40s).
• System B (4 processors, 60s execution) achieves a higher efficiency of approximately 66.75%,
meaning its processors are, on average, more effectively utilised per core.
• If the objective is to achieve the fastest possible execution time for the given problem, System A is
better because it finishes the task in 40 seconds, which is faster than System B's 60 seconds. This is
often the primary concern in high-performance computing.
• If the objective is to maximise the utilisation of each individual processor (i.e., get the most
"work" out of each core you've invested in), then System B is better due to its higher efficiency.
In most performance evaluations, raw speedup (faster completion time) is prioritised, making System A
the generally "better" choice for this particular problem, as it delivers the result faster.
8. Develop a program with a runtime of 80 seconds, where 75% can be parallelized, by using
Amdahl’s Law to calculate the maximum speedup achievable on 6 cores and 20 cores.
Amdahl's Law is a formula that describes the theoretical maximum speedup of a parallel program given the
proportion of the program that can be parallelised.