KEMBAR78
Parallel Computing IA1 | PDF | Parallel Computing | Central Processing Unit
0% found this document useful (0 votes)
33 views29 pages

Parallel Computing IA1

Uploaded by

poojasubbu2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views29 pages

Parallel Computing IA1

Uploaded by

poojasubbu2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Paralell Computing (BCS702)

MODULE-1
Introduction to parallel programming, Parallel hardware and parallel software –Classifications of parallel
computers, SIMD systems, MIMD systems, Interconnection networks, Cache coherence, Shared-memory vs.
distributed-memory, Coordinating the processes/threads, Shared-memory, Distributed-memory.

1. Explain the classification of memory with a schematic diagram and relate it to the suitable system
programming APIs used for shared memory and distributed-memory systems.

Parallel computers are primarily classified based on how their processing cores access memory. This leads
to two main types of memory systems: shared memory systems and distributed memory systems.

1. Shared Memory Systems:

• Concept: In a shared memory system, all processing cores (or processors) have access to a common
block of memory locations. Cores coordinate their work by modifying these shared memory
locations. This means communication between cores happens implicitly by reading from or writing
to shared data structures.
• Suitable APIs: For programming shared memory systems, threads are commonly used.
o OpenMP (Open Multi-Processing): A popular API for shared-memory multiprocessing
programming in C, C++, and Fortran. It uses compiler directives to parallelise regions of
code, allowing threads to share memory.
o Pthreads (POSIX Threads): A standard API for creating and managing threads. It provides
explicit control over thread creation, synchronisation (e.g., mutexes, semaphores), and
memory sharing.
• Why these APIs? These APIs are suitable because they allow programmers to define shared
variables that all threads can access, making implicit communication straightforward.
Synchronisation mechanisms (like mutexes) are provided to manage concurrent access to shared data
and prevent race conditions.
2. Distributed Memory Systems:

• Concept: In a distributed memory system, each processing core has its own private memory. Cores
cannot directly access each other's memory. Instead, they coordinate their work by explicitly
communicating across a network. Communication typically involves sending messages between
processors.
• Suitable APIs: For programming distributed memory systems, processes are commonly used.
o MPI (Message-Passing Interface): This is the most widely used API for message-passing. It
provides functions for sending and receiving messages between processes, allowing explicit
communication and synchronisation.
o PGAS (Partitioned Global Address Space) Languages: These languages offer a
programming model that provides some shared-memory concepts over distributed-memory
hardware. Programmers can control the distribution of data in shared structures, ensuring that
local data access is prioritised to improve performance.
• Why these APIs? These APIs are necessary because each process operates on its private memory.
Programmers must explicitly specify when data needs to be exchanged between processes (e.g.,
using Send and Receive functions in MPI).

Schematic Diagrams:

• Shared-Memory System: A collection of autonomous processors connected to a memory system via


an interconnection network. Diagram: Refer Figure 2.3
• Distributed-Memory System: Each processor is paired with its own private memory, and these
processor-memory pairs communicate over an interconnection network. Diagram: Refer Figure 2.4

2. Explain in detail about Instruction-Level Parallelism to demonstrate how hazards can be reduced
in execution.

Instruction-Level Parallelism (ILP) refers to the execution of multiple instructions from a single
instruction stream concurrently. The sources mention multiple issue and pipelining as forms of parallel
hardware that achieve this by allowing different functional units to execute simultaneously. This form of
parallelism is typically not visible to the programmer and is treated as an extension to the basic von
Neumann model.

While the provided sources acknowledge pipelining and multiple issue as ways to execute functional units
simultaneously, they do not explicitly define or detail "hazards" in execution or specific techniques for their
reduction (such as data hazards, control hazards, or structural hazards, and solutions like forwarding,
stalling, or branch prediction).

However, the concept of Simultaneous Multithreading (SMT) is presented as a variation on fine-grained


multithreading that attempts to exploit superscalar processors. SMT allows multiple threads to make use of
multiple functional units simultaneously. By designating "preferred" threads (those with many instructions
ready to execute), SMT can somewhat reduce the problem of thread slowdown. In a broader sense, this
reduction in thread slowdown can be seen as mitigating issues that would otherwise cause execution delays,
similar to how hazard reduction improves efficiency in traditional ILP by keeping functional units busy and
preventing stalls. Although not directly termed "hazard reduction," the principle of better utilising functional
units to avoid slowdown aligns with the goals of reducing pipeline hazards.

3. Describe the different memory systems based on Flynn’s taxonomy.

Flynn’s taxonomy classifies parallel computers based on the number of instruction streams and data
streams they can manage simultaneously. While Flynn's taxonomy primarily classifies systems rather than
memory directly, the type of system strongly dictates the memory architecture.
The four classifications in Flynn's taxonomy are:

1. SISD (Single Instruction Stream, Single Data Stream):


o Description: This represents a classical von Neumann system. It executes a single instruction
at a time and processes a single data value at a time. This is a serial computing model, not
typically considered parallel.
o Memory System: A single processor accesses a single block of main memory. There are no
complexities of shared or distributed memory access as there is only one computational unit.
2. SIMD (Single Instruction Stream, Multiple Data Streams):
o Description: SIMD systems operate on multiple data streams by applying the same
instruction to multiple data items simultaneously. An abstract SIMD system has a single
control unit broadcasting instructions to multiple datapaths, where each datapath applies the
instruction to its data or remains idle.
o Memory System: SIMD systems typically work with data-parallelism, where large arrays
of data are divided among processors. Memory access needs to be highly efficient for vector
operations.
▪ Vector Processors: A key example of SIMD systems. They feature interleaved
memory consisting of multiple banks that can be accessed independently to load/store
successive elements of a vector with minimal delay. They also support strided
memory access and hardware scatter/gather for efficient access to irregularly spaced
data.
▪ GPUs (Graphics Processing Units): Modern GPUs leverage SIMD parallelism,
incorporating a large number of datapaths on each processing core. GPUs can use
shared or distributed memory; larger systems often use both, with several cores
having access to common memory blocks, and different blocks communicating over a
network. GPUs also have specific memory hierarchies, including global memory
(large, slow, shared) and shared memory (small, fast, private to each processor, acting
as a programmer-managed cache).
3. MIMD (Multiple Instruction Streams, Multiple Data Streams):
o Description: MIMD systems support multiple simultaneous instruction streams operating on
multiple data streams. They consist of a collection of fully independent processing units or
cores, each with its own control unit and datapath. MIMD systems are usually asynchronous,
meaning processors operate at their own pace.
o Memory System: MIMD systems are primarily divided into two main types based on their
memory architecture:
▪ Shared-Memory MIMD Systems: A collection of autonomous processors connected
to a memory system, where each processor can access every memory location.
Processors communicate implicitly by accessing shared data structures. Examples
include multicore processors and systems with UMA (Uniform Memory Access) and
NUMA (Nonuniform Memory Access) architectures. (See Question 5 for details on
UMA/NUMA).
▪ Distributed-Memory MIMD Systems: Each processor is paired with its own private
memory. These processor-memory pairs communicate explicitly over an
interconnection network, typically by sending messages. Examples include clusters of
computers.

4. Explain the concept of SIMD to explain the role of vector processors in improving performance.

Concept of SIMD (Single Instruction, Multiple Data): SIMD systems are a class of parallel computers
designed to execute a single instruction stream on multiple data streams simultaneously.

• An abstract SIMD system consists of a single control unit that broadcasts instructions to multiple
datapaths. Each datapath then applies the same instruction to its local data item, or it remains idle.
• This architecture is highly efficient for data-parallel problems, such as performing the same
operation on many elements of large arrays.
• However, its efficiency can degrade if conditional operations (e.g., if statements) cause some
datapaths to be idle, as all datapaths must execute the same instruction or wait. SIMD datapaths
operate synchronously and do not have individual instruction storage.

Role of Vector Processors in Improving Performance: Vector processors are a significant type of SIMD
system that excel at operating on arrays or "vectors" of data, unlike conventional CPUs that operate on
individual "scalar" data elements. They are designed with specific features to leverage SIMD parallelism for
performance improvement:

1. Vector Registers: These registers are designed to store an entire vector of operands and can operate
on their contents simultaneously. This means a single operation can affect multiple data elements
stored within the register at once.
2. Vectorized and Pipelined Functional Units: Vector processors have functional units optimised to
apply the same operation to each element within a vector. These units are also pipelined, allowing
continuous flow of data through the processing stages.
3. Vector Instructions: These are special instructions that operate on entire vectors rather than scalars.
For example, a simple loop like for (i=0; i<n; i++) x[i] += y[i]; can be executed with a single load, add,
and store instruction for a block of vector_length elements, significantly reducing the total number of
instructions compared to a conventional system that requires these three operations for each element.
This dramatically cuts down instruction fetch and decode overhead.
4. Interleaved Memory: To keep the vector functional units supplied with data, vector processors use
interleaved memory. This system consists of multiple independent "banks" that can be accessed in
parallel or rapid succession, allowing continuous loading and storing of vector elements with little to
no delay.
5. Strided Memory Access and Hardware Scatter/Gather: Vector systems include special hardware
to accelerate accessing vector elements that are at fixed intervals (strided access) or irregular
intervals (scatter/gather). This ensures efficient data movement for various data patterns.

Performance Improvement: Vector processors improve performance significantly because:

• They process multiple data items with a single instruction (SIMD nature), leading to high
computational throughput.
• They effectively exploit memory bandwidth, ensuring that every data item loaded from memory is
actually used, which is more efficient than cache-based systems where an entire cache line might be
loaded but not all items used.
• Vectorizing compilers are very good at identifying and transforming serial code (especially loops)
into vector instructions, making them easy to use for many applications.

For many applications involving large-scale data manipulation (e.g., scientific simulations), vector
processors offer substantial speedups by reducing instruction overhead and maximizing data processing per
clock cycle.

5. With a neat diagram explain the structure of UMA and NUMA architectures.

UMA and NUMA are classifications of shared-memory systems that describe how multiple processors or
cores access the main memory.

1. UMA (Uniform Memory Access) Architecture:

• Structure: In a UMA system, all processing cores in the system have uniform access time to any
location in the main memory. This is typically achieved by connecting all processors directly to a
common main memory module through an interconnection network.
• Characteristics:
o Uniform Access Time: The time taken for any core to access any memory location is
roughly the same.
o Easier Programming: Programmers do not need to consider different memory access times,
simplifying software development.
o Scalability Limitations: As the number of processors increases, the shared interconnect
(e.g., a bus) can become a bottleneck, leading to contention and performance degradation.
• Diagram:

This diagram typically shows multiple processors or cores connected to a single block of main memory via a
single interconnect.

2. NUMA (Nonuniform Memory Access) Architecture:

• Structure: In a NUMA system, the memory is physically distributed among different processing
nodes, but it presents a single, shared address space to the processors. Each processor (or group of
processors) has its local memory that it can access quickly. Accessing memory attached to other
processors (remote memory) is slower. This is often implemented where each processor has a direct
connection to a block of main memory, and processors access each other's memory blocks through
special hardware.
• Characteristics:
o Nonuniform Access Time: Access to local memory is significantly faster than access to
remote memory.
o Programming Complexity: Programmers must be aware of data locality to place data in
local memory for better performance, increasing programming complexity.
o Better Scalability: NUMA systems can scale to use larger amounts of memory and support
more processors than UMA systems because the memory bus is not a single point of
contention for all memory accesses.
• Diagram:

This diagram typically illustrates multiple processor-memory nodes interconnected, showing that each
processor has its own local memory segment, and it can access other segments but with a higher latency.

6. Explain with neat diagram the distributed memory interconnects such as ring and toroidal mesh to
explain how communication efficiency is improved.

Distributed Memory Interconnects are networks that connect processor-memory pairs in distributed-
memory systems. The design and performance of these interconnects are crucial, as a slow interconnect can
severely degrade the overall performance of parallel programs. Ring and toroidal mesh are examples of
direct interconnects, where each switch is directly connected to a processor-memory pair, and the switches
are also connected to each other.

1. Ring Interconnect:

• Structure: A ring interconnect connects processors (or nodes, each consisting of a processor and its
local memory) in a circular fashion. Each node is connected to exactly two neighbours, forming a
closed loop [36, Fig 2.8a]. Data flows in one or both directions around the ring.
• Diagram:

This diagram shows switches (circles) and processors (squares) arranged in a ring, with bidirectional
links.

• Communication Efficiency:
o Multiple Simultaneous Communications: A ring is an improvement over a simple bus
system because it allows multiple communications to occur simultaneously, as long as they
do not conflict over the same link. For example, in an 8-node ring, it's possible for four
communications to happen concurrently if arranged carefully.
o Limitations: Despite allowing simultaneous communications, it's easy to create scenarios
where some processors must wait for others to complete their transmissions, which can still
limit overall efficiency.
o Bisection Width: The bisection width of a ring is typically 2. This measure represents the
minimum number of links that must be cut to divide the network into two equal halves. A
small bisection width indicates potential bottlenecks for communications that must cross the
divide.

2. Toroidal Mesh Interconnect:

• Structure: A toroidal mesh is a two-dimensional grid of processors (or nodes) where the connections
"wrap around" from one edge to the opposite edge, forming a torus (like a donut shape). Each
internal node is connected to four neighbours (North, South, East, West), and the wrap-around
connections connect nodes at the edges [36, Fig 2.8b].
• Communication Efficiency Improvement:
o Higher Connectivity: The toroidal mesh offers significantly greater connectivity than a
simple ring. Each switch typically supports five links (one to the processor, four to other
switches), compared to three links in a ring.
o Increased Simultaneous Communications: It's easier to achieve more simultaneous
communication patterns with a mesh than with a ring. This means that more messages can be
in transit concurrently without conflicting, leading to better overall throughput.
• Diagram:

This diagram shows processors and switches arranged in a grid, with connections that wrap around
the edges.

Improved Bisection Width: For a square two-dimensional toroidal mesh with p = q^2 nodes
(where q is even), the bisection width is 2√p. For example, a 16-node (4x4) mesh would have
a bisection width of 2√16 = 8. This is substantially higher than a ring's bisection width of 2,
indicating that the toroidal mesh can support much more "cross-sectional" communication,
thus improving communication efficiency for large-scale data exchanges.

o Cost: While more expensive than a ring due to more complex switches and more links (2p
links for p processors compared to p links in a ring), the increased communication efficiency
often justifies the cost for larger parallel systems.

In summary, the toroidal mesh improves communication efficiency over a ring by providing higher
connectivity and a greater bisection width, enabling more simultaneous data transfers and reducing the
likelihood of communication bottlenecks.
7. With neat diagrams explain the working of a 3-dimensional hypercube network and generic
indirect networks for processor interconnection.

1. 3-Dimensional Hypercube Network (Direct Interconnect):

• Concept: A hypercube is a highly connected type of direct interconnect used in distributed-memory


systems. In a direct interconnect, each switch is directly connected to a processor-memory pair, and
the switches are also connected to each other. Hypercubes are built inductively, meaning they are
constructed by combining smaller hypercubes.
• Working/Structure:
o A one-dimensional hypercube is simply two processors directly connected [Fig 2.12a].
o A two-dimensional hypercube is built by taking two one-dimensional hypercubes and
joining their "corresponding" switches [Fig 2.12b]. Imagine two lines of two nodes each, then
connecting the ends of one line to the ends of the other to form a square.
o A three-dimensional hypercube is constructed from two two-dimensional hypercubes by
joining their "corresponding" switches [Fig 2.12c]. Picture two squares, one above the other,
with each corner of the top square connected to the corresponding corner of the bottom
square, forming a cube.
• Characteristics:
o A hypercube of dimension d has p = 2^d nodes (processors).
o Each switch in a d-dimensional hypercube is directly connected to its associated processor
and to d other switches. For a 3-dimensional hypercube, each switch connects to 3 other
switches and its processor.
o Connectivity: The bisection width of a hypercube is p/2. This is significantly higher than a
ring or toroidal mesh, meaning it can support a large amount of simultaneous communication
across the network.
o Cost: Hypercubes are more expensive to build than toroidal meshes because their switches
need to support more connections (d+1 wires for a processor and d switches).
• Diagram: Refer to Figure 2.12(c) , for a three-dimensional hypercube. The full figure (a), (b), and
(c) shows the inductive build-up.
2. Generic Indirect Networks for Processor Interconnection:

• Concept: Indirect interconnects offer an alternative to direct interconnects. The key difference is
that in an indirect interconnect, the switches are not necessarily directly connected to a processor.
Instead, they form a separate switching network that processes use to communicate.
• Working/Structure:
o They are often shown with unidirectional links (data flows in one direction only) [47, Fig
2.13].
o Each processor typically has an outgoing link to the switching network and an incoming link
from the switching network [47, Fig 2.13].
o The switching network routes data between processors. This provides flexibility in
communication but can introduce more hops and potential for contention depending on the
network's design.
• Diagram: Refer to Figure 2.13, for a generic indirect network. This diagram illustrates a collection
of processors connected to a central switching network.

8. Explain with a neat diagram to explain the working of a Crossbar and an Omega network.
1. Crossbar Network:

• Concept: A crossbar is a powerful switched interconnect that allows simultaneous communication


among multiple devices. It can be used in both shared-memory systems (with bidirectional links) and
distributed-memory systems (often with unidirectional links).
• Working/Structure:
o A crossbar consists of a grid of switches connecting inputs to outputs.
o For a shared-memory crossbar (e.g., connecting processors P to memory modules M): it
uses internal switches at the intersections of processor lines and memory lines [34, Fig 2.7a].
Each internal switch can be configured to connect a processor line to a memory line [34, Fig
2.7b].
o Simultaneous Communication: If there are at least as many memory modules as processors,
a crossbar allows multiple processors to access different memory modules simultaneously
without conflict. Conflict only occurs if two processors try to access the same memory
module at the exact same time.
o For a distributed-memory crossbar (often with unidirectional links): each processor has an
outgoing link to the crossbar and an incoming link from the crossbar [48, Fig 2.14]. This
allows any processor to send a message to any other processor simultaneously, as long as no
two processors try to send a message to the same destination processor.
o Cost: Crossbars are much faster than buses, but their cost is relatively high. A crossbar
connecting p inputs to p outputs generally requires p^2 switches, making large crossbars very
expensive and often impractical for very large systems.
• Diagrams:
o For a shared-memory crossbar: Figure 2.7(a), page 39, DOC-20250913-WA0009..pdf.
o For a distributed-memory crossbar: Figure 2.14, page 43, DOC-20250913-WA0009.

2. Omega Network:

• Concept: An Omega network is another type of indirect switched interconnect, often considered a
more cost-effective alternative to a full crossbar for large systems.
• Working/Structure:
o It is constructed using multiple stages of smaller switches, typically 2x2 crossbar switches
[Fig 2.15, Fig 2.16].
o These stages are interconnected in a specific pattern to route messages from any input to any
output.
• Communication Efficiency:
o Unlike a crossbar, an Omega network does not allow all possible simultaneous
communications. There can be conflicts. For example, if processor 0 sends a message to
processor 6, processor 1 might not be able to simultaneously send a message to processor 7
because their paths would conflict at an internal switch.
o Cost: The main advantage of an Omega network is its lower cost compared to a crossbar. For
p inputs/outputs, an Omega network uses (p/2) * log₂(p) of the 2x2 crossbar switches, totaling p
* log₂(p) 2x2 switches, which is significantly less than the p^2 switches required by a crossbar.
o Bisection Width: The bisection width of an Omega network is p/2.
• Diagrams:
o For the Omega network structure: Figure 2.15
o For the internal 2x2 switch: Figure 2.16

9. Explain the concept of cache coherence and its approaches to explain how inconsistent data can
arise in a multiprocessor system with diagram.

Diagram: This diagram shows a shared-memory system with two cores, each having its own private data
cache, connected to main memory.

Concept of Cache Coherence: In a shared-memory multiprocessor system, each CPU core often has its
own private data cache. While caches significantly speed up memory access for individual cores, they
introduce a problem called cache coherence.

• Problem: The cache coherence problem arises when multiple caches store copies of the same shared
variable. If one core updates its copy of the variable in its cache, other cores that also have a copy of
that variable in their caches will not "see" this update. This leads to inconsistent data across the
system, where different cores hold different values for the same memory location.
• Why it arises: Programmers do not have direct control over CPU caches. The system hardware
manages them. This means a programmer cannot ensure that when one core updates a cached
variable, the cached values held by other processors are also updated.

How Inconsistent Data Can Arise (Example): Consider a shared-memory system with two cores, each
with its private cache [55, Fig 2.17]. Let x be a shared variable initialized to 2.

• Scenario:
o Time 0: Core 0 executes y0 = x; and Core 1 executes y1 = 3*x;. Both cores might load x=2 into
their respective caches. So, y0 becomes 2 and y1 becomes 6.
o Time 1: Core 0 updates x = 7;. This update happens in Core 0's cache.
o Time 2: Core 1 executes z1 = 4*x;.
• Inconsistency: If Core 1 still has the old value x=2 in its cache and doesn't "see" Core 0's update to
x=7, then z1 will incorrectly become 4 * 2 = 8. The expected value, reflecting the update by Core 0,
would be 4 * 7 = 28.
• This problem occurs regardless of whether a write-through (main memory updated) or write-back
(main memory not immediately updated) cache policy is used. In write-through, main memory is
updated, but Core 1's cache copy is still stale. In write-back, the update might not even be in main
memory for Core 1 to potentially see.

Approaches to Ensuring Cache Coherence:

1. Snooping Cache Coherence:


o Idea: This approach is common in bus-based shared-memory systems. When a core updates a
shared variable in its cache, it broadcasts this information (or the cache line containing the
variable) across the shared bus.
o Working: Other cores "snoop" (monitor) the bus. If a core sees that a cache line it holds has
been updated by another core, it marks its own copy of that cache line as invalid. The next
time that core needs the variable, it will have to fetch the updated value from main memory
or another cache.
o Requirement: The interconnect must support broadcasts from each processor to all others.
o Scalability Issue: Snooping is not scalable to very large systems because broadcasts become
very expensive and slow in large networks.
2. Directory-Based Cache Coherence:
o Idea: This approach uses a directory (a data structure) to store the status of each cache line
in the system.
o Working: The directory is typically distributed, with each core/memory pair responsible for
the part of the directory that tracks cache lines in its local memory.
▪ When a cache line is read into a core's cache, the directory entry for that line is
updated to indicate which core(s) have a copy.
▪ When a shared variable is updated, the directory is consulted. Only the cache
controllers of the specific cores that have a copy of that variable's cache line are
contacted to invalidate their copies. This avoids costly broadcasts.
o Characteristics: Requires substantial additional storage for the directory. However, it is
more scalable than snooping because only relevant cores are contacted upon an update,
reducing network traffic.

False Sharing: A related issue that can degrade performance (though not cause incorrect results) is false
sharing. Since caches operate on cache lines (blocks of memory) rather than individual variables, if two
cores access different, unrelated variables that happen to reside in the same cache line, that cache line will
be repeatedly invalidated and fetched by both cores. This happens even if the cores are not truly sharing the
data items but only the cache line they are part of, leading to many unnecessary main memory accesses and
poor performance. It can be reduced by using temporary local storage.

10. Explain how non determinism can lead to race condition in MIMD systems and how mutex locks
prevent them.

Nondeterminism in MIMD Systems:

• Concept: In Multiple Instruction, Multiple Data (MIMD) systems, particularly those where
processors execute asynchronously (at their own pace), nondeterminism is likely to occur. A
computation is nondeterministic if, for the same input, it can produce different outputs on different
runs.
• Cause: This happens because independent threads or processes execute statements at varying
relative rates from run to run. The exact order in which operations complete cannot be predicted.
• Example: If two threads print a private variable, the output order can vary between runs, or even
interleave, because the operating system schedules their output operations unpredictably.
How Nondeterminism Leads to Race Conditions:

• Problem: While nondeterminism isn't always harmful (like varying print order), it can be disastrous
when threads or processes attempt to access and modify shared resources simultaneously. This can
lead to incorrect program results.
• Race Condition Definition: A race condition occurs when multiple threads or processes try to
access a shared resource at the same time, and the outcome of the computation depends on the
unpredictable order (which thread "wins the race") in which these accesses occur.
• Example (x += my_val):
o Suppose two threads (Thread 0 and Thread 1) want to add their private values ( my_val) to a
shared variable x, initially 0.
o The operation x += my_val is not atomic; it involves multiple steps:
1. Load x into a register.
2. Load my_val into a register.
3. Add my_val to x in the register.
4. Store the result back into x in memory.
o Race Scenario:
▪ Core 0 loads x = 0.
▪ Core 1 loads x = 0.
▪ Core 0 adds my_val = 7 to its register copy of x (result 7).
▪ Core 1 adds my_val = 19 to its register copy of x (result 19).
▪ Core 0 stores x = 7 to memory.
▪ Core 1 stores x = 19 to memory.
o Incorrect Result: The final value of x becomes 19, losing Thread 0's update, instead of the
correct sum 7 + 19 = 26. This unpredictable outcome is a race condition.

How Mutex Locks Prevent Race Conditions:

• Atomic Operations and Critical Sections: To prevent race conditions, the operations on shared
resources must be atomic (appearing to complete without interruption from other threads). This is
achieved by defining a critical section: a block of code that only one thread can execute at a time.
• Mutex (Mutual Exclusion Lock): The most common mechanism for ensuring mutual exclusion in
critical sections is a mutex lock (or simply "lock").
o A mutex is a special object, supported by hardware, that acts as a guard for a critical section.
o Working:
1. Before entering a critical section, a thread must obtain (lock) the mutex by calling a
Lock function.
2. If the mutex is already owned by another thread, the calling thread will wait until it
becomes available.
3. Once the thread successfully obtains the lock, it can execute the code in the critical
section (e.g., x += my_val).
4. After completing the critical section, the thread must relinquish (unlock) the mutex
by calling an Unlock function.
o Prevention: This ensures that only one thread can execute the code within the critical section
at any given time, thereby preventing race conditions. The order in which threads acquire the
lock is not predetermined, but the access to the shared resource is serialized.
• Example with Mutex:
• my_val = Compute_val(my_rank);
• Lock(&add_my_val_lock); // Acquire lock
• x += my_val; // Critical section
• Unlock(&add_my_val_lock); // Release lock

This guarantees that the x += my_val operation from one thread completes fully before another thread
can start its x += my_val operation, thus ensuring the correct sum.
• Consideration: Using mutexes enforces serialization of the critical section. To maintain good
parallel performance, critical sections should be as few and as short as possible.

11. Explain message-passing and partitioned global address space languages in distributed-memory
systems to show how two processes can exchange data without deadlock.

In distributed-memory systems, each processing core has its own private memory and cannot directly
access another core's memory. Therefore, processes must explicitly communicate to exchange data. Two
common programming models for this are message-passing and Partitioned Global Address Space (PGAS)
languages.

1. Message-Passing:

• Concept: Message-passing APIs provide functions for processes to explicitly send and receive data
to and from each other. Processes identify each other by ranks (e.g., 0, 1, ..., p-1). MPI (Message-
Passing Interface) is the most widely used API for this.
• How two processes exchange data (without deadlock in a simple scenario):
o A message-passing program typically involves a Send function and a Receive function.
o Example Pseudocode:

// Process 1 sends a message to Process 0


char message;
my_rank = get_rank(); // Each process gets its unique rank

if (my_rank == 1) {
sprintf(message, "greetings from process 1");
Send(message, MSG_CHAR, 100, 0); // Send to rank 0
} else if (my_rank == 0) {
Receive(message, MSG_CHAR, 100, 1); // Receive from rank 1
printf("Process 0 > Received: %s\n", message);
}

o Working: In this example, Process 1 creates a message and calls Send to transmit it to Process
0. Process 0 simultaneously calls Receive to wait for a message from Process 1.
o Blocking Behavior and Deadlock Avoidance (Simple Case):
▪ The most common behavior for Receive is to block (wait) until the message is fully
received.
▪ For Send, it can either block until the Receive operation has started receiving the data,
or it can copy the message into internal storage and return immediately.
▪ In the simple example above, a matching Send and Receive pair, where one process
sends and the other receives, allows for data exchange. If both Send and Receive are
blocking calls, they will wait for each other. This explicit pairing ensures that data is
exchanged correctly without a deadlock in this straightforward one-to-one
communication. For more complex patterns, careful ordering of Send and Receive
operations is crucial to avoid deadlocks (e.g., if Process A sends to B and B sends to
A, both might block indefinitely if they both try to send before anyone receives).

2. Partitioned Global Address Space (PGAS) Languages:

• Concept: PGAS languages aim to combine the programming convenience of shared-memory models
with the underlying performance characteristics of distributed-memory hardware. They provide a
single global address space, but the programmer is given tools to control the distribution of data
within this shared space.
• How they work for data exchange:
o Addressing the "Remote Access" Problem: A naive approach to a global address space on
distributed memory would lead to unpredictable and poor performance because accessing
"remote memory" (memory attached to another core) is vastly slower than accessing "local
memory" (memory attached to the executing core).
o Programmer Control: PGAS languages solve this by making the programmer aware of data
locality.
▪ Private variables are allocated in the local memory of the core where the process
runs.
▪ The distribution of shared data structures (like arrays) is explicitly controlled by
the programmer. This means the programmer knows which parts of a shared array are
in which process's local memory.
o Data Exchange without Explicit Deadlock: Instead of explicit Send/Receive calls, a process
can access a shared variable. If that variable is in another process's "local" memory (but within
the global address space), the system handles the communication (often referred to as one-
sided communication or remote memory access). The programmer, by carefully arranging
data, ensures that operations primarily access local memory for performance, implicitly
exchanging data when remote access is necessary.
o Example (Vector Addition):

shared int n = ...;


shared double x[n], y[n];
private int i, my_first_element, my_last_element;

// ... initialize my_first_element, my_last_element based on process rank ...

// If x and y elements for each process are in its local memory


for (i = my_first_element; i <= my_last_element; i++)
x[i] += y[i];

If the assigned elements of x and y are allocated so they reside in the local memory of the
core running the process, this code will be very fast. The "exchange" of data is conceptually
handled by the global address space, with the programmer ensuring efficient local access.

o While PGAS languages simplify the communication model by providing a global view, they
still require careful programming to ensure data locality and avoid slow remote memory
accesses for good performance.

12. What is parallel computing where parallel programs are needed to write and explain briefly why
they are better than sequential programs.

What is Parallel Computing? Parallel computing involves performing multiple computations or tasks
simultaneously using multiple processing units (like CPU cores or GPUs). The goal is to solve a single
larger problem faster or handle larger datasets by dividing the work among these units, allowing them to
execute parts of the problem in parallel. This is a departure from the traditional von Neumann model of
sequential computation, which executes one instruction and computes one data value at a time.

Where Parallel Programs Are Needed to Write: Parallel programs are essential in modern computing for
several key reasons:

1. Exploiting Modern Hardware: Almost all modern computing devices, from mobile phones and
tablets to desktops and servers, now use multicore processors. To get the most performance out of
these systems, programs must be written to use multiple cores simultaneously.
2. Increased Application Performance: We can no longer rely solely on hardware advancements and
compilers to automatically provide steady increases in application performance for single-core
execution. To achieve routine increases in application speed and power, software developers must
learn to write parallel applications.
3. Solving Large-Scale Problems: Parallel computing is crucial for problems that demand vast
amounts of data or computation. Examples include:
o Scientific simulations (e.g., climate modeling, molecular dynamics).
o Weather forecasting.
o Large-scale engineering models.
o Image and video processing (heavily uses GPUs which employ SIMD parallelism).
o Many common applications like Excel, Photoshop, and web browsers (Chrome) already use
multiple cores to improve responsiveness and performance.
4. Handling Data-Parallel Workloads: For problems that involve applying the same operation to
large sets of data (e.g., adding two large arrays), parallel computing, particularly SIMD systems like
vector processors and GPUs, is highly efficient.

Why Parallel Programs Are Better Than Sequential Programs:

1. Speedup and Faster Execution: The most significant advantage is increased speed. By dividing a
task among multiple processors, a parallel program can theoretically complete the work much faster
than a sequential program. If a serial program takes Tserial time, a perfectly parallelized program on p
cores could run in Tserial/p time, achieving linear speedup.
2. Ability to Solve Larger and More Complex Problems: Parallel systems, especially distributed-
memory ones, can aggregate memory and computational power far beyond what a single machine
can offer. This allows them to tackle problems that would be impossible for a sequential program due
to time or memory constraints. For many problems, as the problem size increases, the portion that
can be parallelized grows faster than the inherently serial portion, leading to significant real-world
speedups (Gustafson's Law).
3. Efficient Resource Utilization: Parallel programs aim to keep multiple processing units busy,
leading to better overall utilization of the available hardware resources (reflected in high efficiency).
While there are overheads associated with parallelism (like communication and synchronization), for
suitable problems, the benefits of concurrent execution far outweigh these costs.
4. Responsiveness in Interactive Applications: In user-facing applications, parallelism can make
software more responsive by performing background tasks concurrently with user interaction.

13. Describe the classifications of parallel computers and explain hardware multithreading with an
example.

Classifications of Parallel Computers: Parallel computers are classified in several ways, primarily based
on their instruction and data streams, and how their cores access memory.

1. Flynn’s Taxonomy: This classification categorizes computers based on the number of instruction
streams (IS) and data streams (DS) they can handle simultaneously:

• SISD (Single Instruction Stream, Single Data Stream):


o A classical von Neumann system. It executes one instruction at a time on one data item at a
time. This is a serial computer.
• SIMD (Single Instruction Stream, Multiple Data Streams):
o These systems apply the same instruction to multiple data items simultaneously. They
have a single control unit broadcasting instructions to multiple datapaths. Examples include
vector processors and Graphics Processing Units (GPUs).
• MIMD (Multiple Instruction Streams, Multiple Data Streams):
o These systems support multiple independent instruction streams operating on multiple
data streams simultaneously. They consist of multiple independent processing units (cores),
each with its own control unit and datapath. MIMD systems are typically asynchronous.

2. Memory Access Classification: This classification distinguishes parallel computers based on how their
processing cores access memory:
• Shared-Memory Systems:
o All processing cores can access a common block of memory locations. Cores coordinate
their work by implicitly modifying shared data structures. Examples include multicore
processors and UMA/NUMA systems.
• Distributed-Memory Systems:
o Each processing core has its own, private memory. Cores communicate and coordinate their
work by explicitly sending messages across an interconnection network. Clusters of
computers are a common example.

Hardware Multithreading with an Example (Simultaneous Multithreading):

Hardware Multithreading is a technique where the hardware is designed to allow multiple threads to share
the execution resources of a single core. This helps keep the processing units busy and improve overall
throughput. The source primarily discusses Simultaneous Multithreading (SMT).

• Explanation of Simultaneous Multithreading (SMT):


o SMT is a variation of fine-grained multithreading that is designed to exploit superscalar
processors.
o Superscalar processors have multiple functional units (e.g., for arithmetic, loading data,
storing data) that can operate in parallel. SMT allows multiple threads to make use of these
multiple functional units concurrently.
o The goal is to prevent processor stalls. If one thread encounters a delay (e.g., waiting for data
from memory), another thread can use the available functional units, keeping the processor
busy.
o By designating "preferred" threads (those with many instructions ready to execute), SMT can
somewhat reduce the problem of thread slowdown.
o This form of parallelism is usually not visible to the programmer; the hardware manages it
automatically to improve core utilization.
• Example (from GPU context, which heavily uses hardware multithreading): While the source
mentions SMT generally, the most detailed example of hardware multithreading in action is given in
the context of GPUs.
o GPUs heavily rely on hardware multithreading to avoid stalls on memory accesses.
o Working Principle: A GPU processor can run hundreds or thousands of threads. These
threads are organized into groups (e.g., SIMD groups or warps).
o When one group of threads (or a single thread within that group) is stalled (e.g., waiting for
data to be fetched from slow global memory, or for a previous instruction to complete), the
hardware scheduler can immediately switch to and execute instructions from another
ready thread group.
o Benefit: This rapid context switching, managed by low-overhead hardware schedulers, keeps
the numerous functional units of the GPU busy. It effectively hides the latency of memory
accesses or other delays, leading to much higher throughput than if the processor had to wait
for the stalled thread to resume. This is a concrete example of how hardware multithreading
ensures continuous utilization of parallel hardware.
MODULE-2
GPU programming, Programming hybrid systems, MIMD systems, GPUs, Performance – Speedup and
efficiency in MIMD systems, Amdahl’s law, Scalability in MIMD systems, Taking Timings of MIMD programs,
GPU performance.
1. Apply the principles of GPU programming to explain why minimizing communication between
CPU and GPU is essential for better performance.

GPU Programming Principle: GPU programming is fundamentally heterogeneous programming


because it involves coordinating two different types of processors: a CPU (host) and a GPU (device).

• CPU (Host): This is the general-purpose processor that runs the operating system and manages
system services. In GPU programming, the CPU is responsible for:
o Allocating and initializing memory on both the CPU and the GPU.
o Launching GPU programs (often called kernels).
o Collecting and outputting the results after the GPU has finished its computation.
• GPU (Device): This is the highly parallel, specialized processor designed for compute-intensive
work. It does not usually run an operating system or directly manage system services like disk I/O.

Why Minimizing Communication is Essential: The critical reason to minimize communication (data
transfer) between the CPU and GPU is due to their separate memory spaces and the inherent latency
involved in moving data between these distinct components.

1. Separate Memory: The memory for the CPU host and the GPU memory are physically separate.
This means that any data needed by the GPU that originates from the CPU (or vice-versa) must be
explicitly copied across the system's interconnect (e.g., PCIe bus).
2. High Latency and Limited Bandwidth: While not explicitly detailed as "slow" in every instance,
the architectural separation implies that data transfer between the CPU's main memory and the
GPU's global memory is significantly slower than operations performed locally on either device.
Moving large amounts of data back and forth incurs substantial latency (time for data to start
arriving) and can consume significant bandwidth (rate of data transfer) that could otherwise be used
for computation.
3. Overhead of Data Transfer: Every time data is moved, there's an overhead. For example, the CPU
must initiate the transfer, and the GPU must receive it. This overhead reduces the actual time
available for parallel computation on the GPU.

Impact on Performance: If a GPU program frequently transfers data between the CPU and GPU, the
performance gains from the GPU's massive parallel processing power can be severely negated. The GPU
may spend a large portion of its time waiting for data to arrive from the CPU or waiting for its results to be
transferred back. This is especially problematic for applications that have a low "compute-to-data" ratio,
where the amount of computation per data item is small compared to the data transfer cost.

Therefore, efficient GPU programming aims to:

• Transfer necessary data to the GPU once at the beginning of the computation.
• Perform as much computation as possible entirely on the GPU.
• Transfer results back to the CPU only when necessary at the end. This strategy ensures that the
GPU's powerful parallel capabilities are primarily used for computation rather than waiting for data.

2. Write the given rules to decide which process or thread should handle standard input in a parallel
program running on a shared-memory MIMD system.
In parallel programs, especially in shared-memory MIMD systems, managing standard input (stdin) can be
complex due to the potential for nondeterminism when multiple threads try to access it simultaneously. To
address these issues and ensure predictable behaviour, specific rules are followed for handling I/O:

For a parallel program running on a shared-memory MIMD system, the rule for handling standard input
(stdin) is:

• Only the master thread or thread 0 will access stdin.

This rule ensures that a single, designated thread is responsible for reading input, preventing
nondeterministic behaviour or conflicts that could arise if multiple threads attempted to read from the same
input stream simultaneously.

Additionally, while not directly about standard input, other relevant I/O rules for parallel programs
(including shared-memory systems) are:

• All processes/threads can access stdout and stderr for output.


• However, due to the nondeterministic order of output to stdout, in most cases, only a single
process/thread will be used for all output to stdout (except for debugging purposes).
• When debugging, output should always include the rank or ID of the process/thread generating the
output.
• Only a single process/thread will attempt to access any single file other than stdin, stdout, or stderr.

3. Explain the concepts of speedup and efficiency to determine processor utilization in a parallel
program, and show how speedup and efficiency vary with different problem sizes.

In parallel programming, speedup and efficiency are key metrics used to evaluate the performance and

processor utilization of a parallel program.

1. Speedup (S):

• Concept: Speedup measures how much faster a parallel program runs compared to its corresponding
serial (sequential) program. It is the ratio of the time taken by the best serial program ($T_{serial}$)
to the time taken by the parallel program ($T_{parallel}$).
• Interpretation:
o If a program achieves linear speedup, it means $S = p$, where $p$ is the number of
processors/cores used. This is the ideal scenario where a program runs $p$ times faster on
$p$ cores.
o In practice, perfect linear speedup is rarely achieved because parallel programs introduce
overhead, such as communication between processes, synchronization (e.g., using mutexes),
and load imbalance. These overheads are not present in serial programs and tend to increase
as the number of processors ($p$) increases.

2. Efficiency (E):

• Concept: Efficiency measures how well the available processors/cores are being utilized by the
parallel program. It indicates the average fraction of time each core spends doing useful work on the
original problem, rather than dealing with parallel overhead.
• Interpretation:
o Efficiency ranges from 0 to 1 (or 0% to 100%). An efficiency of 1 (or 100%) indicates perfect
utilization, where there is no parallel overhead.
o If $E < 1$, it means some time is being wasted due to parallel overhead.
o Efficiency can be seen as the average utilization of the parallel cores on solving the problem,
with the remainder of the parallel run-time being the parallel overhead.

How Speedup and Efficiency Vary with Different Problem Sizes:

Speedup and efficiency are not constant; they also depend on the problem size.

• Increasing Problem Size: When the problem size is increased (while keeping the number of
processors fixed), both the speedup and efficiency generally increase.
o Reason: This often happens because the time spent on the original, useful computation
($T_{serial}$) grows much faster than the time spent on parallel overhead
($T_{overhead}$). The overheads (like communication setup or synchronization) might be
relatively fixed or grow slowly, becoming a smaller fraction of the total execution time as the
problem size increases. Therefore, the cores spend a larger proportion of their time doing
useful work, leading to better utilization.
o Example (Table 2.5): The sources provide an example where doubling the problem size
significantly improves efficiency, from 0.68 for the original size to 0.89 with 16 cores.
• Decreasing Problem Size: Conversely, when the problem size is decreased, the speedup and
efficiency generally decrease.

o Reason: The overheads become a more significant portion of the total execution time,
leading to lower processor utilization and less impressive speedups.
o Example (Table 2.5): Halving the problem size in the example shows a drop in efficiency
from 0.68 to 0.39 with 16 cores.

Diagrams:

• Refer to Table 2.4, for an example of speedups and efficiencies.


• Refer to Table 2.5, for an example of speedups and efficiencies on different problem sizes.
• Refer to Figure 2.18, for a plot showing speedups of a parallel program on different problem sizes.
• Refer to Figure 2.19, for a plot showing efficiencies of a parallel program on different problem sizes.

4. Explain the concept of scalability to explain how increasing the number of processors in a MIMD

system affects program performance.

Concept of Scalability: Informally, a parallel program is scalable if, by increasing the power of the system
it runs on (e.g., increasing the number of cores), we can obtain speedups over its performance on a less
powerful system.

More formally, in discussions of MIMD parallel program performance, scalability has a specific
definition:

• A program is considered scalable if, when the number of processes/threads is increased, a


corresponding rate of increase in the problem size can be found such that the program always
maintains the same efficiency (E).

How Increasing the Number of Processors Affects Program Performance (and Scalability):

Increasing the number of processors ($p$) in an MIMD system typically affects program performance in the
following ways:
1. Potential for Higher Speedup: With more processors, more work can be done concurrently, leading
to faster execution and higher speedup. Ideally, doubling the processors might double the speedup.
2. Increased Parallel Overhead: However, simply increasing processors does not guarantee linear
speedup. Parallel overheads (like communication, synchronization, and load imbalance) usually
increase as the number of processors grows.
o More processors often mean more communication to exchange data.
o More processors often mean more contention for shared resources or more complex
synchronization mechanisms.
o It becomes harder to perfectly balance the workload across a very large number of processors.
3. Impact on Efficiency: Because parallel overheads tend to increase with $p$, the efficiency ($E$) of
a parallel program usually decreases as the number of processors increases for a fixed problem
size. This means each processor spends a smaller fraction of its time doing useful work on the
original problem and more time on overhead.

Scalability and Performance Relationship: The concept of scalability explains how this trade-off between
increased processors and increased overhead can be managed to maintain performance.

• Weak Scalability: This is when the efficiency of a program can be kept constant by increasing the
problem size at the same rate as the number of processes/threads. This is a more common and
realistic form of scalability in large-scale parallel computing.
o This implies that for a parallel program to maintain its performance characteristics
(efficiency) as we add more processors, the workload must also grow proportionally. This
allows the useful computation time to increase, making the overheads a relatively smaller
portion of the total time.
• Strong Scalability: This is a much rarer and more difficult type of scalability to achieve. A program
is strongly scalable if its efficiency remains fixed even when the number of processes/threads
increases without changing the problem size. This is rare because overheads almost always grow to
some extent as more processors are introduced, making it difficult to maintain constant efficiency for
a fixed amount of work.

In essence, scalability helps us understand that merely adding more processors to an MIMD system doesn't
automatically improve performance linearly. For sustained performance gains, especially in terms of
efficiency, the problem size often needs to be considered and sometimes scaled along with the number of
processors to offset the increasing parallel overheads.

5. Interpret the role of barriers and elapsed times in timing parallel programs and justify with reasons
why the minimum run-time is preferred.

When timing parallel programs, especially in MIMD (Multiple Instruction, Multiple Data) systems, it is
crucial to accurately measure the performance of the parallelised sections of code. This involves
understanding the roles of barriers and elapsed times, and why minimum run-time is typically reported.

1. Role of Barriers:

• Purpose: In parallel programs, different processes or threads execute asynchronously (at their own
pace). To get an accurate measurement of how long a parallel section of code takes, all
processes/threads need to start timing that section at roughly the same moment.
• Working: A barrier is a synchronisation primitive. When a parallel program calls a barrier function
(e.g., Barrier() in pseudocode), all participating processes/threads must reach this point in the code
before any of them are allowed to proceed past it. While an ideal barrier would have all processes
return simultaneously, it usually guarantees that all have at least started the call when the first one
returns.
• Benefit: By placing a barrier just before the start timer call, we ensure that all processes/threads begin
the timed computation together, leading to a more consistent and meaningful measurement of the
parallel work.
2. Role of Elapsed Times:

• Type of Time: When evaluating parallel programs, we are primarily interested in "wall clock" time
(also known as real time or elapsed time), not "CPU time".
o Wall clock time measures the actual time that passes from the start of a code section to its
finish, as if measured by a stopwatch. This includes both the time spent actively computing
and any time spent waiting (e.g., for messages, for synchronisation, or for other threads to
finish).
o CPU time only accounts for the time when the processor is actively executing instructions,
ignoring idle or waiting periods. For parallel programs, waiting is a significant and real cost,
so excluding it would give a misleading picture of performance.
• Measurement: Each individual process/thread records its own start time (my_start) and end time
(my_finish) for the code section it executes, calculating its my_elapsed time.
• Global Measurement: Since processes/threads might finish at different times, the actual parallel
run-time is the time from when the first process/thread began execution to when the last
process/thread finished. This is captured by taking the maximum of all the individual my_elapsed
times using a global_max function (or similar collective operation). This maximum value is then
reported as the overall elapsed time for the parallel section.

3. Justification for Preferring Minimum Run-time:

• Variability: When a parallel program is run multiple times, even with the same input and on the
same system, the measured elapsed times are likely to vary.
• External Factors: This variability is often due to external factors such as operating system
activities, background processes, network fluctuations, or other users on a shared system. These
external events can only slow down a program's execution; they cannot make it run faster than its
inherent best possible speed.
• Best-Case Performance: Therefore, reporting the minimum run-time among several executions is
preferred. This minimum time provides the most optimistic and realistic measure of the program's
best performance, representing its speed when external interferences are minimised. Reporting a
mean or median time would include the negative impact of these external slowdowns, which do not
reflect the program's optimal capability.

6. Explain GPU performance to show whether the evaluation measures used in MIMD systems, such
as speedup, efficiency, and scalability, are applicable to GPU or not.

Evaluating GPU performance requires a different perspective than traditional MIMD (Multiple Instruction,
Multiple Data) systems, primarily because GPUs are fundamentally different in their architecture and
operating principles.

1. Speedup:

• MIMD Systems: For MIMD systems, speedup is typically calculated by comparing the parallel
execution time on p cores to the execution time of the best serial program run on a single core of the
same design. Linear speedup ($S=p$) is the ideal target.
• GPU Performance: While it is common to report speedups of GPU programs over serial CPU
programs or even parallel MIMD programs, the concept of "linear speedup" in the MIMD sense is
not directly applicable to GPUs.
o Reason: GPU cores are inherently parallel and fundamentally different from conventional
CPU cores. They are designed for massive data-parallel tasks using a SIMD-like model.
Directly comparing them as if they were simple extensions of a single CPU core for linear
speedup calculation is not meaningful.
• Applicability: No, the MIMD concept of linear speedup relative to a same-type serial core is not
applicable. General speedup numbers are still reported but should be interpreted carefully.
2. Efficiency:

• MIMD Systems: Efficiency measures how well the available processors are utilised, calculated as
$E = S/p$. It represents the average fraction of time each core spends on useful work, assuming
comparison against a single core of the same type.
• GPU Performance: Similar to speedup, efficiency is ordinarily not used in discussions of GPU
performance in the same way it is for MIMD systems.
o Reason: Since GPU cores are inherently parallel and distinct from serial CPU cores, defining
what "100% utilisation" means for a single GPU core in the context of a serial benchmark
doesn't make sense. The concept of comparing "parallel vs. serial speed per core" as a
measure of efficiency becomes problematic for GPUs.
• Applicability: No, efficiency is generally not a meaningful evaluation measure for GPUs.

3. Scalability:

• MIMD Systems (Formal Definition): A parallel program is formally scalable if, when the number
of processes/threads is increased, a corresponding rate of increase in the problem size can be found
such that the program always maintains the same efficiency.
• GPU Performance: Because the formal definition of efficiency (as used in MIMD systems) does
not apply to GPUs, the formal definition of scalability for MIMD programs cannot be applied to
GPU programs either.
o Informal Usage: However, the informal usage of scalability is routinely applied to GPUs.
A GPU program is considered scalable if, by increasing the size or power of the GPU (e.g.,
adding more cores, increasing memory bandwidth), one can obtain speedups over the
program's performance on a smaller or less powerful GPU.
• Applicability: The formal MIMD definition is not applicable. The informal understanding of
scalability (better performance with more GPU resources) is widely used.

4. Amdahl's Law:

• MIMD Systems: Amdahl's Law states that if a fraction 'r' of a serial program is inherently
unparallelised, the maximum possible speedup is limited to $1/r$, regardless of the number of
processors.
• GPU Performance: Amdahl's Law can be applied to GPU programs if the inherently serial part
of the program is executed on a conventional, serial CPU processor. In this scenario, the resulting
upper bound on possible speedup will be the same as for an MIMD program.
o Caveats: The same caveats that apply to Amdahl's Law in MIMD systems also apply to
GPUs: the serial fraction often decreases with increasing problem size, and many GPU
programs still achieve huge speedups. Even small speedups can be valuable.
• Applicability: Yes, if the serial portion runs on a CPU.

In summary, while absolute speedup figures are often reported for GPUs, the nuanced MIMD measures of
linear speedup, efficiency, and formal scalability are generally not directly applicable due to the inherent
architectural differences between CPUs and GPUs. Amdahl's Law, however, can be relevant when
considering the serial CPU component of a heterogeneous CPU-GPU application.

7. Explain the concepts of speedup and efficiency to decide which system is better: System A: 8
processors, execution time = 40s (sequential time = 160s) System B: 4 processors, execution time = 60s
(sequential time = 160s)

To determine which system is "better," we need to calculate the speedup and efficiency for both systems
using the given concepts.
Comparison & Decision
• System A: Higher speedup (4) → finishes the job faster (40s vs 60s)
• System B: Higher efficiency (66.75%) → processors are better utilised

If the goal is faster execution time → System A is better (40s vs 60s).


If the goal is better resource utilisation → System B is better (66.75% vs 50%).

• System A: Speedup = 4, Efficiency = 50%


• System B: Speedup ≈ 2.67, Efficiency ≈ 66.75%

System A is better for fastest execution (40s).


System B is better for processor utilisation (66.75%).

Decision: Which System is Better?

• System A (8 processors, 40s execution) achieves a higher speedup of 4, meaning it completes the
task in less absolute time (40s).
• System B (4 processors, 60s execution) achieves a higher efficiency of approximately 66.75%,
meaning its processors are, on average, more effectively utilised per core.

The "better" system depends on the primary goal:

• If the objective is to achieve the fastest possible execution time for the given problem, System A is
better because it finishes the task in 40 seconds, which is faster than System B's 60 seconds. This is
often the primary concern in high-performance computing.
• If the objective is to maximise the utilisation of each individual processor (i.e., get the most
"work" out of each core you've invested in), then System B is better due to its higher efficiency.

In most performance evaluations, raw speedup (faster completion time) is prioritised, making System A
the generally "better" choice for this particular problem, as it delivers the result faster.

8. Develop a program with a runtime of 80 seconds, where 75% can be parallelized, by using
Amdahl’s Law to calculate the maximum speedup achievable on 6 cores and 20 cores.

Amdahl's Law is a formula that describes the theoretical maximum speedup of a parallel program given the
proportion of the program that can be parallelised.

You might also like