KEMBAR78
QNS. Parallel Computing | PDF | Parallel Computing | Class (Computer Programming)
0% found this document useful (0 votes)
16 views44 pages

QNS. Parallel Computing

Parallel computing utilizes multiple processors to solve problems more efficiently by breaking tasks into smaller steps executed simultaneously. It is widely used in various fields including scientific research, weather forecasting, financial services, and machine learning. Gustafson's law describes the speedup achievable with parallel computing, contrasting with Amdahl's law, while message passing models facilitate communication between processes in parallel computing environments.

Uploaded by

koiralasantosh28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views44 pages

QNS. Parallel Computing

Parallel computing utilizes multiple processors to solve problems more efficiently by breaking tasks into smaller steps executed simultaneously. It is widely used in various fields including scientific research, weather forecasting, financial services, and machine learning. Gustafson's law describes the speedup achievable with parallel computing, contrasting with Amdahl's law, while message passing models facilitate communication between processes in parallel computing environments.

Uploaded by

koiralasantosh28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Short

1. What is parallel computing? Who is using parallel computing?

Ans: In very simple terms, it is the use of multiple resources, in this case, processors, to
solve a problem. This type of programming takes a problem, breaks it down into a series of
smaller steps, delivers instructions, and processors execute the solutions at the same time.
It is also a form of programming that offers the same results as concurrent programming
but in less time and with more efficiency.

Many computers, such as laptops and personal desktops, use this programming in their
hardware to ensure that tasks are quickly completed in the background. This type of
programming can be used for everything from science and engineering to retail and
research. It’s most common use from a societal perspective is in web search engines,
applications, and multimedia technologies.

Parallel computing is utilized in a wide range of fields and applications, including:

1. Scientific Research: Researchers in fields such as physics, chemistry, and biology often
use parallel computing to simulate complex systems, run experiments, and analyze large
datasets.

2. Weather Forecasting: Meteorologists use parallel computing to process vast amounts of


atmospheric data and create more accurate weather models.

3. Financial Services: The finance industry employs parallel computing for risk analysis,
algorithmic trading, and market simulations, allowing for faster transactions and
analysis.

4. Machine Learning and Artificial Intelligence: Training complex models, such as deep
neural networks, often involves parallel processing to handle large datasets and perform
numerous calculations simultaneously.

5. Image and Signal Processing: Applications such as video encoding, medical imaging, and
real-time signal processing benefit from parallel computation to speed up the
processing time.

6. Engineering and Computational Fluid Dynamics: Engineers use parallel computing to


model and simulate fluid flows, structural analysis, and other complex physical systems.

7. Large-Scale Data Analysis: Businesses and organizations analyze big data using parallel
computing to gain insights faster and make data-driven decisions.
2. Explain gastatson Brasis’s law with suitable example and compare it with
Arndall’s law.
Ans: In computer architecture, Gustafson's law (or Gustafson–Barsis's law) gives the
theoretical speedup in latency of the execution of a task at fixed execution time that can be
expected of a system whose resources are improved.
➢ It is named after computer scientist John L. Gustafson and his colleague Edwin H. Barsis,
and was presented in the article Reevaluating Amdahl's Law in 1988
➢ Gustafson estimated the speedup S gained by using N processors (instead of just one) for a
task with a serial fraction s (which does not benefit from parallelism) as follows: S = N + (1-
N) s
➢ Using different variables, Gustafson's law can be formulated the following way: S latency (s) =
1-p +sp,
Where,
Slatency = The theoretical speedup in latency of the execution of the whole task.
s = It is the speedup in latency of the execution of the part of the task that benefits form the
improvement of the resources of the system.
p = It is the percentage of the execution workload of the whole task concerning the part that
benefits form the improvement of the resources of the system before the improvement.

Example: An application executing on 64 processors using 5% of the total time on non-


parallelizable computations. What is the scaled speedup?

Solution:

Here number of processors (N) = 64

Serial fraction (s) = 5% = 5/100 = 0.05

Scaled speedup (S) =?

We know that

S = N + (N-1) s

= 64 + (64-1) * 0.05

= 60.85 ans….

Note: Gustafson Barsis's law speed is increasing as the number of the processors increases in
comparison to Amdahl's law. So it is better law than Amdahl's law.
3. Why message passing model are used? What is included and not included in mpi
standard. Also write advantage and disadvantage.
ans: In computer science, message passing is a technique for invoking behavior (i.e., running
a program) on a computer.

➢ Message passing model allows multiple processes to read and write data to the message
queue without being connected to each other.
➢ Messages are stored on the queue until their recipient retrieves them.
➢ Message queues are quite useful for interprocess communication and are used by most
operating systems.
➢ A diagram that demonstrates message passing model of process communication is given as
follows

Fig: message passing model

➢ In the above diagram, both the processes P1 and P2 can access the message queue and store
and retrieve data.

The standard includes:

1. Point-to-point communication
2. Datatypes
3. Collective operations
4. Process groups
5. Communication contexts
6. Process topologies
7. Environment management and inquiry
8. The info object
9. Process creation and management
10. One- sided communication
11. External interfaces
12. Parallel file I/O
13. Language findings for Fortran and C
14. Tool support.
The standard does not specify:

1. Operations that require more operating system support than currently standard. For
example: interrupt-driven receivers, remote execution, or active message.
2. Program construction tools
3. Debugging facilities.
Advantages of Message Passing Model
Some of the advantages of message passing model are given as follows:

• The message passing model is much easier to implement than the shared memory model.
• It is easier to build parallel hardware using message passing model as it is quite tolerant
of higher communication latencies.

Disadvantage of Message Passing Model


The message passing model has slower communication than the shared memory model because
the connection setup takes time.

4. What is data parallelism? Explain machine model and CU/PU overlsp.


Ans: Data parallelism is parallelization across multiple processors in parallel computing
environments. It focuses on distributing the data across different nodes, which operate on the
data in parallel. It can be applied on regular data structures like arrays and matrices by
working on each element in parallel.
It contrasts to task parallelism as another form of parallelism. In a multiprocessor system
executing a single set of instructions (SIMD), data parallelism is achieved when each
processor performs the same task on different distributed data.
CU/PU Overlap
Control Unit (CU):
• Directs the operation of the processor.
• Fetches, decodes, and orchestrates instruction execution without performing
computations.
Processing Unit (PU):
• Often refers to the Arithmetic Logic Unit (ALU).
• Performs actual arithmetic and logical operations based on instructions from the CU.
Overlap:
• Instruction Execution: CU fetches and decodes instructions, signaling the PU to execute
them.
• Data Flow Management: CU manages data transfer between memory and PU.
• Pipeline Architecture: Allows overlapping instruction stages for greater efficiency.
• Control Signals: CU generates signals that direct PU operations.
5. Explain divide and conquer technique. Write featurs of hypercube.

Ans: Divide and Conquer Technique

Definition:
Divide and conquer is an algorithm design paradigm used to solve complex problems by
breaking them down into smaller, more manageable subproblems. The key steps involved in
this technique are:

1. Divide: Split the original problem into smaller subproblems that are similar to the
original problem but smaller in size.

2. Conquer: Solve each of the subproblems recursively. If the subproblems are small
enough, solve them directly (base case).

3. Combine: Merge the solutions of the subproblems to form the solution to the original
problem.

Features:

• Recursive Nature: The divide and conquer approach often employs recursion, where a
function calls itself with smaller inputs.

• Efficiency: Many divide and conquer algorithms have improved time complexity
compared to their naive counterparts, often achieving logarithmic or linearithmic time
complexity (e.g., (𝑛log⁡𝑛)O(nlogn)).

• Parallelism: The subproblems can often be solved in parallel, making this approach
suitable for parallel computing environments.

• Base Case: The algorithm must define a base case where the problem is small enough to
be solved directly without further division.

• Examples: Common algorithms that use the divide and conquer technique include
Merge Sort, Quick Sort, Binary Search, and the Fast Fourier Transform (FFT).

Features of Hypercube

A hypercube is a geometric shape that generalizes the concept of a cube to higher dimensions.
The features of a hypercube include:

1. Dimensions: An n-dimensional hypercube (also known as an n-cube) has 2𝑛2n vertices.


For example, a 3-dimensional hypercube (a cube) has 8 vertices, while a 4-dimensional
hypercube (tesseract) has 16 vertices.
2. Edges and Faces: An n-dimensional hypercube has 𝑛⋅2𝑛−1n⋅2n−1 edges
and 2𝑛−(𝑛𝑘)2n−k(kn) k-dimensional faces. For instance, a 3-cube has 12 edges and 6
faces.

3. Symmetry: Hypercubes exhibit a high degree of symmetry. All vertices, edges, and faces
are equivalent, which means they can be transformed into each other through rotations
and reflections.

4. Vertex Connectivity: Each vertex in an n-dimensional hypercube is connected to n other


vertices. This property makes hypercubes useful in network topology as they provide
efficient communication paths.

5. Binary Representation: The vertices of an n-cube can be represented as n-bit binary


numbers. Each dimension corresponds to a bit, and moving from one vertex to another
can be seen as flipping one bit.

6. Applications: Hypercubes are utilized in various fields, including computer science (for
parallel processing architectures), mathematics (in topology and combinatorics), and
data structures (for organizing multi-dimensional data).

7. Distance Metric: The distance between any two vertices in a hypercube can be
measured using the Hamming distance, which counts the number of differing bits in
their binary representations.

6. What is NUMA memory architecture? Differentiate one-to-all broadcast and all-to-one


reduction.

Ans: NUMA stands for Non-Uniform memory access and is a special type of shared memory
architecture where access times to different memory locations by a processor may vary as may
also access times to the same memory location by different processors.

Parallel algorithms often require a single process to send identical data to all other processes
or to a subset of them. This operation is known as one-to-all broadcast. Initially, only the source
process has the data of size m that needs to be broadcast. At the termination of the procedure,
there are p copies of the initial data – one belonging to each process. The dual of one-to-all
broadcast is all-to-one reduction. In an all-to-one reduction operation, each of the p
participating processes starts with a buffer M containing m words. The data from all processes
are combined through an associative operator and accumulated at a single destination process
into one buffer of size m. Reduction can be used to find the sum, product, maximum, or
minimum of sets of numbers – the ith word of the accumulated M is the sum, product,
maximum, or minimum of the ith words of each of the original buffers.
One-to-all broadcast and all-to-one reduction are two fundamental communication patterns in
parallel computing and distributed systems. Here’s a concise differentiation between the two:

One-to-All Broadcast

Definition:
In a one-to-all broadcast, a single source node sends a message to all other nodes in the
network or system.

Characteristics:

• Source: There is one sender (the source) that transmits data.

• Destinations: The message is received by all nodes in the group.

• Use Case: Useful when a single piece of information needs to be disseminated to all
participants, such as configuration updates or notifications.

• Communication Pattern: Typically involves a tree or a mesh network to efficiently


distribute the message to all nodes.

• Example: A leader node in a distributed system sending a command or update to all


worker nodes.

All-to-One Reduction

Definition:
In an all-to-one reduction, multiple source nodes send their data to a single destination node,
which aggregates or reduces the data.

Characteristics:

• Sources: There are multiple senders, each providing their own data.

• Destination: All data is sent to a single receiver (the reduction node).

• Use Case: Commonly used for operations that require combining data, such as summing
values, finding maximums, or calculating averages across multiple nodes.

• Communication Pattern: Often involves gathering data from various nodes to a central
node, which may require synchronization and data aggregation.

• Example: Worker nodes in a parallel computation sending their results back to a master
node for final processing.
6. What is stored memory model? Explain it’s use.

Ans: The stored memory model, also known as the stored-program architecture, is a computer
architecture in which both program instructions and data are stored in the same memory
space. This model allows the CPU to access both instructions and data using the same memory
addressing mechanism.

Key Features of the Stored Memory Model

1. Unified Memory Space: - Both program instructions and data reside in the same memory
space, allowing for easier management and access.

2. Program Instructions as Data: - Programs can be treated as data, enabling powerful features
such as self-modifying code and dynamic loading of programs.

3. Sequential Execution: - The CPU fetches instructions sequentially from memory, executes
them, and can modify the instruction pointer to jump to different parts of the program.

4. Flexibility: - The model allows for more flexible programming techniques, such as recursion
and the use of function pointers.

Uses of the Stored Memory Model

1. General-Purpose Computing: - Most modern computers (including desktops, laptops, and


servers) use the stored memory model, enabling a wide variety of applications, from word
processing to complex scientific computations.

2. Dynamic Program Behavior: - The ability to treat instructions as data allows programs to
modify their own behavior at runtime, which is useful in applications like interpreters, just-in-
time (JIT) compilers, and dynamic programming languages.

3. Operating Systems: - Operating systems utilize the stored memory model to manage tasks,
run applications, and handle system calls, as they can load and execute programs from memory
dynamically.

4. Embedded Systems: - Many embedded systems use the stored memory model for flexibility
in programming and ease of updates, allowing firmware to be modified or updated without
changing hardware.

5. Software Development: - The model facilitates the development of high-level programming


languages and compilers, which can generate machine code that can be executed directly from
memory.
7. What is PRAM model? What are the subclasses of PRAM?

Ans: : A natural extension of the serial model of computation (the Random Access Machine, or
RAM) consists of p processors and a global memory of unbounded size that is uniformly
accessible to all processors. All processors access the same address space. Processors share a
common clock but may execute different instructions in each cycle. This ideal model is also
referred to as a parallel random access machine (PRAM). Since PRAMs allow concurrent access
to various memory locations, depending on how simultaneous memory accesses are handled,
PRAMs can be divided into four subclasses.

1. Exclusive-read, exclusive-write (EREW) PRAM- In this class, access to a memory location is


exclusive. No concurrent read or write operations are allowed. This is the weakest PRAM
model, affording minimum concurrency in memory access.

2. Concurrent-read, exclusive-write (CREW) PRAM- In this class, multiple read accesses to a


memory location are allowed. However, multiple write accesses to a memory location are
serialized.

3. Exclusive-read, concurrent-write (ERCW) PRAM- Multiple write accesses are allowed to a


memory location, but multiple read accesses are serialized.

4. Concurrent-read, concurrent-write (CRCW) PRAM- This class allows multiple read and writes
accesses to a common memory location. This is the most powerful PRAM model.

Allowing concurrent read access does not create any semantic discrepancies in the program.
However, concurrent write access to a memory location requires arbitration. Several protocols
are used to resolve concurrent writes. The most frequently used protocols are as follows:

• Common, in which the concurrent write is allowed if all the values that the processors are
attempting to write are identical.

• Arbitrary, in which an arbitrary processor is allowed to proceed with the write operation and
the rest fail.

• Priority, in which all processors are organized into a predefined prioritized list, and the
processor with the highest priority succeeds and the rest fail.

• Sum, in which the sum of all the quantities is written (the sum-based write conflict resolution
model can be extended to any associative operator defined on the quantities being written).
8. Differentiate between parallel and serial computing.

Ans: Stored Memory Model

The stored memory model, also known as the stored-program architecture, is a computer
architecture in which both program instructions and data are stored in the same memory
space. This model allows the CPU to access both instructions and data using the same memory
addressing mechanism. Key Features of the Stored Memory Model are as follow:

1. Unified Memory Space: Both program instructions and data reside in the same
memory space, allowing for easier management and access.

2. Program Instructions as Data: Programs can be treated as data, enabling


powerful features such as self-modifying code and dynamic loading of programs.

3. Sequential Execution: The CPU fetches instructions sequentially from memory,


executes them, and can modify the instruction pointer to jump to different parts
of the program.

4. Flexibility: The model allows for more flexible programming techniques, such as
recursion and the use of function pointers.

Uses of the Stored Memory Model

a. General-Purpose Computing: Most modern computers (including desktops,


laptops, and servers) use the stored memory model, enabling a wide variety of
applications, from word processing to complex scientific computations.
b. Dynamic Program Behavior: The ability to treat instructions as data allows
programs to modify their own behavior at runtime, which is useful in
applications like interpreters, just-in-time (JIT) compilers, and dynamic
programming languages.
c. Operating Systems: Operating systems utilize the stored memory model to
manage tasks, run applications, and handle system calls, as they can load and
execute programs from memory dynamically.
d. Embedded Systems: Many embedded systems use the stored memory model for
flexibility in programming and ease of updates, allowing firmware to be modified
or updated without changing hardware.
e. Software Development: The model facilitates the development of high-level
programming languages and compilers, which can generate machine code that
can be executed directly from memory.
9. Explain sieve of Erutustheness algorithm. Use these algorithm to find all the
prime numbersbetween 1 to 20.

Ans: The sieve of Eratosthenes is an ancient algorithm for finding all prime numbers up to any given
limit. It does so by iteratively marking as composite (i.e., not prime) the multiples of each prime,
starting with the first prime number, 2. It is one of the most efficient ways to find all of the smaller
primes. It may be used to find primes in arithmetic progressions.

Algorithm:

1. Create a list of consecutive integers from 2 through n: (2, 3, 4, ..., n).


2. Initially, let p equal 2, the smallest prime number.
3. Enumerate the multiples of p by counting in increments of p from 2p to n, and mark them in the
list (these will be 2p, 3p, 4p, ...; the p itself should not be marked).
4. Find the smallest number in the list greater than p that is not marked. If there was no such
number, stop. Otherwise, let p now equal this new number (which is the next prime), and repeat
from step 3.
5. When the algorithm terminates, the numbers remaining not marked in the list are all the primes
below n.

Example: To find all the prime numbers less than or equal to 20, proceed as follows.

Steps of the Sieve of Eratosthenes


1. Create a List:
o Create a list of consecutive integers from 2 to 𝑛n (in this case, 20).
o Example: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
2. Initialize the First Prime:
o Start with the first number in the list (2). This is the first prime number.
3. Eliminate Multiples:
o Remove all multiples of the current prime from the list. For 2, remove 4, 6, 8, 10, 12, 14,
16, 18, and 20.
o The list now looks like this: [2, 3, 5, 7, 9, 11, 13, 15, 17, 19]
4. Repeat:
o Move to the next number in the list that has not been removed (3) and repeat the process.
o Remove multiples of 3: 6, 9, 12, 15, and 18.
o The list now looks like this: [2, 3, 5, 7, 11, 13, 17, 19]
5. Continue Until the Square Root:
o Continue this process for the next prime (5). Remove multiples of 5: 10, 15.
o The list now looks like this: [2, 3, 5, 7, 11, 13, 17, 19]
o Since 20≈4.4720≈4.47, we stop when we reach primes greater than 4 (which means we
have covered 2 and 3).
6. Remaining Numbers:
o The remaining numbers in the list are all the prime numbers up to 20.
Result
Using the Sieve of Eratosthenes, the prime numbers between 1 and 20 are:
2, 3, 5, 7, 11, 13, 17, 19
10. What is parallel reduction? Explain it’s type.

Ans: It is an approach which works by using half the number of threads of the elements in the
dataset. Every thread calculates the minimum of its own element and some other element. The
resultant element is forwarded to the next round. The number of threads is then reduced by
half and the process repeated until there is just a single element remaining, which is the result
of the operation. Types of Parallel Reduction are as follow:-

There are several types of parallel reduction based on the operation being performed. Here are
a few common types:

1. Sum Reduction:

o This is the most common type of reduction, where the goal is to compute the
sum of a set of numbers. For example, summing an array of integers can be done
in parallel by dividing the array into segments, summing each segment, and then
summing the results of those segments.

2. Product Reduction:

o Similar to sum reduction, but instead of adding numbers, the goal is to compute
the product of a set of numbers. This can be useful in various mathematical
computations.

3. Maximum/Minimum Reduction:

o This type of reduction finds the maximum or minimum value in a dataset. Each
processor can find the maximum or minimum in its chunk, and then the results
are combined to find the overall maximum or minimum.

4. Logical Reduction:

o This reduction is used to compute logical operations (AND, OR, etc.) across a set
of boolean values. For example, determining if any element in a boolean array is
true can be done using parallel OR reduction.

5. Custom Reduction:

o In addition to standard operations, custom reduction operations can be defined


based on specific application requirements. This allows for flexibility in how data
is combined.
Example of Sum Reduction

Here’s a simple example of how sum reduction might work in a parallel computing
environment:

1. Input Array: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

2. Divide the Array: Split into chunks: [1, 2, 3, 4], [5, 6, 7, 8], [9, 10]

3. Local Sum: Each processor computes the sum of its chunk:

o Processor 1: 1 + 2 + 3 + 4 = 10

o Processor 2: 5 + 6 + 7 + 8 = 26

o Processor 3: 9 + 10 = 19

4. Global Sum: Combine the results:

o Final Sum = 10 + 26 + 19 = 55

Benefits of Parallel Reduction

• Efficiency: By dividing the workload, parallel reduction can significantly speed up


computations, especially for large datasets.

• Scalability: It can easily scale with the number of processors, allowing for better
resource utilization.

• Reduced Latency: Parallel processing minimizes the time taken to compute results by
leveraging multiple execution units.
11. What are the criteria that are used to evaluate the cost and performance of
static interconnection networks?

Ans: Evaluating the cost and performance of static interconnection networks, which are often used in
parallel computing and distributed systems, involves several criteria. These criteria help determine how
well the network meets the requirements of the applications it serves. Here are the key criteria used for
evaluation:

1. Latency

• Definition: The time it takes for a message to travel from the source node to the
destination node.

• Importance: Low latency is crucial for applications that require real-time processing or
frequent communication between nodes. It impacts the overall responsiveness of the
system.

2. Throughput

• Definition: The amount of data that can be transmitted through the network in a given
time period, usually measured in bits per second (bps).

• Importance: High throughput is essential for applications that involve large data
transfers or require high bandwidth, such as scientific simulations and data analytics.

3. Scalability

• Definition: The ability of the network to maintain performance levels as the number of
nodes increases.

• Importance: A scalable network can efficiently handle the addition of new nodes
without significant degradation in performance, making it suitable for growing
applications.

4. Cost

• Definition: The financial expense associated with building and maintaining the network,
including hardware, software, and operational costs.

• Importance: Cost is a critical factor for organizations when designing systems, as it


impacts budget considerations and return on investment.
5. Reliability

• Definition: The ability of the network to operate correctly and consistently over time,
including the capacity to recover from failures.

• Importance: Reliable networks ensure that data is transmitted without loss and that the
system remains operational, which is vital for mission-critical applications.

6. Fault Tolerance

• Definition: The network's ability to continue functioning in the presence of faults or


failures in some of its components.

• Importance: A fault-tolerant design can prevent system-wide failures and maintain


service availability, which is especially important in large-scale systems.

7. Network Topology

• Definition: The arrangement of nodes and connections in the network, which can affect
performance characteristics.

• Importance: Different topologies (e.g., mesh, torus, tree) have distinct performance
implications, such as routing efficiency and fault tolerance.

8. Bandwidth

• Definition: The maximum rate of data transfer across a network path.

• Importance: Sufficient bandwidth is necessary to support the data transfer needs of


applications, particularly those that require high-volume communication.

9. Congestion Control

• Definition: Mechanisms in place to manage network traffic and avoid bottlenecks.

• Importance: Effective congestion control is essential to maintain performance levels


and prevent delays during peak usage times.

10. Flexibility and Adaptability

• Definition: The ability of the network to adapt to changing workloads or application


requirements.

• Importance: Flexible networks can optimize performance based on current demands,


making them suitable for dynamic environments.
12. Explain the features of parallel object oriented programming.

Ans: Parallel object-oriented programming (POOP) combines the principles of object-oriented


programming (OOP) with parallel computing techniques. This approach allows developers to create
software that can efficiently utilize multiple processing units to perform tasks simultaneously. Here are
the key features of parallel object-oriented programming:

1. Encapsulation

• Definition: Encapsulation is the bundling of data (attributes) and methods (functions) that
operate on the data into a single unit called an object.

• Feature in Parallelism: In parallel programming, encapsulation allows for the creation of


independent objects that can operate concurrently. Each object can maintain its own state and
behavior, which can be executed in parallel without interfering with other objects.

2. Inheritance

• Definition: Inheritance is a mechanism that allows a new class to inherit properties and
behaviors (methods) from an existing class.

• Feature in Parallelism: In parallel programming, inheritance can be used to create specialized


objects that can execute parallel tasks. For example, a base class can define a general parallel
processing interface, while derived classes implement specific parallel algorithms.

3. Polymorphism

• Definition: Polymorphism allows objects of different classes to be treated as objects of a


common superclass. It enables methods to be defined in a way that allows different
implementations based on the object's class.

• Feature in Parallelism: Polymorphism facilitates the use of different parallel algorithms or


strategies by allowing the same interface to be used for different types of parallel processing.
This makes it easier to switch between implementations without changing the overall structure
of the program.

4. Concurrency Control

• Definition: Concurrency control refers to techniques used to manage the execution of multiple
threads or processes to ensure that they do not interfere with each other.

• Feature in Parallelism: In POOP, concurrency control mechanisms (such as locks, semaphores,


and monitors) can be encapsulated within objects. This allows objects to manage their own
state and interactions safely when accessed by multiple threads.
5. Distributed Objects

• Definition: Distributed objects are objects that can communicate and interact with one another
over a network.

• Feature in Parallelism: POOP supports the creation of distributed systems where objects reside
on different nodes in a network. This allows for parallel processing across multiple machines,
leveraging network resources for computation.

6. Task Decomposition:- Definition: Task decomposition involves breaking down a complex problem
into smaller, manageable tasks that can be executed in parallel.

• Feature in Parallelism: In POOP, objects can represent distinct tasks or components of a larger
problem. This modularity allows for easier distribution and parallel execution of tasks among
multiple processors or cores.

7. Data Sharing and Communication :- Definition: Data sharing and communication refer to the
mechanisms by which objects exchange information and synchronize their operations.

• Feature in Parallelism: POOP provides various communication methods (such as message


passing, shared memory, or remote procedure calls) that facilitate interaction between objects
in parallel execution environments.

8. Abstraction :-Definition: Abstraction is the process of simplifying complex systems by modeling


classes based on the essential properties and behaviors of the objects.

• Feature in Parallelism: POOP allows developers to create abstract representations of parallel


processes, making it easier to design and implement parallel algorithms without getting bogged
down in implementation details.

9. Frameworks and Libraries:- Definition: Frameworks and libraries provide pre-defined structures and
tools that simplify the development of parallel applications.

• Feature in Parallelism: Many object-oriented programming languages offer libraries and


frameworks specifically designed for parallel programming (e.g., OpenMP, MPI, and Java's
Fork/Join framework). These tools leverage OOP principles to facilitate parallel task
management and synchronization.

10. Dynamic Behavior:- Definition: Dynamic behavior refers to the ability of objects to change their
state or behavior at runtime.

• Feature in Parallelism: In POOP, objects can adapt their parallel processing strategies based on
runtime conditions, such as available resources or workload characteristics. This flexibility
enhances performance and resource utilization.
13. Explain Amdahl’s law with suitable example. Why actual speed ups are
always less in Amdahl’s law?

➢ Ans: In computer architecture, Amdahl's law (or Amdahl's argument) is a formula


which gives the theoretical speedup in latency of the execution of a task at fixed
workload that can be expected of a system whose resources are improved.
➢ It is named after computer scientist Gene Amdahl, and was presented at the AFIPS
(American Federation of Information Processing Societies) Spring Joint Computer
Conference in 1967.
➢ Amdahl's law is often used in parallel computing to predict the theoretical speedup
when using multiple processors.

Example:
Let a program have 40 percent of its code enhanced (so fe= 0.4) to run 2.3 times
faster (so fi= 2.3). What is the overall system speedup S?
Solution:
Step 1: Setup the equation:
S = ((1 –fe) + (fe/ fi))-1
Step 2: Plug in values & solve
S = ((1 –0.4) + (0.4 / 2.3))-1
= (0.6 + 0.174)-1
= 1 / 0.774
= 1.292

Amdahl’s law uses two factors to find speedup from some enhancement

1. Fraction enhanced:
➢ It is the fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement.
➢ For example: If 10 seconds of the execution time of a program that takes 40
seconds in total can use an enhancement, the fraction is 10/40. This obtained value
is Fraction Enhanced.
➢ Fraction enhanced is always less than 1.
2. Speedup enhanced:
➢ The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program.
➢ For example: If the enhanced mode takes, say 3 seconds for a portion of the
program, while it is 6 seconds in the original mode, the improvement is 6/3. This
value is Speedup enhanced.
➢ Speedup Enhanced is always greater than 1.
14. Explain distributed shared memory system. Write it’s advantage.

Ans: Distributed shared memory (DSM) is a form of memory architecture where


physically separated memories can be addressed as one logically shared address
space. A distributed-memory system, often called a multicomputer, consists of
multiple independent processing nodes with local memory modules which is
connected by a general interconnection network.

fig: Distributed shared memory system representation

Advantages of shared memory system:

Scales well with a large number of nodes. Message passing is hidden. Can handle
complex and large databases without replication or sending the data to processes.
Generally cheaper than using a multiprocessor system. Provides large virtual memory
space

Disadvantages of shared memory system:

Generally slower to access than non-distributed shared memory. Must provide


additional protection against simultaneous accesses to shared data. May incur a
performance penalty. Little programmer control over actual messages being
generated.
15. What are parallel reduction algorithms? How they are used in PRAM
architecture.

Ans: Parallel Reduction Algorithms

Definition: Parallel reduction algorithms are techniques used to combine multiple values into a
single result using parallel processing. Common operations include summation, finding
maximum/minimum values, and logical operations.

Key Steps in Parallel Reduction:

1. Divide the Input: Split the data into smaller chunks for independent processing.

2. Local Reduction: Each processor computes a local result for its assigned chunk.

3. Global Reduction: Intermediate results from all processors are combined to produce the
final result.

Example: Sum Reduction

1. Input Array: [1, 2, 3, 4, 5, 6, 7, 8]

2. Local Sums:

o Processor 1: 10 (from [1, 2, 3, 4])

o Processor 2: 26 (from [5, 6, 7, 8])

3. Final Sum: 10 + 26 = 36

PRAM Architecture

PRAM (Parallel Random Access Machine): A theoretical model for parallel computing with
multiple processors accessing shared memory simultaneously. It has different models, including
EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write), and CRCW
(Concurrent Read Concurrent Write).

Parallel Reduction in PRAM:

1. Initialization: Assign input data to processors.

2. Local Computation: Each processor computes a local result.

3. Combine Results: Intermediate results are written to shared memory.

4. Final Reduction: Results are combined to produce the final output.


16.Discuss different parallel approaches including data parallelism and
controm parallelism.

Ans: Parallel computing can be broadly categorized into two main approaches: data
parallelism and control parallelism. Each approach focuses on different aspects of parallel
execution and is suited for different types of problems.

1. Data Parallelism:- Data parallelism involves distributing subsets of the same data across
multiple processors, where the same operation is performed on each subset concurrently.

Key Features:

• Uniform Operations: The same operation (e.g., addition, multiplication) is applied to


different pieces of data.

• Data Distribution: Data is partitioned into smaller chunks that can be processed
independently.

• Scalability: Easily scales with the amount of data and the number of processors.

Example Use Cases:

• Image processing (applying filters to pixels).

• Scientific simulations (calculating values for different grid points).

• Machine learning (training models on large datasets).

Implementation: Commonly implemented using frameworks like CUDA for GPUs or parallel
libraries such as OpenMP and MPI.

2. Control Parallelism:- Control parallelism involves executing different operations or tasks


concurrently, where each task may operate on different data and may not necessarily perform
the same operation.

Key Features:

• Task Distribution: Different tasks are assigned to different processors, which may
involve different operations.

• Dependency Management: Tasks may have dependencies, requiring careful scheduling


and synchronization.
• Flexibility: Suitable for problems where tasks can be executed independently.

Example Use Cases:

• Task scheduling in operating systems.

• Workflow management in data processing pipelines.

• Complex simulations where different modules operate independently.

Implementation: Often implemented using parallel programming models such as threads,


message passing, or task-based frameworks.

17. Write down the limitation of Amdah’s law.

Ans: Limitations of the Amdahl's Law:

1. A key flaw in past arguments that Amdahl's law is a fatal limit to the future of
parallelism is Gustafon's law: (The proportion of the computations that are
sequential normally decreases the problem size increase.

2. Its proof focuses on the steps in a particular algorithm, and does not consider
whether other algorithms with more parallelism may exist.

3. Amdahl's law applies only to 'standard' problem were super linearity cannot occur.

note: Actual speed ups are always less in Amdahl's law because distributing work to the
parallel processors and collecting the results back together is extra work required in the
parallel version which isn't required in the serial version.
18. Explain the steps of Faster’s design methodology.

Ans: A well-known design methodology that is used to architect and implement distributed-
memory systems (DMS) . This methodology has as foundations four steps :-

1. Partitioning:

i. On partitioning you cut the problem into its origins (its goal) and highlight the primitive
task.

ii. A system whose purpose is to sum all the positions of a giant array, has as primitive task
the act of summing the current position to an accumulator variable.

iii. The purpose of the partitioning is to discover as much parallelism as possible.

➢ This first stage, partitioning is the only chance to do this; remaining stages typically
reduce the amount of parallelism. There are several strategies to do partitioning: Domain
decomposition ,Functional decomposition, Foster's checklist for partitioning:

2. Communication

➢ Once you decided the primitive task, you need to evaluate the need for communication
among them.

➢ Note that the communication pattern should not be too complex.

➢ There are two levels of communication, local communication, where you have a task that
requires data from a set of adjacent tasks (e.g.: to sum position j+1 you need to compute
first position j and pass the accumulated value to the task responsible to compute the
following position),

➢ And global communication, where you have a task producing a value that every other task
requires. You should minimize both cases (neither 8 nor 80).

3. Agglomeration

➢ Agglomeration is the act of bringing the huge problem that we’ve created by partitioning
the original problem into many primitive tasks again into a more manageable problem (i.e.,
bigger tasks).

➢ One of the goals is to hide the majority of the communication cost inside a single task,

➢ So instead of having a task adding its position of the array to the accumulator and then
passing the accumulator value to the next task, we agglomerate a bunch of array positions
into the same task and give it to a single core, hence, reducing dramatically the
communication cost.

4. Mapping

➢ Mapping, the final stage of Foster's methodology, is the procedure of assigning each task to
processor.

➢ This mapping problem does not arise on uniprocessors or on shared memory computers
whose operating systems provide automatic scheduling.

Fig: Foster's design methodology graphical representation.


19. Differentiate between data and task parallesism.

Ans: Difference between task and data parallelism:

Data parallelism Task parallelism

Same operations are performed on Different operations are performed on the same or
different subsets of same data. different data.

Synchronous computation Asynchronous computation

Speedup is more as there is only one Speedup is less as each processor will execute a
execution thread operating on all sets different thread or process on the same or different
of data. set of data.

Amount of parallelization is Amount of parallelization is proportional to the


proportional to the input data size. number of independent tasks to be performed.

Load balancing depends on the availability of the


Designed for optimum load balance on
hardware and scheduling algorithms like static and
multi- processor system.
dynamic scheduling.

20. Explain Karp-Flatt metrix with suitable example.


➢ Ans: The Karp–Flatt metric is a measure of parallelization of code in parallel processor
systems.
➢ This metric exists in addition to Amdahl's law and Gustafson's law as an indication of the
extent to which a particular computer code is parallelized.
➢ It was proposed by Alan H. Karp and Horace P. Flatt in 1990.

Description:

Given a parallel computation exhibiting speedup 𝜓 on p processors, where p > 1, the


experimentally determined serial fraction e id defined to be the Karp-Flatt Metric viz.
1 1

𝜓 𝑝
e= 1
1−
𝑝
The lower the value of e, the better the parallelization.
Justification:

➢ There are many ways to measure the performance of a parallel algorithm running on a
parallel processor.
➢ The Karp–Flatt metric defines a metric which reveals aspects of the performance that are not
easily discerned from other metrics.
➢ A pseudo-"derivation" of sorts follows from Amdahl's Law, which can be written as:
𝑇𝑝
T(p) = Ts + 𝑝

Where:

1. T(p) is the total time taken for code execution in a p-processor system
2. Ts is the time taken for the serial part of the code to run.
3. Tp is the time taken for the parallel part of the code to run in one processor.
4. p is the number of processors.

With the result obtained by substituting p = 1 viz. T(1) = Ts + Tp , if we define the serial fraction
𝑇𝑠
e = 𝑇(1) then the equation can be written as

𝑇(1)(1−𝑒)
T(p) = T(1)e + 𝑝

𝑇(1)
In terms of the speedup 𝜓 = 𝑇(𝑝)

1 1−𝑒
=e+
𝜓 𝑝

➢ Solving for the serial fraction, we get the Karp–Flatt metric as above.
➢ Note that this is not a "derivation" from Amdahl's law as the left hand side represents a
metric rather than a mathematically derived quantity.
➢ The treatment above merely shows that the Karp–Flatt metric is consistent with Amdahl's
Law.

Example: Benchmarking a parallel program on 1, 2, ..., 8 processors produces the following


speedup results:

p 2 3 4 5 6 7 8
𝜓(𝑛, 𝑝) 1.82 2.50 3.08 3.57 4.00 4.38 4.71
LONG

1. Explain Canon’s algorithm. Use Eratostheness algorithm to findall the prime numbers
between 1 to 10.

Ans: Cannon's algorithm:

➢ In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication


for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.

➢ It is especially suitable for computers laid out in an N × N mesh.

➢ While Cannon's algorithm works well in homogeneous 2D grids, extending it to


heterogeneous 2D grids has been shown to be difficult.

➢ The main advantage of the algorithm is that its storage requirements remain constant and
are independent of the number of processors.

Algorithm:

1. Consider two n × n matrices A and B partitioned into p blocks.

2. A(i , j) and B(i , j) (0 ≤ i,j ≤ √p ) of size (n ∕ √p)×(n ∕ √p) each.

3. Process P(i , j) initially stores A(i , j) and B(i , j) computes block C(i , j) of the result matrix.

4. The initial step of the algorithm regards the alignment of the matrixes.

5. Align the blocks of A and B in such a way that each process can independently start
multiplying its local submatrices.

6. This is done by shifting all submatrices A(i , j) to the left (with wraparound) by i steps and all
submatrices B(i , j) up (with wraparound) by j steps.

7. Perform local block multiplication.

8. Each block of A moves one step left and each block of B moves one step up (again with
wraparound).

9. Perform next block multiplication, add to partial result, repeat until all blocks have been
multiplied.

To find all the prime numbers less than or equal to 10, proceed as follows.
The Sieve of Eratosthenes is an efficient algorithm for finding all prime numbers up to a
specified integer. Here’s how the algorithm works to find all prime numbers between 1 and 10:

Steps of the Sieve of Eratosthenes:

1. Initialize a List:

o Create a list of boolean values representing the numbers from 2 to 10. Initially, all
values are set to True, indicating that all numbers are potential primes.

o The list will look like this:

o Index: 0 1 2 3 4 5 6 7 8 9 10

o Values: [F, F, T, T, T, T, T, T, T, T, T]

o Here, the index represents the numbers, and True indicates that the number is a
candidate for being prime.

2. Start with the First Prime (2):

o The first prime number is 2. Start with this number and mark all of its multiples
as False (not prime).

3. Mark Multiples of 2:

o Mark 4, 6, 8, and 10 as False:

o Index: 0 1 2 3 4 5 6 7 8 9 10

o Values: [F, F, T, T, F, T, F, T, F, T, F]

4. Move to the Next Number (3):

o The next number that is still True is 3. Mark all of its multiples as False.

5. Mark Multiples of 3:

o Mark 6 and 9 as False:

o Index: 0 1 2 3 4 5 6 7 8 9 10

o Values: [F, F, T, T, F, T, F, T, F, F, F]

6. Next Prime (4):

o The next number is 4, but it is marked False, so we skip it.

7. Next Prime (5):


o The next number is 5. Since it is True, we mark its multiples:

▪ However, the only multiple of 5 within our range is 10, which is already
marked False.

8. Next Prime (6):

o Skip 6 as it is False.

9. Next Prime (7):

o The next number is 7, which is True. The only multiple of 7 within our range is
14, which is outside our limit.

10. Completion:

o We have now checked all numbers up to the square root of 10 (approximately


3.16). The remaining True values represent prime numbers.

Final List of Primes:

From the boolean list, the prime numbers between 1 and 10 are: 2, 3, 5, 7.
2. What do you mean by combitational search. Explain any one example in detail.

Ans: Combinatorial Search

Definition:
Combinatorial search refers to a class of algorithms and techniques used to explore and find
solutions in a search space defined by a set of discrete objects or combinations. This type of
search is often applied in optimization problems, decision-making, and AI, where the goal is to
find the best arrangement or selection from a finite set of possibilities.

Combinatorial search problems can range from simple tasks, like generating permutations of a
list, to complex problems like the traveling salesman problem or the knapsack problem.

Example: The N-Queens Problem

Problem Statement:

Place N queens on an N×N chessboard so that no two queens threaten each other (no two
queens can be in the same row, column, or diagonal).

Steps to Solve Using Backtracking:

1. Representation: Use a 2D array to represent the chessboard.

2. Backtracking Algorithm:

o Place queens row by row.

o For each row, try placing a queen in each column.

o Check for conflicts with previously placed queens.

o If a valid position is found, place the queen and move to the next row.

o If all queens are placed, print the solution.

o If a position leads to no solution, remove the queen (backtrack) and try the next
column.
Example Code (Python):

def is_safe(board, row, col, n):

for i in range(row):

if board[i][col]: return False

for i, j in zip(range(row, -1, -1), range(col, -1, -1)):

if board[i][j]: return False

for i, j in zip(range(row, -1, -1), range(col, n)):

if board[i][j]: return False

return True

def solve_n_queens_util(board, row, n):

if row >= n:

for r in board:

print(" ".join("Q" if cell else "." for cell in r))

print("\n")

return

for col in range(n):

if is_safe(board, row, col, n):

board[row][col] = True

solve_n_queens_util(board, row + 1, n)

board[row][col] = False

def solve_n_queens(n):

board = [[False] * n for _ in range(n)]

solve_n_queens_util(board, 0, n)

# Example usage solve_n_queens(4) (in next line)


3. What is pipelining? Describe the speed up gain due to pipelining.

Ans: : Pipelining is a technique of dividing one task into multiple subtasks and executing the
subtasks in parallel with multiple hardware units.

In a pipelining processor, multiple instructions are executed at various stages of instruction


cycle simultaneously in different sections of the processor.

Ex: In pipelining processors, S1,S2,S3,………Sn are n sections that form the pipeline.Each section
performs a subtask on the input received from the interstage buffer that stores the output of
the previous section. All sections can perform their operations simultaneously. The objective of
pipelining processing is to increase throughput due to the overlap between the execution of the
consecutive task. An instruction pipeline divides an instruction cycle actions into multiple steps
that are executed one-by-one in different sections of the processor.

The number of sections in the pipeline is designed by the computer architect. Consider the
instruction cycle is divided into four steps:

a)Instruction Fetch(IF)

b)Instruction Decode(ID)

c)Execute(EX)

d)Write Result(WR)

Assume that there are m instructions and these instructions will be executed in n-sections of
the pipeline processor.

The time taken for the first instruction = ntc, where tc is the duration of the clock cycle.

Time taken for remaining (m-1) instructions = (m-1)tc

Total time taken for m instructions = ntc + (m-1)tc = (n+m-1)tc

If the processor is non-pipelined, then the total time taken for m instructions is mntc.

Hence performance gain due to pipeline = mntc/(m+n-1)tc


3. Explain Hybercube algorithm. Write down the feature of hypercube. Explain forward and
reverse phase of prefix computation on Hypercube.

Ans: Definition:
The hypercube algorithm is a parallel computing model based on the structure of a hypercube
graph. Each node in the hypercube represents a processor, and edges between nodes represent
communication links. The hypercube is a powerful architecture for parallel processing due to its high
connectivity and low latency.

Features of Hypercube

1. Topology:- A hypercube of dimension 𝑑d has 2𝑑2d nodes. Each node is


connected to 𝑑d other nodes.

2. Scalability:- It can easily scale by increasing the dimension 𝑑d, allowing for more
processors.

3. Communication:- Nodes can communicate directly with their neighbors,


enabling efficient data transfer.

4. Fault Tolerance:- The structure allows for some level of fault tolerance; if a node
fails, the remaining nodes can still communicate.

5. Parallelism:- Supports a high degree of parallelism, making it suitable for a wide


range of parallel algorithms.

Prefix Computation on Hypercube

Prefix computation is a common operation in parallel computing, where each processor


computes a prefix sum (or other associative operation) over a distributed set of values. The
process can be divided into two phases: the forward phase and the reverse phase.

Forward Phase

1. Initialization:

o Each processor starts with its own value.

2. Communication:

o In each step, each processor sends its value to its neighbors along the edges of
the hypercube.

o The communication proceeds in a way that each processor aggregates the values
from its neighbors, effectively computing partial sums.
3. Steps:

o In the first step, processors at a distance of 1 (neighbors) exchange values.

o In the second step, processors at a distance of 2 exchange values, and so on.

o This continues until the values are propagated through the hypercube structure.

Reverse Phase

1. Finalization:

o After the forward phase, each processor holds a partial sum. The reverse phase
is used to compute the final prefix sum.

2. Communication:

o Similar to the forward phase, each processor sends its value to its neighbors.

o However, in this phase, the processors adjust their values based on the values
received from their neighbors.

3. Steps:

o The reverse phase also involves multiple steps. Each processor updates its value
based on the values received from its neighbors, effectively completing the
prefix computation.
4. ExplainDifferent paradiagram of paralle computing. Also discuss merit and demerit of
parallem computing.

Ans: Parallel computing can be categorized into several paradigms based on how tasks are
divided and executed. Here are some of the main paradigms:

1. Data Parallelism:

o Description: Involves distributing subsets of the same data across multiple


processors. Each processor performs the same operation on different pieces of
data.

o Example: Vector addition, where each element of two vectors is added in


parallel.

2. Task Parallelism:

o Description: Involves distributing different tasks or processes across multiple


processors. Each processor may perform a different operation.

o Example: In a web server, one processor handles requests while another


processes data.

3. Pipeline Parallelism:

o Description: Tasks are divided into stages, and each stage is executed in parallel.
The output of one stage is the input for the next.

o Example: In image processing, one processor might filter an image while another
compresses it.

4. Bit-level Parallelism:

o Description: Involves increasing the word size of the processor to process


multiple bits in a single operation.

o Example: 64-bit processors can handle 64 bits of data in one operation


compared to 32-bit processors.

5. Instruction-level Parallelism (ILP):

o Description: Multiple instructions are executed simultaneously within a single


CPU cycle.
o Example: Superscalar architectures can issue multiple instructions per clock
cycle.

6. Hybrid Parallelism:

o Description: Combines different parallelism approaches (e.g., data and task


parallelism) to optimize performance.

o Example: Using both multi-threading (task parallelism) and vectorization (data


parallelism) in a scientific computation application.

Merits of Parallel Computing

1. Increased Performance:- Parallel computing can significantly reduce the time


required to complete large computations by dividing the workload among
multiple processors.

2. Scalability:- Systems can be scaled up by adding more processors, allowing for


handling larger problems more efficiently.

3. Resource Utilization:- Better utilization of available resources, as multiple


processors can work on different tasks simultaneously.

4. Enhanced Capability:- Enables solving complex problems that are infeasible for
sequential computing due to time or memory constraints.

Demerits of Parallel Computing

1. Complexity:- Designing parallel algorithms and managing data dependencies can


be complex and challenging.

2. Overhead:- Communication and synchronization between processors can


introduce overhead, reducing the overall performance gains.

3. Debugging:- Debugging parallel programs is often more difficult than debugging


sequential programs due to non-deterministic behaviors and race conditions.

4. Limited Speedup:- The theoretical speedup may not be achievable in practice


due to factors like Amdahl's Law, which states that the maximum speedup of a
program is limited by the serial portion of the program.

5. Cost:- High-performance parallel computing systems can be expensive to


develop and maintain.
5. Explain Floyd-Warshall algorithm with suitable example.

Ans: In computer science, the Floyd–Warshall algorithm (also known as Floyd's algorithm, the
Roy–Warshall algorithm, the Roy–Floyd algorithm, or the WFI algorithm) is an algorithm for
finding shortest paths in a directed weighted graph with positive or negative edge weights (but
with no negative cycles). A weighted graph is a graph in which each edge has a numerical value
associated with it.

Floyd-Warshall Algorithm:
n = no of vertices
A = matrix of dimension n*n
for k = 1 to n
for i = 1 to n
for j = 1 to n
Ak[i, j] = min (Ak-1[i, j], Ak-1[i, k] + Ak-1[k, j])
return A

Example
Consider the following directed graph with 4 vertices (0, 1, 2, 3) and the edges with weights:
• 0 → 1 (3)
• 0 → 2 (5)
• 1 → 2 (1)
• 2 → 3 (2)
• 1 → 3 (6)
Step 1: Initialization
The initial distance matrix 𝐷D is:
0 1 2 3
0 [ 0, 3, 5, ∞ ]
1 [ ∞, 0, 1, 6 ]
2 [ ∞, ∞, 0, 2 ]
3 [ ∞, ∞, ∞, 0 ]
Step 2: Relaxation
• For 𝑘=0k=0:
o Update distances using vertex 0.
• For 𝑘=1k=1:
o Update distances using vertex 1:
o 𝐷[0][3]=min⁡(𝐷[0][3],𝐷[0][1]+𝐷[1][3])=min⁡(∞,3+6)=9D[0][3]=min(D[0][3],
D[0][1]+D[1][3])=min(∞,3+6)=9
• For 𝑘=2k=2:
o Update distances using vertex 2:
o 𝐷[1][3]=min⁡(𝐷[1][3],𝐷[1][2]+𝐷[2][3])=min⁡(6,1+2)=3D[1][3]=min(D[1][3],
D[1][2]+D[2][3])=min(6,1+2)=3
o 𝐷[0][3]=min⁡(𝐷[0][3],𝐷[0][2]+𝐷[2][3])=min⁡(9,5+2)=7D[0][3]=min(D[0][3],
D[0][2]+D[2][3])=min(9,5+2)=7
• For 𝑘=3k=3:
o No updates needed since vertex 3 has no outgoing edges.
Final Distance Matrix:
After considering all vertices, the final distance matrix 𝐷D becomes:
0 1 2 3
0 [ 0, 3, 4, 7 ]
1 [ ∞, 0, 1, 3 ]
2 [ ∞, ∞, 0, 2 ]
3 [ ∞, ∞, ∞, 0 ]
6. Explain Flynn’s classification of computer architecture with it’s diagram.

Ans: Flynn's taxonomy:

1. Flynn's taxonomy is a classification of computer architectures, proposed by Michael J. Flynn


in 1966.
2. The classification system has stuck, and it has been used as a tool in design of modern
processors and their functionalities.
3. Since the rise of multiprocessing central processing units (CPUs), a multiprogramming
context has evolved as an extension of the classification system.
4. The four classifications defined by Flynn are based upon the number of concurrent
instruction (or control) streams and data streams available in the architecture

A. Single instruction stream, single data stream (SISD)

1. A sequential computer which exploits no parallelism in either the instruction or data streams.
2. Single control unit (CU) fetches single instruction stream (IS) from memory.
3. The CU then generates appropriate control signals to direct single processing element (PE) to
operate on single data stream (DS) i.e., one operation at a time.
4. Examples of SISD architecture are the traditional uniprocessor machines like older personal
computers (PCs; by 2010, many PCs had multiple cores) and mainframe computers.

Fig: SISD

B. Single instruction stream, multiple data streams (SIMD)

1. A single instruction operates on multiple different data streams. Instructions can be executed
sequentially, such as by pipelining, or in parallel by multiple functional units.

Fig: SIMD
2. Single instruction, multiple threads (SIMT) is an execution model used in parallel computing
where single instruction, multiple data (SIMD) is combined with multithreading.
3. This is not a distinct classification in Flynn's taxonomy, where it would be a subset of SIMD.
Nvidia commonly uses the term in its marketing materials and technical documents where it
argues for the novelty of Nvidia architecture

C. Multiple instruction streams, single data stream (MISD)

1. Multiple instructions operate on one data stream.


2. This is an uncommon architecture which is generally used for fault tolerance.
3. Heterogeneous systems operate on the same data stream and must agree on the result.
4. Examples include the Space Shuttle flight control computer

Fig: MISD

D. Multiple instruction streams, multiple data streams (MIMD)

1. Multiple autonomous processors simultaneously executing different instructions on different


data.
2. MIMD architectures include multi-core superscalar processors, and distributed systems,
using either one shared memory space or a distributed memory space.

Fig: MIMD
7. Explain the architecture of distributed and shared memory of computer.

Ans: Distributed shared-memory programming:

➢ Distributed shared memory (DSM) is a form of memory architecture where physically


separated memories can be addressed as one logically shared address space.
➢ A distributed-memory system, often called a multicomputer, consists of multiple independent
processing nodes with local memory modules which is connected by a general
interconnection network.

fig: Distributed shared memory system representation

Shared memory model:

➢ The shared memory in the shared memory model is the memory that can be simultaneously
accessed by multiple processes.
➢ This is done so that the processes can communicate with each other. All POSIX systems, as
well as Windows operating systems use shared memory.

➢ A diagram that illustrates the shared memory model of process communication is given as
above.
➢ In the above diagram, the shared memory can be accessed by Process 1 and Process 2.
8. What are the major difference between distributed and shared memory?

Ans: Distributed memory and shared memory are two fundamental paradigms used in parallel
computing to manage memory and facilitate communication between processes. Here are the major
differences between them:

1. Memory Architecture

• Distributed Memory: Each processor has its own local memory, and the memory
is not shared among processors. Communication between processors occurs via
message passing (e.g., using MPI - Message Passing Interface). Example: Cluster
computing systems where each node has its own memory.

• Shared Memory: All processors share a common global memory space.


Processes can directly read from and write to shared memory locations.
Synchronization mechanisms (e.g., mutexes, semaphores) are required to
manage concurrent access to shared data. Example: Multi-core processors
where threads can access the same memory.

2. Communication Mechanism

• Distributed Memory: Communication is explicit; processes must send and


receive messages to exchange data. This can introduce overhead due to the
need for serialization and network latency.

• Shared Memory: Communication is implicit; processes can directly access shared


variables. This can lead to faster communication but requires careful
synchronization to avoid race conditions.

3. Scalability

• Distributed Memory: Generally more scalable, as adding more nodes


(processors) can be done without significant changes to the architecture.
Suitable for large systems like supercomputers and cloud computing.

• Shared Memory: Scalability is limited by the memory bus and potential


contention for shared resources. As more processors access the shared memory,
performance can degrade due to increased contention.
4. Programming Model

• Distributed Memory: Requires a message-passing model, which can be more


complex to program. Programmers must explicitly manage data distribution and
communication.

• Shared Memory: Easier to program for certain types of applications, as threads


can access shared data without explicit communication. However, programmers
must be aware of synchronization issues.

5. Fault Tolerance

• Distributed Memory: More robust to individual node failures; if one node fails,
others can continue to operate. Fault tolerance mechanisms can be
implemented at the application level.

• Shared Memory: A failure in the shared memory subsystem can affect all
processes, making it less fault-tolerant. Typically requires additional mechanisms
to handle failures.

Summary Table

Feature Distributed Memory Shared Memory

Memory Architecture Local memory per processor Global shared memory

Communication Direct access to shared


Message passing
Mechanism variables

Scalability More scalable Limited scalability

More complex (message-


Programming Model Easier (shared data access)
passing)

Fault Tolerance More robust to node failures Less fault-tolerant

Conclusion

Both distributed memory and shared memory have their advantages and disadvantages, and
the choice between them depends on the specific requirements of the application, the
architecture of the system, and the desired scalability and performance characteristics.

You might also like