QNS. Parallel Computing
QNS. Parallel Computing
Ans: In very simple terms, it is the use of multiple resources, in this case, processors, to
solve a problem. This type of programming takes a problem, breaks it down into a series of
smaller steps, delivers instructions, and processors execute the solutions at the same time.
It is also a form of programming that offers the same results as concurrent programming
but in less time and with more efficiency.
Many computers, such as laptops and personal desktops, use this programming in their
hardware to ensure that tasks are quickly completed in the background. This type of
programming can be used for everything from science and engineering to retail and
research. It’s most common use from a societal perspective is in web search engines,
applications, and multimedia technologies.
1. Scientific Research: Researchers in fields such as physics, chemistry, and biology often
use parallel computing to simulate complex systems, run experiments, and analyze large
datasets.
3. Financial Services: The finance industry employs parallel computing for risk analysis,
algorithmic trading, and market simulations, allowing for faster transactions and
analysis.
4. Machine Learning and Artificial Intelligence: Training complex models, such as deep
neural networks, often involves parallel processing to handle large datasets and perform
numerous calculations simultaneously.
5. Image and Signal Processing: Applications such as video encoding, medical imaging, and
real-time signal processing benefit from parallel computation to speed up the
processing time.
7. Large-Scale Data Analysis: Businesses and organizations analyze big data using parallel
computing to gain insights faster and make data-driven decisions.
2. Explain gastatson Brasis’s law with suitable example and compare it with
Arndall’s law.
Ans: In computer architecture, Gustafson's law (or Gustafson–Barsis's law) gives the
theoretical speedup in latency of the execution of a task at fixed execution time that can be
expected of a system whose resources are improved.
➢ It is named after computer scientist John L. Gustafson and his colleague Edwin H. Barsis,
and was presented in the article Reevaluating Amdahl's Law in 1988
➢ Gustafson estimated the speedup S gained by using N processors (instead of just one) for a
task with a serial fraction s (which does not benefit from parallelism) as follows: S = N + (1-
N) s
➢ Using different variables, Gustafson's law can be formulated the following way: S latency (s) =
1-p +sp,
Where,
Slatency = The theoretical speedup in latency of the execution of the whole task.
s = It is the speedup in latency of the execution of the part of the task that benefits form the
improvement of the resources of the system.
p = It is the percentage of the execution workload of the whole task concerning the part that
benefits form the improvement of the resources of the system before the improvement.
Solution:
We know that
S = N + (N-1) s
= 64 + (64-1) * 0.05
= 60.85 ans….
Note: Gustafson Barsis's law speed is increasing as the number of the processors increases in
comparison to Amdahl's law. So it is better law than Amdahl's law.
3. Why message passing model are used? What is included and not included in mpi
standard. Also write advantage and disadvantage.
ans: In computer science, message passing is a technique for invoking behavior (i.e., running
a program) on a computer.
➢ Message passing model allows multiple processes to read and write data to the message
queue without being connected to each other.
➢ Messages are stored on the queue until their recipient retrieves them.
➢ Message queues are quite useful for interprocess communication and are used by most
operating systems.
➢ A diagram that demonstrates message passing model of process communication is given as
follows
➢ In the above diagram, both the processes P1 and P2 can access the message queue and store
and retrieve data.
1. Point-to-point communication
2. Datatypes
3. Collective operations
4. Process groups
5. Communication contexts
6. Process topologies
7. Environment management and inquiry
8. The info object
9. Process creation and management
10. One- sided communication
11. External interfaces
12. Parallel file I/O
13. Language findings for Fortran and C
14. Tool support.
The standard does not specify:
1. Operations that require more operating system support than currently standard. For
example: interrupt-driven receivers, remote execution, or active message.
2. Program construction tools
3. Debugging facilities.
Advantages of Message Passing Model
Some of the advantages of message passing model are given as follows:
• The message passing model is much easier to implement than the shared memory model.
• It is easier to build parallel hardware using message passing model as it is quite tolerant
of higher communication latencies.
Definition:
Divide and conquer is an algorithm design paradigm used to solve complex problems by
breaking them down into smaller, more manageable subproblems. The key steps involved in
this technique are:
1. Divide: Split the original problem into smaller subproblems that are similar to the
original problem but smaller in size.
2. Conquer: Solve each of the subproblems recursively. If the subproblems are small
enough, solve them directly (base case).
3. Combine: Merge the solutions of the subproblems to form the solution to the original
problem.
Features:
• Recursive Nature: The divide and conquer approach often employs recursion, where a
function calls itself with smaller inputs.
• Efficiency: Many divide and conquer algorithms have improved time complexity
compared to their naive counterparts, often achieving logarithmic or linearithmic time
complexity (e.g., (𝑛log𝑛)O(nlogn)).
• Parallelism: The subproblems can often be solved in parallel, making this approach
suitable for parallel computing environments.
• Base Case: The algorithm must define a base case where the problem is small enough to
be solved directly without further division.
• Examples: Common algorithms that use the divide and conquer technique include
Merge Sort, Quick Sort, Binary Search, and the Fast Fourier Transform (FFT).
Features of Hypercube
A hypercube is a geometric shape that generalizes the concept of a cube to higher dimensions.
The features of a hypercube include:
3. Symmetry: Hypercubes exhibit a high degree of symmetry. All vertices, edges, and faces
are equivalent, which means they can be transformed into each other through rotations
and reflections.
6. Applications: Hypercubes are utilized in various fields, including computer science (for
parallel processing architectures), mathematics (in topology and combinatorics), and
data structures (for organizing multi-dimensional data).
7. Distance Metric: The distance between any two vertices in a hypercube can be
measured using the Hamming distance, which counts the number of differing bits in
their binary representations.
Ans: NUMA stands for Non-Uniform memory access and is a special type of shared memory
architecture where access times to different memory locations by a processor may vary as may
also access times to the same memory location by different processors.
Parallel algorithms often require a single process to send identical data to all other processes
or to a subset of them. This operation is known as one-to-all broadcast. Initially, only the source
process has the data of size m that needs to be broadcast. At the termination of the procedure,
there are p copies of the initial data – one belonging to each process. The dual of one-to-all
broadcast is all-to-one reduction. In an all-to-one reduction operation, each of the p
participating processes starts with a buffer M containing m words. The data from all processes
are combined through an associative operator and accumulated at a single destination process
into one buffer of size m. Reduction can be used to find the sum, product, maximum, or
minimum of sets of numbers – the ith word of the accumulated M is the sum, product,
maximum, or minimum of the ith words of each of the original buffers.
One-to-all broadcast and all-to-one reduction are two fundamental communication patterns in
parallel computing and distributed systems. Here’s a concise differentiation between the two:
One-to-All Broadcast
Definition:
In a one-to-all broadcast, a single source node sends a message to all other nodes in the
network or system.
Characteristics:
• Use Case: Useful when a single piece of information needs to be disseminated to all
participants, such as configuration updates or notifications.
All-to-One Reduction
Definition:
In an all-to-one reduction, multiple source nodes send their data to a single destination node,
which aggregates or reduces the data.
Characteristics:
• Sources: There are multiple senders, each providing their own data.
• Use Case: Commonly used for operations that require combining data, such as summing
values, finding maximums, or calculating averages across multiple nodes.
• Communication Pattern: Often involves gathering data from various nodes to a central
node, which may require synchronization and data aggregation.
• Example: Worker nodes in a parallel computation sending their results back to a master
node for final processing.
6. What is stored memory model? Explain it’s use.
Ans: The stored memory model, also known as the stored-program architecture, is a computer
architecture in which both program instructions and data are stored in the same memory
space. This model allows the CPU to access both instructions and data using the same memory
addressing mechanism.
1. Unified Memory Space: - Both program instructions and data reside in the same memory
space, allowing for easier management and access.
2. Program Instructions as Data: - Programs can be treated as data, enabling powerful features
such as self-modifying code and dynamic loading of programs.
3. Sequential Execution: - The CPU fetches instructions sequentially from memory, executes
them, and can modify the instruction pointer to jump to different parts of the program.
4. Flexibility: - The model allows for more flexible programming techniques, such as recursion
and the use of function pointers.
2. Dynamic Program Behavior: - The ability to treat instructions as data allows programs to
modify their own behavior at runtime, which is useful in applications like interpreters, just-in-
time (JIT) compilers, and dynamic programming languages.
3. Operating Systems: - Operating systems utilize the stored memory model to manage tasks,
run applications, and handle system calls, as they can load and execute programs from memory
dynamically.
4. Embedded Systems: - Many embedded systems use the stored memory model for flexibility
in programming and ease of updates, allowing firmware to be modified or updated without
changing hardware.
Ans: : A natural extension of the serial model of computation (the Random Access Machine, or
RAM) consists of p processors and a global memory of unbounded size that is uniformly
accessible to all processors. All processors access the same address space. Processors share a
common clock but may execute different instructions in each cycle. This ideal model is also
referred to as a parallel random access machine (PRAM). Since PRAMs allow concurrent access
to various memory locations, depending on how simultaneous memory accesses are handled,
PRAMs can be divided into four subclasses.
4. Concurrent-read, concurrent-write (CRCW) PRAM- This class allows multiple read and writes
accesses to a common memory location. This is the most powerful PRAM model.
Allowing concurrent read access does not create any semantic discrepancies in the program.
However, concurrent write access to a memory location requires arbitration. Several protocols
are used to resolve concurrent writes. The most frequently used protocols are as follows:
• Common, in which the concurrent write is allowed if all the values that the processors are
attempting to write are identical.
• Arbitrary, in which an arbitrary processor is allowed to proceed with the write operation and
the rest fail.
• Priority, in which all processors are organized into a predefined prioritized list, and the
processor with the highest priority succeeds and the rest fail.
• Sum, in which the sum of all the quantities is written (the sum-based write conflict resolution
model can be extended to any associative operator defined on the quantities being written).
8. Differentiate between parallel and serial computing.
The stored memory model, also known as the stored-program architecture, is a computer
architecture in which both program instructions and data are stored in the same memory
space. This model allows the CPU to access both instructions and data using the same memory
addressing mechanism. Key Features of the Stored Memory Model are as follow:
1. Unified Memory Space: Both program instructions and data reside in the same
memory space, allowing for easier management and access.
4. Flexibility: The model allows for more flexible programming techniques, such as
recursion and the use of function pointers.
Ans: The sieve of Eratosthenes is an ancient algorithm for finding all prime numbers up to any given
limit. It does so by iteratively marking as composite (i.e., not prime) the multiples of each prime,
starting with the first prime number, 2. It is one of the most efficient ways to find all of the smaller
primes. It may be used to find primes in arithmetic progressions.
Algorithm:
Example: To find all the prime numbers less than or equal to 20, proceed as follows.
Ans: It is an approach which works by using half the number of threads of the elements in the
dataset. Every thread calculates the minimum of its own element and some other element. The
resultant element is forwarded to the next round. The number of threads is then reduced by
half and the process repeated until there is just a single element remaining, which is the result
of the operation. Types of Parallel Reduction are as follow:-
There are several types of parallel reduction based on the operation being performed. Here are
a few common types:
1. Sum Reduction:
o This is the most common type of reduction, where the goal is to compute the
sum of a set of numbers. For example, summing an array of integers can be done
in parallel by dividing the array into segments, summing each segment, and then
summing the results of those segments.
2. Product Reduction:
o Similar to sum reduction, but instead of adding numbers, the goal is to compute
the product of a set of numbers. This can be useful in various mathematical
computations.
3. Maximum/Minimum Reduction:
o This type of reduction finds the maximum or minimum value in a dataset. Each
processor can find the maximum or minimum in its chunk, and then the results
are combined to find the overall maximum or minimum.
4. Logical Reduction:
o This reduction is used to compute logical operations (AND, OR, etc.) across a set
of boolean values. For example, determining if any element in a boolean array is
true can be done using parallel OR reduction.
5. Custom Reduction:
Here’s a simple example of how sum reduction might work in a parallel computing
environment:
2. Divide the Array: Split into chunks: [1, 2, 3, 4], [5, 6, 7, 8], [9, 10]
o Processor 1: 1 + 2 + 3 + 4 = 10
o Processor 2: 5 + 6 + 7 + 8 = 26
o Processor 3: 9 + 10 = 19
o Final Sum = 10 + 26 + 19 = 55
• Scalability: It can easily scale with the number of processors, allowing for better
resource utilization.
• Reduced Latency: Parallel processing minimizes the time taken to compute results by
leveraging multiple execution units.
11. What are the criteria that are used to evaluate the cost and performance of
static interconnection networks?
Ans: Evaluating the cost and performance of static interconnection networks, which are often used in
parallel computing and distributed systems, involves several criteria. These criteria help determine how
well the network meets the requirements of the applications it serves. Here are the key criteria used for
evaluation:
1. Latency
• Definition: The time it takes for a message to travel from the source node to the
destination node.
• Importance: Low latency is crucial for applications that require real-time processing or
frequent communication between nodes. It impacts the overall responsiveness of the
system.
2. Throughput
• Definition: The amount of data that can be transmitted through the network in a given
time period, usually measured in bits per second (bps).
• Importance: High throughput is essential for applications that involve large data
transfers or require high bandwidth, such as scientific simulations and data analytics.
3. Scalability
• Definition: The ability of the network to maintain performance levels as the number of
nodes increases.
• Importance: A scalable network can efficiently handle the addition of new nodes
without significant degradation in performance, making it suitable for growing
applications.
4. Cost
• Definition: The financial expense associated with building and maintaining the network,
including hardware, software, and operational costs.
• Definition: The ability of the network to operate correctly and consistently over time,
including the capacity to recover from failures.
• Importance: Reliable networks ensure that data is transmitted without loss and that the
system remains operational, which is vital for mission-critical applications.
6. Fault Tolerance
7. Network Topology
• Definition: The arrangement of nodes and connections in the network, which can affect
performance characteristics.
• Importance: Different topologies (e.g., mesh, torus, tree) have distinct performance
implications, such as routing efficiency and fault tolerance.
8. Bandwidth
9. Congestion Control
1. Encapsulation
• Definition: Encapsulation is the bundling of data (attributes) and methods (functions) that
operate on the data into a single unit called an object.
2. Inheritance
• Definition: Inheritance is a mechanism that allows a new class to inherit properties and
behaviors (methods) from an existing class.
3. Polymorphism
4. Concurrency Control
• Definition: Concurrency control refers to techniques used to manage the execution of multiple
threads or processes to ensure that they do not interfere with each other.
• Definition: Distributed objects are objects that can communicate and interact with one another
over a network.
• Feature in Parallelism: POOP supports the creation of distributed systems where objects reside
on different nodes in a network. This allows for parallel processing across multiple machines,
leveraging network resources for computation.
6. Task Decomposition:- Definition: Task decomposition involves breaking down a complex problem
into smaller, manageable tasks that can be executed in parallel.
• Feature in Parallelism: In POOP, objects can represent distinct tasks or components of a larger
problem. This modularity allows for easier distribution and parallel execution of tasks among
multiple processors or cores.
7. Data Sharing and Communication :- Definition: Data sharing and communication refer to the
mechanisms by which objects exchange information and synchronize their operations.
9. Frameworks and Libraries:- Definition: Frameworks and libraries provide pre-defined structures and
tools that simplify the development of parallel applications.
10. Dynamic Behavior:- Definition: Dynamic behavior refers to the ability of objects to change their
state or behavior at runtime.
• Feature in Parallelism: In POOP, objects can adapt their parallel processing strategies based on
runtime conditions, such as available resources or workload characteristics. This flexibility
enhances performance and resource utilization.
13. Explain Amdahl’s law with suitable example. Why actual speed ups are
always less in Amdahl’s law?
Example:
Let a program have 40 percent of its code enhanced (so fe= 0.4) to run 2.3 times
faster (so fi= 2.3). What is the overall system speedup S?
Solution:
Step 1: Setup the equation:
S = ((1 –fe) + (fe/ fi))-1
Step 2: Plug in values & solve
S = ((1 –0.4) + (0.4 / 2.3))-1
= (0.6 + 0.174)-1
= 1 / 0.774
= 1.292
Amdahl’s law uses two factors to find speedup from some enhancement
1. Fraction enhanced:
➢ It is the fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement.
➢ For example: If 10 seconds of the execution time of a program that takes 40
seconds in total can use an enhancement, the fraction is 10/40. This obtained value
is Fraction Enhanced.
➢ Fraction enhanced is always less than 1.
2. Speedup enhanced:
➢ The improvement gained by the enhanced execution mode; that is, how much
faster the task would run if the enhanced mode were used for the entire program.
➢ For example: If the enhanced mode takes, say 3 seconds for a portion of the
program, while it is 6 seconds in the original mode, the improvement is 6/3. This
value is Speedup enhanced.
➢ Speedup Enhanced is always greater than 1.
14. Explain distributed shared memory system. Write it’s advantage.
Scales well with a large number of nodes. Message passing is hidden. Can handle
complex and large databases without replication or sending the data to processes.
Generally cheaper than using a multiprocessor system. Provides large virtual memory
space
Definition: Parallel reduction algorithms are techniques used to combine multiple values into a
single result using parallel processing. Common operations include summation, finding
maximum/minimum values, and logical operations.
1. Divide the Input: Split the data into smaller chunks for independent processing.
2. Local Reduction: Each processor computes a local result for its assigned chunk.
3. Global Reduction: Intermediate results from all processors are combined to produce the
final result.
2. Local Sums:
3. Final Sum: 10 + 26 = 36
PRAM Architecture
PRAM (Parallel Random Access Machine): A theoretical model for parallel computing with
multiple processors accessing shared memory simultaneously. It has different models, including
EREW (Exclusive Read Exclusive Write), CREW (Concurrent Read Exclusive Write), and CRCW
(Concurrent Read Concurrent Write).
Ans: Parallel computing can be broadly categorized into two main approaches: data
parallelism and control parallelism. Each approach focuses on different aspects of parallel
execution and is suited for different types of problems.
1. Data Parallelism:- Data parallelism involves distributing subsets of the same data across
multiple processors, where the same operation is performed on each subset concurrently.
Key Features:
• Data Distribution: Data is partitioned into smaller chunks that can be processed
independently.
• Scalability: Easily scales with the amount of data and the number of processors.
Implementation: Commonly implemented using frameworks like CUDA for GPUs or parallel
libraries such as OpenMP and MPI.
Key Features:
• Task Distribution: Different tasks are assigned to different processors, which may
involve different operations.
1. A key flaw in past arguments that Amdahl's law is a fatal limit to the future of
parallelism is Gustafon's law: (The proportion of the computations that are
sequential normally decreases the problem size increase.
2. Its proof focuses on the steps in a particular algorithm, and does not consider
whether other algorithms with more parallelism may exist.
3. Amdahl's law applies only to 'standard' problem were super linearity cannot occur.
note: Actual speed ups are always less in Amdahl's law because distributing work to the
parallel processors and collecting the results back together is extra work required in the
parallel version which isn't required in the serial version.
18. Explain the steps of Faster’s design methodology.
Ans: A well-known design methodology that is used to architect and implement distributed-
memory systems (DMS) . This methodology has as foundations four steps :-
1. Partitioning:
i. On partitioning you cut the problem into its origins (its goal) and highlight the primitive
task.
ii. A system whose purpose is to sum all the positions of a giant array, has as primitive task
the act of summing the current position to an accumulator variable.
➢ This first stage, partitioning is the only chance to do this; remaining stages typically
reduce the amount of parallelism. There are several strategies to do partitioning: Domain
decomposition ,Functional decomposition, Foster's checklist for partitioning:
2. Communication
➢ Once you decided the primitive task, you need to evaluate the need for communication
among them.
➢ There are two levels of communication, local communication, where you have a task that
requires data from a set of adjacent tasks (e.g.: to sum position j+1 you need to compute
first position j and pass the accumulated value to the task responsible to compute the
following position),
➢ And global communication, where you have a task producing a value that every other task
requires. You should minimize both cases (neither 8 nor 80).
3. Agglomeration
➢ Agglomeration is the act of bringing the huge problem that we’ve created by partitioning
the original problem into many primitive tasks again into a more manageable problem (i.e.,
bigger tasks).
➢ One of the goals is to hide the majority of the communication cost inside a single task,
➢ So instead of having a task adding its position of the array to the accumulator and then
passing the accumulator value to the next task, we agglomerate a bunch of array positions
into the same task and give it to a single core, hence, reducing dramatically the
communication cost.
4. Mapping
➢ Mapping, the final stage of Foster's methodology, is the procedure of assigning each task to
processor.
➢ This mapping problem does not arise on uniprocessors or on shared memory computers
whose operating systems provide automatic scheduling.
Same operations are performed on Different operations are performed on the same or
different subsets of same data. different data.
Speedup is more as there is only one Speedup is less as each processor will execute a
execution thread operating on all sets different thread or process on the same or different
of data. set of data.
Description:
➢ There are many ways to measure the performance of a parallel algorithm running on a
parallel processor.
➢ The Karp–Flatt metric defines a metric which reveals aspects of the performance that are not
easily discerned from other metrics.
➢ A pseudo-"derivation" of sorts follows from Amdahl's Law, which can be written as:
𝑇𝑝
T(p) = Ts + 𝑝
Where:
1. T(p) is the total time taken for code execution in a p-processor system
2. Ts is the time taken for the serial part of the code to run.
3. Tp is the time taken for the parallel part of the code to run in one processor.
4. p is the number of processors.
With the result obtained by substituting p = 1 viz. T(1) = Ts + Tp , if we define the serial fraction
𝑇𝑠
e = 𝑇(1) then the equation can be written as
𝑇(1)(1−𝑒)
T(p) = T(1)e + 𝑝
𝑇(1)
In terms of the speedup 𝜓 = 𝑇(𝑝)
1 1−𝑒
=e+
𝜓 𝑝
➢ Solving for the serial fraction, we get the Karp–Flatt metric as above.
➢ Note that this is not a "derivation" from Amdahl's law as the left hand side represents a
metric rather than a mathematically derived quantity.
➢ The treatment above merely shows that the Karp–Flatt metric is consistent with Amdahl's
Law.
p 2 3 4 5 6 7 8
𝜓(𝑛, 𝑝) 1.82 2.50 3.08 3.57 4.00 4.38 4.71
LONG
1. Explain Canon’s algorithm. Use Eratostheness algorithm to findall the prime numbers
between 1 to 10.
➢ The main advantage of the algorithm is that its storage requirements remain constant and
are independent of the number of processors.
Algorithm:
3. Process P(i , j) initially stores A(i , j) and B(i , j) computes block C(i , j) of the result matrix.
4. The initial step of the algorithm regards the alignment of the matrixes.
5. Align the blocks of A and B in such a way that each process can independently start
multiplying its local submatrices.
6. This is done by shifting all submatrices A(i , j) to the left (with wraparound) by i steps and all
submatrices B(i , j) up (with wraparound) by j steps.
8. Each block of A moves one step left and each block of B moves one step up (again with
wraparound).
9. Perform next block multiplication, add to partial result, repeat until all blocks have been
multiplied.
To find all the prime numbers less than or equal to 10, proceed as follows.
The Sieve of Eratosthenes is an efficient algorithm for finding all prime numbers up to a
specified integer. Here’s how the algorithm works to find all prime numbers between 1 and 10:
1. Initialize a List:
o Create a list of boolean values representing the numbers from 2 to 10. Initially, all
values are set to True, indicating that all numbers are potential primes.
o Index: 0 1 2 3 4 5 6 7 8 9 10
o Values: [F, F, T, T, T, T, T, T, T, T, T]
o Here, the index represents the numbers, and True indicates that the number is a
candidate for being prime.
o The first prime number is 2. Start with this number and mark all of its multiples
as False (not prime).
3. Mark Multiples of 2:
o Index: 0 1 2 3 4 5 6 7 8 9 10
o Values: [F, F, T, T, F, T, F, T, F, T, F]
o The next number that is still True is 3. Mark all of its multiples as False.
5. Mark Multiples of 3:
o Index: 0 1 2 3 4 5 6 7 8 9 10
o Values: [F, F, T, T, F, T, F, T, F, F, F]
▪ However, the only multiple of 5 within our range is 10, which is already
marked False.
o Skip 6 as it is False.
o The next number is 7, which is True. The only multiple of 7 within our range is
14, which is outside our limit.
10. Completion:
From the boolean list, the prime numbers between 1 and 10 are: 2, 3, 5, 7.
2. What do you mean by combitational search. Explain any one example in detail.
Definition:
Combinatorial search refers to a class of algorithms and techniques used to explore and find
solutions in a search space defined by a set of discrete objects or combinations. This type of
search is often applied in optimization problems, decision-making, and AI, where the goal is to
find the best arrangement or selection from a finite set of possibilities.
Combinatorial search problems can range from simple tasks, like generating permutations of a
list, to complex problems like the traveling salesman problem or the knapsack problem.
Problem Statement:
Place N queens on an N×N chessboard so that no two queens threaten each other (no two
queens can be in the same row, column, or diagonal).
2. Backtracking Algorithm:
o If a valid position is found, place the queen and move to the next row.
o If a position leads to no solution, remove the queen (backtrack) and try the next
column.
Example Code (Python):
for i in range(row):
return True
if row >= n:
for r in board:
print("\n")
return
board[row][col] = True
solve_n_queens_util(board, row + 1, n)
board[row][col] = False
def solve_n_queens(n):
solve_n_queens_util(board, 0, n)
Ans: : Pipelining is a technique of dividing one task into multiple subtasks and executing the
subtasks in parallel with multiple hardware units.
Ex: In pipelining processors, S1,S2,S3,………Sn are n sections that form the pipeline.Each section
performs a subtask on the input received from the interstage buffer that stores the output of
the previous section. All sections can perform their operations simultaneously. The objective of
pipelining processing is to increase throughput due to the overlap between the execution of the
consecutive task. An instruction pipeline divides an instruction cycle actions into multiple steps
that are executed one-by-one in different sections of the processor.
The number of sections in the pipeline is designed by the computer architect. Consider the
instruction cycle is divided into four steps:
a)Instruction Fetch(IF)
b)Instruction Decode(ID)
c)Execute(EX)
d)Write Result(WR)
Assume that there are m instructions and these instructions will be executed in n-sections of
the pipeline processor.
The time taken for the first instruction = ntc, where tc is the duration of the clock cycle.
If the processor is non-pipelined, then the total time taken for m instructions is mntc.
Ans: Definition:
The hypercube algorithm is a parallel computing model based on the structure of a hypercube
graph. Each node in the hypercube represents a processor, and edges between nodes represent
communication links. The hypercube is a powerful architecture for parallel processing due to its high
connectivity and low latency.
Features of Hypercube
2. Scalability:- It can easily scale by increasing the dimension 𝑑d, allowing for more
processors.
4. Fault Tolerance:- The structure allows for some level of fault tolerance; if a node
fails, the remaining nodes can still communicate.
Forward Phase
1. Initialization:
2. Communication:
o In each step, each processor sends its value to its neighbors along the edges of
the hypercube.
o The communication proceeds in a way that each processor aggregates the values
from its neighbors, effectively computing partial sums.
3. Steps:
o This continues until the values are propagated through the hypercube structure.
Reverse Phase
1. Finalization:
o After the forward phase, each processor holds a partial sum. The reverse phase
is used to compute the final prefix sum.
2. Communication:
o Similar to the forward phase, each processor sends its value to its neighbors.
o However, in this phase, the processors adjust their values based on the values
received from their neighbors.
3. Steps:
o The reverse phase also involves multiple steps. Each processor updates its value
based on the values received from its neighbors, effectively completing the
prefix computation.
4. ExplainDifferent paradiagram of paralle computing. Also discuss merit and demerit of
parallem computing.
Ans: Parallel computing can be categorized into several paradigms based on how tasks are
divided and executed. Here are some of the main paradigms:
1. Data Parallelism:
2. Task Parallelism:
3. Pipeline Parallelism:
o Description: Tasks are divided into stages, and each stage is executed in parallel.
The output of one stage is the input for the next.
o Example: In image processing, one processor might filter an image while another
compresses it.
4. Bit-level Parallelism:
6. Hybrid Parallelism:
4. Enhanced Capability:- Enables solving complex problems that are infeasible for
sequential computing due to time or memory constraints.
Ans: In computer science, the Floyd–Warshall algorithm (also known as Floyd's algorithm, the
Roy–Warshall algorithm, the Roy–Floyd algorithm, or the WFI algorithm) is an algorithm for
finding shortest paths in a directed weighted graph with positive or negative edge weights (but
with no negative cycles). A weighted graph is a graph in which each edge has a numerical value
associated with it.
Floyd-Warshall Algorithm:
n = no of vertices
A = matrix of dimension n*n
for k = 1 to n
for i = 1 to n
for j = 1 to n
Ak[i, j] = min (Ak-1[i, j], Ak-1[i, k] + Ak-1[k, j])
return A
Example
Consider the following directed graph with 4 vertices (0, 1, 2, 3) and the edges with weights:
• 0 → 1 (3)
• 0 → 2 (5)
• 1 → 2 (1)
• 2 → 3 (2)
• 1 → 3 (6)
Step 1: Initialization
The initial distance matrix 𝐷D is:
0 1 2 3
0 [ 0, 3, 5, ∞ ]
1 [ ∞, 0, 1, 6 ]
2 [ ∞, ∞, 0, 2 ]
3 [ ∞, ∞, ∞, 0 ]
Step 2: Relaxation
• For 𝑘=0k=0:
o Update distances using vertex 0.
• For 𝑘=1k=1:
o Update distances using vertex 1:
o 𝐷[0][3]=min(𝐷[0][3],𝐷[0][1]+𝐷[1][3])=min(∞,3+6)=9D[0][3]=min(D[0][3],
D[0][1]+D[1][3])=min(∞,3+6)=9
• For 𝑘=2k=2:
o Update distances using vertex 2:
o 𝐷[1][3]=min(𝐷[1][3],𝐷[1][2]+𝐷[2][3])=min(6,1+2)=3D[1][3]=min(D[1][3],
D[1][2]+D[2][3])=min(6,1+2)=3
o 𝐷[0][3]=min(𝐷[0][3],𝐷[0][2]+𝐷[2][3])=min(9,5+2)=7D[0][3]=min(D[0][3],
D[0][2]+D[2][3])=min(9,5+2)=7
• For 𝑘=3k=3:
o No updates needed since vertex 3 has no outgoing edges.
Final Distance Matrix:
After considering all vertices, the final distance matrix 𝐷D becomes:
0 1 2 3
0 [ 0, 3, 4, 7 ]
1 [ ∞, 0, 1, 3 ]
2 [ ∞, ∞, 0, 2 ]
3 [ ∞, ∞, ∞, 0 ]
6. Explain Flynn’s classification of computer architecture with it’s diagram.
1. A sequential computer which exploits no parallelism in either the instruction or data streams.
2. Single control unit (CU) fetches single instruction stream (IS) from memory.
3. The CU then generates appropriate control signals to direct single processing element (PE) to
operate on single data stream (DS) i.e., one operation at a time.
4. Examples of SISD architecture are the traditional uniprocessor machines like older personal
computers (PCs; by 2010, many PCs had multiple cores) and mainframe computers.
Fig: SISD
1. A single instruction operates on multiple different data streams. Instructions can be executed
sequentially, such as by pipelining, or in parallel by multiple functional units.
Fig: SIMD
2. Single instruction, multiple threads (SIMT) is an execution model used in parallel computing
where single instruction, multiple data (SIMD) is combined with multithreading.
3. This is not a distinct classification in Flynn's taxonomy, where it would be a subset of SIMD.
Nvidia commonly uses the term in its marketing materials and technical documents where it
argues for the novelty of Nvidia architecture
Fig: MISD
Fig: MIMD
7. Explain the architecture of distributed and shared memory of computer.
➢ The shared memory in the shared memory model is the memory that can be simultaneously
accessed by multiple processes.
➢ This is done so that the processes can communicate with each other. All POSIX systems, as
well as Windows operating systems use shared memory.
➢ A diagram that illustrates the shared memory model of process communication is given as
above.
➢ In the above diagram, the shared memory can be accessed by Process 1 and Process 2.
8. What are the major difference between distributed and shared memory?
Ans: Distributed memory and shared memory are two fundamental paradigms used in parallel
computing to manage memory and facilitate communication between processes. Here are the major
differences between them:
1. Memory Architecture
• Distributed Memory: Each processor has its own local memory, and the memory
is not shared among processors. Communication between processors occurs via
message passing (e.g., using MPI - Message Passing Interface). Example: Cluster
computing systems where each node has its own memory.
2. Communication Mechanism
3. Scalability
5. Fault Tolerance
• Distributed Memory: More robust to individual node failures; if one node fails,
others can continue to operate. Fault tolerance mechanisms can be
implemented at the application level.
• Shared Memory: A failure in the shared memory subsystem can affect all
processes, making it less fault-tolerant. Typically requires additional mechanisms
to handle failures.
Summary Table
Conclusion
Both distributed memory and shared memory have their advantages and disadvantages, and
the choice between them depends on the specific requirements of the application, the
architecture of the system, and the desired scalability and performance characteristics.