What is Parallel Computing?
Doing things simultaneously
Same thing or different things
Solving one larger problem
Serial Computing:
Problem is broken into stream of instructions that are executed sequentially one after the other on
a single processor
One instruction executes at a time
Parallel Computing:
Problem is broken into parts that can be solved concurrently
Each part is further broken into a stream of instructions
Instructions from different parts execute simultaneously on different processors
Why Parallel Computing?
The real world is parallel Complex, inter-related events happen simultaneously
E.g., galaxies, planetary movements, functioning of the brain, weather, traffic
Parallel computing is better suited for modeling, simulation of these processes
Why use parallel computing Save time, produce results in reasonable time for them to be
useful e.g., weather modeling, automated stock trading
Many problems that are interesting to scientists and engineers don’t fit on a PC — they have
huge memory requirements
Processors are not getting any faster
Heat dissipation/Power consumption issues
Where is Parallel Computing used?
Scientific Computing
Numerically Intensive Simulations
Data Base Operations and Information Systems
Web based services, Web search engines, online transaction processing
Client and Inventory Database Management; Data-mining; Online Transaction
Processing; Management Information Systems
Geographic Information Systems; Seismic Data Processing;
Artificial Intelligence, Machine Learning and Deep Learning
Real Time Systems and Control Applications
Hardware and Robotics Control; Speech Processing; Pattern Recognition;
1.1 CLASSIFICATION OF PARALLEL COMPUTER
Flynn’s Classifications
It is a way of organizing multiple processor systems.
Flynn has explained the most common approach for
categorizing the systems with parallel processing capabilities.
Categories of Computer Systems:
SISD – Single Instruction Single Data stream
SIMD – Single Instruction Multiple Data stream
MISD – Multiple Instruction Single Data stream
MIMD – Multiple Instruction Multiple Data stream
FLYNN’S CLASSIFICATION
Flynn’s Classification is the most popular taxonomy of computer
architecture, proposed by Michael J. Flynn in 1966 based on number
of instructions and data
It is based on the Motion of a stream of information. Two types of
information flow into a processor: instructions and data.
The Instruction Stream is defined as the sequence of instructions
executed by the processing unit
The Data Stream is defined as the sequence of data including
inputs, partial, or temporary results, called by the instruction stream
SISD – Single Instruction Single Data stream
In SISD, one instruction runs on one processor on the data stored on single
memory.
Uniprocessor comes in this category
Machine Instructions are executed in sequential order.
This architecture is also known as sequential computers.
Many of conventional computers follow this architecture.
Instructions and data must be stored in primary memory.
Clock defines the speed of processing in SISD.
SIMD – Single Instruction Multiple Data Stream
SIMD, executes a single machine instruction on different set of data by different
processor.
Each processing unit has associated data memory for execution of instruction.
Scientific computing based on SIMD Architecture.
Example of this model are vectors and arrays.
If we want to perform arithmetic operation with all the elements of Array A
then it can be done by this architecture.
Example: Add 10 with all elements, Multiply 5 with all elements, rotate 3 bit
right of all the elements.
MISD – Multiple Instruction Single Data Stream
In MISD, multiple processors execute different instructions with a
single data stream.
It is a theoretical model of computer.
No practical system has been
constructed using the organization
MIMD – Multiple Instruction Multiple Data Stream
MIMD consists of a set of processors which simultaneously execute different
instruction sequences on the different set of data.
This organization refers to a computer system capable of processing several
programs at a same time.
Most of the multiprocessing and multiprogramming computer systems can be
classified in this organization..
Local Memory with MIMD is Non-Uniform Memory Access (NUMA)
organization.
NUMA is costlier compared to SMP.
Collection of uniprocessors or SMP can be used to form a cluster.
Symmetric Multiprocessor (SMP) Characteristics:
There are two or more similar or comparable processors.
Memory access time for all processors will be the same.
All the processors share I/O devices through the same channel.
All the processors can perform the same function (That’s why it is
called symmetric multiprocessor system).
Single Instruction Multiple Data or SIMD Systems
SIMD (Single Instruction Multiple Data) systems are parallel systems.
They operate on multiple data streams by applying the same instruction to multiple data items.
Architecture:
Has a single control unit and multiple ALUs(processing elements).
Instruction is broadcast from the control unit to all ALUs.
Each ALU applies the instruction to its data item or stays idle.
Example – Vector Addition: Given two arrays x and y (each with n elements), we add elements
of y to x:
for (i = 0; i < n; i++)
x[i] += y[i];
Contd
Flynn’s Taxonomy is used to classify computer architectures.
Classification is based on : Number of instruction streams and Number of data streams
Example: A SISD system executes one instruction and processes one data item at a time
(classical von Neumann model).
Contd
Requirement All ALUs must execute the same instruction or remain idle.
This can degrade SIMD system performance.
Example of Conditional Addition:
for (i = 0; i < n; i++)
if (y[i] > 0.0) x[i] += y[i];
Only perform addition if y[i] is positive ALUs must first check each y[i]
Performance Implication If y[i] is positive, addition is performed . Otherwise, the
ALU storing y[i] remains idle. Results in underutilized ALUs
Synchronous ALU Operation ALUs must work synchronously. All wait for the next
broadcast instruction.
Instruction Storage Limitation ALUs have no instruction storage
Cannot delay execution
No ability to store instructions for later
Contd
Vector processors: It operates on arrays or vectors of data, unlike conventional
CPUs that operate on individual data elements (scalars).
Widely used in high-performance computing due to their ability to process
large data sets efficiently.
Modern vector processors are designed with the following key components:
Vector Registers:
Special registers that store vectors of operands.
Enable operations to be performed simultaneously on all elements.
Vector length is fixed by the system, typically ranging from 4 to 256 64-bit
elements.
Enable high-speed, parallel computation.
Contd
Vectorized and Pipelined Functional Units:
Apply the same operation to each element of a vector (SIMD - Single Instruction, Multiple Data).
Common in operations like addition, multiplication, etc.
Each pair of corresponding elements in vectors can be processed in parallel.
Functional units are pipelined for better throughput and efficiency.
Vector Instructions:
Operate on vectors rather than scalars(single values).
Example: for(i = 0; i < n; i++)
x[i] += y[i];
With vector instructions:
Only one load, add, store per block of vector length.
More efficient than conventional systems (which require operations for each element).
Contd
Interleaved Memory
Memory system has multiple independent memory banks.
Banks can be accessed simultaneously or sequentially with less delay.
Advantage:Distributing vector elements across multiple banks allows faster loading/storing.
Minimizes delay in accessing successive elements.
Strided Memory Access and Hardware Scatter/Gather
Strided Memory Access Involves accessing elements at regular intervals in memory.
Example: Accessing 1st, 5th, 9th, ... elements → Stride = 4
Common in matrix and signal processing applications.
Improves performance by enabling predictable memory patterns.
Vector systems use hardware support to handle these accesses efficiently.
Contd
Scatter/Gather Operations
Gather: Reading data from non-contiguous memory locations into a vector.
Scatter: Writing data from a vector to non-contiguous memory locations.
Example: Accessing 1st, 2nd, 4th, 8th, ... elements.
Useful in irregular data structures, such as graphs or sparse matrices.
Vector processors often include special hardware to accelerate these operations.
Contd
Graphics Processing Units (GPUs)
GPUs power real-time graphics using points, lines, and triangles to render
objects via a graphics pipeline(convert 3D models into 2D images).
Shader functions (short, parallel C-like code) program several stages in the
pipeline for per-element operations.
These functions benefit from SIMD parallelism, applying the same
operation to many elements simultaneously.
Each GPU core contains many data paths (e.g., 128) to execute in parallel.
GPUs require high data throughput (hundreds of MB per image); rely on
hardware multithreading to hide memory stalls.
Capable of suspending 100+ threads per active thread; performance depends
on resource usage (registers).
GPUs are not purely SIMD or MIMD:
A single core may run multiple instruction streams.
Multiple cores can operate independently.
Use shared or distributed memory; many systems combine both.
Widely adopted for general-purpose high-performance computing using
GPU programming languages (e.g., CUDA).
1.3 Multiple instruction multiple data
MIMD systems support multiple simultaneous instruction streams operating on multiple data
streams.
MIMD systems typically consist of a collection of fully independent processing units or cores
each of which has its own control unit and its own ALU.
MIMD systems are usually asynchronous that is the processors can operate at their own pace.
In many MIMD systems there is no global clock and there may be no relation between the
system times on two different processors.
1.3.1 Shared-memory systems
A collection of autonomous processors is connected to a memory system via an interconnection
network and each processor can access each memory location.
The processors usually communicate implicitly by accessing shared data structures.
Contd
The most widely available shared-memory systems use one or more multicore
processors.
A multicore processor has multiple CPUs or cores on a single chip.
The cores have private level 1 caches, while other caches may or may not be
shared between the cores.
The time to access all the memory locations will be the same for all the cores.
The interconnect can either connect all the processors directly to main memory or
each processor can have a direct connection to a block of main memory and the
processors can access each other’s blocks of main memory through special hardware
built into the processors.
1.3.1 A UMA multicore systems
The first type of system is called a uniform memory access or UMA system.
UMA systems are usually easier to program, since the programmer doesn’t need to worry
about different access times for different memory locations.
1.3.1.B A NUMA multicore systems
The second type is called a non-uniform memory access or NUMA system.
NUMA systems have the potential to use larger amounts of memory than UMA
systems.
1.3.2 Distributed-memory systems
Clusters are the most common distributed-memory systems.
Composed of commodity systems (e.g., PCs) connected via networks like Ethernet.
Each node typically contains shared-memory systems with multicore processors.
Such systems are often referred to as hybrid systems.
In modern usage, a cluster is assumed to have shared-memory nodes.
Grid computing connects geographically distributed computers.
Grids form a unified distributed-memory system across long distances.
Grids are typically heterogeneous — nodes may vary in hardware.
1.4 Interconnection Networks
The communication between the processor and the main memory is achieved through
Interconnection networks.
It plays a major role in the performance of both shared and distributed memory systems.
Shared Memory Interconnects
➤ Buses
➤ Crossbars
Distributed Memory Interconnects
➤ Direct interconnects
✧ Ring
✧ Two dimensional toroidal mesh
✧ Hypercube
➤ Indirect interconnects
✧ Distributed memory crossbar
✧ Omega network
Shared Memory Interconnects
The most widely used interconnects are:
Buses
Crossbars
1.1 Buses
A bus is a collection of parallel communication wires
together with some hardware that controls access to the
bus.
The key characteristic of a bus is that the
communication wires are shared by the devices
that are connected to it.
Advantage:
Buses are implemented at low cost.
Good flexibility since multiple devices can be connected to a bus with little additional
cost
Disadvantage:
If the number of devices connected to the bus increases, the contention for use of the
bus also increases since communication wires are shared. This in turn decreases the
expected performance of the bus.
If there are more number of processors connected to a bus, then the processors
would frequently wait for access to main memory.Interconnection Networks
1.2 Crossbars Switch:
If there are more number of systems connected, buses are replaced
by switched interconnects.
To control the routing of data among the connected devices,
switches are used.
The most commonly used one is crossbar switch.
Communication links are bidirectional.
The circles represent the switches
Shared Memory Interconnects(Crossbar)
Advantages:
Much faster than buses
Performance is good for simultaneous accessing.
Disadvantages:
Cost of the switches and links are high.
Distributed Memory Interconnects
The distributed memory interconnects are often divided into
groups:
Direct Interconnects
In Direct Interconnects
Direct interconnect
In a direct interconnect each switch is directly connected to a
processor memory pair, and the switches are connected to each other.
The ideal direct interconnect is a fully connected network in which
each switch is directly connected to every other switch.
Most commonly used direct interconnects are:
Ring
Two dimensional toroidal mesh
Hypercube
Fully connected network
Direct Interconnect
2.1.1 Ring
A ring is used to carry out multiple simultaneous communications.
If there are 'n' processors, the number of links is 2n
Advantages:
➤ Simple in nature
➤ Less expensive
Disadvantages:
Some processors should be idle until other processors
complete their tasks
2.1.2 Toroidal Mesh:
In toroidal mesh, the complex switches are mainly used.
If there are 'n' processors, the number of links are 3n.
Advantages:
➤ Good connectivity as compared with ring
Disadvantages:
Complex in nature
More expensive
A bisection of a toroidal mesh
definitions
Hypercube
Hypercube
Inter interconnects
Inter interconnects(Crossbar)
Inter interconnects(Omega Networks)
Latency and Bandwidth
Latency and Bandwidth
Cache Coherence
Cache Coherence
Cache Coherence
Cache Coherence
Cache Coherence
Snooping Cache Coherence
Snooping Cache Coherence
Directory Based Cache Coherence
False Sharing
False Sharing
False Sharing
False Sharing
What is Cache Coherence?
• Ensures all caches in a multiprocessor system reflect the
most recent value of shared variables.
• Critical for consistency in shared-memory architectures.
Shared-memory vs. Distributed-memory
Shared-memory systems
Processors communicate via shared data structures. Easier for programmers (implicit
coordination).
Limitation: Interconnect cost and scalability.
Bus conflicts increase rapidly with more processors.
Large crossbars are expensive → used only in small systems.
Distributed-memory systems
Processors communicate via message passing.
Interconnects like hypercube, toroidal mesh are cost-effective.
Scales to thousands of processors.
Better for problems with huge data or computation needs.
Parallel Software
Parallel Software
SPMD(Single Program Multiple Data)
Coordinating the process/threads
Writing Parallel Programs
Shared Memory
Shared Memory:Nondeterminism
Nondeterminism
Nondeterminism
Nondeterminism
Nondeterminism
Thread Safety
Distributed Memory
Message Passing