0% found this document useful (0 votes)

23 views87 pages

Module 1-3

Parallel computing involves executing multiple processes simultaneously to solve larger problems more efficiently than serial computing, which executes instructions sequentially on a single processor. It is particularly useful in scientific computing, artificial intelligence, and real-time systems, where complex, interrelated events occur concurrently. Flynn's classification categorizes parallel computer architectures into SISD, SIMD, MISD, and MIMD, each with distinct characteristics and applications.

Uploaded by

sachinsachi7964

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views87 pages

Module 1-3

Uploaded by

sachinsachi7964

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

What is Parallel Computing?

 Doing things simultaneously

 Same thing or different things
 Solving one larger problem
 Serial Computing:
 Problem is broken into stream of instructions that are executed sequentially one after the other on
a single processor
 One instruction executes at a time
 Parallel Computing:
 Problem is broken into parts that can be solved concurrently
 Each part is further broken into a stream of instructions
 Instructions from different parts execute simultaneously on different processors
Why Parallel Computing?
 The real world is parallel Complex, inter-related events happen simultaneously
 E.g., galaxies, planetary movements, functioning of the brain, weather, traffic
 Parallel computing is better suited for modeling, simulation of these processes
 Why use parallel computing Save time, produce results in reasonable time for them to be
useful e.g., weather modeling, automated stock trading
 Many problems that are interesting to scientists and engineers don’t fit on a PC — they have
huge memory requirements
 Processors are not getting any faster
 Heat dissipation/Power consumption issues
Where is Parallel Computing used?
 Scientific Computing
Numerically Intensive Simulations
 Data Base Operations and Information Systems
Web based services, Web search engines, online transaction processing
Client and Inventory Database Management; Data-mining; Online Transaction
Processing; Management Information Systems
Geographic Information Systems; Seismic Data Processing;
 Artificial Intelligence, Machine Learning and Deep Learning
 Real Time Systems and Control Applications
Hardware and Robotics Control; Speech Processing; Pattern Recognition;
1.1 CLASSIFICATION OF PARALLEL COMPUTER

 Flynn’s Classifications
 It is a way of organizing multiple processor systems.
 Flynn has explained the most common approach for
categorizing the systems with parallel processing capabilities.
 Categories of Computer Systems:
 SISD – Single Instruction Single Data stream
 SIMD – Single Instruction Multiple Data stream
 MISD – Multiple Instruction Single Data stream
 MIMD – Multiple Instruction Multiple Data stream
FLYNN’S CLASSIFICATION
 Flynn’s Classification is the most popular taxonomy of computer
architecture, proposed by Michael J. Flynn in 1966 based on number
of instructions and data
 It is based on the Motion of a stream of information. Two types of
information flow into a processor: instructions and data.
 The Instruction Stream is defined as the sequence of instructions
executed by the processing unit
 The Data Stream is defined as the sequence of data including
inputs, partial, or temporary results, called by the instruction stream
SISD – Single Instruction Single Data stream
In SISD, one instruction runs on one processor on the data stored on single
memory.
 Uniprocessor comes in this category

 Machine Instructions are executed in sequential order.

 This architecture is also known as sequential computers.
 Many of conventional computers follow this architecture.
 Instructions and data must be stored in primary memory.
 Clock defines the speed of processing in SISD.
SIMD – Single Instruction Multiple Data Stream
 SIMD, executes a single machine instruction on different set of data by different
processor.
 Each processing unit has associated data memory for execution of instruction.

 Scientific computing based on SIMD Architecture.

 Example of this model are vectors and arrays.
 If we want to perform arithmetic operation with all the elements of Array A
then it can be done by this architecture.
 Example: Add 10 with all elements, Multiply 5 with all elements, rotate 3 bit
right of all the elements.
MISD – Multiple Instruction Single Data Stream
 In MISD, multiple processors execute different instructions with a
single data stream.

 It is a theoretical model of computer.

 No practical system has been

constructed using the organization


MIMD – Multiple Instruction Multiple Data Stream
 MIMD consists of a set of processors which simultaneously execute different
instruction sequences on the different set of data.

 This organization refers to a computer system capable of processing several

programs at a same time.

 Most of the multiprocessing and multiprogramming computer systems can be

classified in this organization..
 Local Memory with MIMD is Non-Uniform Memory Access (NUMA)
organization.
 NUMA is costlier compared to SMP.
 Collection of uniprocessors or SMP can be used to form a cluster.
Symmetric Multiprocessor (SMP) Characteristics:

 There are two or more similar or comparable processors.

 Memory access time for all processors will be the same.
 All the processors share I/O devices through the same channel.
 All the processors can perform the same function (That’s why it is
called symmetric multiprocessor system).
Single Instruction Multiple Data or SIMD Systems

 SIMD (Single Instruction Multiple Data) systems are parallel systems.

 They operate on multiple data streams by applying the same instruction to multiple data items.
Architecture:
 Has a single control unit and multiple ALUs(processing elements).
 Instruction is broadcast from the control unit to all ALUs.
 Each ALU applies the instruction to its data item or stays idle.
Example – Vector Addition: Given two arrays x and y (each with n elements), we add elements
of y to x:
for (i = 0; i < n; i++)
x[i] += y[i];
Contd

 Flynn’s Taxonomy is used to classify computer architectures.

 Classification is based on : Number of instruction streams and Number of data streams
 Example: A SISD system executes one instruction and processes one data item at a time
(classical von Neumann model).
Contd
 Requirement All ALUs must execute the same instruction or remain idle.
 This can degrade SIMD system performance.
 Example of Conditional Addition:
for (i = 0; i < n; i++)
if (y[i] > 0.0) x[i] += y[i];
 Only perform addition if y[i] is positive ALUs must first check each y[i]
 Performance Implication If y[i] is positive, addition is performed . Otherwise, the
ALU storing y[i] remains idle. Results in underutilized ALUs
 Synchronous ALU Operation ALUs must work synchronously. All wait for the next
broadcast instruction.
 Instruction Storage Limitation ALUs have no instruction storage
Cannot delay execution
No ability to store instructions for later
Contd
 Vector processors: It operates on arrays or vectors of data, unlike conventional
CPUs that operate on individual data elements (scalars).
 Widely used in high-performance computing due to their ability to process
large data sets efficiently.
 Modern vector processors are designed with the following key components:
Vector Registers:
 Special registers that store vectors of operands.
 Enable operations to be performed simultaneously on all elements.
 Vector length is fixed by the system, typically ranging from 4 to 256 64-bit
elements.
 Enable high-speed, parallel computation.
Contd
Vectorized and Pipelined Functional Units:
 Apply the same operation to each element of a vector (SIMD - Single Instruction, Multiple Data).
 Common in operations like addition, multiplication, etc.
 Each pair of corresponding elements in vectors can be processed in parallel.
 Functional units are pipelined for better throughput and efficiency.
Vector Instructions:
 Operate on vectors rather than scalars(single values).
 Example: for(i = 0; i < n; i++)

x[i] += y[i];
 With vector instructions:
 Only one load, add, store per block of vector length.
 More efficient than conventional systems (which require operations for each element).
Contd
Interleaved Memory
 Memory system has multiple independent memory banks.
 Banks can be accessed simultaneously or sequentially with less delay.
 Advantage:Distributing vector elements across multiple banks allows faster loading/storing.
 Minimizes delay in accessing successive elements.

Strided Memory Access and Hardware Scatter/Gather

 Strided Memory Access Involves accessing elements at regular intervals in memory.
 Example: Accessing 1st, 5th, 9th, ... elements → Stride = 4
 Common in matrix and signal processing applications.
 Improves performance by enabling predictable memory patterns.
 Vector systems use hardware support to handle these accesses efficiently.
Contd
Scatter/Gather Operations
Gather: Reading data from non-contiguous memory locations into a vector.
Scatter: Writing data from a vector to non-contiguous memory locations.
Example: Accessing 1st, 2nd, 4th, 8th, ... elements.
Useful in irregular data structures, such as graphs or sparse matrices.
Vector processors often include special hardware to accelerate these operations.

Contd
 Graphics Processing Units (GPUs)


 GPUs power real-time graphics using points, lines, and triangles to render
objects via a graphics pipeline(convert 3D models into 2D images).

 Shader functions (short, parallel C-like code) program several stages in the
pipeline for per-element operations.
 These functions benefit from SIMD parallelism, applying the same
operation to many elements simultaneously.
 Each GPU core contains many data paths (e.g., 128) to execute in parallel.
 GPUs require high data throughput (hundreds of MB per image); rely on
hardware multithreading to hide memory stalls.
 Capable of suspending 100+ threads per active thread; performance depends
on resource usage (registers).

 GPUs are not purely SIMD or MIMD:

A single core may run multiple instruction streams.

Multiple cores can operate independently.

 Use shared or distributed memory; many systems combine both.

 Widely adopted for general-purpose high-performance computing using

GPU programming languages (e.g., CUDA).
1.3 Multiple instruction multiple data
 MIMD systems support multiple simultaneous instruction streams operating on multiple data
streams.
 MIMD systems typically consist of a collection of fully independent processing units or cores
each of which has its own control unit and its own ALU.
 MIMD systems are usually asynchronous that is the processors can operate at their own pace.
 In many MIMD systems there is no global clock and there may be no relation between the
system times on two different processors.
1.3.1 Shared-memory systems
 A collection of autonomous processors is connected to a memory system via an interconnection
network and each processor can access each memory location.
 The processors usually communicate implicitly by accessing shared data structures.
Contd
 The most widely available shared-memory systems use one or more multicore
processors.
 A multicore processor has multiple CPUs or cores on a single chip.
 The cores have private level 1 caches, while other caches may or may not be
shared between the cores.
 The time to access all the memory locations will be the same for all the cores.
 The interconnect can either connect all the processors directly to main memory or
each processor can have a direct connection to a block of main memory and the
processors can access each other’s blocks of main memory through special hardware
built into the processors.
1.3.1 A UMA multicore systems

 The first type of system is called a uniform memory access or UMA system.
 UMA systems are usually easier to program, since the programmer doesn’t need to worry
about different access times for different memory locations.
1.3.1.B A NUMA multicore systems

 The second type is called a non-uniform memory access or NUMA system.

 NUMA systems have the potential to use larger amounts of memory than UMA
systems.
1.3.2 Distributed-memory systems
 Clusters are the most common distributed-memory systems.
 Composed of commodity systems (e.g., PCs) connected via networks like Ethernet.
 Each node typically contains shared-memory systems with multicore processors.
 Such systems are often referred to as hybrid systems.
 In modern usage, a cluster is assumed to have shared-memory nodes.
 Grid computing connects geographically distributed computers.
 Grids form a unified distributed-memory system across long distances.
 Grids are typically heterogeneous — nodes may vary in hardware.
1.4 Interconnection Networks
 The communication between the processor and the main memory is achieved through
Interconnection networks.
 It plays a major role in the performance of both shared and distributed memory systems.
 Shared Memory Interconnects
➤ Buses
➤ Crossbars
 Distributed Memory Interconnects
➤ Direct interconnects
✧ Ring
✧ Two dimensional toroidal mesh
✧ Hypercube
➤ Indirect interconnects
✧ Distributed memory crossbar
✧ Omega network
Shared Memory Interconnects
The most widely used interconnects are:
 Buses
 Crossbars
1.1 Buses
 A bus is a collection of parallel communication wires
together with some hardware that controls access to the
bus.
 The key characteristic of a bus is that the
communication wires are shared by the devices
that are connected to it.
 Advantage:
 Buses are implemented at low cost.
 Good flexibility since multiple devices can be connected to a bus with little additional
cost
 Disadvantage:
 If the number of devices connected to the bus increases, the contention for use of the
bus also increases since communication wires are shared. This in turn decreases the
expected performance of the bus.
 If there are more number of processors connected to a bus, then the processors
would frequently wait for access to main memory.Interconnection Networks
1.2 Crossbars Switch:

 If there are more number of systems connected, buses are replaced

by switched interconnects.
 To control the routing of data among the connected devices,
switches are used.
 The most commonly used one is crossbar switch.
 Communication links are bidirectional.
 The circles represent the switches
Shared Memory Interconnects(Crossbar)
 Advantages:

 Much faster than buses

 Performance is good for simultaneous accessing.

 Disadvantages:

 Cost of the switches and links are high.

Distributed Memory Interconnects

The distributed memory interconnects are often divided into

groups:

 Direct Interconnects

 In Direct Interconnects
Direct interconnect
 In a direct interconnect each switch is directly connected to a
processor memory pair, and the switches are connected to each other.
 The ideal direct interconnect is a fully connected network in which
each switch is directly connected to every other switch.
 Most commonly used direct interconnects are:
 Ring
 Two dimensional toroidal mesh
 Hypercube
Fully connected network
Direct Interconnect
2.1.1 Ring
 A ring is used to carry out multiple simultaneous communications.

 If there are 'n' processors, the number of links is 2n

 Advantages:
➤ Simple in nature
➤ Less expensive
 Disadvantages:
 Some processors should be idle until other processors
complete their tasks
2.1.2 Toroidal Mesh:

 In toroidal mesh, the complex switches are mainly used.

• Critical for consistency in shared-memory architectures.

Shared-memory vs. Distributed-memory
Shared-memory systems
 Processors communicate via shared data structures. Easier for programmers (implicit
coordination).
 Limitation: Interconnect cost and scalability.
 Bus conflicts increase rapidly with more processors.
 Large crossbars are expensive → used only in small systems.
Distributed-memory systems
 Processors communicate via message passing.
 Interconnects like hypercube, toroidal mesh are cost-effective.
 Scales to thousands of processors.
 Better for problems with huge data or computation needs.
Parallel Software
Parallel Software
SPMD(Single Program Multiple Data)
Coordinating the process/threads
Writing Parallel Programs
Shared Memory
Shared Memory:Nondeterminism
Nondeterminism
Nondeterminism
Nondeterminism
Nondeterminism
Thread Safety
Distributed Memory
Message Passing

Unit - 01 Easid
No ratings yet
Unit - 01 Easid
18 pages
Perfect ? I
No ratings yet
Perfect ? I
7 pages
Unit 4 COA
No ratings yet
Unit 4 COA
5 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
49 pages
Introduction Mod1
No ratings yet
Introduction Mod1
120 pages
COA U5 PPT Full
No ratings yet
COA U5 PPT Full
43 pages
NOTES
No ratings yet
NOTES
19 pages
Parallel Computer Models: PCA Chapter 1
No ratings yet
Parallel Computer Models: PCA Chapter 1
61 pages
Parallel Computing & Flynn's Taxonomy
No ratings yet
Parallel Computing & Flynn's Taxonomy
34 pages
COE4590 10 Flyns
No ratings yet
COE4590 10 Flyns
15 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Lect6-SPC - Flynns
No ratings yet
Lect6-SPC - Flynns
16 pages
Parallel Processing Parallel Processing
No ratings yet
Parallel Processing Parallel Processing
64 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
Flynn's and Fengs Architecture
No ratings yet
Flynn's and Fengs Architecture
28 pages
5 Marks Q. Describe Array Processor Architecture
No ratings yet
5 Marks Q. Describe Array Processor Architecture
11 pages
Unit 1
No ratings yet
Unit 1
22 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
Parallel Computing for Tech Students
No ratings yet
Parallel Computing for Tech Students
14 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Computer Architecture for Students
No ratings yet
Computer Architecture for Students
118 pages
Parallel Computer Architecture
No ratings yet
Parallel Computer Architecture
44 pages
Parallel Computer Architecture Guide
No ratings yet
Parallel Computer Architecture Guide
44 pages
General System Architecture
No ratings yet
General System Architecture
28 pages
Parallel Scalable Models
No ratings yet
Parallel Scalable Models
61 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
No ratings yet
Lecture 3.1.1 (Parallelism in Uniprocessor System, Flynn - S Classification)
8 pages
Stored Program Concept
No ratings yet
Stored Program Concept
27 pages
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture19 Simd Beforelecture
64 pages
L2
No ratings yet
L2
27 pages
Architecture of Parallel Computing
No ratings yet
Architecture of Parallel Computing
6 pages
Advance Computer Architecture2
No ratings yet
Advance Computer Architecture2
36 pages
SISd
No ratings yet
SISd
17 pages
Parallel Processing Explained
No ratings yet
Parallel Processing Explained
33 pages
Presentation 4
No ratings yet
Presentation 4
37 pages
Flynn's Classification
No ratings yet
Flynn's Classification
13 pages
ch.9 Pipeline MoDIFIED
No ratings yet
ch.9 Pipeline MoDIFIED
76 pages
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
No ratings yet
1/1 Multiprocessors (Or) Shared Memory Multi-Processor Model
17 pages
ACA1
No ratings yet
ACA1
26 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
No ratings yet
SIMD and Associative Computational Models: Parallel & Distributed Algorithms
31 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Pda 2
No ratings yet
Pda 2
105 pages
Lecture 10 - SIMD Architecture
No ratings yet
Lecture 10 - SIMD Architecture
27 pages
Unit 1
No ratings yet
Unit 1
48 pages
Computer Architecture and Organization
No ratings yet
Computer Architecture and Organization
15 pages
PDC-1 2
No ratings yet
PDC-1 2
114 pages
Onur 447 Spring15 Lecture14 Simd Afterlecture
No ratings yet
Onur 447 Spring15 Lecture14 Simd Afterlecture
60 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Parallel Processing Explained
No ratings yet
Parallel Processing Explained
22 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Parallel Processing Challenges
No ratings yet
Parallel Processing Challenges
4 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
Parallel Processing Report
No ratings yet
Parallel Processing Report
9 pages
CS-3006!3!1 SIMD Intrinsic Programming Reduced
No ratings yet
CS-3006!3!1 SIMD Intrinsic Programming Reduced
55 pages
Design Cal of Cememt Silo PDF
100% (1)
Design Cal of Cememt Silo PDF
176 pages
T780 Industrial Electronics N4 Memo Nov 2024
No ratings yet
T780 Industrial Electronics N4 Memo Nov 2024
7 pages
Republic of The Philippines Division of Bohol Department of Education Region VII, Central Visayas
No ratings yet
Republic of The Philippines Division of Bohol Department of Education Region VII, Central Visayas
6 pages
Chapter 4 Thinkers Beliefs and Buildings Notes
100% (1)
Chapter 4 Thinkers Beliefs and Buildings Notes
32 pages
Final Research 13
No ratings yet
Final Research 13
20 pages
Answer
No ratings yet
Answer
2 pages
Prospective Board Member Questionnaire
No ratings yet
Prospective Board Member Questionnaire
2 pages
An Analysis of The Wood Sugar Assay Using HPLC PDF
No ratings yet
An Analysis of The Wood Sugar Assay Using HPLC PDF
7 pages
Part - A Answer The Following Questions (10x1 10)
No ratings yet
Part - A Answer The Following Questions (10x1 10)
2 pages
Is LBA Mandatory For SCAN Listener in Oracle RAC?
100% (1)
Is LBA Mandatory For SCAN Listener in Oracle RAC?
8 pages
T S Eliot Poems
No ratings yet
T S Eliot Poems
9 pages
Technical Drawing 8 (Q1-Week 3)
100% (1)
Technical Drawing 8 (Q1-Week 3)
3 pages
Motorcycle EFI Tuning Guide
100% (2)
Motorcycle EFI Tuning Guide
168 pages
Gold Standard Benchmark For Cisco IOS Routers. Gold Standard Benchmark Version 3.0.1
No ratings yet
Gold Standard Benchmark For Cisco IOS Routers. Gold Standard Benchmark Version 3.0.1
37 pages
Future With Will Exercises - Homework 2
No ratings yet
Future With Will Exercises - Homework 2
2 pages
Data Encoding for Developers
No ratings yet
Data Encoding for Developers
35 pages
TDS 30060 Hardtop Eco Euk GB
No ratings yet
TDS 30060 Hardtop Eco Euk GB
6 pages
Đề Kiểm tra cuối HK 2 Môn Tiếng Anh 6 năm học 2023-2024
No ratings yet
Đề Kiểm tra cuối HK 2 Môn Tiếng Anh 6 năm học 2023-2024
2 pages
Java Image Processing
No ratings yet
Java Image Processing
15 pages
Art Journal 45 3 Video The Reflexive Medium PDF
No ratings yet
Art Journal 45 3 Video The Reflexive Medium PDF
85 pages
Math g6 m5 Topic A Lesson 5 Teacher
No ratings yet
Math g6 m5 Topic A Lesson 5 Teacher
19 pages
Understanding Solar Plant Design ParametersSolar Irradiance, Tilt Angle, Azimuth, Efficiency Factors and Shading Analysis
No ratings yet
Understanding Solar Plant Design ParametersSolar Irradiance, Tilt Angle, Azimuth, Efficiency Factors and Shading Analysis
46 pages
HydroFS User Guide for Developers
100% (1)
HydroFS User Guide for Developers
4 pages
Titus Russian Publications Yale Press
No ratings yet
Titus Russian Publications Yale Press
2 pages
Lecture 1 Parts of Speech
No ratings yet
Lecture 1 Parts of Speech
16 pages
Aakriti Mahajan
No ratings yet
Aakriti Mahajan
45 pages
Essay On Criticism
67% (3)
Essay On Criticism
5 pages
Ph.D. Thesis, S. Hughes, V. 6
No ratings yet
Ph.D. Thesis, S. Hughes, V. 6
109 pages
Leprosy and Measles Overview
No ratings yet
Leprosy and Measles Overview
176 pages
Gcse Chemistry Revision Booklet: Dr. E. Kruiswijk, 2005
No ratings yet
Gcse Chemistry Revision Booklet: Dr. E. Kruiswijk, 2005
20 pages

Module 1-3

Uploaded by

Module 1-3

Uploaded by

What is Parallel Computing?

 Doing things simultaneously

 Machine Instructions are executed in sequential order.

 Scientific computing based on SIMD Architecture.

 It is a theoretical model of computer.

 No practical system has been

constructed using the organization

 This organization refers to a computer system capable of processing several

 Most of the multiprocessing and multiprogramming computer systems can be

 There are two or more similar or comparable processors.

 SIMD (Single Instruction Multiple Data) systems are parallel systems.

 Flynn’s Taxonomy is used to classify computer architectures.

Strided Memory Access and Hardware Scatter/Gather

 GPUs are not purely SIMD or MIMD:

Multiple cores can operate independently.

 Use shared or distributed memory; many systems combine both.

 Widely adopted for general-purpose high-performance computing using

 The second type is called a non-uniform memory access or NUMA system.

 If there are more number of systems connected, buses are replaced

 Much faster than buses

 Performance is good for simultaneous accessing.

 Cost of the switches and links are high.

The distributed memory interconnects are often divided into

 If there are 'n' processors, the number of links is 2n

 In toroidal mesh, the complex switches are mainly used.

 If there are 'n' processors, the number of links are 3n.

• Critical for consistency in shared-memory architectures.

You might also like